Vine Pruning for Efficient Multi-Pass Dependency Parsing Alexander M. Rush and Slav Petrov
Dependency Parsing
Styles of Dependency Parsing greedy O(n) transition-based parsers (Nivre 2004) graph-based parsers (Eisner 2000) (McDonald 2005) speed first-order O(n 3 ) k-best O(kn) second-order O(n 3 ) accuracy third-order O(n 4 )
Styles of Dependency Parsing greedy O(n) transition-based parsers (Nivre 2004) graph-based parsers (Eisner 2000) (McDonald 2005) speed first-order O(n 3 ) k-best O(kn) second-order O(n 3 ) accuracy this work third-order O(n 4 )
Preview: Coarse-to-Fine Cascades cgwire vine first second
linear-size dependency representation
Representation Heads Modifiers
Representation Heads Modifiers
Representation Heads Modifiers
Representation Heads Modifiers
Representation Heads Modifiers
Representation Heads Modifiers
Representation Heads Modifiers
Representation Heads Modifiers
First-Order Feature Calculation
First-Order Feature Calculation [] [VBD] [] [ADP] [] [VERB] [] [IN] [ VBD] [ ADP] [ ] [VBD ADP] [ VERB] [ IN] [ ] [VERB IN] [VBD ADP] [ ADP] [ VBD ADP] [ VBD ] [ADJ ADP] [VBD ADP] [VBD ADJ ADP] [VBD ADJ ] [NNS ADP] [NNS VBD ADP] [NNS VBD ] [ADJ ADP NNP] [VBD ADP NNP] [VBD ADJ NNP] [NNS ADP NNP] [NNS VBD NNP] [ left 5] [VBD left 5] [ left 5] [ADP left 5] [VERB IN] [ IN] [ VERB IN] [ VERB ] [JJ IN] [VERB IN] [VERB JJ IN] [VERB JJ ] [NOUN IN] [NOUN VERB IN] [NOUN VERB ] [JJ IN NOUN] [VERB IN NOUN] [VERB JJ NOUN] [NOUN IN NOUN] [NOUN VERB NOUN] [ left 5] [VERB left 5] [ left 5] [IN left 5] [ VBD ADP] [VBD ADJ ADP] [NNS VBD ADP] [VBD ADJ ADP NNP] [NNS VBD ADP NNP] [ VBD left 5] [ ADP left 5] [ left 5] [VBD ADP left 5] [ VERB IN] [VERB JJ IN] [NOUN VERB IN] [VERB JJ IN NOUN] [NOUN VERB IN NOUN] [ VERB left 5] [ IN left 5] [ left 5] [VERB IN left 5] [VBD ADP left 5] [ ADP left 5] [ VBD ADP left 5] [ VBD left 5] [ADJ ADP left 5] [VBD ADP left 5] [VBD ADJ ADP left 5] [VBD ADJ left 5] [NNS ADP left 5] [NNS VBD ADP left 5] [NNS VBD left 5] [ADJ ADP NNP left 5] [VBD ADP NNP left 5] [VBD ADJ NNP left 5] [NNS ADP NNP left 5] [NNS VBD NNP left 5] [VERB IN left 5] [ IN left 5] [ VERB IN left 5] [ VERB left 5] [JJ IN left 5] [VERB IN left 5] [VERB JJ IN left 5] [VERB JJ left 5] [NOUN IN left 5] [NOUN VERB IN left 5]
Arc Length By Part-of-Speech 0.5 0.4 NOUN ADP DET VERB ADJ counts 0.3 0.2 0.1 0.0 1 2 3 4 5 6 length
Arc Length By Part-of-Speech 0.5 0.4 NOUN ADP DET VERB ADJ counts 0.3 0.2 0.1 0.0 1 2 3 4 5 6 length
Arc Length By Part-of-Speech 0.5 0.4 NOUN ADP DET VERB ADJ counts 0.3 0.2 0.1 0.0 1 2 3 4 5 6 length
bill The to the intends to RTC restrict only borrowings Treasury the unless authorization congressional specific Arc Length Examples The bill intends to restrict the RTC to Treasury borrowings only unless the agency receives specific congressional authorization. receives agency.
This was in system financing the new created in law to order the keep from bailout the spending. swelling deficit budget Arc Length Examples This financing system was created in the new law in order to keep the bailout spending from swelling the budget deficit.
Arc Length Examples But the RTC also requires working capital to maintain the bad assets of thrifts that are sold until the assets can be sold separately. But the RTC also requires working capital to maintain the bad assets of thrifts that are sold until the assets can be sold separately.
Arc Length Examples It s a problem that clearly has to be resolved said David Cooke executive director of the RTC. It s a problem that clearly has to be resolved said David Cooke executive director of the RTC.
Arc Length Examples We would have to wait until we have collected on those assets before we can move forward he said. We would have to wait until we have collected on those assets before we can move forward he said.
The in the huge language law new complicated has the. fight muddied Arc Length Examples The complicated language in the huge new law has muddied the fight.
Arc Length Examples That secrecy leads to a proposal like the one from Ways and Means which seems to me sort of draconian he said. That secrecy leads to a proposal like the one from Ways and Means which seems to me sort of draconian he said.
Arc Length Examples The RTC is going to have to pay a price of prior consultation on the Hill if they want that kind of flexibility. The RTC is going to have to pay a price of prior consultation on the Hill if they want that kind of flexibility.
Arc Length Heat Map 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9
Arc Length Heat Map 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9
Banded Matrix
Banded Matrix
Outer Arc
Outer Arc
Outer Arc
Outer Arc
vine Coarse-to-Fine
Coarse-to-Fine vine first
Coarse-to-Fine cgwire vine first second
dynamic programs for parsing
Inference Questions questions: How do we reduce inference time to O(n)? How do we decide which arcs to prune? Vine Parsing (Eisner and Smith 2005)
Eisner First-Order Rules + h m h r r + 1 m + h e h m m e
First-Order Parsing
First-Order Parsing
First-Order Parsing
First-Order Parsing
First-Order Parsing
First-Order Parsing
First-Order Parsing
First-Order Parsing
First-Order Parsing
Vine Parsing Rules + 0 e 0 e 1 e 1 e + 0 e 0 m m e 0 e 0 e + 0 e 0 m m e + 0 e 0 e 1 e 1 e
Vine Parsing
Vine Parsing
Vine Parsing
Vine Parsing
Vine Parsing
Vine Parsing
Vine Parsing
Vine Parsing
Vine Parsing
Vine Parsing
Vine Parsing
Vine Parsing
Arc Pruning Prune arcs based on max-marginals. maxmarginal(a) = max (y w) y:a y Can compute using inside-outside algorithm. Generic algorithm using hypergraph parsing.
Max-Marginals for First-Order Arcs maxmarginal( ) > threshold?
Max-Marginals for Outer Arcs maxmarginal(left ) > threshold?
pruning and training
Max-Marginal Pruning goal: Define a threshold on max-marginal score. Validation parameter α trades off between speed and accuracy. t α (w) = α max (y w) + (1 α) 1 y A maxmarginal(a w) a A Highest scoring parse upper bounds any max-marginal. sume average of max-marginals is lower than gold.
Pruning Threshold feature two w feature one
Pruning Threshold max feature two w feature one
Pruning Threshold max feature two w feature one
Pruning Threshold max feature two w feature one
Pruning Threshold max feature two w feature one
Pruning Threshold max feature two average max-marginal w feature one
Pruning Threshold max feature two average max-marginal w feature one
Pruning Threshold max feature two average max-marginal w feature one
Pruning Threshold max feature two average max-marginal w feature one
Pruning Threshold max feature two α average max-marginal w feature one
Pruning Threshold max feature two average max-marginal w feature one
Structured Cascade Training (Weiss and Taskar 2011) Train a linear model with a loss function for pruning. Regularized risk minimization with loss based on threshold min w λ w 2 + 1 P P [1 y (p) w + t α (p) (w)] + p=1 Can use a simple variant of perceptron/pegasos to train.
Structured Cascade Training max feature two w feature one gold
Structured Cascade Training max feature two average max-marginal w feature one gold
Structured Cascade Training max feature two average max-marginal w feature one gold
Structured Cascade Training max feature two average max-marginal w feature one gold
Structured Cascade Training max feature two average max-marginal w feature one gold
Structured Cascade Training max feature two average max-marginal w feature one gold
Structured Cascade Training max feature two average max-marginal w feature one gold
Structured Cascade Training feature two w feature one gold
Structured Cascade Training feature two max w feature one gold
Structured Cascade Training feature two max w feature one gold
Structured Cascade Training feature two max w feature one gold
experiments
Implementation Inference Experiments use a highly-optimized C++ implementation. Baseline first-order parser processes 2000 tokens/sec. Hypergraph parsing framework with shared inference. Model Final models trained with hamming-loss MIRA. Full collection of dependency parsing features (Koo 2010). First- second- and third-order models match state-of-the-art.
Baselines NoPrune exhaustive parsing model with no pruning LocalShort unstructured classifier over O(n) short arcs (Bergsma and Cherry 2010) Local unstructured classifier over O(n 2 ) arcs (Bergsma and Cherry 2010) FirstOnly structured first-order model in cascade (Koo 2010) VinePosterior posterior pruning cascade trained with L-BFGS ZhangNivre reimplementation of state-of-the-art k-best transition-based parser (Zhang and Nivre 2011).
Speed/Accuracy Experiments: First-Order Parsing NoPrune Local FirstOnly VinePosterior VineCascade ZhangNivre(8) 0 1 2 3 4 5 6 Relative Speed 90 91 92 93 94 Accuracy
Speed/Accuracy Experiments: Second-Order Parsing NoPrune Local FirstOnly VinePosterior VineCascade ZhangNivre(16) 0 1 2 3 4 Relative Speed 90 91 92 93 94 Accuracy
Speed/Accuracy Experiments: Third-Order Parsing NoPrune Local FirstOnly VinePosterior VineCascade ZhangNivre(64) 0 1 2 Relative Speed 90 91 92 93 94 Accuracy
Empirical Complexity: First-Order Parsing NoPrune [2.8] VineCascade [1.4] time 10 20 30 40 50 sentence length
Empirical Complexity: Second-Order Parsing NoPrune [2.8] VineCascade [1.8] time 10 20 30 40 50 sentence length
Empirical Complexity: Third-Order Parsing NoPrune [3.8] VineCascade [1.9] time 10 20 30 40 50 sentence length
Multilingual Experiments: First-Order Parsing En Bg De Pt Sw Zh NoPrune VineCascade 0 1 2 3 4 5 6 7 Relative Speed
Multilingual Experiments: Second-Order Parsing En Bg De Pt Sw Zh NoPrune VineCascade 0 1 2 3 4 5 6 Relative Speed
Multilingual Experiments: Third-Order Parsing En Bg De Pt Sw Zh NoPrune VineCascade 0 1 2 3 Relative Speed
Special thanks to: Ryan McDonald Hao Zhang Michael Ringgaard Terry Koo Keith Hall Kuzman Ganchev Yoav Goldberg Andre Martins and the rest of the Google NLP team