Dependency Parsing. Statistical NLP Fall (Non-)Projectivity. CoNLL Format. Lecture 9: Dependency Parsing

Size: px

Start display at page:

Download "Dependency Parsing. Statistical NLP Fall (Non-)Projectivity. CoNLL Format. Lecture 9: Dependency Parsing"

Evelyn McKenzie
6 years ago
Views:

1 Dependency Parsing Statistical NLP Fall 2016 Lecture 9: Dependency Parsing Slav Petrov Google prep dobj ROOT nsubj pobj det PRON VERB DET NOUN ADP NOUN They solved the problem with statistics CoNLL Format ROOT vc su obj1 mod punct Cathy zag hen zwaaien (Non-)Projectivity Crossing Arcs needed to account for nonprojective constructions Fairly rare in English but can be common in other languages (e.g. Czech):

2 Formal Conditions Styles of Dependency Parsing Transition-Based (tr) Fast greedy linear time inference algorithms Trained for greedy search Beam search Graph-Based (gr) Slower exhaustive dynamic programming inference algorithms Higher-order factorizations Accuracy greedy tr k-best tr O(k n) 1st-order gr 3rd-order gr O(n 4 ) 2nd-order gr O(n 3 ) O(n) O(n 3 ) [Nivre et al ] Time [McDonald et al ] Arc-Factored Models Graph-based Parsing sumes that scores factor over the tree Arc-factored models Score(tree) = edges I washed dishes with detergent = I washed washed dishes washed with with detergent + + +

3 Dependency Representation Representation Dependency Representation Representation Heads Heads Modifiers Modifiers Dependency Representation Representation Dependency Representation Representation Heads Heads Modifiers Modifiers

4 Dependency Representation Representation Dependency Representation Representation Heads Heads Modifiers Modifiers Dependency Representation Representation Dependency Representation Representation Heads Heads Modifiers Modifiers

5 Graph-Based Parsing Arc-factored Projective Parsing Arc-factored Projective Parsing Eisner Algorithm

6 Eisner First-Order Rules Eisner First-Order Parsing Eisner First-Order Parsing First-Order Parsing + h m h r r +1 m + h e h m m e In practice also left arc version Eisner First-Order Parsing First-Order Parsing Eisner First-Order Parsing First-Order Parsing

7 Eisner First-Order Parsing First-Order Parsing Eisner First-Order Parsing First-Order Parsing Eisner First-Order Parsing First-Order Parsing Eisner First-Order Parsing First-Order Parsing

8 Eisner First-Order Parsing First-Order Parsing Eisner First-Order Parsing First-Order Parsing Eisner Algorithm Pseudo Code Maximum Spanning Trees (MSTs) Can use MST algorithms for nonprojective parsing!

9 Chu-Liu-Edmonds Chu-Liu-Edmonds Find Cycle and Contract Recalculate Edge Weights

10 Theorem Final MST Chu-Liu-Edmonds PseudoCode Chu-Liu-Edmonds PseudoCode

11 Arc Weights Arc Feature Ideas for f(ijk) Identities of the words wi and wj and the label lk Part-of-speech tags of the words wi and wj and the label lk Part-of-speech of words surrounding and between wi and wj Number of words between wi and wj and their orientation Combinations of the above First-Order Feature Calculation First-Order Feature Computation (Structured) Perceptron [] [VBD] [] [ADP] [] [VERB] [] [IN] [ VBD] [ ADP] [ ] [VBD ADP] [ VERB] [ IN] [ ] [VERB IN] [VBD ADP] [ ADP] [ VBD ADP] [ VBD ] [ADJ ADP] [VBD ADP] [VBD ADJ ADP] [VBD ADJ ] [NNS ADP] [NNS VBD ADP] [NNS VBD ] [ADJ ADP NNP] [VBD ADP NNP] [VBD ADJ NNP] [NNS ADP NNP] [NNS VBD NNP] [ left 5] [VBD left 5] [ left 5] [ADP left 5] [VERB IN] [ IN] [ VERB IN] [ VERB ] [JJ IN] [VERB IN] [VERB JJ IN] [VERB JJ ] [NOUN IN] [NOUN VERB IN] [NOUN VERB ] [JJ IN NOUN] [VERB IN NOUN] [VERB JJ NOUN] [NOUN IN NOUN] [NOUN VERB NOUN] [ left 5] [VERB left 5] [ left 5] [IN left 5] [ VBD ADP] [VBD ADJ ADP] [NNS VBD ADP] [VBD ADJ ADP NNP] [NNS VBD ADP NNP] [ VBD left 5] [ ADP left 5] [ left 5] [VBD ADP left 5] [ VERB IN] [VERB JJ IN] [NOUN VERB IN] [VERB JJ IN NOUN] [NOUN VERB IN NOUN] [ VERB left 5] [ IN left 5] [ left 5] [VERB IN left 5] [VBD ADP left 5] [ ADP left 5] [ VBD ADP left 5] [ VBD left 5] [ADJ ADP left 5] [VBD ADP left 5] [VBD ADJ ADP left 5] [VBD ADJ left 5] [NNS ADP left 5] [NNS VBD ADP left 5] [NNS VBD left 5] [ADJ ADP NNP left 5] [VBD ADP NNP left 5] [VBD ADJ NNP left 5] [NNS ADP NNP left 5] [NNS VBD NNP left 5] [VERB IN left 5] [ IN left 5] [ VERB IN left 5] [ VERB left 5] [JJ IN left 5] [VERB IN left 5] [VERB JJ IN left 5] [VERB JJ left 5] [NOUN IN left 5] [NOUN VERB IN left 5]

12 Transition Based Dependency Parsing Arc-Standard Example Process sentence left to right Different transition strategies available Delay decisions by pushing on stack Arc-Standard Transition Strategy [Nivre 03] Initial configuration: ([][0 n][]) Terminal configuration: ([0][]A) SHIFT shift: (σ[i β]a) ([σ i]βa) left-arc (label): ([σ i j]ba) ([σ j]ba {jli}) right-arc (label): ([σ i j]ba) ([σ i]ba {ilj}) Arc-Standard Example Arc-Standard Example booked a flight to Lisbon SHIFT I LEFT-ARC nsubj

13 Arc-Standard Example Arc-Standard Example I booked a flight to Lisbon a flight to Lisbon SHIFT I booked SHIFT nsubj nsubj Arc-Standard Example Arc-Standard Example flight to Lisbon a flight to Lisbon I a booked LEFT-ARC det I booked SHIFT nsubj nsubj det

14 Arc-Standard Example Arc-Standard Example to Lisbon Lisbon a I flight booked SHIFT a to flight RIGHT-ARC pobj I booked nsubj det nsubj det Arc-Standard Example Arc-Standard Example to Lisbon a flight to Lisbon a I flight booked RIGHT-ARC prep I booked RIGHT-ARC dobj nsubj det pobj nsubj det prep pobj

15 Arc-Standard Example Features a flight I booked to Lisbon SHIFT RIGHT-ARC? LEFT-ARC? dobj nsubj I booked det a prep flight pobj to Lisbon Features ZPar Parser # From Single Words pair { stack.tag stack.word } stack { word tag } pair { input.tag input.word } input { word tag } pair { input(1).tag input(1).word } input(1) { word tag } pair { input(2).tag input(2).word } input(2) { word tag } # From quad { triple triple triple triple pair { pair { pair { word pairs stack.tag stack.word input.tag input.word } { stack.tag stack.word input.word } { stack.word input.tag input.word } { stack.tag stack.word input.tag } { stack.tag input.tag input.word } stack.word input.word } stack.tag input.tag } input.tag input(1).tag } # From triple triple triple triple triple triple word triples { input.tag input(1).tag input(2).tag } { stack.tag input.tag input(1).tag } { stack.head(1).tag stack.tag input.tag } { stack.tag stack.child(-1).tag input.tag } { stack.tag stack.child(1).tag input.tag } { stack.tag input.tag input.child(-1).tag } # Distance pair { stack.distance stack.word } pair { stack.distance stack.tag } pair { stack.distance input.word } pair { stack.distance input.tag } triple { stack.distance stack.word input.word } triple { stack.distance stack.tag input.tag } Stack top word = flight Stack top POS tag = NOUN Buffer front word = to Child of stack top word = a... # valency pair { stack.word stack.valence(-1) } pair { stack.word stack.valence(1) } pair { stack.tag stack.valence(-1) } pair { stack.tag stack.valence(1) } pair { input.word input.valence(-1) } pair { input.tag input.valence(-1) } # unigrams stack.head(1) {word tag} stack.label stack.child(-1) {word tag label} stack.child(1) {word tag label} input.child(-1) {word tag label} SVM / Structured Perceptron Hyperparameters Regularization Loss function Hand-crafted features # third order stack.head(1).head(1) {word tag} stack.head(1).label stack.child(-1).sibling(1) {word tag label} stack.child(1).sibling(-1) {word tag label} input.child(-1).sibling(1) {word tag label} triple { stack.tag stack.child(-1).tag stack.child(-1).sibling(1).tag } triple { stack.tag stack.child(1).tag stack.child(1).sibling(-1).tag } triple { stack.tag stack.head(1).tag stack.head(1).head(1).tag } triple { input.tag input.child(-1).tag input.child(-1).sibling(1).tag } # label set pair { stack.tag stack.child(-1).label } triple { stack.tag stack.child(-1).label stack.child(-1).sibling(1).label } quad { stack.tag stack.child(-1).label stack.child(-1).sibling(1).label stack.child(-1).sibling(2).label } pair { stack.tag stack.child(1).label } triple { stack.tag stack.child(1).label stack.child(1).sibling(-1).label } quad { stack.tag stack.child(1).label stack.child(1).sibling(-1).label stack.child(1).sibling(-2).label } pair { input.tag input.child(-1).label } triple { input.tag input.child(-1).label input.child(-1).sibling(1).label } quad { input.tag input.child(-1).label input.child(-1).sibling(1).label input.child(-1).sibling(2).label }

16 Neural Network Transition Based Parser [Chen & Manning 14] and [Weiss et al. 15] Softmax Neural Network Transition Based Parser Softmax [Weiss et al. 15] Hidden Layer Hidden Layer words pos labels words pos labels Embedding Layer Embedding Layer f0 = 1 [buffer0-word = to ] f0 = 1 [buffer0-word = to ] f1 = 1 [buffer1-word = Bilbao ] f2 = 1 [buffer0-pos = IN ] f3 = 1 [stack0-label = pobj ] Atomic Inputs f1 = 1 [buffer1-word = Bilbao ] f2 = 1 [buffer0-pos = IN ] f3 = 1 [stack0-label = pobj ] Atomic Inputs Neural Network Transition Based Parser Neural Network Transition Based Parser Softmax [Weiss et al. 15] Softmax [Weiss et al. 15] Hidden Layer 2 Hidden Layer Hidden Layer 1 words pos labels words pos labels Embedding Layer Embedding Layer f0 = 1 [buffer0-word = to ] f0 = 1 [buffer0-word = to ] f1 = 1 [buffer1-word = Bilbao ] f2 = 1 [buffer0-pos = IN ] f3 = 1 [stack0-label = pobj ] Atomic Inputs f1 = 1 [buffer1-word = Bilbao ] f2 = 1 [buffer0-pos = IN ] f3 = 1 [stack0-label = pobj ] Atomic Inputs

17 English Results (WSJ 23) Neural Network Transition Based Parser fs words pos.w [Weiss et al. 15] labels Method 3rd-order Graph-based (ZM2014) UAS LAS Beam Transition-based Linear (ZN2011) NN Baseline (Chen & Manning 2014) NN Better SGD (Weiss et al. 2015) NN Deeper Network (Weiss et al. 2015) f0 = 1 [buffer0-word = to ] structured perceptron f1 = 1 [buffer1-word = Bilbao ] f2 = 1 [buffer0-pos = IN ] f3 = 1 [stack0-label = pobj ] NN Hyperparameters Regularization Loss function NN Hyperparameters Regularization Dimensions Loss function Activation function Initialization Adagrad Dropout

schedule Momentum Stopping time Parameter averaging Random Restarts: How much Variance? Effect of Embedding Dimensions Variance of Networks on Tuning/Dev Set 92.6 92 91.5 92.

18 NN Hyperparameters Regularization Dimensions NN Hyperparameters Loss function Optimization matters! Use random restarts grid Pick best using holdout data Activation function Initialization Adagrad Tune: WSJ S24 Dev: WSJ S22 Test: WSJ S23 Dropout Mini-batch size Initial learning rate Learning rate schedule Momentum Stopping time Parameter averaging Random Restarts: How much Variance? Effect of Embedding Dimensions Variance of Networks on Tuning/Dev Set nd hidden layer + pre training increases correlation UAS (%) UAS (%) on WSJ Dev Set Word Tuning on WSJ (Tune Set DposDlabels=32) Pretrained 200x200 Pretrained x Pretrained 200x200 Pretrained x UAS (%) on WSJ Tune Set Word Embedding Dimension (Dwords) 128

19 UAS (%) Effect of Embedding Dimensions POS/Label Tuning on WSJ (Tune Set D words =64) Pretrained 200x Pretrained x POS/Label Embedding Dimension (D pos D labels ) How Important is Lookahead? UAS ( 22 of the WSJ) Local ? Alice saw Bob eat pizza with Charlie How Important is Lookahead? How Important is Lookahead? 95 Local 95 Local UAS 85 UAS ? Alice saw Bob eat pizza with Charlie? Alice saw Bob eat pizza with Charlie

20 How Important is Lookahead? How Important is Lookahead? Local Local UAS 85 UAS ? Alice saw Bob eat pizza with Charlie? Alice saw Bob eat pizza with Charlie How Important is Lookahead? How Important is Lookahead? Local Local LSTM [Kiperwasser & Goldberg '16] UAS 85 UAS Alice saw Bob eat pizza with Charlie Bi-LSTM Alice saw Bob eat pizza with Charlie Bi-LSTM

21 Better Beam Search with Local Model Beam Search with Local Model Beam 95 Local +Beam 90 UAS (Schematic) Alice saw Bob eat pizza with Charlie Lookahead Beam Training with Early Updates BACKPROP X i X i X i X i X i (1) i (2) i (3) i (4) i ( ) i Globally normalized with respect to the beam: ( ) i exp P i P Beam j=1 exp P i (j) i Backpropagate through all steps paths and layers [Collins and Roark 04 Zhou et al. 15] Label Bias In Local Model every decision 0 ɸ 1 cannot penalize decision for pushing overall structure to far from gold Global Model learns to assign credit/blame Other views: Need to learn to model states not along the gold path Look into the future and overrule low-entropy (low-branching) states

Globally Normalized Model English Results (WSJ 23) UAS 95 90 85 80 75 Local +Beam Global 0 1 2 3 4 Lookahead Method UAS LAS Beam 3rd-order Graph-based (ZM2014) 93.22 91.

18 1 NN Perceptron (Weiss et al. 2015) 93.99 92.05 8 NN CRF (Andor et al. 2016) 94.61 92.79 32 NN CRF Semi-Supervised (Andor et al.) 95.01 92.97 32 S-LSTM (Dyer et al. 2015) 93.20 90.

22 Globally Normalized Model English Results (WSJ 23) UAS Local +Beam Global Lookahead Method UAS LAS Beam 3rd-order Graph-based (ZM2014) Transition-based Linear (ZN2011) NN Baseline (Chen & Manning 2014) NN Better SGD (Weiss et al. 2015) NN Deeper Network (Weiss et al. 2015) NN Perceptron (Weiss et al. 2015) NN CRF (Andor et al. 2016) NN CRF Semi-Supervised (Andor et al.) S-LSTM (Dyer et al. 2015) Contrastive NN (Zhou et al. 2015) Tri-Training ZPar Parser Berkeley Parser UAS LAS UAS LAS [Zhou et al. 05 Li et al. 14] UAS LAS ~40% agreement English Out-of-Domain Results Train on WSJ + Web Treebank + QuestionBank Evaluate on Web 3rd Order Graph (ZM2014) Transition-based Linear (ZN 2011 B=32) Transition-based NN (B=1) Transition-based NN (B=8) UAS (%) Supervised Semi-Supervised

23 Multilingual Results SyntaxNet and Parsey McParseface Tensor-Based Graph (Lei et al. '14 3rd-Order Graph (Zhang & McDonald '14) Transition-based NN (Weiss et al. '15) Transition-based CRF (Andor et al. 16) 95.0 UAS (%) Catalan Chinese Czech English German Japanese Spanish No tri-training data With morph features SyntaxNet and Parsey LSTMs vs SyntaxNet Add screenshots once released LSTMs SyntaxNet Accuracy + ++ Efficiency - + End-to-End ++ - Recurrence + - (yet) (yet)

Universal Dependencies Universal Dependencies Stanford Universal++ Dependencies Interset++ Morphological Features Google Universal++ POS Tags http://universaldependencies.github.

24 Universal Dependencies Universal Dependencies Stanford Universal++ Dependencies Interset++ Morphological Features Google Universal++ POS Tags Constituency Parsing CKY Algorithm Lexicalized Grammars Summary Latent Variable Grammars Conditional Random Field Parsing Neural Network Representations Dependency Parsing Eisner Algorithm Maximum Spanning Tree Algorithm Transition Based Parsing Neural Network Representations

Vine Pruning for Efficient Multi-Pass Dependency Parsing. Alexander M. Rush and Slav Petrov

Vine Pruning for Efficient Multi-Pass Dependency Parsing Alexander M. Rush and Slav Petrov Dependency Parsing Styles of Dependency Parsing greedy O(n) transition-based parsers (Nivre 2004) graph-based