Improving Neural Parsing by Disentangling Model Combination and Reranking Effects. Daniel Fried*, Mitchell Stern* and Dan Klein UC Berkeley

Improving Neural Parsing by Disentangling Model Combination and Reranking Effects Daniel Fried*, Mitchell Stern* and Dan Klein UC erkeley

Top-down generative models

Top-down generative models S VP The man had an idea.

Top-down generative models S VP The man had an idea. (S

Top-down generative models S VP The man had an idea. (S (

Top-down generative models S VP (S ( The The man had an idea.

Top-down generative models S VP (S ( The man The man had an idea.

Top-down generative models S VP (S ( The man ) The man had an idea.

Top-down generative models S VP The man had an idea. (S ( The man ) (VP

Top-down generative models S VP The man had an idea. (S ( The man ) (VP had ( an idea ) ). )

Top-down generative models S VP The man had an idea. (S ( The man ) (VP had ( an idea ) ). ) G LSTM [Parsing as Language Modeling, Choe and Charniak, 2016]

Top-down generative models S VP The man had an idea. (S ( The man ) (VP had ( an idea ) ). ) G LSTM [Parsing as Language Modeling, Choe and Charniak, 2016] G RNNG [Recurrent Neural Network Grammars, Dyer et al. 2016]

Generative models as rerankers

Generative models as rerankers base parser generative neural model G

Generative models as rerankers base parser generative neural model G S-INV S VP S VP ADJP The man had an idea. The man had an idea. The man had an idea. y ~ p y x)

Generative models as rerankers base parser generative neural model G S-INV VP S ADJP The man had an idea. The man had an idea. S VP The man had an idea. y ~ p y x) argmax y p G (x, y)

Generative models as rerankers base parser generative neural model G

Generative models as rerankers base parser generative neural model G F1 on Penn Tree ank

Generative models as rerankers base parser generative neural model G F1 on Penn Tree ank Choe and Charniak 2016 89.7 Charniak parser 92.6 LSTM language model (G LSTM )

Generative models as rerankers base parser generative neural model G F1 on Penn Tree ank Choe and Charniak 2016 Dyer et al. 2016 89.7 Charniak parser 91.7 RNNG-discriminative 92.6 LSTM language model (G LSTM ) 93.3 RNNG-generative (G RNNG )

: Necessary evil, or secret sauce? base parser generative neural model G

: Necessary evil, or secret sauce? base parser generative neural model G Should we try to do away with?

: Necessary evil, or secret sauce? base parser generative neural model G Should we try to do away with? No, better to combine and G more explicitly

: Necessary evil, or secret sauce? base parser generative neural model G Should we try to do away with? No, better to combine and G more explicitly 93.9 F1 on PT; 94.7 semi-supervised

Using standard beam search for G True Parse (S ( The man eam

Using standard beam search for G True Parse (S ( The man (S eam

Using standard beam search for G True Parse (S ( The man (S ( eam (VP (PP

Using standard beam search for G True Parse (S ( The man (S ( ( eam (VP (PP ( (

Using standard beam search for G True Parse (S ( The man (S ( ( ( eam (VP ( ( (PP ( (

Using standard beam search for G True Parse (S ( The man (S ( ( (... The eam (VP ( ( (PP ( (

Using standard beam search for G True Parse (S ( The man (S ( ( (... The eam (VP ( ( (PP ( ( G RNNG G LSTM eam Size 100 29.1 F1 27.4 F1

Log probability Standard beam search in G fails Word generation is lexicalized: 0-1 -2-3 -4-5 (S ( The man ) (VP had ( an idea ) ). )

Word-synchronous beam search w 0 (S [Roark 2001; Titov and Henderson 2010; Charniak 2010; uys and lunsom 2015 ]

Word-synchronous beam search w 0 w 1 (S ( (VP (PP ( ( The The The [Roark 2001; Titov and Henderson 2010; Charniak 2010; uys and lunsom 2015 ]

Word-synchronous beam search w 0 w 1 w 2 (S ( ( The ( man (VP The ( man (PP ( The man [Roark 2001; Titov and Henderson 2010; Charniak 2010; uys and lunsom 2015 ]

F1 on PT Word-synchronous beam search 95 90 G LSTM 85 80 G RNNG 75 70 100 200 300 400 500 600 700 800 900 1000 eam Size

F1 on PT Word-synchronous beam search 95 90 G LSTM G RNNG G LSTM 85 80 G RNNG 75 70 100 200 300 400 500 600 700 800 900 1000 eam Size

Finding model combination effects G

Finding model combination effects G S-INV S- VP S VP ADJP The man had an idea. The man had an idea. The man had an idea.

Finding model combination effects Add G s search proposal to candidate list: G S-INV S- VP S VP ADJP The man had an idea. The man had an idea. The man had an idea.

Finding model combination effects Add G s search proposal to candidate list: G G S-INV S- VP S VP ADJP The man had an idea. The man had an idea. The man had an idea.

Finding model combination effects Add G s search proposal to candidate list: G G S S VP S VP S-INV VP S- VP S The man had an idea. VP The man had an idea ADJP. The man had an idea. The man had an idea. The man had an idea. The man had an idea.

Finding model combination effects Add G s search proposal to candidate list: G G S VP S S-INV VP S- VP S The man had an idea. VP ADJP The man had an idea. The man had an idea. The man had an idea. The man had an idea. S VP The man had an idea.

Finding model combination effects F1 on PT 93.5 93.7 G RNNG RNNG Generative Model G LSTM LSTM Generative Model

Finding model combination effects F1 on PT 93.7 93.5 93.5 92.8 G RNNG RNNG Generative Model G LSTM LSTM Generative Model

Reranking shows implicit model combination G hides model errors in G

Making model combination explicit Can we do better by simply combining model scores? G G G log p G (x, y)

Making model combination explicit Can we do better by simply combining model scores? G + G G + log p G (x, y)

Making model combination explicit Can we do better by simply combining model scores? G + G G + λ log p G (x, y) + 1 λ log p (y x)

Making model combination explicit score with G + score with G F1 on PT 93.5 93.7 93.5 92.8 G RNNG RNNG Generative Model (G=G RNNG ) G LSTM LSTM Generative Model (G=G LSTM )

Making model combination explicit score with G + score with G 93.9 93.9 F1 on PT 94.0 94.2 93.5 93.7 93.5 92.8 G RNNG RNNG Generative Model (G=G RNNG ) G LSTM LSTM Generative Model (G=G LSTM )

Explicit score combination prevents errors G + fast G G + best

Comparison to past work F1 on PT

Comparison to past work F1 on PT 92.6 Choe & Charniak 2016

Comparison to past work F1 on PT 93.3 92.6 Choe & Charniak 2016 Dyer et al. 2016

Comparison to past work F1 on PT 93.3 93.6 92.6 Choe & Charniak 2016 Dyer et al. 2016 Kuncoro et al. 2017

Comparison to past work F1 on PT 93.3 93.6 93.5 G RNNG G RNNG + 92.6 Choe & Charniak 2016 Dyer et al. 2016 Kuncoro et al. 2017 Ours

Comparison to past work F1 on PT 93.3 93.6 93.9 add G LSTM 93.5 G RNNG G RNNG + 92.6 Choe & Charniak 2016 Dyer et al. 2016 Kuncoro et al. 2017 Ours

Comparison to past work F1 on PT 93.8 add silver data 93.3 93.6 93.9 add G LSTM 93.5 G RNNG G RNNG + 92.6 Choe & Charniak 2016 Dyer et al. 2016 Kuncoro et al. 2017 Ours

Comparison to past work F1 on PT 94.7 add silver data 93.8 add silver data 93.3 93.6 93.9 add G LSTM 93.5 G RNNG G RNNG + 92.6 Choe & Charniak 2016 Dyer et al. 2016 Kuncoro et al. 2017 Ours

Search procedure for G Conclusion

Conclusion Search procedure for G (more effective version forthcoming: Stern et al., EMNLP 2017)

Conclusion Search procedure for G (more effective version forthcoming: Stern et al., EMNLP 2017) Found model combination effects in G

Conclusion Search procedure for G (more effective version forthcoming: Stern et al., EMNLP 2017) Found model combination effects in G Large improvements from simple, explicit score combination: G +

Thanks!