Recurrent neural network grammars

Size: px

Start display at page:

Download "Recurrent neural network grammars"

Rosa Riley
5 years ago
Views:

1 Widespread phenomenon: Polarity items can only appear in certain contexts Recurrent neural network grammars lide credits: Chris Dyer, Adhiguna Kuncoro Example: anybody is a polarity item that tends to appear only in specific contexts: lecture that I gave did not appeal to anybody but not: * lecture that I gave appealed to anybody We might infer that the licensing context is the word not appearing among the preceding words, and you could use an RNN to model this. However: * lecture that I did not give appealed to anybody Language is hierarchical licensing context depends on recursive structure (syntax) One theory of hierarchy Generate symbols sequentially using an RNN Add some control symbols to rewrite the history periodically lecture that I gave did not appeal to anybody Periodically compress a sequence into a single constituent Augment RNN with an operation to compress recent history into a single vector (-> reduce ) RNN predicts next symbol based on the history of compressed elements and non-compressed terminals ( shift or generate ) RNN must also predict control symbols that decide how big constituents are lecture appealed to anybody We call such models recurrent neural network grammars. that I did not give

2 (Ordered) tree traversals are sequences Terminals tack Action NT() ( NT() ( ( GEN() ( ( GEN(hungry) hungry ( ( hungry GEN(cat) ( ( REDUCE ( ( ) ( ( ) ( meows ). ) Terminals hungry tack ( ( hungry ( ( Action NT() ( NT() ( ( GEN() ( ( ( ( ) ( ( ) Compress into a single composite symbol GEN(hungry) GEN(cat) REDUCE Terminals hungry tack Action NT() ( NT() ( ( GEN() ( ( GEN(hungry) ( ( hungry GEN(cat) ( ( REDUCE ( ( ) NT()

3 Terminals tack Action Terminals tack Action NT() NT() ( NT() ( NT() ( ( GEN() ( ( GEN() ( ( GEN(hungry) ( ( GEN(hungry) hungry ( ( hungry GEN(cat) hungry ( ( hungry GEN(cat) ( ( REDUCE ( ( REDUCE ( ( ) NT() ( ( ) NT() ( ( ) ( ( ( ) (??? Q: What information can we use to predict the next action, and how can we encode it with an RNN? Terminals hungry tack ( ( hungry ( ( ( ( ) ( ( ) ( Action NT() ( NT() ( ( GEN() ( ( A: We can use an RNN for each of: 1. Previous terminal symbols 2. Previous actions 3. Current contents GEN(hungry) GEN(cat) REDUCE NT() Terminals hungry meows meows tack Action NT() ( NT() ( ( GEN() ( ( GEN(hungry) ( ( hungry GEN(cat) ( ( REDUCE ( ( ) NT() ( ( ) ( GEN(meows) ( ( ) ( meows REDUCE ( ( ) ( meows) GEN(.) ( ( ) ( meows). REDUCE ( ( ) ( meows).)

Terminals hungry meows meows tack Action NT() ( NT() ( ( GEN() ( ( GEN(hungry) ( ( hungry GEN(cat) ( ( REDUCE Final ( symbol ( hungry is cat) NT() (a vector ( representation ( ) of) ( GEN(meows) the

4 Terminals hungry meows meows tack Action NT() ( NT() ( ( GEN() ( ( GEN(hungry) ( ( hungry GEN(cat) ( ( REDUCE Final ( symbol ( hungry is cat) NT() (a vector ( representation ( ) of) ( GEN(meows) the complete tree. ( ( ) ( meows REDUCE ( ( ) ( meows) GEN(.) ( ( ) ( meows). REDUCE ( ( ) ( meows).) yntactic Composition Need representation for: ( ) ( hungry cat ) Recursion tack symbols composed recursively mirror corresponding tree structure Need representation for: ( ) ( (ADJP very hungry) cat) {z } v Effect tack encodes -down syntactic recency, rather than left-to-right string recency ( v cat )

5 tack RNNs tack RNNs Augment a sequential RNN with a pointer Two constant-time operations push - read input, add to of, connect to current location of the pointer pop - move pointer to its parent A summary of contents is obtained by accessing the output of the RNN at location of the pointer Note: push and pop are discrete actions here (cf. Grefenstette et al., 2015) y 0 ; PUH tack RNNs tack RNNs y 0 y 1 POP y 0 y 1 ; x 1 ; x 1

6 tack RNNs tack RNNs y 0 y 1 PUH y 0 y 1 y 2 POP ; x 1 ; x 1 x 2 tack RNNs tack RNNs y 0 y 1 y 2 y 0 y 1 y 2 PUH ; x 1 x 2 ; x 1 x 2

7 tack RNNs evolution of the LTM over y 0 y 1 y 2 y 3 ; x 1 x 2 x 3 ( ( ) ( meows ). ) evolution of the LTM over evolution of the LTM over ( ( ) ( meows ). ) ( ( ) ( meows ). )

8 evolution of the LTM over evolution of the LTM over ( ( ) ( meows ). ) ( ( ) ( meows ). ) evolution of the LTM over evolution of the LTM over ( ( ) ( meows ). ) ( ( ) ( meows ). )

9 evolution of the LTM over evolution of the LTM over ( ( ) ( meows ). ) ( ( ) ( meows ). ) evolution of the LTM over evolution of the LTM over ( ( ) ( meows ). ) ( ( ) ( meows ). )

10 evolution of the LTM over Each word is conditioned on history represented by a trio of RNNs p(meows history) ( ( ) ( meows ). ) ( ( ) ( meows ). ) Train with backpropagation through structure In training, backpropagate through these three RNNs) And recursively through this structure. ( ( ) ( meows ). ) This network is dynamic. Don t derive gradients by hand that s error prone. Use automatic differentiation instead Complete model sentence tree Model is dynamic: variable number of context-dependent actions at each step allowable actions at this step equence of actions (completely defines x and y) Actions up to time t bias history action embedding embedding

11 Complete model Parameter Estimation RNNGs jointly model sequences of words together with a tree structure, p (x, y) Any parse tree can be converted to a sequence of actions (depth first traversal) and vice versa (subject to wellformedness constraints) output (buffer) action history We use trees from the Penn Treebank We could treat the non-generation actions as latent variables or learn them with RL, effectively making this a problem of grammar induction. Future work Inference An RNNG is a joint distribution p(x,y) over strings (x) and parse trees (y) We are interested in two inference questions: What is p(x) for a given x? [language modeling] What is max p(y x) for a given x? [parsing] y Unfortunately, the dynamic programming algorithms we often rely on are of no help here We can use importance sampling to do both by sampling from a discriminatively trained model English PTB (Parsing) Petrov and Klein (2007) hindo et al (2012) ingle model hindo et al (2012) Ensemble Vinyals et al (2015) PTB only Vinyals et al (2015) Ensemble Type F1 G 90.1 G 91.1 ~G 92.4 D Discriminative D 89.8 Generative (I) G 92.4

12 Importance ampling Assume we ve got a conditional distribution q(y x) s.t. (i) p(x, y) > 0 =) q(y x) > 0 (ii) y q(y x) is tractable and (iii) q(y x) is tractable Let the importance weights p(x) = X y2y(x) p(x, y) = = E y q(y x) w(x, y) X y2y(x) w(x, y) = p(x, y) q(y x) w(x, y)q(y x) Importance ampling p(x) = X y2y(x) p(x, y) = = E y q(y x) w(x, y) X y2y(x) w(x, y)q(y x) Replace this expectation with its Monte Carlo estimate. y (i) q(y x) for i 2 {1, 2,...,N} E q(y x) w(x, y) MC 1 NX w(x, y (i) ) N i=1 English PTB (LM) Perplexity 5-gram IKN LTM + Dropout Generative (I) Do we need a? Kuncoro et al., Oct 2017 Both and action history encode the same information, but expose it to the classifier in different ways. Chinese CTB (LM) Perplexity 5-gram IKN LTM + Dropout Generative (I) Leaving out is harmful; using it on its own works slightly better than complete model!

RNNG as a mini-linguist Replace composition with one that computes attention over objects in the composed sequence,

What does this learn? What does this learn?

13 RNNG as a mini-linguist Replace composition with one that computes attention over objects in the composed sequence, using embedding of NT for similarity. What does this learn? RNNG as a mini-linguist Replace composition with one that computes attention over objects in the composed sequence, using embedding of NT for similarity. What does this learn? RNNG as a mini-linguist Replace composition with one that computes attention over objects in the composed sequence, using embedding of NT for similarity. What does this learn? RNNG as a mini-linguist Replace composition with one that computes attention over objects in the composed sequence, using embedding of NT for similarity. What does this learn?

RNNG as a mini-linguist Replace composition with one that computes attention over objects in the composed sequence, using embedding of NT for similarity. What does this learn?

14 RNNG as a mini-linguist Replace composition with one that computes attention over objects in the composed sequence, using embedding of NT for similarity. What does this learn? ummary Language is hierarchical, and this inductive bias can be encoded into an RNN-style model. RNNGs work by simulating a tree traversal like a pushdown automaton, but with continuous rather than finite history. Modeled by RNNs encoding (1) previous tokens, (2) previous actions, and (3) contents. A LTM evolves with contents. final representation computed by a LTM has a down recency bias, rather than left-to-right bias, which might be useful in modeling sentences. Effective for parsing and language modeling, and seems to capture linguistic intuitions about headedness.

Parsing. Based on presentations from Chris Manning s course on Statistical Parsing (Stanford)

Parsing. Based on presentations from Chris Manning s course on Statistical Parsing (Stanford) Parsing Based on presentations from Chris Manning s course on Statistical Parsing (Stanford) S N VP V NP D N John hit the ball Levels of analysis Level Morphology/Lexical POS (morpho-synactic), WSD Elements