Tackling the Limits of Deep Learning for NLP

Size: px

Start display at page:

Download "Tackling the Limits of Deep Learning for NLP"

Howard Harper
6 years ago
Views:

1 Tackling the Limits of Deep Learning for NLP Richard Socher Salesforce Research Caiming Xiong, Romain Paulus, Stephen Merity, James Bradbury, Victor Zhong

with image recognition Optimize CPG sales processes to

2 Einstein s Deep Learning Einstein Vision Einstein Object Detection (Pilot) Next for research Transform your apps with image recognition Optimize CPG sales processes to improve retail Advancing Natural Language Understanding `

3 The Limits of Single Task Learning Great performance improvements Projects start from random Single unsupervised task can t fix it How to express different tasks in the same framework, e.g. sequence tagging sentence-level classification seq2seq?

4 Framework for Tackling NLP A joint model for comprehensive QA

5 I: Mary walked to the bathroom. I: Sandra went to the garden. I: Daniel went back to the garden. I: Sandra took the milk there. Q: Where is the milk? A: garden I: Everybody is happy. Q: What s the sentiment? A: positive Q: What s the summary of the story? A: Happy people walk to different places. QA Examples A: NNP VBZ DT NN IN NNP. I: I think this model is incredible Q: In French? A: Je pense que ce modèle est incroyable. I: What color are Q: What color are the bananas? A: Green. Move from {x i,y i } to {x i,q i,y i }

First of Four Major Obstacles For NLP no single model architecture with consistent state of the art results across tasks Task Question answering (babi) Sentiment Analysis

6 First of Four Major Obstacles For NLP no single model architecture with consistent state of the art results across tasks Task Question answering (babi) Sentiment Analysis (SST) Part of speech tagging (PTB-WSJ) State of the art model Strongly Supervised MemNN (Weston et al 2015) Tree-LSTMs (Tai et al. 2015) Bi-directional LSTM-CRF (Huang et al. 2015)

7 Tackling Obstacle 1: Dynamic Memory Network Episodic Memory e Module 1 e 2 e 3 e 4 e 5 e 6 e 7 e m 2 Answer module e 1 e 2 e 3 e 4 e 5 e 6 e 7 e m 1 hallway <EOS> Input Module s 1 s 2 s 3 s 4 s 5 s 6 s 7 s 8 Question Module q w 1 Mary got the milk there. John moved to the bedroom. Sandra went back to the kitchen. Mary travelled to the hallway. John got the football there. John went to the hallway. John put down the football. Mary went to the garden. w T Where is the fooball?

8 Analysis of Attention for Sentiment Sharper attention when 2 passes are allowed. Examples that are wrong with just one pass

Modularization Allows for Different Inputs Episodic Memory Answer Kitchen Episodic Memory Answer Palm Input Module John moved to the garden. John got the apple there. John moved to the kitchen.

9 Modularization Allows for Different Inputs Episodic Memory Answer Kitchen Episodic Memory Answer Palm Input Module John moved to the garden. John got the apple there. John moved to the kitchen. Sandra picked up the milk there. John dropped the apple. John moved to the office. Question Where is the apple? Input Module Question What kind of tree is in the backgrou nd? (a) Text Question-Answering (b) Visual Question-Answering Dynamic Memory Networks for Visual and Textual Question Answering, Caiming Xiong, Stephen Merity, Richard Socher

10 Attention Visualization What is this sculpture made out of? Answer: metal What color are the bananas? Answer: green What is the pattern on the cat ' s fur on its tail? Answer: stripes What Did the is the player boy hit holding? the ball? Answer: surfboard yes ure 4. Examples of qualitative results of attention wn withfor thevqa. attention Eachthat image the episodic (left) show mem

12 Obstacle 2: Questions have input independent representations Interdependence needed for a comprehensive QA model Dynamic Coattention Networks for Question Answering by Caiming Xiong, Victor Zhong, Richard Socher (ICLR 2017) Coattention encoder Dynamic pointer decoder start index: 49 end index: 51 steam turbine plants Document encoder The weight of boilers and condensers generally makes the power-to-weight... However, most electric power is generated using steam turbine plants, so that indirectly the world's industry is... Question encoder What plants create most electric power?

13 Coattention Encoder D: U: u t bi-lstm bi-lstm bi-lstm bi-lstm bi-lstm m+1 document A D Q: A Q product C Q product D C concat concat n+1

14 Dynamic Decoder L S T M h i h i+1 L S T M u si 1 u ei 1 HMN argmax s i : 49 e i : 51 (steam) u 49 argmax (turbine) u 51 HMN u si u ei U: u 48 u 49 u 50 u 51 u using steam plant turbine,

15 Stanford Question Answering Dataset

16 Results on SQUAD Competition Model Dev EM Dev F1 Test EM Test F1 Ensemble DCN (Ours) Microsoft Research Asia Allen Institute Singapore Management University Google NYC Single model DCN (Ours) Microsoft Research Asia Google NYC Singapore Management University Carnegie Mellon University Dynamic Chunk Reader (Yu et al., 2016) Match-LSTM (Wang & Jiang, 2016) Baseline (Rajpurkar et al., 2016) Human (Rajpurkar et al., 2016) Results are at time of ICLR submission See for latest results

17 Demo

18 Obstacle 3: RNNs are Slow RNNs are the basic building block for deepnlp Idea: Take the best and parallelizable parts of RNNs and CNNs Quasi-Recurrent Neural Networks by James Bradbury, Stephen Merity, Caiming Xiong & Richard Socher (ICLR 2017)

19 Quasi-Recurrent Neural Network LSTM CNN QRNN Linear LSTM/Linear Linear LSTM/Linear Convolution Max-Pool Convolution Max-Pool Convolution fo-pool Convolution fo-pool Convolutions for parallelism across time: z t = tanh(w 1 zx t 1 + W 2 zx t ) f t = (W 1 f x t 1 + W 2 f x t ) o t = (W 1 ox t 1 + W 2 ox t ). Element-wise gated recurrence for parallelism across channels: h t = f t h t 1 +(1 f t ) z t, à Z = tanh(w z X) F = (W f X) O = (W o X),

20 Q-RNNs for Language Modeling Better Model Parameters Validation Test LSTM (medium) (Zaremba et al., 2014) 20M Variational LSTM (medium) (Gal & Ghahramani, 2016) 20M LSTM with CharCNN embeddings (Kim et al., 2016) 19M 78.9 Zoneout + Variational LSTM (medium) (Merity et al., 2016) 20M Our models LSTM (medium) 20M QRNN (medium) 18M QRNN + zoneout (p =0.1) (medium) 18M Faster Batch size Sequence length x 8.8x 11.0x 12.4x 16.9x x 6.7x 7.8x 8.3x 10.8x x 4.5x 4.9x 4.9x 6.4x x 3.0x 3.0x 3.0x 3.7x x 1.9x 2.0x 2.0x 2.4x x 1.4x 1.3x 1.3x 1.3x

21 Q-RNNs for Sentiment Analysis Better and faster than LSTMs More interpretable Model Time / Epoch (s) Test Acc (%) BSVM-bi (Wang & Manning, 2012) layer sequential BoW CNN (Johnson & Zhang, 2014) 92.3 Ensemble of RNNs and NB-SVM (Mesnil et al., 2014) layer LSTM (Longpre et al., 2016) 87.6 Residual 2-layer bi-lstm (Longpre et al., 2016) 90.1 Our models Deeply connected 4-layer LSTM (cudnn optimized) Deeply connected 4-layer QRNN D.C. 4-layer QRNN with k = Example: Initial positive review Review starts out positive At 117: not exactly a bad story At 158: I recommend this movie to everyone, even if you ve never played the game

22 Obstacle 4: Hard to generate long language sequences that make sense! Romain Paulus, Caiming Xiong, and Richard Socher A Deep Reinforced Model for Abstractive Summarization Two necessary, new ingredients attention during generation and reinforcement learning

23 Ingredient 1: Intra-decoder Attention

24 Ingredient 2: Reinforcement Learning Instead of:

25 Ingredient 2: Reinforcement Learning Global learning of entire summary:

26 Summarization Results

27 Summarization Results

28 Comprehensive Question Answering Framework for tackling the limits of deepnlp What color are the bananas? Answer: green QRNN D: U: u t bi-lstm bi-lstm bi-lstm bi-lstm bi-lstm Convolution m+1 document A D fo-pool Q: n+1 A Q product C Q concat product D C concat Convolution fo-pool We re looking for researchers and research engineers

Tackling the Limits of Deep Learning for NLP

Tackling the Limits of Deep Learning for NLP Richard Socher Salesforce Research Caiming Xiong, Stephen Merity, James Bradbury, Victor Zhong, Kazuma Hashimoto and Stanford: Hakan Inan, Khashayar Khosravi