Random Coattention Forest for Question Answering

Similar documents
A QUESTION ANSWERING SYSTEM USING ENCODER-DECODER, SEQUENCE-TO-SEQUENCE, RECURRENT NEURAL NETWORKS. A Project. Presented to

Extractive Question Answering Using Match-LSTM and Answer Pointer

Machine Comprehension with Deep Learning on SQuAD dataset

Bidirectional Attention for SQL Generation. Tong Guo 1 Huilin Gao 2. Beijing, China

arxiv: v1 [cs.cl] 21 May 2017

Slide credit from Hung-Yi Lee & Richard Socher

Multitask Learning and Extensions of Dynamic Coattention Network

Deep Learning for Natural Language Processing. Sidharth Mudgal April 4, 2017

Improved Learning through Augmenting the Loss

arxiv: v2 [cs.cl] 1 Jan 2019

Google s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation

Tackling the Limits of Deep Learning for NLP

Leveraging Context Information for Natural Question Generation

High Order LSTM/GRU. Wenjie Luo. January 19, 2016

arxiv: v2 [cs.cl] 25 Mar 2017

CS230: Lecture 8 Word2Vec applications + Recurrent Neural Networks with Attention

Review Networks for Caption Generation

Recurrent Neural Networks with Flexible Gates using Kernel Activation Functions

CSC321 Lecture 16: ResNets and Attention

Recurrent Autoregressive Networks for Online Multi-Object Tracking. Presented By: Ishan Gupta

a) b) (Natural Language Processing; NLP) (Deep Learning) Bag of words White House RGB [1] IBM

Variational Attention for Sequence-to-Sequence Models

Based on the original slides of Hung-yi Lee

Multi-Source Neural Translation

CS 229 Project Final Report: Reinforcement Learning for Neural Network Architecture Category : Theory & Reinforcement Learning

Deep Learning Sequence to Sequence models: Attention Models. 17 March 2018

arxiv: v1 [cs.cl] 13 Dec 2016

TTIC 31230, Fundamentals of Deep Learning David McAllester, April Sequence to Sequence Models and Attention

Segmental Recurrent Neural Networks for End-to-end Speech Recognition

Recurrent Neural Networks (Part - 2) Sumit Chopra Facebook

CSC321 Lecture 10 Training RNNs

Deep learning for Natural Language Processing and Machine Translation

Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, Spis treści

Deep Q-Learning with Recurrent Neural Networks

Deep Learning For Mathematical Functions

Multi-Source Neural Translation

Classification of Higgs Boson Tau-Tau decays using GPU accelerated Neural Networks

COMPARING FIXED AND ADAPTIVE COMPUTATION TIME FOR RE-

Sequence Modeling with Neural Networks

A Constituent-Centric Neural Architecture for Reading Comprehension

Fast Learning with Noise in Deep Neural Nets

Memory-Augmented Attention Model for Scene Text Recognition

Self-Attention with Relative Position Representations

with Local Dependencies

Conditional Language modeling with attention

Deep Learning for NLP Part 2

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

CS224N Final Project: Exploiting the Redundancy in Neural Machine Translation

Faster Training of Very Deep Networks Via p-norm Gates

Neural Architectures for Image, Language, and Speech Processing

Neural Networks. David Rosenberg. July 26, New York University. David Rosenberg (New York University) DS-GA 1003 July 26, / 35

Deep Sequence Models. Context Representation, Regularization, and Application to Language. Adji Bousso Dieng

Caesar s Taxi Prediction Services

CSC321 Lecture 15: Recurrent Neural Networks

Encoder Based Lifelong Learning - Supplementary materials

Lecture 5 Neural models for NLP

YNU-HPCC at IJCNLP-2017 Task 1: Chinese Grammatical Error Diagnosis Using a Bi-directional LSTM-CRF Model

Recurrent Neural Networks. Jian Tang

Notes on Deep Learning for NLP

Implicitly-Defined Neural Networks for Sequence Labeling

Key-value Attention Mechanism for Neural Machine Translation

Recurrent Neural Networks (RNN) and Long-Short-Term-Memory (LSTM) Yuan YAO HKUST

Learning to reason by reading text and answering questions

Trajectory-based Radical Analysis Network for Online Handwritten Chinese Character Recognition

Analysis of the Learning Process of a Recurrent Neural Network on the Last k-bit Parity Function

A Multi-Stage Memory Augmented Neural Network for Machine Reading Comprehension

Better Conditional Language Modeling. Chris Dyer

A Question-Focused Multi-Factor Attention Network for Question Answering

FINAL: CS 6375 (Machine Learning) Fall 2014

CKY-based Convolutional Attention for Neural Machine Translation

Semantics with Dense Vectors. Reference: D. Jurafsky and J. Martin, Speech and Language Processing

Combining Static and Dynamic Information for Clinical Event Prediction

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Prediction of Citations for Academic Papers

Algorithm-Independent Learning Issues

text classification 3: neural networks

From perceptrons to word embeddings. Simon Šuster University of Groningen

Improving Sequence-to-Sequence Constituency Parsing

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Language Models. Tobias Scheffer

Autoregressive Attention for Parallel Sequence Modeling

Coverage Embedding Models for Neural Machine Translation

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

Machine Translation. 10: Advanced Neural Machine Translation Architectures. Rico Sennrich. University of Edinburgh. R. Sennrich MT / 26

Convolutional Neural Networks II. Slides from Dr. Vlad Morariu

Exploiting Contextual Information via Dynamic Memory Network for Event Detection

Neural-based Natural Language Generation in Dialogue using RNN Encoder-Decoder with Semantic Aggregation

A TIME-RESTRICTED SELF-ATTENTION LAYER FOR ASR

Deep Structured Prediction in Handwriting Recognition Juan José Murillo Fuentes, P. M. Olmos (Univ. Carlos III) and J.C. Jaramillo (Univ.

Instructions for NLP Practical (Units of Assessment) SVM-based Sentiment Detection of Reviews (Part 2)

NEURAL SPEED READING VIA SKIM-RNN

Inter-document Contextual Language Model

Jointly Extracting Event Triggers and Arguments by Dependency-Bridge RNN and Tensor-Based Argument Interaction

Edinburgh Research Explorer

Overview Today: From one-layer to multi layer neural networks! Backprop (last bit of heavy math) Different descriptions and viewpoints of backprop

Introduction to Deep Neural Networks

PSIque: Next Sequence Prediction of Satellite Images using a Convolutional Sequence-to-Sequence Network

Eve: A Gradient Based Optimization Method with Locally and Globally Adaptive Learning Rates

NEURAL SPEED READING VIA SKIM-RNN

Structured Neural Networks (I)

Deep Neural Solver for Math Word Problems

Transcription:

Random Coattention Forest for Question Answering Jheng-Hao Chen Stanford University jhenghao@stanford.edu Ting-Po Lee Stanford University tingpo@stanford.edu Yi-Chun Chen Stanford University yichunc@stanford.edu CodaLab ID: tingpo CodaLab Group: cs224n-tingpo_jhenghao_yichun Abstract We build a Random Coattention Forest Network that achieved 55% F1 score and 41% EM score on the on-line test set of question answering dataset SQuAD. We use Coattention encoder and LSTM Classification decoder to build our basic encoder-decoder pair. Multiple loosely correlated encoder-decoder pairs are trained in parallel to form a random-forest-style model. Our implementation provides efficiency in training, which allowed us to train multiple encoder-decoder pairs and run multiple experiments to tune hyperparameters. 1 Introduction Teaching machines to read natural language documents and answer corresponding questions is a central, yet unsolved goal of NLP. Such reading comprehension and question answering abilities have a variety of applications e.g. information retrieval from commercial/legal contracts, dialog systems and chatbots designed to simulate human conversation. General neural network based sequence-to-sequence model suffers from the limitations that contexts are much longer than questions but are of equal importance and the model has to be able to produce different output from a fixed context given different questions. Those limitations make it hard to apply Machine-Translation-like models to solve question answering problems. Given context paragraph P = {p 1, p 2,..., p m } of length m, and a question Q = {q 1, q 2,..., q n } of length n, the answer {p j1, p j2,..., p jl } is a contiguous part of context p with length l m. The problem can be formulated as given a context question pair (p, q), learn a function to predict tuple A = {a s, a e } where a s and a e indicates the starting and ending position of the answer in context paragraph p. Stanford Question Answering Dataset (SQuAD) is the dataset we train and evaluate our model with. SQuAD is comprised of context paragraphs and around 100K question-answer pairs. The context paragraphs were extracted from a set of Wikipedia articles. The ground truth answers are genrated by human who highlight part of context paragraph as the answer to the given question. SQuAD is a challenging testbed for evaluating machine comprehension because the answers do not come from a small set of candidate answers and they have variable lengths. 2 Random Coattention Forest Networks Our general approach is based on the conditional recurrent language model like what a human would do, we encoded the question and paragraph into a continuous representation with RNN-based approach. We used GloVe as our distributed word representation. Coattention networks was applied to learn and encode context-question interactions. An RNN-based decoder is then used to predict the 30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.

Figure 1: System Architecture. probability distributions of starting point and ending point over the context paragraph. The encoderdecoder architecture is visualized in Fig.1. Finally multiple endoer-decoder pairs were combined to construct a random-forest style predictor to produce the predicted answer tuple  = {â s, â e }. 2.1 Encoder Model 2.1.1 Sequence Attention Mix Encoder Context and question word embeddings are first passed through two separate one-directional LSTMs to produce context and question representations H P and H Q. Then we consider the attention mechanisms by multiplying H P and H Q to encode context-question interactions. H P = LSTM(P ) = [ h P 1,..., h P ] m [ ] H Q = LSTM(Q) = h Q 1,..., hq n A = softmax ( HQH T ) P C p = H Q A Where H P R d p and H Q R d q, d is the output dimension of the LSTM above. The matrix A is the attention weights of context over questions. Then the attention enhanced context representation C P can be obtained by applying weights A to question representation H Q. Finally, context vector C P and the original context representation H P are concatenated and passed through a single neural layer to create the encoded output P as follow P = LSTM (W [C P, H P ] + b) Where W R d 2d is the linear weights applied to mix C P and H P. 2.1.2 Coattention Encoder Coattention encoder is described in [2]. Here we implement a slightly different version. Similar to the sequence attention mix encoder, context-question interactions are captured through matrix multiplications. And in coattention encoder we tried to enforce more context-question interactions through more carefully designed matrix multiplications, operations and concatenations. First, we performed a matrix multiplication and then used column and row-based softmax on context and question hidden states H P, H Q separately. A P = softmax ( H T QH P, dim = 1 ) A Q = softmax ( H T QH P, dim = 2 ) 2

Figure 2: Attention Encoder. Where A P R q p and A Q R q p are the intermediate matrices that represent attention weights. Then we first computed the question attention vector, and concatenated it with the original question hidden states H Q to create a new context vector C P. C Q = H P A T Q C P = [H Q, C Q ] A P Where C Q R d q and C P R 2d p are the context vectors for question and context. Finally, the concatenated vector is passed to an one-direction LSTM to compute the encoded output P, which remains the same dimension with the original hidden state H P P = LSTM (W [H P, C P ] + b) Figure 2 shows sequence attention mix and coattention encoder that we discussed. 2.2 LSTM Classification Decoder Encoder input P is used to predict the probability distribution f as (i P ), where i indicates a position in P. P is then passed through a one-directional LSTM to produce transformed encoded information P and the probability distribution of ending position f ae (j P ) is computed accordingly. Where j is a position in P. 3

Figure 3: LSTM Classification Decoder. Single Coattention Network Random Coattention Forest F1 EM F1 EM Training 69.7 57.0 Training 94.4 90.0 Validation 45.2 35.0 Validation 65.3 50.0 Development 31.2 21.7 Development 53.9 39.7 Test 31.3 20.5 Test 54.6 41.0 Table 1: Experiment results of single Coattention Network and Random Coattention Forest with 3 networks. f as (i P ) = softmax (W 1 P ) P = LSTM (P ) f ae (j P ) = softmax (W 2 P ) â s = arg max f as (i P ) i â e = arg max f ae (j P ) j Where W 1 R 1 d, W 2 R 1 d, v start R 1 p. The decoder operation is visualized in Fig.3. 2.3 Random Forest Network Our implementation of coattention encoder is actually a weaker version than described in [2]. The design choice is made so that the encoder-decoder pairs fit better as building blocks of a randomforest-style structure. Random forest predictors are formed by ensembling several loosely correlated weak predictors. Our coattention encoder and Classification decoder uses one-directional LSTMs, which provides better computational efficiency. It also helps decreases model complexity to prevent overfitting, which, in our experiment results later, is shown to be a major challenge to generalize our model. During the training phase, all encoder-decoder pairs are given the same input, however, each encoderdecoder pair sees different input due to random dropout. Also, different random start was applied. The two sources randomness introduced above makes our encoder-decoder pairs only loosely correlated with each other, which is idea for building random forest networks. 3 Experiments Our model was trained and tested with SQuAD dataset. The 105K question-answer pairs in SQuAD was randomly divided into four parts: training (81K), validation (4K), development (10K) and test (10K). Where the test set was never seen by the model throughout the model training process and was used for final evaluation only. Validation set was used to tune hyperparameters and development 4

Figure 4: length. Performance for different context Figure 5: Performance for different question length. Figure 6: length. Performance for different answer Figure 7: Performance for question types. set was used to estimate model performance during the training process. The model was trained with dropout and exponential learning rate decay using Adam optimizer. 3.1 Evaluation Metrics We use exact match (EM) and F1 score to evaluate our experimental results. F1 score is a metric that loosely measures the average overlap between the prediction and ground truth answer. Exact match measures the percentage of predictions that match one of the ground truth answers exactly. We treat the prediction and ground truth as bags of tokens, and compute their F1. We take the maximum F1 over all of the ground truth answers for a given question, and then average over all of the questions. 3.2 Performance Our experiment results are shown in Table.1. The Random Coattention Forest we implemented consists of 3 Coattention Networks. The benefit of ensembling loosely correlated coattention networks is obvious, near 20% boost in F1 and EM scores are observed when compared to our implementation of single Coattention Network. This shows the correlation between different encoderdecoder pairs is small as we designed and how weak predictors are suitable for building a random forest network. It took around 120 seconds to train one full epoch (81K training samples) on one Tesla M60 GPU with 8GB memory. This computational efficiency was due to our design choice of building weaker version of coattention network encoder and a simple decoder design. Training one epoch on our Random Coattention Forest with 3 encoder-decoder networks took around 480 seconds, which is still a desirable computational performance. 5

Figure 8: A P matrix visualization. Input context was "... expelled. Polish civilian deaths are estimated at between 150,000 and 200,000." Input question was "What is the estimated death toll for Polish civilians?" And the corresponding answer was "between 150,000 and 200,000". 3.3 Coattention Mechanism The coattention encoder captures context-question interactions through matrix multiplications. Fig.8 provides a visualization of how this attention mechanism works. The visualization demonstrated a transposed A P matrix which was obtained by multiplying context and question representation then softmax over context words. It shows A P activated in context area before the answer. Actually, since A P is derived through context-question multiplication, when common words between context and question meet, their cosine distance is small (context and question word embeddings went through different LSTMs, so they will not be exactly the same) and inner product is likely to be large. This results in a highlighted area in A P. As the example in Fig.8 shows, answer range is generally in the context near where the question words appear, thus highlighted area in context gives a hint for where the true answer may reside and will be helpful in predicting answer positions. 3.4 Length Analysis Fig.4, 5 and 6 plot model performance with respect to the length of context paragraphs, questions and ground truth answers, respectively. The number of samples we have for question length < 4 is too small so shall be excluded from analysis. Generally, F1 and EM performance gradually decrease as the length of context paragraphs, questions and ground truth answers increases. E.g. for contexts containing more than 250 characters, the F1 score of our model drops to around 40% and the exact match score drops to around 20%, compared to the F1 score and exact match score of close to 57% and 35%, for context with shorter length. For question containing more than 23 characters, the F1 score drops to around 37% and the exact match score drops to around 10%, compared to the F1 score and exact match score close to 55% and 40% for question with shorter length. It s worth pointing out as the length of ground truth answers grow, the corresponding F1 and EM scores drop with different slope. While F1 scores drop gradually, EM drops to lower than 10% as answer length exceeds 24. As the length of target answer grows, it s more likely for our model to make a "mistake" when predicting starting and ending points, therefore the sharp drop in EM scores. But this also shows our model can still make a reasonable prediction of answer range, thus the F1 score drops much slower. 3.5 Question Types Analysis We analyze the performance of our model on different types of questions. We basically split the questions into different groups including how, what, when, where, who, and why. These different words refer to questions with different types of answers. According to the F1 and EM scores on the development set, our model performs the best for when questions. This may be because temporal expressions is much more distinguishable in the context. "How" questions come as the second best performing type of questions, "how" type includes questions like "how much" or "how many" which are generally refer to numbers, making them easier to answer. It can be observed that "why" types of question has average F1 performance but the worst EM performance. The is the direct result from average answer length with respect to different type of questions. The average answer length of different types of questions are: 6.75 for "why", 3.04 for "what", 2.90 for "where", 2.59 for "who", 2.52 for "how" and 2.31 for "when". As we ve seen in 6

Fig.6 that EM score drops sharply with answer length, there s no surprise the "why" type questions have the worst EM performance. 4 Conclusion We build a Random Coattention Forest model that solves question answering problem to a satisfactory level. With the nature of loosely correlated encoder-decoder pairs, ensembling multiple encoderdecoder pairs gave significant boost in F1 and EM performance. Our design choice of building weaker encoders and decoders enabled us to build a random-forest-style model both theoretically and computationally. Due to computing power limitation, we did not experiment with larger number of ensembled encoderdecoder pairs and larger state sizes which may possibly boost model performance even more. We leave these as future directions. Also, it s worth exploring the trade-offs between encoder/decoder complexity and the performance of random-forest-style setting. Though random-forest-style settings prefer weaker individual predictors, a more complex encoder/decoder design can be expected to have better individual level performance. We had a 40% gap between F1 scores of training and test set, indicating a room for improvement if we can further eliminate overfitting. A finer tuned hyperparameter set may be a possible future solution. Dropout rate, learning rate decay rate, number of encoder-decoder pairs in random forest network can all be tuned to find a better regularization. Experimental design techniques like Kriging may provide a systematic way to search through the parameter space. Given the fast-training nature of our model, such systematic hyperparameter search is possible. References [1] Minh-Thang Luong, Hieu Pham, and Christopher D. Manning. Effective Approaches to Attention-based Neural Machine Translation. arxiv preprint arxiv:1508.04025, 2015. [2] Wang, S. & Jiang, J. (2017) Machine Comprehension Using Match-LSTM and Answer Pointer. arxiv preprint arxiv:1608.07905v2, 2016. [3] Caiming Xiong, Victor Zhong, and Richard Socher. Dynamic coattention networks for question answering. arxiv preprint arxiv:1611.01604, 2016. [4] Dzmitry Bahdanau, KyungHyun Cho, Yoshua Bengio. Neural Machine Translation by Jointly Learning to Align and Translate. arxiv preprint arxiv:1409.0473, 2016. [5] Minjoon Seo, Aniruddha Kembhavi, Ali Farhadi, and Hannaneh Hajishirzi. Bidirectional attention flow for machine comprehension. arxiv preprint arxiv:1611.01603, 2016. [6] P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang. Squad: 100,000+ questions for machine comprehension of text. In Empirical Methods in Natural Language Processing (EMNLP), 2016. 7