arxiv: v1 [cs.cl] 21 Nov 2017

Size: px

Start display at page:

Download "arxiv: v1 [cs.cl] 21 Nov 2017"

Merilyn Cathleen Barrett
6 years ago
Views:

1 Cross Temporal Recurren Neworks for Ranking Quesion Answer Pairs Yi Tay 1, Luu Anh Tuan 2 and Siu Cheung Hui 3 1, 3 Nanyang Technological Universiy School of Compuer Science and Engineering, Singapore 2 Insiue for Infocomm Research, Singapore ariv: v1 [cs.cl] 21 Nov 2017 Absrac Temporal gaes play a significan role in modern recurrenbased neural encoders, enabling fine-grained conrol over recursive composiional operaions over ime. In recurren models such as he long shor-erm memory (LSTM), emporal gaes conrol he amoun of informaion reained or discarded over ime, no only playing an imporan role in influencing he learned represenaions bu also serving as a proecion agains vanishing gradiens. This paper explores he idea of learning emporal gaes for sequence pairs (quesion and answer), joinly influencing he learned represenaions in a pairwise manner. In our approach, emporal gaes are learned via 1D convoluional layers and hen subsequenly cross applied across quesion and answer for join learning. Empirically, we show ha his concepually simple sharing of emporal gaes can lead o compeiive performance across muliple benchmarks. Inuiively, wha our nework achieves can be inerpreed as learning represenaions of quesion and answer pairs ha are aware of wha each oher is remembering or forgeing, i.e., pairwise emporal gaing. Via exensive experimens, we show ha our proposed model achieves sae-of-he-ar performance on wo communiy-based QA daases and compeiive performance on one facoid-based QA daase. Inroducion Learning-o-rank for QA (quesion answering) is a long sanding problem in NLP and IR research which benefis a wide assormen of subasks such as communiy-based quesion answering (CQA) and facoid based quesion answering. The problem is mainly concerned wih compuing relevance scores beween quesions and prospecive answers and subsequenly ranking hem. Across he rich hisory of answer or documen rerieval, saisical approaches based on feaure engineering are commonly adoped. These models are largely based on complex lexical and synacic feaures (Wang and Manning 2010; Zhou e al. 2011; Wang, Ming, and Chua 2009) and a learning-o-rank classifier such as Suppor Vecor Machine (SVM) (Severyn e al. 2014; Filice e al. 2016). Today, we see a shif ino neural quesion answering. Specifically, end-o-end deep neural neworks are used for Copyrigh c 2018, Associaion for he Advancemen of Arificial Inelligence ( All righs reserved. boh auomaically learning feaures and scoring of QA pairs. Popular neural encoders for neural quesion answering include long shor-erm memory (LSTM) neworks (Hochreier and Schmidhuber 1997) and convoluional neural neworks (CNN). The key idea behind neural encoders is o learn o compose (Li e al. 2015), i.e., compressing an enire senence ino a single feaure vecor. While i is possible o encode quesions and answers independenly, and laer merge hem wih muli-layer perceprons (MLP) (Severyn and Moschii 2015), ensor layers (Qiu and Huang 2015) or holographic layers (Tay e al. 2017), i would be desirable for quesion and answer pairs o benefi from informaion available from heir parner. There have been many models proposed for doing so which adop echniques for joinly learning quesion and answer represenaions. Many of hese recen echniques adop sof-aenion maching (Yang e al. 2016; Sanos e al. 2016; Zhang e al. 2017) o learn aenion weighs ha are joinly influenced by boh quesion and answer. Subsequenly, he join aenion weighs are applied accordingly o learn a final represenaion of quesion and answer. Performance resuls have shown ha incorporaing he ineracions beween QA pairs can indeed improve he performance of QA sysems. Temporal gaes form he cornersone of modern recurren neural encoders such as long shor-erm memory (LSTM) or gaed recurren unis (GRU), serving as one of he key miigaion sraegies agains vanishing gradiens. In hese models, emporal gaes conrol he inner recursive loop along wih he amoun of informaion being discarded and reained a each ime sep, allowing fine-grained conrol over he semanic composiionaliy of learned represenaions. Our work explores he idea of joinly learning emporal gaes for sequence pairs, aiming o learn fine-grained represenaions of QA pairs which benefi from informaion peraining o wha each oher is remembering or forgeing. The key idea here is as follows: By exploiing informaion abou he quesion, can we learn an opimal way o semanically compose he answer? (and vice versa). Firs, consider he following example in Table 1 which highlighs he imporance of semanic composiionaliy. Firs, i would be easy for many sof-aenion and maching-based models o classify his quesion and answer pair wih a high relevance score due o he underlined words

2 Q: Wha deep learning framework should I learn if I wan o ge ino deep learning? I am a beginner wihou programming experience. Wan o build cool apps. A: Tensorflow. I is a prey solid and low level deep learning framework for advanced research. Table 1: Example of a quesion and answer pair. Ground ruh is negaive. ( deep learning and ensorflow ). However, he reason why his is a negaive example is in he inricae deails which can be effecively learned only via semanic composiionaliy (e.g., wihou programming experience, ge ino ). On he oher hand, by exploiing join emporal gaes, our mehod learns o compose he senence given he informaion abou is parner. For insance, wihou he knowledge of he answer which conains he phrase advanced research, he quesion encoder will no know if i should reain he word beginner. Hence, join learning of emporal gaes can help our model learn o compose, by influencing wha i remembers and forges. As a resul, his addiional knowledge can allow he words in boldface ( beginner, advanced research ) o be srongly reained in he final represenaion. This is in similar spiri o neural aenion. However, our approach joinly learns o compose insead of learning o aend. The difference is a he level which represenaions are influenced a. Our Conribuions The main conribuions of his work are: We inroduce a new mehod for using emporal gaes o synchronously and joinly learn he ineracions beween ex pairs. In he conex of quesion answering, we learn which informaion o remember or discard in he answer while being aware of he conex of he quesion. To he bes of our knowledge, his is he firs work ha performs quesion answer maching a he emporal gae level. We propose a novel neural archiecure for QA ranking. Our proposed Cross Temporal Recurren Nework (CTRN) model is largely inspired by he recenly inceped Quasi Recurren Neural Nework (QRNN) (Bradbury e al. 2016) and can be considered as a naural exension of QRNN o sequence pairs. Our model akes afer QRNN in he sense ha gaes are firs learned (via 1D convoluional layers) and hen subsequenly applied o emporally adjus he represenaions. Hence, he faciliaion of informaion flow can be inerpreed as join pairwise gaing. Our proposed CTRN model achieves sae-of-he-ar performance on wo communiy-based QA (CQA) daases, namely he Yahoo Answers daase and he QaarLiving daase from SemEval Moreover, our model also achieves highly compeiive performance on he TrecQA daase for facoid based QA. Experimenal resuls show ha CTRN ouperforms models ha uilize aenionbased maching while being significanly more efficien. Experimenal resuls also confirm ha CTRN improves he underlying QRNN model. Relaed Work This secion inroduces prior work in he field of neural QA ranking. We also inroduce he Quasi Recurren Neural Nework (QRNN) model, which lives a he hear of our proposed approach. Neural Quesion Answer Ranking Convoluional neural nework (CNN) (Hu e al. 2014; Severyn and Moschii 2015) and recurren models like he long shor-erm memory (LSTM) (Wang and Nyberg 2015) nework are popular neural encoders for he QA ranking problem. (Yu e al. 2014) proposed o use CNN for learning feaures and subsequenly apply logisic regression for generaing QA relevance scores. Subsequenly, an end-o-end neural archiecure based on CNN and bilinear maching was proposed in (Severyn and Moschii 2015). In many recen works, he key innovaion in mos models is he echnique used o model ineracion beween quesion and answer pairs. The CNN model inroduced in (Severyn and Moschii 2015) uses a MLP o compose vecors of quesions and answers. (Qiu and Huang 2015) adoped ensor layers for richer modeling capabiliies. (Tay e al. 2017) proposed holographic memory layers. (He, Gimpel, and Lin 2015) proposed Muli-Perspecive CNN which maches muliple perspecives based on variaions of pooling and convoluion schemes. Recen work has showed he effeciveness of learning QA embeddings in non-euclidean spaces such as Hyperbolic space (Tay, Luu, and Hui 2017). Models based on sof-aenion such as Aenive Pooling neworks (AP-BiLSTM and AP-CNN) (Sanos e al. 2016), AI- CNN (Zhang e al. 2017) and anmm (aenion-based neural maching) (Yang e al. 2016) have also been proposed. These models learn weighed represenaions of QA pairs using similariy marix based aenions. Quasi Recurren Neural Nework (QRNN) In his secion, we inroduce Quasi Recurren Neural Nework (Bradbury e al. 2016) and our key moivaion of basing CTRN on QRNN. QRNN is a recurren neural nework ha is acually a convoluional neural nework in disguise. The key inuiion wih QRNN is ha i firs learns emporal gaes via 1D convoluions and subsequenly applies hem sequenially. Conrasively, recurren models learn hese gaes sequenially. Given an inpu of a sequence of L vecors w R m where L is he maximum sequence lengh and m is he dimension of he vecors, he QRNN model applies hree 1D convoluion operaions as follows: Z = anh(w z ) F = σ(w f ) (1) O = σ(w o ) where is he inpu sequence of m dimensional vecors wih sequence lengh L. W x, W f, W o R k d m are parameers of QRNN and denoes a convoluion across he emporal dimension. k is he filer widh and d is he oupu

3 Sofmax Layer MLP Layer Sofmax MLP Mean Pooling Mean Pool Mean Pool Hadamard Produc v v Forward Op CTRN cell Temporal Crossing CTRN cell Quasi-Recurren Layer Conv Z (q) Conv F (q) Conv O (q) Conv O (a) Conv F (a) Conv Z (a) Projecion Layer Embedding Layer Shared Projecion Layer Shared Embedding Layer Quesion Inpu Shared Projecion Layer Shared Embedding Layer Answer Inpu Figure 1: Diagram of our proposed CTRN archiecure. Gaes (denoed Conv F and Conv O) are prelearned via convoluions and each CTRN cell (for q and a) incorporaes he gaes of heir parner while learning represenaions. The base represenaions (in which gaes are applied in CTRN) are denoed by Conv Z. Green denoes informaion flow from quesion and blue denoes informaion flow from answer. dimension. Subsequenly, he following equaions describe he forward (recursive) operaion of he QRNN cell: c = f c 1 + (1 f ) z h = o c where c is he cell sae and h is he hidden sae. f, o are he forge and oupu gaes respecively a ime sep. σ is he sigmoid acivaion which nonlinearly projecs each elemen of is inpu o [0, 1]. Z can be regarded as he convolved base represenaion similar o wha a radiional CNN model learns. F and O are hen applied recursively o emporally adjus and influence he semanic composiionaliy of Z. As such, his makes i quasi-recurren. The key difference beween QRNN and recurren models like LSTM is ha gaes are prelearned via convoluion while RNN models like LSTM learn heir gaes sequenially during he recursive forward operaion. In shor, he forward operaion in QRNN is sill sequenially applied bu is comparaively much cheaper han radiional LSTM cells since gaes are merely applied in he case of QRNN. As such, he parallelizaion of gae learning improves he speed of QRNN as compared o LSTM. In he original paper, QRNN achieved around 4 imes less compuaional ime as compared o LSTM models while achieving similar or beer performance. For he sake of breviy, we refer ineresed readers o (Bradbury e al. 2016) for more deails. Inspired by he compuaional benefis of QRNN, we adop i as our base model. Nex, we also noice an aracive propery of QRNN. In QRNN models, because gaes are prelearned, i enables us o align emporal gaes beween wo QRNNs easily. Conversely, considering he fac ha quesions and answers migh no have similar sequence lengh, rying o sequenially align emporal gaes in LSTM models can be exremely cumbersome and inefficien. More imporanly, emporal gaes of LSTM cells do no have global informaion, i.e., each sep is only aware of all seps ha precede i. On he oher hand, emporal gaes from QRNNs have global informaion abou he enire sequence. Our Proposed Approach In his secion, we describe our novel deep learning model layer-by-layer. The overall archiecure of our model is illusraed in Figure 1. For noaional convenience, we denoe he subscrips q, a on wheher a parameer belongs o quesion or answer respecively. Embedding + Projecion Layer Our model acceps wo sequences of indices (quesion and answer inpus) which are passed hrough an embedding layer (shared beween q and a inpus) and reurns a sequence of n dimensional vecors. In pracice, we iniialize and fix his layer wih prerained embeddings while connecing o a n m projecion layer. As such, he oupu of he embedding + projecion layer is a m dimensional vecor. Noe ha his layer is shared beween quesion and answer inpus. Quasi-Recurren Layer The inpu o he quasi-recurren layer is a sequence of L vecors w R m where L is he maximum sequence lengh.

4 This layer applies hree 1D convoluion operaions as described in Equaion (1). Finally, he oupus of he quasirecurren layer are represenaions or marices {Z s, F s, O s } where s = {q, a}. Noe ha up ill now, his quasi-recurren layer remains funcionally idenical o QRNN. Lighweigh Temporal Crossing (LTC) In his secion, we inroduce our novel lighweigh emporal crossing (LTC) mechanism which lives a he hear of our CTRN model. Our approach exends upon he QRNN model, we leverage he fac ha gaes F x, O x are learned non-sequenially. The key idea is o leverage he informaion in F q, O q for Z a and vice versa. This informaion flow is denoed by he green and blue arrows in Figure 1. The oupus of his layer are similar o he LSTM model, i.e., hey are a sequence of hidden saes H R L d where L is he sequence lengh and d is he number of filers. A his layer, here are wo CTRN cells, namely CTRN-Q and CTRN-A, for quesion and answer represenaions respecively. - (&) - (& ( ) $./ $./ & ) $! + & * $ anh (&) h $ & + $! (&) h $ (& h ( ) $! + + $, ) $, Figure 2: Diagram of a single CTRN-Q cell. Red lines are informaion flow from answer gaes. denoes elemen-wise muliplicaion, + denoes elemen-wise addiion. anh is he hyperbolic angen funcion and σ is he sigmoid funcion. In his secion, we use CTRN-Q as an example bu noe ha CTRN-A and CTRN-Q are funcionally symmerical. Figure 2 illusraes a single CTRN-Q cell. Each CTRN-Q cell conains wo cell saes denoed as c (q) and c (q ) and wo hidden saes denoed as h (q) and h (q ). As such, here are wo learned represenaions in he CTRN-Q cell, denoed by q and q respecively. The firs represenaion is learned as per normal, i.e., applying F q, O q on Z q. The second represenaion is learned by applying parner gaes F a, O a on he quesion represenaion Z q. The following equaions depic he forward operaion of he CTRN-Q cell. c (q) h (q) = f q c (q) 1 + (1 f q ) z q = o q c (q) c (q ) = f a c(q ) 1 + (1 f a ) zq h (q ) = o a ) c(q where f n, o n, z n denoe he forge and oupu gaes for ex n {q, a} a ime sep. is an aligned ime sep beween quesion and answer sequences as he sequence lengh of quesion and answer migh be differen. For simpliciy, we! consider = max( q, a ) min( q, a ). Similarly, he forward operaion for CTRN-A is as follows: c (a) h (a) = f a c (a) 1 + (1 f a ) z a = o a c (a) c (a ) = f q c(a ) 1 + (1 f q ) za h (a ) = o q c (a ) Finally, o obain a single represenaion for each quesion and answer. We simply apply he Hadamard produc beween hidden saes of each ime sep, i.e., h (q) h (q ) and h (a) = h (a) h (a ) = h (q). This enables join represenaions of emporal gaes which form he crux of our LTC mechanism. Why does his work? Noably, since gaes are learned via parameerized convoluional layers, our learned gaes (F and O) no only conain local index-specific informaion bu also global informaion of he enire ex sequence. This is modeled by he parameers of he convoluional layers which produce F and O. As such, i would suffice o compose hem index-wise since he goal is o enable informaion flow beween he emporal gaes of quesion and answer. Our inuiion here is o cross apply quesion and answer gaes o boh quesion and answer represenaions so as o enable gradiens flow across quesion and answer during backpropagaion. Since he goal is o fuse and no o mach, we empirically found ha sof-aenion alignmen of gaes o yield no performance benefis over a simple index-wise alignmen. Temporal Mean Pooling Layer The oupu of each CTRN cell is an array of hidden saes [h s 1, h s 2..h s L ]. In his layer, we apply emporal mean pooling for boh CTRN- Q and CTRN-A. The operaion of his layer is a simple elemen-wise average of all oupu hidden vecors. Dense Layers (MLP) The inpus of his layer are wo vecors which are he final represenaions of quesion and answer respecively. In his layer, we concaenae he wo vecors and pass hem hrough a series of fully-conneced dense layers (or MLP). Likewise, he number of layers is also a hyperparameer o be uned. Sofmax Layer and Opimizaion The final oupu of he hidden layer is hen passed hrough a 2-class sofmax layer. The final score of each QA pair is described as follows: s(q, a) = sofmax(w f x + b f ) (2) where x is he oupu of he las hidden layer and W f R h 2 and b f R 2. h is he size of he hidden layer. Our nework minimizes he sandard cross enropy loss as is raining objecive. The choice of a poinwise model is moivaed by 1) ease of implemenaion and 2) previous work (Severyn and Moschii 2015; Tay e al. 2017; Zhang e al. 2017). The loss funcion is defined as follows: L = N [y i log s i + (1 y i ) log(1 s i )] + λ θ 2 2 (3) i=1

5 where s is he oupu of he sofmax layer. θ conains all he parameers of he nework and λ θ 2 2 is he L2 regularizaion. The parameers of he nework are updaed using he Adam Opimizer (Kingma and Ba 2014). Complexiy Analysis In his secion, we sudy he memory complexiy of our model o furher jusify he lighweigh aspec in our LTC mechanism. Firs, our CTRN does no incur any parameer cos over he vanilla QRNN model (3kdm for a single QRNN model). As such, he memory complexiy and parameer size remain equal o QRNN. This is easy o see as here is no addiional parameers added since our model, from a compuaional graph perspecive, is simply adding connecions beween nodes. Nex, we consider he runime complexiy of our model (forward pass). Le d be he number of filers of he convoluion layer and L be he maximum sequence lengh. The compuaional complexiy of a single QRNN cell is O(dL) excluding convoluion operaions used o generae {F, O, Z}. Though he number of operaions is approximaely doubled (due o cross applying gaes), he complexiy of a CTRN cell is sill O(dL), i.e., our model sill runs in linear ime as compared o LSTM models wih quadraic ime complexiy. Overall, our model, hough seemingly more complicaed, does no increase he parameer size and only incurs a sligh increase in compuaional cos as compared o he already efficien QRNN model. We are also able o leverage he compuaional benefis of QRNN over he vanilla LSTM model. Table 2 shows a simple comparison of our proposed CTRN model agains he sandard LSTM and AP-BiLSTM models. We observe ha QRNN and CTRN are much more parameer efficien as compared o recurren models, only aking up 58% he parameer size of he vanilla LSTM model and being 400% smaller han AP-BiLSTM. Model # Mem Complexiy # Params LSTM 4(md + d 2 ) + 2dh + h 1.79M AP-BiLSTM 4(md + d 2 ) + 4d M QRNN 3 kdm + 2dh + h 1.05M CTRN 3 kdm + 2dh + h 1.05M Table 2: Memory complexiy analysis wih shared parameers for q and a. m is he size of he inpu embeddings. d is he number of filers and he dimensionaliy of he LSTM model. h is he size of he hidden layer. The complexiy highighed in boldface is used o compose q and a. # Params gives an esimae wih d = 512, m = 300, h = 128 and k = 2. Word embedding parameers are excluded from comparison. Experimens To ascerain he effeciveness of our proposed approach, we conduc experimens on hree popular benchmark daases. Experimenal Seup This secion describes he daases used, baselines compared and evaluaion merics. Daases We selec hree popular benchmark daases which are described as follows: YahooQA - Yahoo Answers is a CQA plaform. This is a moderaely large daase conaining 142, 627 QA pairs which are obained from he CQA plaform. More specifically, preprocessing and esing splis 1 are obained from (Tay e al. 2017). In heir seing, quesions and answers ha are no in he range of 5 50 okens are filered. Addiionally, 4 negaive samples are generaed for each quesion by sampling from he op 1000 his using Lucene search. QaarLiving - This is anoher CQA daase which was obained from he popular SemEval-2016 Task 3 Subask A (CQA). This is a real world daase obained from Qaar Living Forums. In his daase, here are en answers per hread (quesion) which are marked as Good, Poenially Useful or Bad. Following (Zhang e al. 2017), we rea Good as posiive and anyhing else as negaive labels. TrecQA - This is a popular QA ranking benchmark obained from he TREC QA Tracks QA pairs are generally shor and facoid-based consising rivia like quesions. In his daase, here are wo raining ses, namely TRAIN and TRAIN-ALL. TRAIN consiss of QA pairs ha have been manually judged and annoaed. TRAIN- ALL is an auomaically judged daase of QA pairs and conains a larger number of QA pairs. TRAIN-ALL, being a larger daase, also conains more noise. Neverheless, boh daases enable he comparison of all models wih respec o he availabiliy and volume of raining samples. The saisics of all daases, i.e., raining ses, developmen ses and esing ses, are given in Table 3. CQA TrecQA YahooQA QL TRAIN TRAIN-ALL Train Qns 50.1K 4.8K Dev Qns 6.2K Tes Qns 6.2K Train Pairs 253K 36K 4.7K 53K Dev Pairs 31.7K 2.4K 1.1K 1.1K Tes Pairs 31.7K 3.2K 1.5K 1.5K Table 3: Saisics of daases. QL denoes he QaarLiving daase. TRAIN and TRAIN-ALL are wo seings of he TrecQA daase. Evaluaion Merics For each daase, we adop he evaluaion merics used in prior work. For YahooQA, we follow (Tay e al. 2017) ha uses P@1 (Precision@1) and MRR (Mean Reciprocal Rank). For QaarLiving, we follow (Zhang e al. 2017) and evaluae on P@1 and MAP (Mean Average Precision). For TrecQA, we follow he experimen procedure in (Severyn and Moschii 2015) using he official evaluaion merics of MAP and MRR. Since he evaluaion 1 Splis be obained a hps://gihub.com/ vanzyay/yahooqa_splis.

6 merics are commonplace in ranking asks, we omi any furher deails for he sake of breviy. Implemenaion Deails and Baselines For our CTRN model, we une he oupu dimension (number of filers) wihin [128, 1024] in muliples of 128. A single layered CTRN and QRNN is used. The number of dense (MLP) layers is uned from [1, 3] and learning rae uned amongs {10 3, 10 4, 10 5 }. Bach size is uned amongs {64, 128, 256, 512}. Dropou is se o 0.5 and L2 regularizaion is se o Word embedding marices are all non-rainable and are learned by he projecion layer insead. For he hree daases, we adop daase-specific baselines largely based on prior published works. YahooQA - We compare agains muliple sae-of-hear models. Specifically, we compare our model wih he vanilla LSTM, vanilla CNN, CNTN, NTN-LSTM and HD-LSTM. Since we use he same esing splis, we repor he resuls direcly from (Tay e al. 2017). Addiionally, we include addiional baselines such as AP-BiLSTM and AP-CNN (Sanos e al. 2016) which serve as a represenaive for sof-aenion alignmen based models. Cosine similariy wih pairwise ranking is used as he meric for AP-BiLSTM and AP-CNN following he original implemenaion. Models superscriped wih are implemened by us. We iniialize our model wih prerained GloVE embeddings (Penningon, Socher, and Manning 2014) of d = 300. QaarLiving - The key compeiors of his daase are he CNN-based ARC-I/II archiecure by Hu e al. (Hu e al. 2014), he Aenive Pooling CNN (Sanos e al. 2016), Kelp (Filice e al. 2016) a feaure engineering based SVM mehod, ConvKN (Barrón-Cedeño e al. 2016) a combinaion of convoluional ree kernels wih CNN and finally AI-CNN (Aenive Ineracive CNN) (Zhang e al. 2017), a ensor-based aenive pooling neural model. We iniialize wih prerained GloVE embeddings of d = 200 rained using he domain-specific unannoaed corpus provided by he ask. TrecQA - We compare agains published works which include boh radiional models and neural models. Moreover, we compare wih models repored in (Tay e al. 2017) on TRAIN and TRAIN-ALL daases o observe he effec of differen daase sizes. The evaluaion procedure follows (Severyn and Moschii 2015) closely. We iniialize he embedding layers wih he same prerained word embeddings of d = 50 as (Severyn and Moschii 2015) for fair comparisons agains compeior approaches. These embeddings are rained wih he Skipgram model using he Wikipedia and AQUAINT corpus. Four word overlap feaures are also concaenaed before he dense layers following (Severyn and Moschii 2015). We rain our model for 25 epochs for TRAIN and 5 epochs for TRAIN-ALL and repor he es score from he bes performing model on he developmen se. Hyperparameers are also uned on he developmen se. Early sopping is adoped and raining is erminaed if he validaion performance doesn improve afer 5 epochs. Experimenal Resuls In his secion, we repor some observaions peraining o our empirical resuls. Model P@1 MRR Random Guess BM CNN φ CNTN φ LSTM φ NTN-LSTM φ HD-LSTM φ AP-CNN AP-BiLSTM QRNN CTRN (This paper) Table 4: Experimenal resuls on YahooQA. Models are ranked by P@1. Models marked wih φ are repored direcly from (Tay e al. 2017) while denoes our own implemenaion. Bes resul is in boldface and second bes is underlined. Experimenal Resuls on YahooQA Table 4 repors he experimenal resuls on he YahooQA daase. Firsly, we observe ha our proposed CTRN achieves sae-of-he-ar performance on his daase. Noably, we ouperform HD- LSTM (Tay e al. 2017) by 4% in erms of P@1 and 2% in erms of MRR. CTRN also ouperforms aenion based models such as AP-BiLSTM and AP-CNN (Sanos e al. 2016) by a considerable margin, i.e., of abou 2% 3%. A his juncion, we make several observaions abou our proposed CTRN model. Firsly, his shows ha our LTC mechanism is more effecive han sof-aenion maching on his daase. Secondly, he meris of his mechanism can be furher observed by he performance difference in he QRNN and CTRN. Our proposed CTRN comforably ouperforms QRNN by 2% 3% in erms of P@1 and MRR. Surprisingly, we see ha a simple baseline QRNN performs quie well on his daase which ouperforms oher complex models such as NTN-LSTM and HD-LSTM (Tay e al. 2017). Experimenal Resuls on QaarLiving Table 5 repors our experimenal resuls on he QaarLiving daase. Our CTRN model ouperforms AI-CNN 2 by 2.5% in erms of P@1 while mainaining similar performance on MRR. The performance of he CTRN model also ouperforms he baseline QRNN by 3% on P@1. Similar o he Yahoo QA daase, we also found ha he baseline QRNN performed surprisingly well, i.e., ouperforming ConvKN and oher CNN based models such as ARC-I and ARC-II. Overall, our proposed approach achieves very compeiive resuls on his daase. 2 For fair comparison, we compare agains he repored resuls of AI-CNN ha does no use handcrafed feaures.

7 Model MAP ARC-I CNN ARC-II CNN AP Kelp ConvKN QRNN AI-CNN CTRN (This paper) Table 5: Experimenal resuls on he QaarLiving daase. Bes resul is in boldface and second bes is underlined. CTRN ouperforms he complex AI-CNN model. TRAIN TRAIN-ALL Model MAP MRR MAP MRR CNN + LR CNN CNTN LSTM MV-LSTM NTN-LSTM HD-LSTM QRNN CTRN (This paper) Table 6: Comparisons of various neural baselines on he TREC QA ask on wo daase seings TRAIN and TRAIN- ALL. Compeiors (excep QRNN and CTRN) are repored from (Tay e al. 2017). Bes resul is in boldface and second bes is underlined. Resuls on TrecQA Table 6 repors he resuls on TRAIN and TRAIN-ALL seings of he TrecQA ask. CTRN achieves he op resuls comparing o he muliude of neural baselines. Noably, he performance of CTRN is abou 1% 5% beer han QRNN. QRNN performs very compeiively on he TRAIN-ALL seing bu fails in comparison for he TRAIN seing. This migh be because QRNN, wih hree 1D convoluional layers, migh overfi on he smaller daase. However, CTRN performs well on smaller TRAIN as well which hins a possibly some regularizing effec of he LTC mechanism. The performance of he vanilla QRNN model on he TRAIN-ALL seing is also surprisingly compeiive, ouperforming more complex models such as HD- LSTM and NTN-LSTM. Table 7 repors he resuls of he CTRN model agains oher published compeiors. We can see ha CTRN ouperforms many complex neural archiecures such as he anmm model (Yang e al. 2016), HD-LSTM (Tay e al. 2017) and MP-CNN model (He, Gimpel, and Lin 2015; Rao, He, and Lin 2016). Runime Comparison Seconds / Epoch QRNN CTRN LSTM BiLSTM AP-BiLSTM Runime Analysis 200 Model MAP MRR Wang e al. (2007) Heilman and Smih (2010) Wang and Manning (2010) Yao (2013) Wang e al. (2015) - BM S & M (2013) Yih e al. (2013) Yu e al. (2014) - CNN+LR Wang e al. (2015) - bilstm S & M. (2015) - CNN Yang e al. (2016) - anmm Tay e al. (2017) - HD-LSTM Bradbury e al. (2016) - QRNN+MLP He and Lin (2016) - MP-CNN CTRN (TRAIN) CTRN (TRAIN-ALL) Table 7: Performance comparisons of all published works on TREC QA daase. S&M is shor for Severyn and Moschii. Bes resul is in boldface and second bes is underlined. The MP-CNN resul is repored from (Rao, He, and Lin 2016), using he poinwise model for fair comparison. denoes resuls repored by his work. 0 Runime Figure 3: Comparisons of runime of all recurren models on he TRAIN-ALL daase wih d = 800. Figure 3 shows he runime comparison for recurren models on he TRAIN-ALL daase. We observe ha CTRN is a very scalable and efficien model. Noably, our CTRN model benefis from he raining speed brough from he QRNN model, which is clearly significanly faser han LSTM models. Moreover, we also show ha CTRN does no significanly increase he runime of he base QRNN, only incurring an addiional 10s per epoch. Moreover, we achieve 4 imes faser runime compared o vanilla LSTM models and 8 imes faser han AP-BiLSTM. Conclusion We inroduced a novel mehod for joinly learning o compose QA pairs. This is achieved by aligning emporal gaes. We show ha our lighweigh emporal crossing (LTC) mechanism is an effecive mehod of modeling ineracions beween QA pairs wihou incurring any parameer cos. Our CTRN model performs compeiively on wo CQA benchmarks and one facoid QA benchmark while being much faser han LSTM and AP-BiLSTM models.

8 Acknowledgemens The auhors hank anonymous reviewers for heir hardwork and feedback. References Barrón-Cedeño, A.; Marino, G. D. S.; Joy, S. R.; Moschii, A.; Al-Obaidli, F.; Romeo, S.; Tymoshenko, K.; and Uva, A Convkn a semeval-2016 ask 3: Answer and quesion selecion for quesion answering on arabic and english fora. In Proceedings of he 10h Inernaional Workshop on Semanic Evaluaion, SemEval@NAACL-HLT 2016, San Diego, CA, USA, June 16-17, Bradbury, J.; Meriy, S.; iong, C.; and Socher, R Quasi-recurren neural neworks. CoRR abs/ Filice, S.; Croce, D.; Moschii, A.; and Basili, R Kelp a semeval-2016 ask 3: Learning semanic relaions beween quesions and answers. In Proceedings of he 10h Inernaional Workshop on Semanic Evaluaion, SemEval@NAACL-HLT 2016, San Diego, CA, USA, June 16-17, He, H.; Gimpel, K.; and Lin, J. J Muli-perspecive senence similariy modeling wih convoluional neural neworks. In Proceedings of he 2015 Conference on Empirical Mehods in Naural Language Processing, EMNLP 2015, Lisbon, Porugal, Sepember 17-21, Hochreier, S., and Schmidhuber, J Long shor-erm memory. Neural compuaion 9(8). Hu, B.; Lu, Z.; Li, H.; and Chen, Q Convoluional neural nework archiecures for maching naural language senences. In Advances in Neural Informaion Processing Sysems 27: Annual Conference on Neural Informaion Processing Sysems 2014, December , Monreal, Quebec, Canada. Kingma, D. P., and Ba, J Adam: A mehod for sochasic opimizaion. CoRR abs/ Li, J.; Chen,.; Hovy, E.; and Jurafsky, D Visualizing and undersanding neural models in nlp. ariv preprin ariv: Penningon, J.; Socher, R.; and Manning, C. D Glove: Global vecors for word represenaion. In Proceedings of he 2014 Conference on Empirical Mehods in Naural Language Processing, EMNLP 2014, Ocober 25-29, 2014, Doha, Qaar, A meeing of SIGDAT, a Special Ineres Group of he ACL. Qiu,., and Huang, Convoluional neural ensor nework archiecure for communiy-based quesion answering. In Proceedings of he Tweny-Fourh Inernaional Join Conference on Arificial Inelligence, IJCAI 2015, Buenos Aires, Argenina, July 25-31, Rao, J.; He, H.; and Lin, J. J Noise-conrasive esimaion for answer selecion wih deep neural neworks. In Proceedings of he 25h ACM Inernaional on Conference on Informaion and Knowledge Managemen, CIKM 2016, Indianapolis, IN, USA, Ocober 24-28, Sanos; Tan, M.; iang, B.; and Zhou, B Aenive pooling neworks. CoRR abs/ Severyn, A., and Moschii, A Learning o rank shor ex pairs wih convoluional deep neural neworks. In Proceedings of he 38h Inernaional ACM SIGIR Conference on Research and Developmen in Informaion Rerieval, Saniago, Chile, Augus 9-13, Severyn, A.; Moschii, A.; Tsagkias, M.; Berendsen, R.; and de Rijke, M A synax-aware re-ranker for microblog rerieval. In The 37h Inernaional ACM SIGIR Conference on Research and Developmen in Informaion Rerieval, SI- GIR 14, Gold Coas, QLD, Ausralia - July 06-11, Tay, Y.; Phan, M. C.; Luu, A. T.; and Hui, S. C Learning o rank quesion answer pairs wih holographic dual lsm archiecure. In Proceedings of he 40h Inernaional ACM SIGIR Conference on Research and Developmen in Informaion Rerieval, Shinjuku, Tokyo Augus 7-11, Tay, Y.; Luu, A. T.; and Hui, S. C Enabling efficien quesion answer rerieval via hyperbolic neural neworks. CoRR abs/ Wang, M., and Manning, C. D Probabilisic ree-edi models wih srucured laen variables for exual enailmen and quesion answering. In COLING 2010, 23rd Inernaional Conference on Compuaional Linguisics, Proceedings of he Conference, Augus 2010, Beijing, China. Wang, D., and Nyberg, E A long shor-erm memory model for answer senence selecion in quesion answering. In Proceedings of he 53rd Annual Meeing of he Associaion for Compuaional Linguisics, ACL 2015, July 26-31, 2015, Beijing, China, Volume 2: Shor Papers. Wang, K.; Ming, Z.; and Chua, T A synacic ree maching approach o finding similar quesions in communiy-based qa services. In Proceedings of he 32nd Annual Inernaional ACM SIGIR Conference on Research and Developmen in Informaion Rerieval, SIGIR 2009, Boson, MA, USA, July 19-23, Yang, L.; Ai, Q.; Guo, J.; and Crof, W. B anmm: Ranking shor answer exs wih aenion-based neural maching model. In Proceedings of he 25h ACM Inernaional on Conference on Informaion and Knowledge Managemen, CIKM 2016, Indianapolis, IN, USA, Ocober 24-28, Yu, L.; Hermann, K. M.; Blunsom, P.; and Pulman, S Deep learning for answer senence selecion. CoRR abs/ Zhang,.; Li, S.; Sha, L.; and Wang, H Aenive ineracive neural neworks for answer selecion in communiy quesion answering. In Proceedings of he Thiry-Firs AAAI Conference on Arificial Inelligence, February 4-9, 2017, San Francisco, California, USA. Zhou, G.; Cai, L.; Zhao, J.; and Liu, K Phrase-based ranslaion model for quesion rerieval in communiy quesion answer archives. In The 49h Annual Meeing of he Associaion for Compuaional Linguisics: Human Language Technologies, Proceedings of he Conference, June, 2011, Porland, Oregon, USA.

Deep Learning: Theory, Techniques & Applications - Recurrent Neural Networks -

Deep Learning: Theory, Techniques & Applicaions - Recurren Neural Neworks - Prof. Maeo Maeucci maeo.maeucci@polimi.i Deparmen of Elecronics, Informaion and Bioengineering Arificial Inelligence and Roboics