A Generalized Recurrent Neural Architecture for Text Classification with Multi-Task Learning

Size: px

Start display at page:

Download "A Generalized Recurrent Neural Architecture for Text Classification with Multi-Task Learning"

Matthew Wells
6 years ago
Views:

1 Proceedings of he Tweny-Sixh Inernaional Join Conference on Arificial Inelligence (IJCAI-17) A Generalized Recurren Neural Archiecure for Tex Classificaion wih Muli-Task Learning Honglun Zhang 1, Liqiang Xiao 1, Yongkun Wang 2, Yaohui Jin 1,2 1 Sae Key Lab of Advanced Opical Communicaion Sysem and Nework 2 Nework and Informaion Cener Shanghai Jiao Tong Universiy {ykw}@sju.edu.cn Absrac Muli-ask learning leverages poenial correlaions among relaed asks o exrac common feaures and yield performance gains. However, mos previous works only consider simple or weak ineracions, hereby failing o model complex correlaions among hree or more asks. In his paper, we propose a muli-ask learning archiecure wih four ypes of recurren neural layers o fuse informaion across muliple relaed asks. The archiecure is srucurally flexible and considers various ineracions among asks, which can be regarded as a generalized case of many previous works. Exensive experimens on five benchmark daases for ex classificaion show ha our model can significanly improve performances of relaed asks wih addiional informaion from ohers. 1 Inroducion Neural nework based models have been widely exploied wih he prosperiies of Deep Learning [Bengio e al., 2013] and achieved inspiring performances on many NLP asks, such as ex classificaion [Chen e al., 2015; Liu e al., 2015a], semanic maching [Liu e al., 2016d; 2016a] and machine ranslaion [Suskever e al., 2014]. These models are robus a feaure engineering and can represen words, senences and documens as fix-lengh vecors, which conain rich semanic informaion and are ideal for subsequen NLP asks. One formidable consrain of deep neural neworks (DNN) is heir srong reliance on large amouns of annoaed corpus due o subsanial parameers o rain. A DNN rained on limied daa is prone o overfiing and incapable o generalize well. However, consrucions of large-scale highqualiy labeled daases are exremely labor-inensive. To solve he problem, hese models usually employ a pre-rained lookup able, also known as Word Embedding [Mikolov e al., 2013b], o map words ino vecors wih semanic implicaions. However, his mehod jus inroduces exra knowledge and does no direcly opimize he argeed ask. The problem of insufficien annoaed resources is no solved eiher. Muli-ask learning leverages poenial correlaions among relaed asks o exrac common feaures, increase corpus size implicily and yield classificaion improvemens. Inspired by [Caruana, 1997], here are a large lieraure dedicaed for muli-ask learning wih neural nework based models [Collober and Weson, 2008; Liu e al., 2015b; 2016b; 2016c]. These models basically share some lower layers o capure common feaures and furher feed hem o subsequen ask-specific layers, which can be classified ino hree ypes: Type-I One daase annoaed wih muliple labels and one inpu wih muliple oupus. Type-II Muliple daases wih respecive labels and one inpu wih muliple oupus, where samples from differen asks are fed one by one ino he models sequenially. Type-III Muliple daases wih respecive labels and muliple inpus wih muliple oupus, where samples from differen asks are joinly learned in parallel. In his paper, we propose a generalized muli-ask learning archiecure wih four ypes of recurren neural layers for ex classificaion. The archiecure focuses on Type-III, which involves more complicaed ineracions bu has no been researched ye. All he relaed asks are joinly inegraed ino a single sysem and samples from differen asks are rained in parallel. In our model, every wo asks can direcly inerac wih each oher and selecively absorb useful informaion, or communicae indirecly via a shared inermediae layer. We also design a global memory sorage o share common feaures and collec ineracions among all asks. We conduc exensive experimens on five benchmark daases for ex classificaion. Compared o learning separaely, joinly learning muliple relaive asks in our model demonsrae significan performance gains for each ask. Our conribuions are hree-folds: Our model is srucurally flexible and considers various ineracions, which can be concluded as a generalized case of many previous works wih deliberae designs. Our model allows for ineracions among hree or more asks simulaneously and samples from differen asks are rained in parallel wih muliple inpus. We consider differen scenarios of muli-ask learning and demonsrae srong resuls on several benchmark classificaion daases. Our model ouperforms mos of sae-of-he-ar baselines. 3385

2 Proceedings of he Tweny-Sixh Inernaional Join Conference on Arificial Inelligence (IJCAI-17) 2 Problem Saemens 2.1 Single-Task Learning For a single supervised ex classificaion ask, he inpu is a word sequences denoed by x = {,,..., }, and he oupu is he corresponding class label y or class disribuion y. A lookup layer is used firs o ge he vecor represenaion x i R d of each word x i. A classificaion model f is rained o ransform each x = {,,..., } ino a prediced disribuion ŷ. f(,,..., ) = ŷ and he raining objecive is o minimize he oal crossenropy of he prediced and rue disribuions over all samples. N C L = y ij log ŷ ij i=1 j=1 where N denoes he number of raining samples and C is he class number. 2.2 Muli-Task Learning Given K supervised ex classificaion asks, T 1, T 2,..., T K, a joinly learning model F is rained o ransform muliple inpus ino a combinaion of prediced disribuions in parallel. F(x, x,..., x (K) ) = (ŷ, ŷ,..., ŷ (K) ) where x (k) are sequences from each asks and ŷ (k) are he corresponding predicions. The overall raining objecive of F is o minimize he weighed linear combinaion of coss for all asks. L = N K i=1 k=1 λ k C k j=1 y (k) ij log ŷ (k) ij (4) where N denoes he number of sample collecions, C k and λ k are class numbers and weighs for each ask T k respecively. 2.3 Three Perspecives of Muli-Task Learning Differen asks may differ in characerisics of he word sequences x or he labels y. We compare los of benchmark asks for ex classificaion and conclude hree differen perspecives of muli-ask learning. Muli-Cardinaliy Tasks are similar excep for cardinaliy parameers, for example, movie review daases wih differen average sequence lenghs and class numbers. Muli-Domain Tasks involve conens of differen domains, for example, produc review daases on books, DVDs, elecronics and kichen appliances. Muli-Objecive Tasks are designed for differen objecives, for example, senimen analysis, opics classificaion and quesion ype judgmen. The simples muli-ask learning scenario is ha all asks share he same cardinaliy, domain and objecive, while come from differen sources, so i is inuiive ha hey can obain useful informaion from each oher. However, in he mos complex scenario, asks may vary in cardinaliy, domain and even objecive, where he ineracions among differen asks can be quie complicaed and implici. We will evaluae our model on differen scenarios in he Experimen secion. 3 Mehodology Recenly neural nework based models have obained subsanial ineress in many naural language processing asks for heir capabiliies o represen variable-lengh ex sequences as fix-lengh vecors, for example, Neural Bag-of- Words (NBOW), Recurren Neural Neworks (RNN), Recursive Neural Neworks (RecNN) and Convoluional Neural Nework (CNN). Mos of hem firs map sequences of words, n-grams or oher semanic unis ino embedding represenaions wih a pre-rained lookup able, hen fuse hese vecors wih differen archiecures of neural neworks, and finally uilize a sofmax layer o predic caegorical disribuion for specific classificaion asks. For recurren neural nework, inpu vecors are absorbed one by one in a recurren way, which makes RNN paricularly suiable for naural language processing asks. 3.1 Recurren Neural Nework A recurren neural nework mainains a inernal hidden sae vecor h ha is recurrenly updaed by a ransiion funcion f. A each ime sep, he hidden sae h is updaed according o he curren inpu vecor x and he previous hidden sae h 1. { 0 = 0 h = (5) f(h 1, x ) oherwise where f is usually a composiion of an elemen-wise nonlineariy wih an affine ransformaion of boh x and h 1. In his way, recurren neural neworks can comprehend a sequence of arbirary lengh ino a fix-lengh vecor and feed i o a sofmax layer for ex classificaion or oher NLP asks. However, gradien vecor of f can grow or decay exponenially over long sequences during raining, also known as he gradien exploding or vanishing problems, which makes i difficul o learn long-erm dependencies and correlaions for RNNs. [Hochreier and Schmidhuber, 1997] proposed Long Shor-Term Memory Nework (LSTM) o ackle he above problems. Apar from he inernal hidden sae h, LSTM also mainains a inernal hidden memory cell and hree gaing mechanisms. While here are numerous varians of he sandard LSTM, here we follow he implemenaion of [Graves, 2013]. A each ime sep, saes of he LSTM can be fully represened by five vecors in R n, an inpu gae i, a forge gae f, an oupu gae o, he hidden sae h and he memory cell c, which adhere o he following ransiion funcions. i = σ(w i x + U i h 1 + V i c 1 + b i ) (6) f = σ(w f x + U f h 1 + V f c 1 + b f ) (7) o = σ(w o x + U o h 1 + V o c 1 + b o ) (8) c = anh(w c x + U c h 1 ) (9) c = f c 1 + i c (10) h = o anh(c ) (11) 3386

3 Proceedings of he Tweny-Sixh Inernaional Join Conference on Arificial Inelligence (IJCAI-17) where x is he curren inpu, σ denoes logisic sigmoid funcion and denoes elemen-wise muliplicaion. By selecively conrolling porions of he memory cell c o updae, erase and forge a each ime sep, LSTM can beer comprehend long-erm dependencies wih respec o labels of he whole sequences. 3.2 A Generalized Archiecure Based on he LSTM implemenaion of [Graves, 2013], we propose a generalized muli-ask learning archiecure for ex classificaion wih four ypes of recurren neural layers o convey informaion inside and among asks. Figure 1 illusraes he srucure design and informaion flows of our model, where hree asks are joinly learned in parallel. As Figure 1a shows, each ask owns a LSTM-based Single Layer for inra-ask learning. Pair-wise Coupling Layer and Local Fusion Layer are designed for direc and indirec iner-ask ineracions. And we furher uilize a Global Fusion Layer o mainain a global memory for informaion shared among all asks. Single Layer Each ask owns a LSTM-based Single Layer wih a collecion of parameers Φ (k), aking Eqs.(9) for example. = anh(w c (k) x (k) + U (k) c h (k) 1 ) (12) Inpu sequences of each ask are ransformed ino vecor represenaions (x, x,..., x (K) ), which are laer recurrenly fed ino he corresponding Single Layers. The hidden saes a he las ime sep h (k) T of each Single Layer can be regarded as fix-lengh represenaions of he whole sequences, which are followed by a fully conneced layer and a sofmax non-linear layer o produce class disribuions. ŷ (k) = sofmax(w (k) h (k) T + b(k) ) (13) where ŷ (k) is he prediced class disribuion for x (k). Coupling Layer Besides Single Layers, we design Coupling Layers o model direc pair-wise ineracions beween asks. For each pair of asks, hidden saes and memory cells of he Single Layers can obain exra informaion direcly from each oher, as shown in Figure 1b. We re-define Eqs.(12) and uilize a gaing mechanism o conrol he porion of informaion flows from one ask o anoher. The memory conen of each Single Layer is updaed on he leverage of pair-wise couplings. K = anh(w c (k) x (k) + g (j k) U (j k) c h (j) 1 ) (14) j=1 g (j k) = σ(w gc (k) x (k) + U (j) gc h (j) 1 ) (15) where g (j k) conrols he porion of informaion flow from T j o T k, based on he correlaion srengh beween x (k) and h (j) 1 a he curren ime sep. In his way, he hidden saes and memory cells of each Single Layer can obain exra informaion from oher asks and sronger relevance resuls in higher chances of recepion. y sofmax SL1 Single Layer Coupling Layer CL<1,3> LFL<1,3> CL<1,2> LFL<1,2> y sofmax SL2 GFL CL<2,3> LFL<2,3> y sofmax SL3 Local Fusion Layer Global Fusion Layer (a) Overall archiecure wih Single Layers, Coupling Layers, Local Fusion Layers and Global Fusion Layer (1,2) sofmax sofmax (b) Deails of Coupling Layer Beween T 1 and T 2 (1,2) (1,2) sofmax sofmax (c) Deails of Local Fusion Layer Beween T 1 and T 2 Figure 1: A generalized recurren neural archiecure for modeling ex wih muli-ask learning Local Fusion Layer Differen from Coupling Layers, Local Fusion Layers inroduce a shared bi-direcional LSTM Layer o model indirec pair-wise ineracions beween asks. For each pair of asks, we feed he Local Fusion Layer wih he concaenaion of boh inpus, x (j,k) = x (j) x (k), as shown in Figure 1c. We denoe he oupu of he Local Fusion Layer as y y y y 3387

4 Proceedings of he Tweny-Sixh Inernaional Join Conference on Arificial Inelligence (IJCAI-17) h (j,k) = h (j,k) h (j,k), a concaenaion of hidden saes from he forward and backward LSTM a each ime sep. Similar o Coupling Layers, hidden saes and memory cells of he Single Layers can selecively decide how much informaion o accep from he pair-wise Local Fusion Layers. We re-define Eqs.(14) by considering he ineracions beween he memory conen and oupus of he Local Fusion Layers as follows. = anh(w c (k) K LF (k) = j=1,j k x (k) + C (k) + LF (k) ) (16) g (j,k) U (j,k) c h (j,k) (17) g (j,k) = σ(w (k) gf x(k) + U (j) gf h(j,k) ) (18) where C (k) denoes he coupling erm in Eqs.(14) and LF (k) represens he local fusion erm. Again, we employ a gaing mechanism g (j,k) o conrol he porion of informaion flow from he Local Coupling Layers o T k. Global Fusion Layer Indirec ineracions beween Single Layers can be pair-wise or global, so we furher propose he Global Fusion Layer as a shared memory sorage among all asks. The Global Fusion Layer consiss of a bi-direcional LSTM Layer wih he inpus x (g) = x x (g) h h (g). x (K) and he mem- We denoe he global fusion erm as GF (k) ory conen is calculaed as follows. GF (k) and he oupus h (g) = = anh(w c (k) x (k) + C (k) + LF (k) + GF (k) ) (19) = σ(w gg (k) x (k) + U (k) gg h (g) )U (g) c h (g) (20) As a resul, our archiecure covers complicaed ineracions among differen asks. I is capable of mapping a collecion of inpu sequences from differen asks ino a combinaion of prediced class disribuions in parallel, as shown in Eqs Sampling & Training Mos previous muli-ask learning models [Collober and Weson, 2008; Liu e al., 2015b; 2016b; 2016c] belongs o Type-I or Type-II. The oal number of inpu samples is N = K k=1 N k, where N k are he sample numbers of each ask. However, our model focuses on Type-III and requires a 4-D ensor N K T d as inpus, where N, K, T, d are oal number of inpu collecions, ask number, sequence lengh and embedding size respecively. Samples from differen asks are joinly learned in parallel so he oal number of all possible inpu collecions is N max = K k=1 N k. We propose a Task Oriened Sampling algorihm o generae sample collecions for improvemens of a specific ask T k. Given he generaed sequence collecions X and label combinaions Y, he overall loss funcion can be calculaed based on Eqs.(4) and (13). The raining process is conduced in a Algorihm 1 Task Oriened Sampling Inpu: N i samples from each ask T i ; k, he oriened ask index; n 0, upsampling coefficien s.. N = n 0 N k Oupu: sequence collecions X and label combinaions Y 1: for each i [1, K] do 2: generae a se S i wih N samples for each ask: 3: if i = k hen 4: repea each sample for n 0 imes 5: else if N i N hen 6: randomly selec N samples wihou replacemens 7: else 8: randomly selec N samples wih replacemens 9: end if 10: end for 11: for each j [1, N] do 12: randomly selec a sample from each S i wihou replacemens 13: combine heir feaures and labels as X j and Y j 14: end for 15: merge all X j and Y j o produce he sequence collecions X and label combinaions Y sochasic manner unil convergence. For each loop, we randomly selec a collecion from he N candidaes and updae he parameers by aking a gradien sep. 4 Experimen In his secion, we design hree differen scenarios of muliask learning based on five benchmark daases for ex classificaion. we invesigae he empirical performances of our model and compare i o exising sae-of-he-ar models. 4.1 Daases As Table 1 shows, we selec five benchmark daases for ex classificaion and design hree experimen scenarios o evaluae he performances of our model. Muli-Cardinaliy Movie review daases wih differen average lenghs and class numbers, including SST- 1 [Socher e al., 2013], SST-2 and IMDB [Maas e al., 2011]. Muli-Domain Produc review daases on differen domains from Muli-Domain Senimen Daase [Blizer e al., 2007], including Books, DVDs, Elecronics and Kichen. Muli-Objecive Classificaion daases wih differen objecives, including IMDB, RN [Apé e al., 1994] and QC [Li and Roh, 2002]. 4.2 Hyperparameers and Training The whole nework is rained hrough back propagaion wih sochasic gradien descen [Amari, 1993]. We obain a prerained lookup able by applying Word2Vec [Mikolov e al., 2013a] on he Google News corpus, which conains more han 100B words wih a vocabulary size of abou 3M. All involved parameers are randomly iniialized from a runcaed normal disribuion wih zero mean and sandard deviaion. 3388

5 Proceedings of he Tweny-Sixh Inernaional Join Conference on Arificial Inelligence (IJCAI-17) Table 1: Five benchmark classificaion daases: SST, IMDB, MDSD, RN, QC. Daase Descripion Type Lengh Class Objecive SST Movie reviews in Sanford Senimen Treebank including SST-1 and SST-2 Senence 19 / 19 5 / 2 Senimen IMDB Inerne Movie Daabase Documen Senimen MDSD Produc reviews on books, DVDs, elecronics and kichen appliances Documen 176 / 189 / 115 / 97 2 Senimen RN Reuers Newswire opics classificaion Documen Topics QC Quesion Classificaion Senence 10 6 Quesion Types For each ask T k, we conduc TOS wih n 0 = 2 o improve is performance. Afer raining our model on he generaed sample collecions, we evaluae he performance of ask T k by comparing ŷ (k) and y (k) on he es se. We apply 10-fold cross-validaion and differen combinaions of hyperparameers are invesigaed, of which he bes one, as shown in Table 2, is reserved for comparisons wih sae-of-he-ar models. rade-off beween efficiency and effeciveness, we deermine n 0 = 2 as he opimal value for our experimens. Table 2: Hyperparameer seings Embedding size d = 300 Hidden layer size of LSTM n = 100 Iniial learning rae η = 0.1 Regularizaion weigh λ = Resuls We compare performances of our model wih he implemenaion of [Graves, 2013] and he resuls are shown in Table 3. Our model obains beer performances in Muli- Domain scenario wih an average improvemen of 4.5%, where daases are produc reviews on differen domains wih similar sequence lenghs and he same class number, hus producing sronger correlaions. Muli-Cardinaliy scenario also achieves significan improvemens of 2.77% on average, where daases are movie reviews wih differen cardinaliies. However, Muli-Objecive scenario benefis less from muli-ask learning due o lacks of salien correlaion among senimen, opic and quesion ype. The QC daase aims o classify each quesion ino six caegories and is performance even ges worse, which may be caused by poenial noises inroduced by oher asks. In pracice, he srucure of our model is flexible, as couplings and fusions beween some empirically unrelaed asks can be removed o alleviae compuaion coss. Influences of n 0 in TOS We furher explore he influence of n 0 in TOS on our model, which can be any posiive ineger. A higher value means larger and more various samples combinaions, while requires higher compuaion coss. Figure 2 shows he performances of daases in Muli- Domain scenario wih differen n 0. Compared o n 0 = 1, our model can achieve considerable improvemens when n 0 = 2 as more samples combinaions are available. However, here are no more salien gains as n 0 ges larger and poenial noises from oher asks may lead o performance degradaions. For a Figure 2: Influences of n 0 in TOS on differen daases Pair-wise Performance Gain In order o measure he correlaion srengh beween wo ask T i and T j, we learn hem joinly wih our model and define Pair-wise Performance Gain as P P G ij = Pi P j P ip j, where P i, P j, P i, P j are he performances of asks T i and T j when learned individually and joinly. We calculae PPGs for every wo asks in Table 1 and illusrae he resuls in Figure 3, where darkness of colors indicae srengh of correlaion. I is inuiive ha daases of Muli- Domain scenario obain relaively higher PPGs wih each oher as hey share similar cardinaliies and abundan lowlevel linguisic characerisics. Senences of QC daase are much shorer and convey unique characerisics from oher asks, hus resuling in quie lower PPGs. 4.4 Comparisons wih Sae-of-he-ar Models We apply he opimal hyperparameer seings and compare our model agains he following sae-of-he-ar models: NBOW Neural Bag-of-Words ha simply sums up embedding vecors of all words. PV Paragraph Vecors followed by logisic regression [Le and Mikolov, 2014]. MT-RNN Muli-Task learning wih Recurren Neural Neworks by a shared-layer archiecure [Liu e al., 2016c]. 3389

6 Proceedings of he Tweny-Sixh Inernaional Join Conference on Arificial Inelligence (IJCAI-17) Table 3: Resuls of our model on differen scenarios Model Muli-Cardinaliy Muli-Domain Muli-Objecive SST-1 SST-2 IMDB Books DVDs Elecronics Kichen IMDB RN QC Single Task Our Model Table 4: Comparisons wih sae-of-he-ar models Model SST-1 SST-2 IMDB Books DVDs Elecronics Kichen QC NBOW PV MT-RNN MT-CNN MT-DNN GRNN Our Model SST-1 SST-2 IMDB Books DVDs Elecronics Kichen QC SST-1 SST-2 IMDB Books DVDs Elecronics Kichen QC Figure 3: Visualizaion of Pair-wise Performance Gains MT-CNN Muli-Task learning wih Convoluional Neural Neworks [Collober and Weson, 2008] where lookup ables are parially shared. MT-DNN Muli-Task learning wih Deep Neural Neworks [Liu e al., 2015b] ha uilizes bag-of-word represenaions and a hidden shared layer. GRNN Gaed Recursive Neural Nework for senence modeling [Chen e al., 2015]. As Table 4 shows, our model obains compeiive or beer performances on all asks excep for he QC daase, as i conains poor correlaions wih oher asks. MT-RNN slighly ouperforms our model on SST, as senences from his daase are much shorer han hose from IMDB and MDSD, and anoher possible reason may be ha our model are more complex and requires larger daa for raining. Our model proposes he designs of various ineracions including coupling, local and global fusion, which can be furher implemened by oher sae-of-he-ar models and produce beer performances. 5 Relaed Work There are a large body of lieraures relaed o muli-ask learning wih neural neworks in NLP [Collober and Weson, ; Liu e al., 2015b; 2016b; 2016c]. [Collober and Weson, 2008] belongs o Type-I and uilizes shared lookup ables for common feaures, followed by ask-specific neural layers for several radiional NLP asks such as par-of-speech agging and semanic parsing. They use a fix-size window o solve he problem of variable-lengh exs, which can be beer handled by recurren neural neworks. [Liu e al., 2015b; 2016b; 2016c] all belong o Type- II where samples from differen asks are learned sequenially. [Liu e al., 2015b] applies bag-of-word represenaion and informaion of word orders are los. [Liu e al., 2016b] inroduces an exernal memory for informaion sharing wih a reading/wriing mechanism for communicaing, and [Liu e al., 2016c] proposes hree differen models for muli-ask learning wih recurren neural neworks. However, models of hese wo papers only involve pair-wise ineracions, which can be regarded as specific implemenaions of Coupling Layer and Fusion Layer in our model. Differen from he above models, our model focuses on Type-III and uilize recurren neural neworks o comprehensively capure various ineracions among asks, boh direc and indirec, local and global. Three or more asks are learned simulaneously and samples from differen asks are rained in parallel benefiing from each oher, hus obaining beer senence represenaions. 6 Conclusion and Fuure Work In his paper, we propose a muli-ask learning archiecure for ex classificaion wih four ypes of recurren neural layers. The archiecure is srucurally flexible and can be regarded as a generalized case of many previous works wih deliberae designs. We explore hree differen scenarios of muli-ask learning and our model can improve performances of mos asks wih addiional relaed informaion from ohers in all scenarios. In fuure work, we would like o invesigae furher implemenaions of couplings and fusions, and conclude more muli-ask learning perspecives. 3390

7 Proceedings of he Tweny-Sixh Inernaional Join Conference on Arificial Inelligence (IJCAI-17) References [Amari, 1993] Shun-ichi Amari. Backpropagaion and sochasic gradien descen mehod. Neurocompuing, 5: , [Apé e al., 1994] Chidanand Apé, Fred Damerau, and Sholom M. Weiss. Auomaed Learning of Decision Rules for Tex Caegorizaion. ACM Trans. Inf. Sys., 12: , [Bengio e al., 2013] Yoshua Bengio, Aaron C. Courville, and Pascal Vincen. Represenaion Learning: A Review and New Perspecives. IEEE Trans. Paern Anal. Mach. Inell., 35(8): , [Blizer e al., 2007] John Blizer, Mark Dredze, and Fernando Pereira. Biographies, Bollywood, Boom-boxes and Blenders: Domain Adapaion for Senimen Classificaion. In ACL, [Caruana, 1997] Rich Caruana. Muliask Learning. Machine Learning, 28:41 75, [Chen e al., 2015] Xinchi Chen, Xipeng Qiu, Chenxi Zhu, Shiyu Wu, and Xuanjing Huang. Senence Modeling wih Gaed Recursive Neural Nework. In EMNLP, pages , [Collober and Weson, 2008] Ronan Collober and Jason Weson. A unified archiecure for naural language processing: deep neural neworks wih muliask learning. In ICML, pages , [Graves, 2013] Alex Graves. Generaing Sequences Wih Recurren Neural Neworks. CoRR, abs/ , [Hochreier and Schmidhuber, 1997] Sepp Hochreier and Jürgen Schmidhuber. Long Shor-Term Memory. Neural Compuaion, 9(8): , [Le and Mikolov, 2014] Quoc V. Le and Tomas Mikolov. Disribued Represenaions of Senences and Documens. In Proceedings of he 31h Inernaional Conference on Machine Learning, ICML 2014, Beijing, China, June 2014, pages , [Li and Roh, 2002] Xin Li and Dan Roh. Learning Quesion Classifiers. In COLING, [Liu e al., 2015a] Pengfei Liu, Xipeng Qiu, Xinchi Chen, Shiyu Wu, and Xuanjing Huang. Muli-Timescale Long Shor-Term Memory Neural Nework for Modelling Senences and Documens. In EMNLP, pages , [Liu e al., 2015b] Xiaodong Liu, Jianfeng Gao, Xiaodong He, Li Deng, Kevin Duh, and Ye-Yi Wang. Represenaion Learning Using Muli-Task Deep Neural Neworks for Semanic Classificaion and Informaion Rerieval. In NAACL HLT, pages , [Liu e al., 2016a] Pengfei Liu, Xipeng Qiu, Jifan Chen, and Xuanjing Huang. Deep Fusion LSTMs for Tex Semanic Maching. In ACL, [Liu e al., 2016b] Pengfei Liu, Xipeng Qiu, and Xuanjing Huang. Deep Muli-Task Learning wih Shared Memory for Tex Classificaion. In EMNLP, pages , [Liu e al., 2016c] Pengfei Liu, Xipeng Qiu, and Xuanjing Huang. Recurren Neural Nework for Tex Classificaion wih Muli-Task Learning. In IJCAI, pages , [Liu e al., 2016d] Pengfei Liu, Xipeng Qiu, Yaqian Zhou, Jifan Chen, and Xuanjing Huang. Modelling Ineracion of Senence Pair wih Coupled-LSTMs. In EMNLP, pages , [Maas e al., 2011] Andrew L. Maas, Raymond E. Daly, Peer T. Pham, Dan Huang, Andrew Y. Ng, and Chrisopher Pos. Learning Word Vecors for Senimen Analysis. In NAACL HLT, pages Associaion for Compuaional Linguisics, June [Mikolov e al., 2013a] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficien Esimaion of Word Represenaions in Vecor Space. CoRR, abs/ , [Mikolov e al., 2013b] Tomas Mikolov, Ilya Suskever, Kai Chen, Gregory S. Corrado, and Jeffrey Dean. Disribued Represenaions of Words and Phrases and heir Composiionaliy. In Advances in Neural Informaion Processing Sysems, pages , [Socher e al., 2013] Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Chrisopher D. Manning, Andrew Y. Ng, and Chrisopher Pos. Recursive Deep Models for Semanic Composiionaliy Over a Senimen Treebank. In EMNLP, pages , Sroudsburg, PA, Ocober Associaion for Compuaional Linguisics. [Suskever e al., 2014] Ilya Suskever, Oriol Vinyals, and Quoc V. Le. Sequence o Sequence Learning wih Neural Neworks. In Advances in Neural Informaion Processing Sysems, pages ,

Deep Learning: Theory, Techniques & Applications - Recurrent Neural Networks -

Deep Learning: Theory, Techniques & Applicaions - Recurren Neural Neworks - Prof. Maeo Maeucci maeo.maeucci@polimi.i Deparmen of Elecronics, Informaion and Bioengineering Arificial Inelligence and Roboics