Fine-grained Opinion Mining with Recurrent Neural Networks and Word Embeddings

Size: px

Start display at page:

Download "Fine-grained Opinion Mining with Recurrent Neural Networks and Word Embeddings"

Justin Cameron
6 years ago
Views:

1 Fine-grained Opinion Mining wih Recurren Neural Neworks and Word Embeddings Pengfei Liu 1, Shafiq Joy 2 and Helen Meng 1 1 Deparmen of Sysems Engineering and Engineering Managemen, The Chinese Universiy of Hong Kong, Hong Kong SAR, China 2 Qaar Compuing Research Insiue - HBKU, Doha, Qaar {pfliu, hmmeng}@se.cuhk.edu.hk, sjoy@qf.org.qa Absrac The asks in fine-grained opinion mining can be regarded as eiher a oken-level sequence labeling problem or as a semanic composiional ask. We propose a general class of discriminaive models based on recurren neural neworks (RNNs) and word embeddings ha can be successfully applied o such asks wihou any askspecific feaure engineering effor. Our experimenal resuls on he ask of opinion arge idenificaion show ha RNNs, wihou using any hand-crafed feaures, ouperform feaure-rich CRF-based models. The RNNs based on he long shorerm memory (LSTM) archiecure deliver he bes resuls ouperforming previous mehods including op performing sysems in he SemEval 14 evaluaion campaign. 1 Inroducion Fine-grained opinion mining involves idenifying he opinion holder who expresses he opinion, deecing opinion expressions, measuring heir inensiy and senimen, and idenifying he arge or aspec of he opinion (Wiebe e al., 2005). For example, in he senence John says, he hard disk is very noisy, John, he opinion holder, expresses a very negaive (i.e., senimen wih inensiy) opinion owards he arge hard disk using he opinionaed expression very noisy. A number of NLP applicaions can benefi from fine-grained opinion mining including opinion summarizaion and opinion-oriened quesion answering. The asks in fine-grained opinion mining can be regarded as eiher a oken-level sequence labeling problem or as a semanic composiional ask a he sequence (e.g., phrase) level. For example, idenifying opinion holders, opinion expressions and opinion arges can be formulaed as a oken-level The hard disk is very noisy O B-TARG I-TARG O O O O O O O B-EXPR I-EXPR Table 1: An example senence annoaed wih BIO labels for opinion arge (TARG ags) and for opinion expression (EXPR ags) exracion. sequence agging problem, where he ask is o label each word in a senence using he convenional BIO agging scheme. For example, Table 1 shows a senence agged wih BIO scheme for opinion arge (middle row) and for opinion expression (boom row) idenificaion asks. On he oher hand, characerizing inensiy and senimen of an opinionaed expression can be regarded as a semanic composiional problem, where he ask is o aggregae vecor represenaions of okens in a meaningful way and laer use hem for senimen classificaion (Socher e al., 2013). Condiional random fields (CRFs) (Laffery e al., 2001) have been quie successful for differen fine-grained opinion mining asks, e.g., opinion expression exracion (Yang and Cardie, 2012). The sae-of-he-ar model for opinion arge exracion is also based on a CRF (Poniki e al., 2014). However, he success of CRFs depends heavily on he use of an appropriae feaure se and feaure funcion expansion, which ofen requires a lo of engineering effor for each ask in hand. An alernaive approach of deep learning auomaically learns laen feaures as disribued vecors and have recenly been shown o ouperform CRFs on similar asks. For example, Irsoy and Cardie (2014) apply deep recurren neural neworks (RNNs) o exrac opinion expressions from senences and show ha RNNs ouperform CRFs. Socher e al. (2013) propose recursive neural neworks for a semanic composiional ask o idenify he senimens of phrases and senences hierarchically using he synacic parse rees.

2 Meanwhile, recen advances in word embedding inducion mehods (Collober and Weson, 2008; Mikolov e al., 2013b) have benefied researchers in wo ways: (i) hey have conribued o significan gains when used as exra word feaures in exising NLP sysems (Turian e al., 2010; Lebre and Lebre, 2013), and (ii) hey have enabled more effecive raining of RNNs by providing compac inpu represenaions of he words (Mesnil e al., 2013; Irsoy and Cardie, 2014). Moivaed by he recen success of deep learning, in his paper we propose a general class of models based on RNN archiecure and word embeddings, ha can be successfully applied o finegrained opinion mining asks wihou any askspecific feaure engineering effor. We experimen wih several imporan RNN archiecures including Elman-ype, Jordan-ype and long shor erm memory (LSTM) and heir variaions. We acquire pre-rained word embeddings from several exernal sources o give beer iniializaion o our RNN models. The RNN models hen fine-une he word vecors during raining o learn ask-specific embeddings. We also presen an archiecure o incorporae oher linguisic feaures ino RNNs. Our resuls on he ask of opinion arge exracion show ha word embeddings improve he performance of sae-of-he-ar CRF models, when included as addiional feaures. They also improve RNNs when used as pre-rained word vecors and fine-uning hem on he ask gives he bes resuls. A comparison beween models demonsraes ha RNNs significanly ouperform CRFs, even when hey use word embeddings as he only feaures. Incorporaing simple linguisic feaures ino RNNs improves he performance even furher. Our bes resuls wih LSTM RNN ouperform he op performing sysems in SemEval 14 evaluaion campaign. We make our source code available. 1 In he remainder of his paper, afer discussing relaed work in Secion 2, we presen our RNN models in Secion 3. In Secion 4, we briefly describe he pre-rained word embeddings. The experimens and analysis of resuls are presened in Secion 5. Finally, we summarize our conribuions wih fuure direcions in Secion 6. 2 Relaed Work A line of previous research in fine-grained opinion mining focused on deecing opinion (subjecive) 1 hps://gihub.com/ppfliu/opinion-arge expressions, e.g., (Wilson e al., 2005; Breck e al., 2007). The common approach was o formulae he problem as a sequence agging ask and use a CRF model. Laer approaches exended his o joinly idenify opinion holders (Choi e al., 2005), and inensiy and polariy (Choi and Cardie, 2010). Exracing aspec erms or opinion arges have been acively invesigaed in he pas. Typical approaches include associaion mining o find frequen iem ses (i.e., co-occurring words) as candidae aspecs (Hu and Liu, 2004), classificaionbased mehods such as hidden Markov model (Jin e al., 2009) and CRF (Shariay and Moghaddam, 2011; Yang and Cardie, 2012; Yang and Cardie, 2013), as well as opic modeling echniques using Laen Dirichle Allocaion (LDA) model and is varians (Tiov and McDonald, 2008; Lin and He, 2009; Moghaddam and Eser, 2012). Convenional RNNs (e.g., Elman ype) and LSTM have been successfully applied o various sequence predicion asks, such as language modeling (Mikolov e al., 2010; Sundermeyer e al., 2012), speech recogniion (Graves and Jaily, 2014; Sak e al., 2014) and spoken language undersanding (Mesnil e al., 2013). For senimen analysis, Socher e al. (2013) propose o use recursive neural neworks o hierarchically compose semanic word vecors based on synacic parse rees, and use he vecors o idenify he senimens of he phrases and senences. Le and Zuidema (2015) exended recursive neural neworks wih LSTM o compue a paren vecor in parse rees by combining informaion of boh oupu and LSTM memory cells from is wo children. Mos relevan o our work is he recen work of Irsoy and Cardie (2014), where hey apply deep Elman-ype RNN o exrac opinion expressions and show ha deep RNN ouperforms CRF, semi- CRF and shallow RNN. They used word embeddings from Google wihou fine-uning hem. Alhough inspired, our work differs from he work of Irsoy and Cardie (2014) in many ways. (i) We experimen wih no only Elman-ype, bu also wih a Jordan-ype and wih a more advanced LSTM RNN, and demonsrae ha LSTM generally ouperforms he ohers. (ii) We use no only Google embeddings as pre-rained word vecors, bu also oher embeddings including Senna and Amazon, and show heir performances. (iii) We also fine-une he embeddings for our ask, which is shown o be very crucial. (iv) We presen

3 an RNN archiecure o include oher linguisic feaures and show is effeciveness. (v) Finally, we presen a comprehensive experimen exploring differen embedding dimensions and hidden layer sizes for all he variaions of he RNNs (i.e., including feaures and bi-direcionaliy). 3 Recurren Neural Models The recurren neural models in his secion compue composiional vecor represenaions for word sequences of arbirary lengh. These highlevel (i.e., hidden-layer) disribued represenaions are hen used as feaures o classify each oken in he senence. We firs describe he common properies ha he below RNNs share, which is followed by descripions of he specific RNNs. Each word in he vocabulary V is represened by a D dimensional vecor in he shared look-up able L R V D. Noe ha L is considered as a model parameer o be learned. We can iniialize L randomly or by pre-rained word embedding vecors (see Secion 4). Given an inpu senence s = (s 1,, s T ), we firs ransform i ino a feaure sequence by mapping each word oken s s o an index in L. The look-up layer hen creaes a conex vecor x R md covering m 1 neighboring okens for each s by concaenaing heir respecive vecors in L. For example, given he conex size m = 3, he conex vecor x for he word disk in Figure 1 is formed by concaenaing he embeddings of hard, disk and is. This window-based approach is inended o capure shor-erm dependencies beween neighboring words in a senence (Collober e al., 2011). The concaenaed vecor is hen passed hrough non-linear recurren hidden layers o learn highlevel composiional represenaions, which are in urn fed o he oupu layer for classificaion using sofmax. Formally, he probabiliy of k-h label in he oupu for classificaion ino K classes: P (y = k s, θ) = exp (w T k h ) K k=1 exp (wt k h ) (1) where, h = φ(x ) defines he ransformaions of x hrough he hidden layers, and w k are he weighs from he las hidden layer o he oupu layer. We fi he models by minimizing he negaive log likelihood (NLL) of he raining daa. The NLL for he senence s can be wrien as: J(θ) = T =1 k=1 K y k log P (y = k s, θ) (2) where, y k = I(y = k) is an indicaor variable o encode he gold labels, i.e., y k = 1 if he gold label y = k, oherwise 0. 2 The loss funcion minimizes he cross-enropy beween he prediced disribuion and he arge disribuion (i.e., gold labels). The main difference beween he models described below is how hey compue h = φ(x ). 3.1 Elman-ype RNN (Elman, 1990) In an Elman-ype RNN (Fig. 1a), he oupu of he hidden layer h a ime is compued from a nonlinear ransformaion of he curren inpu x and he previous hidden layer oupu h 1. Formally, h = f(uh 1 + V x + b) (3) where f is a nonlinear funcion (e.g., sigmoid) applied o he hidden unis. U and V are weigh marices beween wo consecuive hidden layers, and beween he inpu and he hidden layers, respecively, and b is he bias vecor. This RNN hus creaes inernal saes by remembering previous hidden layer, which allows i o exhibi dynamic emporal behavior. We can inerpre h as an inermediae represenaion summarizing he pas, which is in urn used o make a final decision on he curren inpu. 3.2 Jordan-ype RNN (Jordan, 1997) Jordan-ype RNNs (Fig. 1b) are similar o Elmanype RNNs excep ha he hidden layer h a ime is fed from he previous oupu layer y 1 insead of he previous hidden layer h 1. Formally, h = f(uy 1 + V x + b) (4) where U, V, b, and f are similarly defined as before. Boh Elman-ype and Jordan-ype RNNs are known as simple RNNs. These ypes of RNNs are generally rained using sochasic gradien descen (SGD) wih backpropagaion hrough ime (BPTT), where errors (i.e., gradiens) are propagaed back hrough he edges over ime. One common issue wih BPTT is ha as he errors ge propagaed, hey may soon become very small or very large ha can lead o undesired values in weigh marices, causing he raining o fail. 2 This is also known as one-ho vecor represenaion.

4 y - 1 y y + 1 h - 1 U W h h + 1 V x - 1 x x + 1 The hard disk is very The hard disk is very The hard disk is very (a) Elman-ype RNN y - 1 y y + 1 h - 1 U W h h + 1 V x - 1 x x + 1 (b) Jordan-ype RNN y y 1 y 1 lsm lsm lsm x x 1 x 1 x LSTM Inpu Gae i (c) Long Shor-Term Memory (LSTM) RNN Memory Cell c f Oupu Gae Forge Gae Figure 1: Elman-ype, Jordan-ype and LSTM RNNs wih a lookup-able layer, a hidden layer and an oupu layer. The concaenaed conex vecor for he word disk a ime is x = [x hard, x disk, x is ] wih a conex window of size 3. One memory block in he LSTM hidden layer has been enlarged. o h This is known as he vanishing and he exploding gradiens problem (Bengio e al., 1994). One simple way o overcome his issue is o use a runcaed BPTT (Mikolov, 2012) for resricing he backpropagaion o only few seps like 4 or 5. However, his soluion limis he RNN o capure long-range dependencies. In he following, we describe an elegan RNN archiecure o address his problem. 3.3 Long Shor-Term Memory RNN Long Shor-Term Memory or LSTM (Hochreier and Schmidhuber, 1997) is specifically designed o model long erm dependencies in RNNs. The recurren layer in a sandard LSTM is consiued wih special (hidden) unis called memory blocks (Fig. 1c). A memory block is composed of four elemens: (i) a memory cell c (i.e., a neuron) wih a self-connecion, (ii) an inpu gae i o conrol he flow of inpu signal ino he neuron, (iii) an oupu gae o o conrol he effec of he neuron acivaion on oher neurons, and (iv) a forge gae f o allow he neuron o adapively rese is curren sae hrough he self-connecion. The following sequence of equaions describe how a layer of memory blocks is updaed a every ime sep : i = σ(u ih 1 + V ix + C ic 1 + b i) (5) f = σ(u f h 1 + V f x + C f c 1 + b f ) (6) c = i g(u ch 1 + V cx + b c) + f c 1 (7) o = σ(u oh 1 + V ox + C oc + b o) (8) h = o h(c ) (9) where U k, V k and C k are he weigh marices beween wo consecuive hidden layers, beween he inpu and he hidden layers, and beween wo consecuive cell acivaions, respecively, which are associaed wih gae k (i.e., inpu, oupu, forge and cell), and b k is he associaed bias vecor. The symbol denoes a elemen-wise produc of he wo vecors. The gae funcion σ is he sigmoid acivaion, and g and h are he cell inpu and cell oupu acivaions, ypically a anh. LSTMs are generally rained using runcaed or full BPTT. 3.4 Bidirecionaliy Noice ha he RNNs defined above only ge informaion from he pas. However, informaion from he fuure could also be crucial. In our example senence (Table 1), o correcly ag he word hard as a B-TARG, i is beneficial for he RNN o know ha he nex word is disk. Our window-based approach, by considering he neighboring words, already capures shor-erm dependencies like his from he fuure. However, i requires uning o find he righ window size, and i disregards long-range dependencies ha go beyond he conex window, which is ypically of size 1 (i.e., no conex) o 5 (see Secion 5.2). For insance, consider he wo senences: (i) Try he crunchy una, i is o die for. and (ii) Try he crunchy una, i is local. The phrase crunchy una is an aspec erm in he firs senence, bu no in he second. The RNN models described above will assign he same labels o he words crunchy and una in boh senences, since he preceding sequences and he conex window (of size 1 o 5) are he same. To capure long-range dependencies from he fuure as well as from he pas, we propose o use bidirecional RNNs (Schuser and Paliwal, 1997), which allow bidirecional links in he nework. In an Elman-ype bidirecional RNN (Fig. 2a), he forward hidden layer h and he backward hidden layer h a ime are compued as follows: h = f( U h 1 + V x + b ) (10) h = f( U h 1 + V x + b ) (11)

5 hf y y 1 y 1 1 hb hf hf 1 1 hb hb 1 x x 1 x 1 The hard disk is very (a) y y 1 y 1 h f 1 1 h 1 U h W h h 1 V f h f 1 1 x x 1 x 1 The hard disk is very Figure 2: (a) Bidirecional Elman-ype RNN and (b) Linguisic feaures concaenaed wih he hidden layer oupu in Elman-ype RNN. where U, V and b are he forward weigh marices as before; and U, V and b are heir backward counerpars. The concaenaed vecor h = [ h, h ] is passed o he oupu layer. We can hus inerpre h as an inermediae represenaion summarizing he pas and he fuure, which is hen used o make a final decision on he curren inpu. Similarly, he unidirecional LSTM RNN can be exended o bidirecional LSTM by allowing bidirecional connecions in he hidden layer. This amouns o having a backward counerpar for each of he equaions from 5 o 9. Noice ha he forward and he backward compuaions of bidirecional RNNs are independenly done unil hey are combined in he oupu layer. This means, during raining, afer backpropagaing he errors from he oupu layer o he forward and o he backward hidden layers, wo independen BPTT can be applied one o each direcion. 3.5 Fine-uning of Embeddings (b) In our RNN framework, we inend o avoid manual feaure engineering effors by using word embeddings as he only feaures. As menioned before, we can iniialize he embeddings randomly and learn hem as par of model parameers by backpropagaing he errors o he look-up layer. One issue wih random iniializaion is ha i may lead he SGD o suck in local minima (Murphy, 2012). On he oher hand, one can plug he readily available embeddings from exernal sources (Secion 4) in he RNN model and use hem as feaures wihou uning hem furher for he ask, as is done in any oher machine learning model. However, his approach does no exploi he auomaic feaure learning capabiliy of NN models, which is one of he main moivaions of using hem. In our work, we use he pre-rained word embeddings o beer iniialize our models, and we fine-une hem for our ask in raining, which urns ou o be quie beneficial (see Secion 5.2). 3.6 Incorporaing oher Linguisic Feaures Alhough NNs learn word feaures (i.e., embeddings) auomaically, we may sill be ineresed in incorporaing oher linguisic feaures like par-ofspeech (POS) ags and chunk informaion o guide he raining and o learn a beer model. However, unlike word embeddings, we wan hese feaures o be fixed during raining. As shown in Figure 2b, his can be done in our RNNs by feeding hese addiional feaures direcly o he oupu layer, and learn heir associaed weighs in raining. 4 Word Embeddings Word embeddings are disribued represenaions of words, represened as real-valued, dense, and low-dimensional vecors. Each dimension poenially describes synacic or semanic properies of he word. Here we briefly describe he hree ypes of pre-rained embeddings ha we use in our work. 4.1 SENNA Embeddings Collober e al. (2011) presen a unified NN archiecure for various NLP asks (e.g., POS agging, chunking, semanic role labeling, named eniy recogniion) wih a window-based approach and a senence-based approach (i.e., he inpu layer is a senence). Each word in he inpu layer is represened by M feaures, each of which has an embedding vecor associaed wih i in a lookup able. To give heir nework a beer iniializaion, hey learn word embeddings using a nonprobabilisic language model, which was rained on English Wikipedia for abou 2 monhs. They released heir 50-dimensional word embeddings (vocabulary size 130K) under he name SENNA Google Embeddings Mikolov e al. (2013a) propose wo simple loglinear models for compuing word embeddings from large corpora efficienly: (i) a bag-of-words model CBOW ha predics he curren word based on he conex words, and (ii) a skip-gram model 3 hp://ronan.collober.com/senna/

6 ha predics surrounding words given he curren word. They released heir pre-rained 300- dimensional word embeddings (vocabulary size 3M) rained by he skip-gram model on par of Google news daase conaining abou 100 billion words Amazon Embeddings Since we work on cusomer reviews, which are less formal han Wikipedia and news, we have also rained domain-specific embeddings (vocabulary size 1M) using he CBOW model of word2vec ool (Mikolov e al., 2013b) from a large corpus of Amazon reviews. 5 The corpus conains 34, 686, 770 reviews (4.7B words) on Amazon producs from June 1995 o March 2013 (McAuley and Leskovec, 2013). For comparison wih SENNA and Google, we learn word embeddings of 50- and 300-dimensions using he word2vec ool. 5 Experimens In his secion, we presen our experimenal seings and resuls for he ask of opinion arge exracion from cusomer reviews. 5.1 Experimenal Seings Daases: In our experimens, we use he wo review daases provided by he SemEval-2014 ask 4: aspec-based senimen analysis evaluaion campaign (Poniki e al., 2014), namely he Lapop and he Resauran daases. Table 2 shows some basic saisics abou he daases. The majoriy of aspec erms have only one word, while abou one hird of hem have muliple words. In boh daases, some senences have no aspec erms and some have more han one aspec erms. We use he sandard rain:es spli o compare our resuls wih he SemEval bes sysems. In addiion, we show a more general performance of our models on he wo daases based on 10 fold cross validaion. Evaluaion Meric: The evaluaion meric measures he sandard precision, recall and F 1 score based on exac maches. This means ha a candidae aspec erm is considered o be correc only if i exacly maches wih he aspec erm annoaed by he human. In all our experimens when comparing wo models, we use paired -es on he F 1 4 hps://code.google.com/p/word2vec/ 5 hps://snap.sanford.edu/daa/web-amazon.hml Lapop Resauran Train Tes Train Tes Senences Senence lengh One-word arges Muli-word arges Toal arges Table 2: Corpora saisics. scores o measure saisical significance and repor he corresponding p-value. CRF Baseline: We use a linear-chain CRF (Laffery e al., 2001) of order 2 as our baseline, which is he sae-of-he-ar model for opinion arge exracion (Poniki e al., 2014). Specifically, he CRF generaes (binary) feaure funcions of order 1 and 2; see (Cuong e al., 2014) for higher order CRFs. The feaures used in he baseline model include he curren word, is POS ag, is prefixes and suffixes beween one o four characers, is posiion, is sylisics (e.g., case, digi, symbol, alphanumeric), and is conex (i.e., he same feaures for he wo preceding and he wo following words). In addiion o he hand-crafed feaures, we also include he hree differen ypes of word embeddings described in Secion 4. RNN Seings: We pre-processed each daase by lowercasing all words and spelling ou each digi number as DIGIT. We hen buil he vocabulary from he raining se by marking rare words wih only one occurrence as UNKNOWN, and adding a PADDDING word o make conex windows for boundary words. To implemen early sopping in SGD, we prepared a validaion se by separaing ou randomly 10% of he available raining daa. The remaining 90% is used for raining. The weighs in he nework were iniialized by sampling from a small random uniform disribuion U( 0.2, 0.2). The ime sep in he runcaed BPTT was fixed o 6 based on he performance on he validaion se; smaller values hur he performance, while larger values showed no significan gains bu increased he raining ime. We use a fixed learning rae of 0.01, bu we change he bach size depending on he senence lengh following Mesnil e al. (2013). The ne effec is a variable sep size in he SGD. We run SGD for 40 epochs, calculae he F 1 score on he validaion se afer each epoch, and sop if he accuracy sars o

7 decrease. The size of he conex window and he hidden layer are empirically se based on he performance on he validaion se. We experimened wih he window size {1, 3, 5}, and found 3 o be he opimal on he validaion se. The hidden layer sizes we experimened wih are 50, 100, 150, 200, 250 and 300; we repor he opimal values in Table 3 (see h l and h r columns). Linguisic Feaures in RNNs: In addiion o he neural feaures, we also explore he conribuion of simple linguisic feaures in our RNN models using he archiecure described in Secion 3.6. Specifically, we encode four POS ag classes (noun, adjecive, verb, adverb) and BIO-agged chunk informaion (NP, VP, PP, ADJP, ADVP) as binary feaures ha are direcly fed o he oupu layer of he RNNs. Par-of-speech and phrasal informaion are arguably he mos informaive feaures for idenifying aspec erms (i.e., aspec erms are generally noun phrases). BIO ags could be useful o find he righ ex spans (i.e., aspec erms are unlikely o violae phrasal boundaries). 5.2 Resuls and Discussion Table 3 presens our resuls of aspec erm exracion on he sandard esse in F 1 scores. In Table 4, we show he resuls on he whole daases based on 10-fold cross validaion. RNNs in Table 4 are rained using SENNA embeddings. We perform significance ess on he 10-fold resuls. In he following, we highligh our main findings. Conribuions of Word Embeddings in CRF: From he firs group of resuls in Table 3, we can observe ha even hough CRF uses a handful of hand-designed feaures, including word embeddings sill leads o sizable improvemens on boh daases. The domain-specific Amazon embeddings (300 dim.) yield more general performance across he daases, delivering he bes gain of absolue 3.54% on he Lapop and he second bes on he Resauran daase. Google embeddings give he bes gain on he Resauran daase (absolue 3.08%). The conribuion of embeddings in CRF is also validaed by he 10-fold resuls in Table 4 (see firs wo rows), where SENNA embeddings yield significan improvemens absolue 1.47% on Lapop (p < 0.03) and absolue 1.24% on Resauran (p < 0.01). This demonsraes ha word embeddings offer generalizaions ha complemen oher srong feaures, and hus should be considered. Sysem Dim. h l Lapop h r Resauran CRF Base SENNA Amazon Google Amazon Jordan-RNN +SENNA Amazon Google Amazon Elman-RNN +SENNA Amazon Google Amazon Elman-RNN + Fea. +SENNA Amazon Google Amazon Bi-Elman-RNN +SENNA Amazon Google Amazon Bi-Elman-RNN + Fea. +SENNA Amazon Google Amazon LSTM-RNN +SENNA Amazon Google Amazon LSTM-RNN + Fea. +SENNA Amazon Google Amazon Bi-LSTM-RNN +SENNA Amazon Google Amazon Bi-LSTM-RNN + Fea. +SENNA Amazon Google Amazon SemEval-14 op sysems IHS RD DLIREC Table 3: F 1 -score performance for CRF baselines, RNNs and SemEval 14 bes sysems on he sandard Lapop and Resauran esses. h l and h r columns show he number of hidden unis.

8 Model Lapop Resauran P R F 1 P R F 1 CRF Base SENNA Elman-RNN Fea Bidir Fea. + Bidir LSTM-RNN Fea Bidir Fea. + Bidir Table 4: 10-fold cross validaion resuls of he models on he wo daases. Elman- and LSTM-RNNs are rained using SENNA embeddings. CRF vs. RNNs: When we compare he resuls of RNNs wih hose of CRF in Table 3, we see all our RNN models ouperform CRF models wih he maximum absolue gains of 5.18% by Bi-LSTM- RNN+Fea. on Lapop and 4.49% by LSTM- RNN+Fea. on Resauran. Wha is remarkable is ha RNNs wihou any hand-crafed feaures ouperform feaure-rich CRF models by a good margin maximum absolue gains of 4.65% on Lapop and 4.06% on Resauran by LSTM- RNN. When we compare heir general performance on he 10-folds in Table 4, we observe similar gains, maximum 10.84% on Lapop and 1.97% on Resauran, which are significan wih p < on Lapop and p < on Resauran. These resuls demonsrae ha RNNs as sequence labelers are more effecive han CRFs for fine-grained opinion mining asks. This can be aribued o RNN s abiliy o learn beer feaures auomaically and o capure long-range sequenial dependencies beween he oupu labels. Comparison among RNN Models: A comparison among he RNN models in Table 3 ells ha Elman RNN generally performs beer han Jordan RNN, and LSTM generally ouperforms Elman delivering he maximum gains of 0.26% on Lapop and 1.47% on Resauran. This is also consisen on he 10-folds Lapop daase (Table 4), where LSTM significanly ouperforms Elman wih a maximum absolue gain of 4.96% (p < ). This gain could be aribued o LSTM s abiliy o capure long-range dependencies. When we compare he uni-direcional RNNs wih heir bi-direcional counerpars, we do no see any gain for he bi-direcional ones. In fac, bidirecionaliy hurs boh Elman and LSTM RNNs. This finding conrass he finding of Irsoy and Cardie (2014) in opinion expression deecion ask, where bi-direcional Elman RNNs ouperform heir uni-direcional counerpars. However, when we analyzed he daa, we found i o be unsurprising because aspec erms are generally shorer han opinion expressions. For example, he average lengh of an aspec erm in our Resauran daase is 1.4, where he average lengh of an expressive subjecive expression in heir MPQA corpus is 3.3. Therefore, he informaion required o correcly idenify aspec erms (e.g., hard disk) is already capured by he unidirecional link and he conex window covering he neighboring words. Bi-direcional links double he number of parameers in he RNNs, which migh conribue o overfiing on his specific ask. As a parial soluion o his problem, we experimened wih a bi-direcional Elman-RNN, where boh direcions share he same parameers. Therefore, he number of parameers remains he same as he uni-direcional one. This modificaion improves he performance over he non-shared one slighly bu no significanly. This demands for beer modeling of he wo sources of informaion raher han simple concaenaion or sharing. Conribuions of Linguisic Feaures in RNNs: Alhough our linguisic feaures are quie simple (i.e., POS ags and chunk informaion), hey give gains on boh daases when incorporaed ino Elman and LSTM RNNs. The maximum gains on he sandard esse (Table 3) are 0.87% on Lapop and 1.05% on Resauran for Elman, and 0.28% on Lapop and 0.43% on Resauran for LSTM.

9 Similar gains are also observed on he 10-folds in Table 4, where he maximum gains are 1.3% on Lapop and 1.4% on Resauran. These gains are significan wih p < on Lapop and p < 0.04 on Resauran. Linguisic feaures hus complemen word embeddings in RNNs. Imporance of Fine-uning in RNNs: Finally, in order o show he imporance of fine-uning he word embeddings in RNNs on our ask, we presen in Table 5 he performance of Elman and Jordan RNNs, when he embeddings are used as hey are ( -une ), and when hey are fine-uned ( +une ) on he ask. The able also shows he conribuions of pre-rained embeddings as compared o random iniializaion. Surprisingly, Amazon embeddings wihou fine-uning deliver he wors performance, even lower han he Random iniializaion. We found ha wih Amazon embeddings he nework ges suck in a local minimum from he very firs epoch. Oher pre-rained (ununed) embeddings improve over he Amazon and Random by providing beer iniializaion. In mos cases fine-uning makes a big difference. For example, he absolue gains for fine-uning SENNA embeddings in Elman RNN are 17.93% in Lapop and 10.83% in Resauran. Remarkably, fine-uning brings boh Random and Amazon embeddings close o he bes ones. Comparison wih SemEval-2014 Sysems: When our RNN resuls are compared wih he op performing sysems in he SemEval-2014 (las wo rows in Table 3), we see ha RNNs wihou using any linguisic feaures ouperform he bes sysem (IHS RD) on Lapop wih absolue differences of 1.70% for Elman and 2.30% for LSTM. LSTM wihou feaures already ouperforms he bes sysem (DLIREC) on he Resauran daase by an absolue gain of 0.41%. Noe ha hese RNNs only use word embeddings, while IHS RD and DLIREC use complex feaures like dependency relaions, named eniy, senimen orienaion of words, word cluser and many more in heir CRF models, mos of which are expensive o compue; see (Toh and Wang, 2014; Chernyshevich, 2014). The performance of our RNNs improves when hey are given access o very simple feaures like POS ags and chunks, and LSTM is able o ouperform DLIREC on he Resauran daase wih an absolue gain of 0.84%. Sysem Dim. Lapop Resauran Elman-RNN -une +une -une +une +SENNA Amazon Random Google Amazon Jordan-RNN -une +une -une +une +SENNA Amazon Random Google Amazon Table 5: Effecs of fine-uning in Elman-RNN and Jordan-RNN. 6 Conclusion and Fuure Direcion We presened a general class of discriminaive models based on recurren neural nework (RNN) archiecure and word embeddings, ha can be successfully applied o fine-grained opinion mining asks wihou any ask-specific manual feaure engineering effor. We used pre-rained word embeddings from hree exernal sources in differen RNN archiecures including Elman-ype, Jordanype, LSTM and heir several variaions. Our resuls on he opinion arge exracion ask demonsrae ha word embeddings improve he performance of boh CRF and RNN models, however, fine-uning hem in RNNs on he ask gives he bes resuls. RNNs significanly ouperform CRFs, even when hey use word embeddings as he only feaures. Incorporaing simple linguisic feaures ino RNNs improves he performance furher. Our bes resuls wih LSTM RNN ouperforms he op performing sysems in SemEval We made our code publicly available. In he fuure, we would like apply our models o oher fine-grained opinion mining asks including opinion expression deecion and characerizing he inensiy and senimen of he opinion expressions. We would also like o explore o wha exen hese asks can be joinly modeled in an RNN-based muli-ask learning framework. Acknowledgmens We are graeful o he anonymous reviewers for heir insighful commens and suggesions o improve he paper. This research is affiliaed wih he Big Daa Decision Analyics Research Cener of The Chinese Universiy of Hong Kong.

10 References Yoshua Bengio, Parice Simard, and Paolo Frasconi Learning long-erm dependencies wih gradien descen is difficul. IEEE Transacions on Neural Neworks, 5(2): Eric Breck, Yejin Choi, and Claire Cardie Idenifying expressions of opinion in conex. In Proceedings of he 20h Inernaional Join Conference on Arifical Inelligence, pages Morgan Kaufmann Publishers Inc. Maryna Chernyshevich IHS R&D Belarus: Cross-domain exracion of produc feaures using condiional random fields. In Proceedings of he 8h Inernaional Workshop on Semanic Evaluaion (SemEval 2014), page 309. Yejin Choi and Claire Cardie Hierarchical sequenial learning for exracing opinions and heir aribues. In Proceedings of he ACL 2010 Conference Shor Papers, pages ACL. Yejin Choi, Claire Cardie, Ellen Riloff, and Siddharh Pawardhan Idenifying sources of opinions wih condiional random fields and exracion paerns. In Proceedings of HLT/EMNLP, pages ACL. Ronan Collober and Jason Weson A unified archiecure for naural language processing: deep neural neworks wih muliask learning. In Proceedings of ICML, pages ACM. Ronan Collober, Jason Weson, Léon Boou, Michael Karlen, Koray Kavukcuoglu, and Pavel Kuksa Naural language processing (almos) from scrach. The Journal of Machine Learning Research, 12: Nguyen Vie Cuong, Nan Ye, Wee Sun Lee, and Hai Leong Chieu Condiional random field wih high-order dependencies for sequence labeling and segmenaion. The Journal of Machine Learning Research, 15(1): Jeffrey L Elman Finding srucure in ime. Cogniive science, 14(2): Alex Graves and Navdeep Jaily Towards endo-end speech recogniion wih recurren neural neworks. In Proceedings of ICML, pages Sepp Hochreier and Jürgen Schmidhuber Long shor-erm memory. Neural compuaion, 9(8): Minqing Hu and Bing Liu Mining and summarizing cusomer reviews. In Proceedings of SIGKDD, pages ACM. Ozan Irsoy and Claire Cardie Opinion mining wih deep recurren neural neworks. In Proceedings of EMNLP, pages Wei Jin, Hung Hay Ho, and Rohini K Srihari A novel lexicalized HMM-based learning framework for web opinion mining. In Proceedings of ICML, pages Cieseer. Michael I Jordan Serial order: A parallel disribued processing approach. Advances in psychology, 121: John D. Laffery, Andrew McCallum, and Fernando C. N. Pereira Condiional Random Fields: Probabilisic Models for Segmening and Labeling Sequence Daa. In Proceedings of ICML, pages Phong Le and Willem Zuidema Composiional disribuional semanics wih long shor erm memory. In Proceedings of he join Conference on Lexical and Compuaional Semanics (*SEM). Rémi Lebre and Ronan Lebre Word emdeddings hrough Hellinger PCA. arxiv preprin arxiv: Chenghua Lin and Yulan He Join senimen/opic model for senimen analysis. In Proceedings of CIKM, pages ACM. Julian McAuley and Jure Leskovec Hidden facors and hidden opics: undersanding raing dimensions wih review ex. In Proceedings of he 7h ACM conference on Recommender sysems, pages ACM. Grégoire Mesnil, Xiaodong He, Li Deng, and Yoshua Bengio Invesigaion of recurren-neuralnework archiecures and learning mehods for spoken language undersanding. In Proceedings of IN- TERSPEECH, pages Tomas Mikolov, Marin Karafiá, Lukas Burge, Jan Cernockỳ, and Sanjeev Khudanpur Recurren neural nework based language model. In Proceedings of INTERSPEECH, pages Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013a. Efficien esimaion of word represenaions in vecor space. arxiv preprin arxiv: Tomas Mikolov, Ilya Suskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013b. Disribued represenaions of words and phrases and heir composiionaliy. In Advances in Neural Informaion Processing Sysems, pages Tomas Mikolov, Saisical Language Models based on Neural Neworks. PhD hesis, Brno Universiy of Technology. Samaneh Moghaddam and Marin Eser On he design of LDA models for aspec-based opinion mining. In Proceedings of CIKM, pages ACM. Kevin Murphy Machine Learning A Probabilisic Perspecive. The MIT Press.

11 Maria Poniki, Haris Papageorgiou, Dimirios Galanis, Ion Androusopoulos, John Pavlopoulos, and Suresh Manandhar Semeval-2014 ask 4: Aspec based senimen analysis. In Proceedings of he 8h Inernaional Workshop on Semanic Evaluaion (SemEval 2014), pages Bishan Yang and Claire Cardie Join inference for fine-grained opinion exracion. In Proceedings of he 51s Annual Meeing of he Associaion for Compuaional Linguisics, pages ACL. Hasim Sak, Andrew Senior, and Françoise Beaufays Long shor-erm memory recurren neural nework archiecures for large scale acousic modeling. In Proceedings of INTERSPEECH, pages Mike Schuser and Kuldip K Paliwal Bidirecional recurren neural neworks. IEEE Transacions on Signal Processing, 45(11): Shabnam Shariay and Samaneh Moghaddam Fine-grained opinion mining using condiional random fields. In Inernaional Conference on Daa Mining Workshops (ICDMW), pages IEEE. Richard Socher, Alex Perelygin, Jean Y Wu, Jason Chuang, Chrisopher D Manning, Andrew Y Ng, and Chrisopher Pos Recursive deep models for semanic composiionaliy over a senimen reebank. In Proceedings of EMNLP, pages Cieseer. Marin Sundermeyer, Ralf Schlüer, and Hermann Ney LSTM neural neworks for language modeling. In Proceedings of INTERSPEECH, pages Ivan Tiov and Ryan McDonald Modeling online reviews wih muli-grain opic models. In Proceedings of WWW, pages ACM. Zhiqiang Toh and Wening Wang DLIREC: Aspec erm exracion and erm polariy classificaion sysem. In Proceedings of he 8h Inernaional Workshop on Semanic Evaluaion (SemEval 2014), page 235. Joseph Turian, Lev Rainov, and Yoshua Bengio Word represenaions: a simple and general mehod for semi-supervised learning. In Proceedings of he 48h Annual Meeing of ACL, pages ACL. Janyce Wiebe, Theresa Wilson, and Claire Cardie Annoaing expressions of opinions and emoions in language. Language resources and evaluaion, 39(2-3): Theresa Wilson, Janyce Wiebe, and Paul Hoffmann Recognizing conexual polariy in phrase-level senimen analysis. In Proceedings of HLT/EMNLP, pages ACL. Bishan Yang and Claire Cardie Exracing opinion expressions wih semi-markov condiional random fields. In Proceedings of he 2012 Join Conference on Empirical Mehods in Naural Language Processing and Compuaional Naural Language Learning, pages ACL.

Learning to Process Natural Language in Big Data Environment

Learning to Process Natural Language in Big Data Environment CCF ADL 2015 Nanchang Oc 11, 2015 Learning o Process Naural Language in Big Daa Environmen Hang Li Noah s Ark Lab Huawei Technologies Par 2: Useful Deep Learning Tools Powerful Deep Learning Tools (Unsupervised