Part-of-Speech Driven Cross-Lingual Pronoun Prediction with Feed-Forward Neural Networks

Save this PDF as:

Size: px
Start display at page:

Download "Part-of-Speech Driven Cross-Lingual Pronoun Prediction with Feed-Forward Neural Networks"


1 artospch Drivn CrossLingual ronoun rdiction with FdForward Nural Ntworks Jimmy Callin, Christian Hardmir, Jörg Tidmann Dpartmnt o Linguistics and hilology Uppsala Univrsity, Swdn {christian.hardmir, Abstract For som languag pairs, pronoun translation is a discoursdrivn task which rquirs inormation that lis byond its local contxt. This motivats th task o prdicting th corrct pronoun givn a sourc sntnc and a targt translation, whr th translatd pronouns hav bn rplacd with placholdrs. For crosslingual pronoun prdiction, w suggst a nural ntworkbasd modl using prcding nouns and dtrminrs as aturs or suggsting antcdnt candidats. Our modl scors on par with similar modls whil having a simplr architctur. Introduction Most modrn statistical machin translation (SMT) systms us contxt or translation; th maning o a word is mor otn than not ambiguous, and can only b dcodd through its usag. That said, contxt us in modrn SMT still mostly assums that sntncs ar indpndnt o on anothr, and dpndncis btwn sntncs ar simply ignord. Whil today s popular SMT systms could us aturs rom prvious sntncs in th sourc txt, translatd sntncs within a documnt hav up to this point rarly bn includd. Hardmir and Frdrico (00) argu that SMT rsarch has bcom matur nough to stop assuming sntnc indpndnc, and start to incorporat aturs byond th sntnc boundary. Languags with gndrmarkd pronouns introduc crtain diicultis, sinc th choic o pronoun is dtrmind by th gndr o its antcdnt. icking th wrong thirdprson pronoun might sm lik a rlativly minor rror, spcially i prsnt in an othrwis comprhnsibl translation, but could potntially produc misundrstandings. Tak th ollowing English sntncs: Th monky at th banana bcaus it was hungry. Th monky at th banana bcaus it was rip. Th monky at th banana bcaus it was tatim. It in ach o ths thr cass rrnc somthing dirnt, ithr th monky, th banana, or th abstract notion o tim. I w wr to translat ths sntncs to Grman, w would hav to consciously mak dcisions whthr it should b in masculin (r, rrring to th monky), minin (si, rrring to th banana), or nutr (s, rrring to th tim) (Mitkov t al., 995). Whil ths xampls us a local dpndncy, th antcdnt o it could just as asily hav bn on or svral sntncs away which would hav mad ncssary translation aturs out o rach or sntnc basd SMT dcodrs. Rlatd work Most o th work in anaphora rsolution or machin translation has bn don in th paradigm o rulbasd MT, whil th topic has gaind littl intrst within SMT (Hardmir and Fdrico, 00; Mitkov, 999). On o th irst xampls o using discours analysis or pronoun translation in SMT was don by Nagard and Kohn (00), who us corrnc rsolution to prdict th antcdnts in th sourc languag as aturs in a standard SMT systm. Whil thy saw scor improvmnts in pronoun prdiction, thy claim th bad prormanc o th corrnc rsolution sriously impactd th rsults ngativly. Thy prormd this as a postprocssing stp, which sms to b primarily or practical rasons sinc most popular SMT ramworks such as Moss (Kohn t al., 007) do not provid prvious targt translations or us as aturs. Guillou t al. (0) 59 rocdings o th Scond Workshop on Discours in Machin Translation (DiscoMT), pags 59 64, Lisbon, ortugal, 7 Sptmbr 05. c 05 Association or Computational Linguistics.

2 trid a similar approach or EnglishCzch translation with littl improvmnt vn atr actoring out major sourcs o rror. Thy singld out on possibl rason or this, which is how a rasonabl translation altrnativ o a pronoun s antcdnt could act th prdictd pronoun, including th possibility o simply cancling out pronouns. E.g, th u.s., claiming som succss in its trad could b paraphrasd as th u.s., claiming som succss in trad diplomacy without any loss in translation quality, whil still acting th scor ngativly. This dmonstrats thr is ncssary linguistic inormation in th targt translation that is not availabl in th sourc. Hardmir and Frdrico (00) xtndd th phrasbasd Moss dcodr with a word dpndncy modl basd on xisting corrnc rsolution systms, by parsing th output o th dcodr and catching its prvious translations. Unortunatly thy only producd minor improvmnts or EnglishGrman. In light o this, thr hav bn attmpts at considring pronoun translation a classiication task sparat rom traditional machin translation. This could potntially lad to urthr insights into th natur o anaphora rsolution. In this ashion a pronoun translation modul could b tratd as just anothr part o translation by discours orintd machin translation systms, or as a postprocssing stp similarly to Guillou t al. (0). Hardmir t al. (0b) introducd this task and prsntd a dorward nural ntwork modl using aturs rom an xtrnal anaphora rsolution systm, BART (Broschit t al., 00), to inr th pronoun s antcdnt candidats and us th alignd words in th targt translation as input. This modl was latr intgratd into thir documntlvl dcodr Docnt (Hardmir t al., 0a; Hardmir, 04, chaptr 9). Task stup Th goal o crosslingual pronoun prdiction is to accuratly prdict th corrct missing pronoun in translatd txt. Th pronouns in ocus ar it and thy, whr th word alignd phrass in th translation hav bn rplacd by placholdrs. Th word alignmnt is includd, and was automatically producd by GIZA (Och, 00). W ar also awar o documnt boundaris within th corpus. Th corpus is a st o thr dirnt English Frnch paralll corpora gathrd rom thr sparat domains: transcribd TED talks, Europarl (Kohn, 005) with transcribd procdings rom th Europan parliamnt, and a st o nws txts. Tst data is a collction o transcribd TED talks, in total documnts containing 09 sntncs with a total o 05 classiication problms, with a similar dvlopmnt st. Furthr dtails o th task stup, including inal prormanc rsults, ar availabl in Hardmir t. al. (05). 4 Mthod Inspird by th nural ntwork architctur st up in Hardmir t al. (0b), w similarly propos a dorward nural ntwork with a layr o word mbddings as wll as an additional hiddn layr or larning abstract atur rprsntations. Th inal architctur as shown in ig. uss both sourc contxt and translation contxt around th missing pronoun, by ncoding a numbr o word mbddings n words to th lt and m words to th right (hrby rrrd to as having a contxt window siz o nm). Th main dirnc in our modl lis in avoiding using an xtrnal anaphora rsolution systm to collct antcdnt aturs. Rathr, to simpliy th modl w simply look at th our closst prvious nouns and dtrminrs in English, and us th corrsponding alignd Frnch nouns and articls in th modl, as illustratd in ig.. Whrvr th alignmnts map to mor than on word, only th ltmost word in th phras is usd. W ncod ths nouns and articls as mbddings in th irst input layr. This way, th ordr o ach word is mbddd, which should approximat th distanc rom th missing pronoun. Additionally, w allow ourslvs to look at th Frnch contxt o th missing pronoun. Whil th automatically translatd contxt might b too unrliabl, Frnch usag should b a bttr indicator or som o th classs,.g. c which is highly dpndnt on bing prcdnt o st. S ig. or an xampl o contxt in sourc and translation as aturs. Similarly to th original modl in Hardmir t al. (0b), th nural ntwork is traind using stochastic gradint dscnt with minibatchs and L rgularization. Crossntropy is usd as a cost unction, with a sotmax output layr. Furthrmor th dimnsionality o th mbddings is incrasd rom 0 to 50, sinc w saw minor improvmnts o th scors on th dvlopmnt st with th incras. To rduc training tim and spd up convrgnc, w us tanh as activa 60

3 p r o n o E H S 4 Figur : Nural ntwork architctur. Blu mbddings (E) signiis sourc contxt, rd targt contxt, and yllow th prcding OS tags. Th shown numbr o aturs is not quivalnt with what is usd in th inal modl. tion unction btwn th hiddn layrs (LCun t al., 0), in contrast to th sigmoid unction usd in Hardmir s modl. To avoid ovritting, arly stopping is introducd whr th training stops i no improvmnts hav bn ound within a crtain numbr o itrations. This usually rsults in a training tim o 0 pochs, whn run on TED data. Th modl uss a layrwis uniorm random wight initialization as proposd by Glorot and Bngio (00), whr thy show that nural ntwork modls using tanh as activation unction gnrally prorm bttr with a uniormally distributd random initialization within th intrval [ anin 6, anin 6 ], whr an in an out an out and an out ar numbr o inputs and numbr o hiddn units rspctivly. Sinc th modl uss a ixd contxt window siz or English and Frnch, as wll as a ixd numbr o prcding nouns and articls, w nd to ind out optimal paramtr sttings. W obsrv that a paramtr stting o 44 contxt window or English and Frnch, with prcding nouns and articls ach prorm wll. Figur 4 showcass how window siz and numbr o prcding OS tags act th prormanc outcom on th dvlopmnt st. W also look into asymmtric window sizs, but notic no improvmnts (ig. 5). W hav this bannr in our oics in alo Alto Nous avons ctt bannièr dans nos buraux à alo Alto Figur : An English OS taggr is usd to ind nouns and articls in prcding uttrancs, whil th word alignmnts dtrmin which Frnch words ar to b usd as aturs. Fatur ablation as prsntd in tabl shows that whil all atur classs ar rquird or rtriving top scor, OS aturs ar gnrally th atur class that contributs th last to improvd rsults. It is curious to notic that ll vn prorms bttr without th OS aturs, whil lls rcivs a suicint bump with thm. Furthrmor, th rsults indicat that targt aturs is th most inormativ o th tstd atur classs. Th nural ntwork is implmntd in Thano (Brgstra t al., 00), and is publicly availabl on Github. whatlls 6

4 <S> <S> <S> it xprsss our viw o how w <S> <S> <S> xprim notr manièr d' abordr Figur : Exampl o contxt usd in th classiication modl, color codd according to thir position in th nural ntwork as illustratd in ig.. Macro F aramtr variation 0.4 Window siz OS tags Figur 4: aramtr variation o window siz and numbr o prcding OS tags. Window siz is varid in a symmtrical ashion o nn. Whn varying window siz, prcding OS tags ar usd. Whn varying numbr o OS tags, a window siz o 44 is usd. 5 Rsults Th rsults rom th shard task ar prsntd in tabl and tabl. Th bst prorming classs ar c, ils, and othr, all raching F scors ovr 80 prcnt. Th lss commonly occurring classs ll and lls prorm signiicantly wors, spcially rcallwis. Th ovrall macro F scor nds up bing 55.%. 6 Discussion Rsults indicat that th modl prorms on par with prviously suggstd modls (Hardmir t al., 0b), whil having a simplr architctur. Classs highly dpndnt on local contxt, such as c, prorm spcially wll, which is likly du to st bing a good indicator o its prsnc. This is supportd by th larg prormanc gains rom 40 to 4 in ig. 5, sinc st usually ollows c. Singular and plural classs rarly gt conusd, du to thm bing prdicatd on th English pronoun which marks it or thy. Th classs o minin gndr do not prorm as wll, spcially rcallwis, but this was to b xpctd Window asymmtry variation Macro F Figur 5: aramtr variation o window siz asymmtry, whr ach labl corrsponds to nn, whr n is th contxt siz in ach dirction. sinc th only inormation rom which to inr its antcdnt is ordrd distanc rom th pronoun in ocus. It is apparnt that th modl has a bias towards making majority class prdictions, spcially givn th low numbr o wrong prdictions on th ll and lls classs rlativ to il and ils. Th high rcall o ils is xplaind by this phnomnon as wll. An additional hypothsis is that thr is simply too littl data to ralistically crat usabl mbddings, xcpt or a w roccurring circumstancs. A somwhat intrsting xampl o what OS tags might caus is:... which is th history o who invntd gams... and thy would b so immrsd in playing th dic gams l histoir d qui a invnté l ju t pourquoi... sraint si concntrés sur lur ju d dés... This is on o th w instancs whr ils has bn misclassiid as lls. Sinc this classiication only happns whn using at last thr prcding OS tags, it is likly thr is somthing happning with th antcdnt candidats. Th third dtrminr is th (history), and points to histoir which is a noun o minin gndr. It is likly th classiir has larnd this connction and has put too much wight into it. Th xtra numbr o aturs as wll as th incras in mbdding dimnsionality maks th training and prdiction slightly slowr, but sinc th training still is don in lss than an hour, and tsting dos not tak longr than a w sconds, 6

5 OS Sourc Targt Non c cla ll lls il ils OTHER Macro Micro Tabl : Fscor or ach labl in a atur ablation tst, whr th spciid atur classs wr rmovd in training and tsting on th dvlopmnt st. Th Non column has no rmovd aturs. Micro scor is th ovrall classiication scor, whil macro is th avrag ovr ach class. rcision Rcall F c cla ll lls il ils othr Macro Micro Tabl : rcision, rcall, and Fscor or all classs. Micro scor is th ovrall classiication scor, whil macro is th avrag ovr ach class. Th lattr scoring mthod is usd or incrasing th importanc o classs with wr instancs. it is still good nough or gnral usag. Furthrmor, th implmntation is mad in such a way that urthr prormanc incrass ar to b xpctd i you run it on CUDA compatibl GU with minor changs. Whil thr sparat training data collctions wr availabl, w only ound intrsting rsults whn using data rom th sam domain as th tst data, i.. transcribd TED talks. To ovrcom th skwd class distribution, attmpts wr mad at ovrsampling th lss rqunt classs rom Europarl, but unortunatly this only ld to prormanc loss on th dvlopmnt st. Th modl dos not sm to gnraliz wll rom othr typs o training data such as Europarl or nws txt, dc cla ll lls il ils othr sum c cla ll lls il ils othr sum Tabl : Conusion matrix o class prdictions. Row signiis actual class according to gold standard, whil column rprsnts prdictd class according to th classiir. spit Europarl bing transcribd spch as wll. This is an obvious shortcoming o th modl. W trid svral altrations in paramtr sttings or contxt window and OS tags, and ound no signiicant improvmnts byond th inal paramtr sttings whn run on th dvlopmnt st, as sn in ig. 4. Figur 5 maks it clar that a symmtric window siz is bnicial, whil w ar not as sur o why this is th cas. Right contxt sms to b mor important than lt contxt, which could b du to th act that pronouns in thir rol as subjcts largly appars arly in sntncs, making lt contxt nothing but sntnc start markrs. In utur work, it would b intrsting to look into how much sourc contxt actually contributs to th classiication, givn a targt contxt. rliminary rsults o th atur ablation tst in tabl indicat that w indd captur inormation or at last som o th classs with th us o sourc aturs, whil it is not quit clar why this is th cas. Whil th English contxt is nic to hav, sinc you cannot b ntirly crtain o th translation quality in th targt languag, intuitivly all ncssary linguistic inormation or inrring th corrct pronoun should b availabl in th targt translation. Atr all, th gndr o a pronoun is not dpndnt on whatvr sourc languag you translat rom, as long as you hav ound its antcdnt. I th sourc txt still wr ound usul, all English word mbddings could b prtraind on a larg numbr o translation xampls and through this procss larn th most probabl crosslinguistic gndr. In th sam mannr, gndr awar Frnch word mbddings would hypothtically incras th scor as wll. 6

6 7 Conclusion In this work, w dvlop a crosslingual pronoun prdiction classiir basd on a dorward nural ntwork. Th modl is havily inspird by Hardmir t al. (0b), whil trying to simpliy th architctur by using prcding nouns and dtrminrs or corrnc rsolution rathr than using aturs rom an anaphora xtractor such as BART, as in th original papr. W ind out that th modl indd prorms on par with similar modls, whil bing asir to train. Thr ar som xpctd drops in prormanc or th lss common classs havily dpndnt on inding thir antcdnt. W discuss probabl causs or this, as wll as possibl solutions using prtraind mbddings on largr amounts o data. Rrncs [Brgstra t al.00] Jams Brgstra, Olivir Brulux, Frédéric Bastin, ascal Lamblin, Razvan ascanu, Guillaum Dsjardins, Josph Turian, David Ward Farly, and Yoshua Bngio. 00. Thano: a cpu and gpu math xprssion compilr. In rocdings o th ython or Scintiic Computing Conrnc (Sciy). [Broschit t al.00] Samul Broschit, Massimo osio, Simon aolo onztto, Kpa Josba Rodriguz, Lornza Romano, Olga Uryupina, Yannick Vrsly, and Robrto Zanoli. 00. Bart: A multilingual anaphora rsolution systm. In rocdings o th 5th Intrnational Workshop on Smantic Evaluation, pags Association or Computational Linguistics. [Glorot and Bngio00] Xavir Glorot and Yoshua Bngio. 00. Undrstanding th diiculty o training dp dorward nural ntworks. In Intrnational conrnc on artiicial intllignc and statistics, pags [Guillou0] Lian Guillou. 0. Improving pronoun translation or statistical machin translation. In rocdings o th Studnt Rsarch Workshop at th th Conrnc o th Europan Chaptr o th Association or Computational Linguistics, EACL, pags 0. Association or Computational Linguistics. [Hardmir and Fdrico00] Christian Hardmir and Marcllo Fdrico. 00. Modlling pronominal anaphora in statistical machin translation. In rocdings o th 7th Intrnational Workshop on Spokn Languag Translation, pags [Hardmir t al.0a] Christian Hardmir, Sara Stymn, Jörg Tidmann, and Joakim Nivr. 0a. Docnt: A documntlvl dcodr or phrasbasd statistical machin translation. In ACL 0 (5st Annual Mting o th Association or Computational Linguistics), pags Association or Computational Linguistics. [Hardmir t al.0b] Christian Hardmir, Jörg Tidmann, and Joakim Nivr. 0b. Latnt anaphora rsolution or crosslingual pronoun prdiction. In rocdings o th 0 Conrnc on Empirical Mthods in Natural Languag rocssing, pags [Hardmir t al.05] Christian Hardmir, rslav Nakov, Sara Stymn, Jörg Tidmann, Yannick Vrsly, and Mauro Cttolo. 05. ronounocusd MT and crosslingual pronoun prdiction: Findings o th 05 DiscoMT shard task on pronoun translation. In rocdings o th Scond Workshop on Discours in Machin Translation, Lisbon, ortugal. [Hardmir04] Christian Hardmir. 04. Discours in Statistical Machin Translation. hd thsis, Uppsala Univrsity, Dpartmnt o Linguistics and hilology. [Kohn t al.007] hilipp Kohn, Hiu Hoang, Alxandra Birch, Chris CallisonBurch, Marcllo Fdrico, Nicola Brtoldi, Brook Cowan, Wad Shn, Christin Moran, Richard Zns, t al Moss: Opn sourc toolkit or statistical machin translation. In rocdings o th 45th annual mting o th ACL on intractiv postr and dmonstration sssions, pags Association or Computational Linguistics. [Kohn005] hilipp Kohn Europarl: A paralll corpus or statistical machin translation. In MT summit, volum 5, pags [L Nagard and Kohn00] Ronan L Nagard and hilipp Kohn. 00. Aiding pronoun translation with corrnc rsolution. In rocdings o th Joint Fith Workshop on Statistical Machin Translation and MtricsMATR, WMT 0, pags 5 6. Association or Computational Linguistics. [LCun t al.0] Yann A. LCun, Léon Bottou, Gnviv B. Orr, and KlausRobrt Müllr. 0. Eicint backprop. In Nural ntworks: Tricks o th trad, pags Springr. [Mitkov t al.995] Ruslan Mitkov, Sungkwon Choi R, and All Sharp Anaphora rsolution in machin translation. In rocdings o th Sixth Intrnational Conrnc on Thortical and Mthodological Issus in Machin Translation, pags 5 7. [Mitkov999] Ruslan Mitkov Introduction: Spcial issu on anaphora rsolution in machin translation and multilingual nlp. Machin translation, 4():59 6. [Och00] Franz Jos Och. 00. Giza sotwar. Intrnal rport, RWTH Aachn Univrsity. 64