Preprocessing on bilingual data for Statistical Machine Translation

Size: px

Start display at page:

Download "Preprocessing on bilingual data for Statistical Machine Translation"

Lily Bradford
6 years ago
Views:

1 Prprocssing on bilingual data for Statistical Machin Translation

2 Tabl of contnts 1 INTRODUCTION MACHINE TRANSLATION SMT, ALIGNMENT AND PREPROCESSING OVERVIEW OF FURTHER CHAPTERS STATISTICAL MACHINE TRANSLATION BASIC THEORY LANGUAGE MODELING TRANSLATION MODELING ALIGNMENT THE EXPECTATION MAXIMIZATION ALGORITHM GIZA PARAMETER ESTIMATION PREPROCESSING TOKENIZATION AND SENTENCE ALIGNMENT STEMMING NAMED ENTITY RECOGNITION USING CONDITIONAL RANDOM FIELDS NER FOR SMT NER ALGORITHMS FOR BILINGUAL DATA UPPERCASE AND LOWERCASE EXPERIMENTS AND EVALUATION THE EUROPARL PARALLEL CORPUS ALIGNMENT ERROR RATE PRECISION, RECALL AND F1-SCORE BASIC EXPERIMENTS EXPERIMENTS ON BILINGUAL DATA EXPERIMENTAL RESULTS TOKENIZATION NAMED ENTITY RECOGNITION STEMMING REVIEW AND CONCLUSIONS PREPROCESSING EFFECTIVENESS USING BILINGUAL DATA FOR PREPROCESSING FUTURE RESEARCH REFERENCES APPENDIX A: DUTCH LIST OF NON-BREAKING PREFIXES... 45

3 1 Introduction 1.1 Machin translation Machin Translation (MT) is th translation of txt from on human languag to anothr by a computr. Computrs, lik all machins, ar xcllnt at taking ovr rptitiv and mundan tasks from humans. As translating long txts from on languag to anothr qualifis as such a task, Machin Translation is a potntially vry conomic way of translation. Unfortunatly natural languags ar not vry suitabl for procssing by a machin. Thy ar ambiguous, illogical and constantly volving, qualitis that ar difficult to handl with a machin. This maks th problm of Natural Languag Procssing, and by xtnsion MT, a difficult on to solv. A thortical mthod that can analyz a txt in a natural languag and dciphr its smantic contnt can stor this smantic contnt in a languag-indpndnt rprsntation. From this rprsntation, anothr txt with th sam smantic contnt can b gnratd in any languag for which xists a gnration mchanism. Such an MT architctur would provid high quality translations, and b modular; a nw languag could b addd to th pool of intr-translatabl languags simply by dvloping an analysis and gnration mthod for that languag. Unfortunatly this mthod dos not xist. Som xisting MT attmpts to approach it to a dgr, but as long as smantic analysis rmains an unsolvd problm in th fild of Natural Languag Procssing thr can b no tru languag indpndnt rprsntation. Figur 1 shows th Machin Translation Pyramid, which is a schmatic rprsntation of th dgr of analysis prformd on th input txt. Th MT mthod dscribd in ths paragraphs is at th top of th pyramid. Figur 1. Th Machin Translation Pyramid, which is an indication of th lvl of syntactic and smantic analysis prformd by various MT mthods. Existing MT can b catgorizd into thr filds: Rul-basd Machin Translation, Exampl-basd Machin Translation and Statistical Machin Translation.

4 Rul-basd Machin Translation Rul-basd MT is a mthod that focuss on analyzing th sourc languag by syntactic ruls. It typically crats an intrmdiary, symbolic rprsntation from that analysis that rprsnts th contnt of th txt, and thn builds a translation from th intrmdiary rprsntation. In th Machin Translation Pyramid (figur 1), this mthod is highst up of all th xisting MT varitis. It can do syntactic analysis, but sinc smantic analysis is still an unsolvd problm, thr is still a nd for languag-spcific translation stps. Rul-basd MT is on of th most popular mthods for practical us. Wll known rulbasd systms includ Systran and METEO [1]. Systran was fairly succssful, bing utilizd for a tim by both th Unitd Stats Air Forc and th Europan Union Commission. Th systm was nvr abandond, as today it is usd in Altavista s Bablfish and Googl s languag tools. METEO is a systm dvlopd for th purpos of translating wathr forcasts from Frnch to English, and was usd by Environmnt Canada. It continud to srv its purpos until Th main advantag to this mthod of translation is that it works fast. A translation can b producd within sconds, which maks it an attractiv mthod for th casual usr. Howvr, th translations producd by Rul-basd MT tnd to b of poor quality, as Rul-basd MT dals poorly with ambiguity. Exampl-basd Machin Translation Exampl-basd Machin Translation oprats on th philosophy that translation can b don by analogy. An Exampl-basd Machin Translation systm braks down th sourc txt into phrass, and translats ths phrass analogous to th xampl translations it was traind with. Nw sntncs ar cratd by substituting parts of a larnd sntnc with parts from othr larnd sntncs. This basic principl is xplaind in a papr by Nagao [2]. In th Machin Translation Pyramid (figur 1), this mthod is lowr than Rul-basd MT, bcaus thr is littl in th way of analysis of th txt. Thr hav bn fw commrcially usd xampl-basd translation systms, but th tchniqus involvd ar still bing rsarchd. A rcnt proposal for an Exampl-basd MT systm was submittd by Sasaki and Murata [3]. Th advantag of Exampl-basd MT is that it can produc vry high-quality translations, as long as it is applid to vry domain-spcific txts such as product manuals. Howvr, onc th txts bcom mor divrs, th translation quality drops quickly. Statistical Machin Translation Statistical Machin Translation (SMT) is a typ of MT similar to Exampl-basd MT in th sns that it translats an input txt according to what it has larnd from training data. Unlik Exampl-basd MT, SMT aims to b abl to translat phrass it has not spcifically sn bfor.

5 With advancs in statistical modling th translation quality of SMT systms has risn abov that of altrnativ mthods. A papr by Alshawi and Douglas shows this diffrnc in prformanc [4]. Howvr, th drawback of this mthod of MT is that it rquirs massiv amounts of procssing tim and training matrial to produc a translation. This maks it unsuitabl for tim-critical applications. Furthrmor, SMT is not vry ffctiv for languag pairs that hav littl training data availabl. In th Machin Translation Pyramid (figur 1), SMT is all th way at th bottom. This rflcts th fact that SMT dos not syntactically or smantically analyz th input txt at all. It simply uss statistics obtaind during modl training to find a squnc of words it dms th bst translation. Two wll-known SMT systms includ Moss and Pharaoh. In addition thr xists a varity of othr systms that focus on componnts of an SMT systm, such as languag modl trainrs and dcodrs. 1.2 SMT, Alignmnt and prprocssing This thsis will invstigat prprocssing mthods for SMT, in an attmpt to find ways to incras th prformanc of this mthod of MT. Bfor w can go into th dtails of prprocssing w must introduc th workings of SMT. SMT translats basd on information it has traind from xampl translation data. This xampl translation data taks th form of a paralll corpus. Such a corpus consists of two txts, ach of which is th translation of th othr. In this work, this corpus is th Europarl corpus [5], which is frly availabl for th purposs of SMT rsarch. By statistically analyzing such paralll corpora, on can stimat th paramtrs for whatvr statistical modls on chooss to mploy (s Brown t al [6] and Och and Ny [7]). Th traind statistical modls ar thn usd by th systm to calculat th sntnc that has th highst probability of bing th translation of th input sntnc. Th part of th systm that dos this is calld th dcodr. Dcoding is a difficult problm that is th subjct of much rsarch, but it falls outsid th scop of this thsis. It is important to not, howvr, that th prformanc of th dcodr is a function of th quality of th statistical modls. Prprocssing on th training data can hlp improv modl quality and by xtnsion th prformanc of an SMT systm. To undrstand how prprocssing has an ffct on th prformanc of an SMT systm, on must undrstand th concpt of alignmnt. In th rmaindr of this txt thr will b mntion of two typs of alignmnt: th sntnc-lvl alignmnt and th word-lvl alignmnt.

6 Th sntnc-lvl alignmnt rfrs to th way th sntncs in th corpus ar squncd. If a givn sntnc in on half of th paralll corpus is in th sam squntial position as a sntnc in th othr half of th corpus, thn thos sntnc ar said to b a sntnc pair. If th sntncs in a sntnc pair ar translations of ach othr, thos sntncs ar said to b alignd. In ordr to train an SMT systm, on rquirs a paralll corpus th sntncs of which ar proprly alignd. In th rmaindr of this thsis, w will assum th training data has a corrct sntnc-lvl alignmnt. Following Och and Ny [8], th word-lvl alignmnt is dfind as a subst of th Cartsian product of th word positions. This can b visualizd by printing a sntnc pair and drawing lins btwn th words. Such a visualization is givn in figurs 2 and 3. Both visualizations will b usd in th rmaindr of this thsis. Figur 2 An xampl of a graphical rprsntation of an alignmnt on a sntnc pair. Words connctd by a lin ar considrd to b translations of ach othr. Figur 3 Th sam alignmnt as shown in figur 2, rprsntd as a subst of th Cartsian product of th word positions. As is xplaind in mor dtail in chaptr 2, a word-lvl alignmnt is ssntial for training th statistical modls. As training data dos not contain a word-lvl alignmnt from th gt-go, such an alignmnt must b cratd by th systm. Not that thr is not ncssarily on spcific good alignmnt for any givn sntnc pair. Whn askd to crat an alignmnt, human alignrs may wll com up with diffrnt alignmnts, and it may wll b that on alignmnt is as good as anothr, dpnding on on s point of viw. A human will crat an alignmnt basd on maning, whil a machin cannot do this. Clarly, this mans that it s not asy to dfin what a good alignmnt is for a machin. In practic, th algorithm simply tris to com up

7 with an alignmnt that has a low prplxity. Prplxity is a masur of how complx an alignmnt is. To giv a simpl xampl whr w only considr word mapping, alignmnts in which words ar mappd to multipl othr words will hav a highr prplxity than alignmnts that hav fw word mappings. Th xampl in figurs 2 and 3 has xactly on word associatd with ach word in th sourc languag, and thrfor has a low prplxity. Prplxity will b formally introducd in chaptr 2. Whn crating a word-lvl alignmnt, a lot dpnds on th quality of th corpus itslf. Things such as splling rrors, missing words or sntncs, garbag and mistranslations may ngativly influnc th accuracy of th alignmnt, which in turn may ngativly affct th translation modls that ar traind from th alignmnt. To minimiz ths ngativ influncs, th corpus can b adjustd prior to th alignmnt stp. Splling rrors can b corrctd or simply toknizd away. Incomplt sntnc pairs can b rmovd, as can garbag such as punctuation or formatting cods. It is tasks such as ths that ar prformd by prprocssing. In addition to liminating lmnts that rduc th corpus quality, prprocssing can analyz th corpus, oftn on a smantic lvl, thrby stimulating a crtain tndncy in th cration of th word-lvl alignmnt. For xampl, numbr tagging can nsur that numbrs will b alignd to othr numbrs. For dtaild dscriptions of prprocssing stps that ar rlvant in this thsis, rfr to chaptr 3. Prvious rsarch into prprocssing stps includs work by Habash and Sadat [9], who invstigatd th ffct of prprocssing stps on SMT prformanc for th Arabic languag. Rsarch into prprocssing for automatic valuation of MT has also bn don by Lusch t al [10]. Th goals of this work ar twofold. Firstly, it invstigats th impact of prprocssing on th prformanc of SMT training, if th prprocssing is applid to both halvs of a paralll corpus. Th objctiv is to judg whthr such prprocssing stps ar a usful addition to a typical training procss. Th prprocssing mthods xamind in this thsis ar Stmming, Toknization and Namd Entity Rcognition. Scondly, it invstigats whthr th fficincy of prprocssing in th contxt of a bilingual corpus can b improvd by making us of th bilingual corpus and th assumption that th two halvs ar accurat translations of ach othr. This rsarch goal will focus on th Namd Entity Rcognition prprocssing mthod.

8 1.3 Ovrviw of furthr chaptrs Chaptr 2 will giv an outlin of th SMT thory undrlying th xprimnts. It dscribs th basic SMT thory, which is latr usd to xplain how prprocssing stps can influnc th prformanc of an SMT systm. Chaptr 3 dscribs th thortical foundations of th xprimnts prformd. It introducs th tchniqus involvd in th prprocssing stps and xplains thir working. Chaptr 4 is a dscription of th xprimntal stup. It lists th xprimnts that wr prformd as wll as a prdictd rsult, along with a motivation. Chaptr 5 will show th rsults of th xprimnts and provid an xplanation as to what thy man and how thy rflct th impact of th prprocssing stps. Finally, Chaptr 6 will conclud this thsis, rviw th rsults and giv rcommndations on futur rsarch.

9 2 Statistical machin translation To undrstand why changing crtain proprtis of a corpus can hav an influnc on th accuracy of an SMT systm, it is important to undrstand how an SMT systm works. An SMT systm can roughly b considrd as a training procss and a dcoding procss. Bcaus prprocssing has its ffct during th training procss, this chaptr will focus on that and forgo a dtaild xplanation of th dcoding procss. 2.1 Basic thory MT is about finding a sntnc that is th translation of a givn sntnc f. Th idntifirs f and originally stood for Frnch and English bcaus thos wr th languags usd in various articls writtn on th subjct (Brown t al [6], Knight [11]). This thsis dals with Dutch and English, but will adhr to th convntion. SMT considrs vry sntnc to b a potntial translation of sntnc f. Considr that translating a sntnc from on languag to anothr is not dtrministic. Whil a typical sntnc can usually only b intrprtd in only on way whn it coms to its maning, th translation may b phrasd in many diffrnt ways. In othr words, a sntnc can hav multipl translations. For this rason, SMT dos not, in principl, outright discard any sntnc in th forign languag. Any sntnc is a candidat. Th trick is to dtrmin which candidat has th highst probability of bing a good translation. For vry pair of sntncs (, f ) w dfin a probability P( f ) that is th translation of f. W choos th sntnc that is th most probabl translation of f by taking th sntnc for which P ( f ) is gratst. This is writtn as: Argmax P ( f ) (1) SMT is ssntially an implmntation of th noisy channl modl, in which th targt languag sntnc is distortd by th channl into th sourc languag sntnc. Th targt languag sntnc is rcovrd by rasoning about how it cam to b by th distortion of th sourc languag sntnc. As th first stp, w apply Bays Thorm to th formula givn abov. Bcaus P ( f ) dos not influnc th argmax calculation, it can b disrgardd. Th formula thn bcoms: Argmax Argmax P ( f ) = P ( ) P( f ) (2) Th highst probability that is th translation of f has bn xprssd in trms of th probability of a priori and th probability of f givn. At first glanc this dos not appar to b bnficial. Howvr, th introduction of factor P () lts us find translations that ar wll-formd. To undrstand this, rmmbr that P ( f ) is nvr zro aftr

10 all, vry is a potntial translation of f. It isn t zro vn if is complt gibbrish. In ffct, this mans that som of th probability mass is givn to translations that ar ill-formd sntncs a sizabl portion of th probability mass, in fact. Th probability P () compnsats for this. It is calld th languag modl probability. Th languag modl probability can b thought of as th probability that would occur. As gibbrish is lss likly to occur than cohrnt, wll-formd sntncs, P () is highr for th lattr than for th formr. Th probability P( f ) is calld th translation modl probability. Th translation modl probability is th probability that th sntnc has f as its translation. Evidntly th product P ( ) P( f ) will b gratst if both P () and P ( f ) ar high in othr words, if f is a translation of and if is a good sntnc. It is spcially P ( f ) that is of intrst for this thsis. As statd in th introduction, prprocssing affcts th word-lvl alignmnt that th systm crats on a corpus, and th word-lvl alignmnt is usd to stimat modls that ar usd to calculat P ( f ). P () givs statistical information about a singl languag, and as such is not dtrmind from th alignmnt. Oftn, languag modls ar traind sparatly from translation modls, on diffrnt data. Figur 4 is a graphical rprsntation of th abov. This figur shows a basic SMT systm, including prprocssing for th translation modl training. Figur 4 A mor dtaild schmatic of an SMT systm s architctur. Not that th languag modl is traind from sparat (monolingual) training data. As will b clar, a good SMT systm rquirs a good languag modl as wll as a good translation modl. Th rmaindr of this chaptr dscribs how on may obtain such modls. 2.2 Languag modling

11 Th languag modl P () is largly dtrmind by th training corpus that was usd to train th languag modl. Th mor similar th sntnc is to th raining data, th highr its P () scor will b. An important part of this probability is th wll-formdnss of th sntnc. A sntnc that is grammatically corrct is a wll-formd sntnc, whras a sntnc that is a mr collction words that bar no rlation to ach othr is ill-formd. Whil wll-formdnss is important, it is not all that mattrs for th languag modl probability. Th words usd in th sntnc also hav thir impact. If a sntnc uss many uncommon words, it may b givn a lowr P () scor than a sntnc that only uss mor common words. For xampl, a sntnc that uss th words mausolum, nanotchnology and comatos togthr may b givn a lowr P () scor than a sntnc that uss th words cooking, hous and vning togthr. Of cours, if th training corpus was largly on th subjct of popl bing hld comatos in a mausolum by mans of nanotchnology, th opposit might b tru, as th formr sntnc would b using common words givn that training corpus. Th languag modl can b traind by simpl counting. Training rquirs a training corpus, prfrably as larg a corpus as possibl, which contains sntncs in th languag for which th languag modl is bing traind. From this corpus a collction of n-grams is constructd. An n-gram is a fragmnt of a sntnc that consists of n conscutiv words. For xampl, th sntnc Rsumption of th sssion contains fiv 2-grams: <s> Rsumption, Rsumption of, of th, th sssion and sssion <s>, whr <s> and </s> indicat th absnc of a word at th start and th nd of th sntnc, rspctivly. Th n-grams can thn b assignd probabilitis as follows: #( X 0... Xn) P ( Xn X 0... Xn 1) = (3) #( X 0... Xn 1) Whr X is a word in th sntnc, X 0...Xn is an n-gram and # is th numbr of occurrncs of in th corpus. Any sntnc that can b constructd from th n-grams that th systm has larnd can b assignd a P () that is a function of th probabilitis of its componnt n-grams.

12 Howvr, this isn t sufficint. A sntnc that cannot b built out of larnd n-grams will b givn a probability of zro, which mans th systm will not b abl to gnrat thos sntncs. As training data is finit and th numbr of possibl sntncs is not, it will always b possibl to construct a sntnc that has on or mor n-grams that do not occur in th training data, no mattr how larg th training corpus is. Thrfor, w mploy a tchniqu calld smoothing, which assigns a nonzro probability to vry possibl n-gram givn th words in th training corpus, vn thos that don t actually occur. This allows th systm to gnrat sntncs that contain n-grams that wrn t in th training corpus, as long as thos n-grams contain known words. Thr ar many possibl approachs to smoothing, th most simpl bing th addition of a vry small valu to vry n-gram that did not appar in th corpus. Thr ar othr mthods of building a languag modl, though th smoothd n-gram mthod is prvalnt. Ths mthods ar outsid th scop of this thsis. Howvr, it is worth pointing out that Eck t al [12] invstigatd a mthod that xpands on th n-gram mthod by adapting th languag modl to b mor domain spcific, thrby achiving bttr rsults in that domain. 2.3 Translation modling Th purpos of th translation modl is to indicat for a givn sntnc pair (, f ) th probability that f is th translation of. It assigns a probability to ach potntial translation of th input sntnc, and if th modl is any good, bttr translations will hav highr probabilitis. Training modls that will yild good probabilitis is not asy, and in fact a grat dal of th rsarch don in th fild of SMT is rlatd to translation modling. Th approach usd by th SMT translation modls is calld string rwriting. It is dscribd in dtail by Brown t al [6]. String rwriting ssntially rplacs th words in a sntnc with thir translations, thn rordrs thm. Whil string rwriting cannot xplicitly map syntactic rlationships btwn words from th sourc sntnc to th targt sntnc, it is possibl to approximat such a mapping statistically. Th upsid to this mthod is that it s vry asy in principl, and it can b larnd from availabl data. This mans that as long as appropriat training data is availabl this mthod applis to any languag pair. In string rwriting, thr ar four paramtrs that ar calculatd by th translation modl.

13 Th first paramtr is th amount of translatd words that ar associatd with vry sourc word. This is calld th frtility of that word. For xampl, a word with a frtility of 3 will hav 3 words associatd with it as a translation of that word. Th frtility for a word is not dirctly dpndnt on th othr words in th sntnc or thir frtilitis, but as th sum of all frtilitis must b qual to th amount of words in th targt sntnc, frtilitis indirctly influnc ach othr by compting for words whn stimating th frtility probabilitis during modl training. Th part of th translation modl that dcids on th frtility is calld th frtility modl. This modl assigns to ach word i a frtility φ i with probability n φ ) (4) ( i i Scondly, th translation modl dcids which translation words ar gnratd for ach word in th sourc sntnc. This is calld th translation probability, not to b confusd with th translation modl probability. Mor formally, for ach word i th gnration modl chooss k forign words τ ik with probability With 1 k φ i t τ ) (5) ( ik i Thirdly, th translation modl dcids th ordr in which ths translatd words ar to b placd. This part of th translation modl is calld th distortion modl. Th distortion modl chooss for ach gnratd word τ ik a position π ik with probability d( π i, l, m) (6) ik Whr l is th amount of words in th sourc sntnc and m is th sum of all frtilitis. Finally, th translation modl causs words to b insrtd spuriously. To undrstand this, considr that somtims, words that appar in a translation may not b dirctly gnratd from a word in th original sntnc. For xampl, a grammatical hlpr word that xists in on languag may hav no quivalnt in th othr languag, and will thrfor not b gnratd by any of th words in that languag. For this rason, all sntncs ar assumd to hav a NULL word at th start of th sntnc. This NULL word can hav translations lik any othr word, which allows words without a countrpart in th original sntnc to b gnratd. This is calld spurious insrtion. Evry tim a word is gnratd normally in th targt sntnc, thr is a probability that a word is gnratd spuriously. This probability is dnotd with p 1 (7)

14 Th probability p 0 is th probability that spurious gnration dos not occur, givn by p0 = 1 p 1 (8) In th following sction w will s how ths paramtrs can b turnd into th translation modl probability P ( f ) that w r looking for, by mans of a word-lvl alignmnt. 2.4 Alignmnt An actual translation modl is traind from a word-lvl alignmnt. Th paramtrs dscribd in th prvious sction can b stimatd from a word-lvl alignmnt. n( φ ) for a crtain and φ can b obtaind simply by chcking th word-lvl alignmnt on th ntir corpus, counting all th occurrncs that is alignd to xactly φ words in th forign languag, thn dividing this count by th amount of probabilitis n in th translation modl. #( φ ) n( φ ) = (9) # n t( τ ) for a crtain τ and can b obtaind by counting how many words ar gnratd by all occurrncs of in th alignmnt and thn dividing th amount of τ by th total count. Whr x mans any word. #( τ ) t( τ ) = (10) #( x ) d ( π i, l, m) for a crtain π, i,l and m can b obtaind by counting th occurrncs of ( π i, l, m) and dividing it by th count of all occurrncs of ( j i, l, m), with j = 1... m. # ( π i, l, m) d( π i, l, m) = (11) # ( j i, l, m)

15 p 1 can b obtaind by looking at th forign corpus. This corpus consists of N words. W rason that M of ths N words wr gnratd spuriously, and that th othr N M words wr gnratd from English words. M can b obtaind from th wordlvl alignmnt by counting th occurrncs of translation paramtr (x " NULL" ). This lads to th valu for p 1 : M p1 = (12) N M Th abov shows that it is vitally important for th word-lvl alignmnt to b as accurat as possibl. Th idal scnario is that vry word in a sntnc is alignd to a word that is a translation of that word, or if thr is no translation of that word availabl in th translation sntnc, that it not b alignd to anothr word at all. In practic, prfabricatd word-lvl alignmnts do not xist. It falls to th SMT systm training procss to stimat on from th sntnc-alignd corpus. Bcaus th trainr has no knowldg of th languags involvd at all, it must dtrmin which alignmnt is th bst on basd on pattrns that xist in th corpus. For this purpos w mploy th Expctation Maximization (EM) algorithm (Al-Onaizan t al [13]). 2.5 Th Expctation Maximization algorithm EM is an itrativ procss that attmpts to find th most probabl word-lvl alignmnt on all sntnc pairs in th paralll corpus. It attmpts to find pattrns in th corpus by statistically analyzing th componnt sntnc pairs, and considrs alignmnts that conform to ths pattrns to b bttr than alignmnts that don t. This is why prprocssing on th corpus has an ffct on th ovrall translation modl quality. By modifying th corpus w modify crtain pattrns, with th intnt to dirct th EM algorithm produc a bttr alignmnt. In crating a word-lvl alignmnt on a sntnc-alignd corpus, w considr that ach sntnc pair has a numbr of alignmnts, not just a singl on. Som of ths alignmnts w may considr bttr than othrs. To rflct this, w introduc alignmnt wights. An alignmnt with a highr wight is considrd bttr than an alignmnt with a lowr wight. Th sum of th alignmnt wights for all alignmnts on a sntnc pair is qual to 1. Thr wights will hlp us stimat th translation modl paramtrs by collcting fractional counts ovr all alignmnts. Th basic mthod is th sam as dscribd at th bginning of this sction, but w do it for all alignmnts. Furthrmor, w multiply th counts by th wight of th alignmnt that w count th paramtr from, and thn add th fractional counts for a paramtr togthr to gt th final count. In this mannr, w can stimat paramtrs vn if w hav mor than a singl alignmnt on a sntnc pair.

16 Th qustion ariss whr ths alignmnt wights com from. Lt us xprss ths wights in trms of alignmnt probabilitis. Th probability of an alignmnt on a sntnc pair ), ( f is th probability that th alignmnt would occur givn that sntnc pair. W writ this probability as ), ( f a P (13) Whr a is th alignmnt. W can us th dfinition of conditional probability to rwrit this probability as ), ( f a P = ), ( ),, ( f P f a P (14) Bcaus is statistically indpndnt of both f and a, w can writ ), ( f a P = ) ( ) ( ) ( ), ( P f P P f a P (15) Aftr dividing out ) ( P w nd up with ), ( f a P = ) ( ), ( f P f a P (16) It is asy to s that taking th sum ovr a of all probabilitis ), ( f a P is th sam as ) ( f P : ) ( f P = a f a P ), ( (17) In othr words: ), ( f a P = a f a P f a P ), ( ), ( (18) Finally, ), ( f a P is calculatd as follows:! 1 ),, ( ) ( ) ( ), ( φ φ φ φ φ φ φ = = = = = l i i m i j m j aj j l i i i m m l a j d f t n p p m f a P (19)

17 Whr is th sourc sntnc f is th forign sntnc a is th alignmnt is th sourc word in position i f i j l m a j is th forign word in position j is th numbr of words in th sourc sntnc is th numbr of words in th forign sntnc is th position in th sourc languag that conncts to position j in th forign languag in alignmnt a aj is th word in th sourc sntnc in position a j φ i is th frtility for th sourc word in position i givn alignmnt a p1 is th probability that spurious insrtion occurs p is th probability that spurious insrtion dos not occur 0 Not that, in dducing th formula for P ( a, f ), w introducd a formula for calculating P ( f ) (formula 17). Rmmbr that this is th translation modl probability that w ultimatly aim to stablish by training th translation modls. In summary, th alignmnt probability P ( a, f ) can b xprssd in all th translation modl paramtrs. As w alrady assrtd, ths translation modl paramtrs can b calculatd givn th alignmnt probability. If w hav on, w can comput th othr. Ndlss to say w start out with nithr, which prsnts a problm. This is oftn rfrrd to th chickn-and-gg problm. W will nd a mthod for bootstrapping th training, and Expctation Maximization is xactly that. EM bgins with a st of uniform paramtrs. Evry word in th corpus will b givn th sam frtility, th sam translation probabilitis and th sam distortion probabilitis. With this st of paramtrs, alignmnt probabilitis can b computd for vry sntnc pair in th corpus, as dscribd abov. From ths alignmnts w can collct fractional counts, and with th fractional counts w can comput a nw st of paramtr stimats. This nw st of paramtrs is going to b bttr than th on w startd with, bcaus th procss taks into account th corrlation data in th paralll corpus. For xampl, if a crtain word always shows up with a crtain othr word in th othr languag, th translation paramtr for thos two words will gt a highr count. As a rsult, th EM procss will giv a highr probability to alignmnts that connct thos words with ach othr.

18 EM sarchs for an optimization of numrical data. As EM itrats it will produc alignmnts it considrs bttr. In this contxt, bttr mans a lowr prplxity. In th introduction, prplxity was dscribd as a masur of complxity. With th thory dscribd in this sction, w ar rady for a mor formal dfinition: P ( f ) log N 2 (20) Rmmbr that P ( f ) can b xprssd in trms of P ( a, f ) (quation 14). N is th amount of words in th corpus. Th highr P ( f ) is, th lowr th prplxity will b. P ( f ) is highr if th paramtrs that mak up P ( a, f ) hav highr valus. Finally, th paramtrs will hav highr valus whn thir fractional counts ovr th ntir corpus ar high. Bcaus simpl rlationships btwn words will show up mor oftn than complx ons, alignmnts with such simpl rlationships will yild highr paramtr valus. What EM dos is find th lowst prplxity it can. Each itration lowrs prplxity. Howvr, bcaus prplxity is a masur ovr a product of th paramtrs, th EM algorithm is only guarantd to find a local optimum, rathr than th global optimum. Th optimum it finds is partly a function of whr it starts sarching, or to put it in othr words, what paramtr valus it starts with. Thr ar svral EM algorithms imaginabl. For xampl, thr could b an EM algorithm that simplifis th translation modl by ignoring frtility probabilitis, probabilitis for spurious insrtion and distortion probabilitis. This EM algorithm will only optimiz prplxity in trms of th translation probability paramtr. As thr is only on factor to optimiz, this EM algorithm will b guarantd to find th global optimum for its prplxity. This EM algorithm xists, and it is th EM algorithm usd in IBM Modl GIZA++ In practic th alignmnt is gnratd by a program calld GIZA++. GIZA++ is an xtnsion of GIZA, which is an implmntation of svral IBM translation modls. In addition to th IBM modls, GIZA++ also implmnts Hiddn Markov Modls (HMMs). By rqust this thsis acknowldgs Franz Josf Och and Hrmann Ny for GIZA++. Th thory of thir implmntation is dscribd in [8]. GIZA++ producs a word-lvl alignmnt on a sntnc alignd paralll corpus. GIZA++ will produc a on-to-many alignmnt, in which words in th targt sntnc may only b alignd to a singl word in th sourc sntnc. This is illustratd in figur 5.

19 Figur 5. Two on-to-many alignmnts, on for Enlish-Dutch and on for Dutch-English. Not that ths alignmnts ar not optimal. Som rrors xist, such as th alignmnt of naar to b. To achiv a many-to-many alignmnt from GIZA++ it is ncssary to produc two on-to-many alignmnts, on for ach translation dirction, and combin thm into a singl many-to-many alignmnt. This procss is rfrrd to as symmtrization. Thr ar two mthods of symmtrization usd in ths xprimnts: Union and Intrsction symmtrization. Union symmtrization assums that any alignmnt th two on-to-many alignmnts do not agr on should b includd in th many-to-many alignmnt. Formally: A MTM = A A (21) OTM 1 OTM 2 Intrsction symmtrization assums that any alignmnt th two on-to-many alignmnts do not agr on should b discardd. Words that no longr hav any alignmnt aftr symmtrization ar alignd to NULL. Formally: A MTM = A A (22) OTM 1 OTM 2 Th Union and Intrsction many-to-many alignmnts for th two on-to-many alignmnts givn in figur 5 ar shown in figur 6.

20 Figur 6. Two many-to-many alignmnts cratd from th on-to-many alignmnts in figur 5. Th uppr figur is th Union alignmnt and th lowr figur is th Intrsction alignmnt. 2.7 Paramtr stimation Whn optimum prplxity has bn achivd, th alignmnt with th highst probability is calld th Vitrbi alignmnt. Th Vitrbi alignmnt found by Modl 1 whn starting off with uniform paramtr valus may b a vry bad alignmnt. For xampl, all words could b connctd to th sam translation word. Modl 1 has no way of knowing that this is not a probabl alignmnt, bcaus it ignors all th paramtrs that show this improbability, such as th frtility paramtr. Howvr, th Modl 1 Vitrbi alignmnt can b usd as th starting position for mor complx EM algorithms. Mor complx algorithms hav mor paramtrs that wigh into th prplxity, and thy ar not guarantd to find a global optimum. From th most probabl alignmnt givn a local optimum w can gt a nw st of paramtrs. This nw st of paramtrs can thn b fd back to Modl 1, which may find a nw Vitrbi alignmnt as a rsult of its nw starting paramtrs. This last part is an important aspct of translation modl training. By using th paramtrs from a training itration of on modl, w can start a nw training itration, with that sam modl or with a diffrnt on, which will hopfully yild improvd paramtr valus. This procss is calld paramtr stimation. A simpl, schmatic rprsntation of this procss is givn in figur 7.

21 Figur 7. A schmatic rprsntation of th paramtr stimation procss. Th modls can ach b traind a numbr of tims, taking th rsults from th prvious itration as th starting point for th nw itration. Whn a modl stimats paramtrs that wr not stimatd by a prvious modl, it starts th first training itration with uniform valus for thos paramtrs. Thr is on practical problm with starting th paramtr stimation procss. Rcall that P( a, f ) can b xprssd in trms of P ( a, f ) (s formula 18). In th Modl 1 EM algorithm, th dnominator of that formula, P ( a, f ), can b writtn as a a m j= 1 t( f ) (23) j aj As is implid by this formula, th EM algorithm nds to numrat ovr vry alignmnt in th corpus. In a corpus of N words in languag and M words in languag f, th amount of alignmnts is qual to M ( N + 1) (24) To illustrat, in a singl sntnc pair with 20 words in ach sntnc, th amount of 26 alignmnts is For a corpus with 120,000 sntnc pairs th amount of alignmnts is astronomical, and numrating all of thm is impractical. Fortunatly, w can optimiz th numration procss.

22 As formula 21 sums ovr a product that contains indpndnt lmnts, w can factor out th lmnts indpndnt to th product and tak th product ovr th indpndnt lmnt of th sum of th factord xprssion: m l j= 1 i 0 t( f ) (25) j i With this formula, th amount of alignmnts to b numratd to gt th fractional counts for all alignmnts is qual to N +1 M (26) This formula has a quadratic ordr of magnitud, whras formula 22 has an xponntial ordr. To illustrat, considr again th singl sntnc pair with 20 words in ach sntnc. Th amount of alignmnts to numrat is now only 420. Th abov mans that w can numrat all th alignmnts for IBM Modl 1 within rasonabl tim, and thrfor find th Vitrbi alignmnt. Th sam is tru for IBM Modl 2, which is lik Modl 1, but handls distortion probabilitis as wll as translation probabilitis. Unfortunatly, this mannr of simplification cannot b prformd for complx modls lik IBM Modl 3, and so w cannot find thir Vitrbi alignmnt in rasonabl tim. Howvr, thr is a tchniqu calld hill climbing that can b usd to find th (local) optimum for such modls. Hill climbing taks for vry sntnc pair a singl alignmnt to start with. A good plac to start would b th Modl 2 Vitrbi alignmnt. Th modl thn maks a small chang to th alignmnt, for xampl by moving a connction from on position to anothr position clos by. Thn th modl computs th prplxity for th nw alignmnt. This can b don fairly quickly with formula 19. If th nw alignmnt is wors, it is discardd. If it is bttr, it rplacs th old alignmnt. Th modl rpats this procss until no bttr alignmnt can b found by making a small chang. Th alignmnt th modl nds up with is considrd th Vitrbi alignmnt for this modl, vn though thr is no guarant that th alignmnt is, in fact th bst alignmnt givn th currnt paramtr valus. From this Vitrbi alignmnt and a small st of alignmnts that ar clos to it w can collct fractional counts and stimat a nw st of paramtrs. This nw st of paramtrs can b usd as th starting point for a nw training itration or, if no mor training is dmd ncssary, to calculat th final P ( f ). Thr ar mor modls than IBM Modls 1, 2, and 3, such as th modls prsntd by Och and Ny [7]. Howvr, for th purpos of prprocssing it is not ncssary to numrat and xplain ach of ths modls.

23 In summary, this chaptr shows that Statistical Machin Translation is about calculating th highst probability that a givn sntnc is th translation of anothr sntnc. This is don by analyzing monolingual data to obtain languag modls and bilingual data to obtain translation modls. Translation modl training is by far th most complx task, bcaus it rquirs a word-lvl alignmnt, which has to b stimatd from a sntnclvl alignmnt by analyzing pattrns. Th EM algorithm is usd for this analysis. By modifying th pattrns w influnc th EM algorithm, and by xtnsion th word-lvl alignmnt and th translation modl.

24 3 Prprocssing Prprocssing is litrally to procss somthing bfor it is procssd by somthing ls. In computr scinc a prprocssor is a program that procsss its input data to produc output that is usd as input to anothr program. In th spcific contxt of ths xprimnts th output data of th prprocssors srvs as th input data for GIZA++. This chaptr givs an ovrviw of th prprocssing tchniqus usd in th xprimnts. 3.1 Toknization and sntnc alignmnt Following Wbstr and Kit [14], toknization is dfind as a typ of prprocssing that dcomposs parts of a givn txt into mor basic units. An xampl of toknization on English is dcomposing th contraction it s into it is. Th toknization mployd in ths xprimnts is limitd to th rmoval of punctuation and words that do not bar any smantic significanc, such as corpus markup. Toknization on punctuation is a trivial task in itslf. Howvr th scripts that ar includd with th Europarl corpus tak a slightly mor involvd approach, making us of a list on non-braking prfixs. Ths non-braking prfixs indicat words that do not mark th nd of a sntnc whn ncountrd with a priod. Th list is usd not only in toknization but also for sntnc alignmnt, and latr it will b usd during Namd Entity Rcognition as wll. Europarl includs a list of non-braking prfixs for English, but not for Dutch. Although a Dutch list is not rquird to rmov punctuation on th corpus, th list can also b usd during th sntnc-alignmnt of th txt. If no suitabl list is found for a languag, th sntnc alignmnt script falls back to English. This can rsult in a bad sntnc alignmnt, which is ffctivly uslss as a bas to train statistical translation modls. A list of Dutch non-braking prfixs is shown in Appndix A. 3.2 Stmming Stmming is th procss of rducing a word to its stm. In linguistics a stm is th part of a word that is common to all its inflctd variants. This is also calld th morphological root. Th morphological root is th primary lxical unit of a word, which carris th most significant aspcts of smantic contnt and cannot b rducd into smallr constitunts. Any word in natural languags such as Dutch and English can b considrd to b composd of a stm, optionally inflctd by an affix (b it a prfix, a suffix or a circumfix).

25 In th contxt of Natural Languag Procssing, a word s stm is takn to b th first part of that word, stripping off any trailing lttrs that might constitut inflctions, conjugations or othr modifirs to th word. It is important to not that th stm thus gnratd is not ncssarily th sam as th morphological root of that word. In stmming for NLP, it is usually sufficint that rlatd words map to th sam stm, vn if this stm is not in itslf a valid root. Exampl 8 illustrats th rsult of stmming on a small slction of words. consist consist consistd consist consistncy consist consistnt consist consistntly consist consisting consist consists consist knock knock knockd knock knockr knockr knockrs knockr Exampl 8. Exampls of stmming applid to a fw English words and thir variations. Th stmming algorithm usd in most NLP rlatd tasks is th Portr stmmr, or a stmmr drivd from th Portr stmmr. Th stmmr usd in this xprimnt is no xcption; th stmmr usd is th Portr2 stmmr, which is a slightly improvd vrsion of th Portr stmmr. Th Portr stmming algorithm is dscribd by Portr t al [15]. Bcaus stmming dirctly changs th corpus, it rquirs a postprocssing stp to rstor th words in th corpus to thir original forms onc th alignmnt has bn computd, as stimating th translation modl paramtrs from a stmmd corpus would not rsult in a vry good SMT systm. 3.3 Namd Entity Rcognition using Conditional Random Filds Namd Entity Rcognition (NER) is a subtask of information xtraction that sks to locat and classify Namd Entitis (NEs), which ar xprssions that rfr to th nams of prsons, organizations, locations, xprssions of tims, quantitis, montary valus, prcntags, tc. NER has bn a topic of rsarch for yars on confrncs and in workshops, most notably th Mssag Undrstanding Confrnc (MUC), th Confrnc on Natural Languag Larning (CoNLL) and th Multilingual Entity Tasks (MET). On of th mthods valuatd on ths confrncs and workshops is NER by mans of Conditional Random Filds (CRFs) (Laffrty t al, [16]).

26 CRFs ar conditional probabilistic modls not unlik HMMs for labling or sgmnting squntial data, such as a plaintxt corpus. On of th problms with labling squntial data is that th data oftn cannot b intrprtd as indpndnt units. For xampl, an English sntnc is bound by grammatical ruls that impos long-rang rlationships btwn th words in th sntnc. This maks numrating all obsrvation squncs intractabl, which in turn mans that a joint probability distribution ovr th obsrvation and labl squncs, as would b th cas in an HMM, cannot b calculatd in rasonabl tim. On th othr hand, making unwarrantd indpndnc assumptions about th obsrvation squncs is not dsirabl ithr. A solution to this problm is to dfin a conditional probability givn a particular obsrvation squnc rathr than a joint probability distribution ovr two random variabls. Lt X b a random variabl that rangs ovr obsrvabl squncs of words, and lt Y b a random variabl that rangs ovr th corrsponding squncs of labls, in this cas namd ntity tags. A CRF dfins a conditional probability P ( Y x) givn a particular obsrvd squnc of words x. Th modl attmpts to find th maximal probability P ( y x) for a particular labl squnc y. Lt G= ( V, E) b an undirctd graph such that thr is a nod v V corrsponding to ach of th random variabls rprsnting an lmnt Y v of Y. If ach random variabl Yv obys th Markov proprty with rspct to G, thn ( Y, X ) is a CRF. Th Markov proprty ntails that th modl is mmorylss, maning ach nxt stat in th modl dpnds solly on its prvious stat. Whil in thory G may hav any structur, in practic it always taks th form of a simpl first-ordr chain. Figur 9 illustrats this. Figur 9. A simpl first-ordr chain CRF architctur. Th stats at th top ar th labl squncs gnratd by th modl, whil th stat X at th bottom rprsnts th obsrvd data squncs. Th stats Yn ar only dpndnt on thir nighboring stats, thrby satisfying th Markov proprty.

27 Laffrty t al. [16] dfin th probability of a particular labl squnc y givn obsrvation squnc x to b a normalizd product of potntial functions, ach of th form xp( λ jt j ( yi 1, yi, x, i) + µ ksk ( yi, x, i)) (27) j Whr t j ( yi 1, yi, x, i) is a transition fatur function of th ntir obsrvation squnc and th labls at positions i and i 1 in th labl squnc, and s ( y, x, i) is a stat fatur function of th labl at position i and th obsrvation squnc. λ j and µ k ar paramtrs that ar obtaind from th training data. Ths fatur functions tak on th valus of on of a numbr of ral-valud faturs. Such faturs ar conditions on th obsrvation squnc that may or may not b satisfid. For xampl, a fatur b ( x, i) could b k k i 1, if th word at position i starts with a capital lttr b( x, i) = 0, othrwis (28) A stat fatur function for this fatur could b b( x, i), if yi = IN s k ( yi, x, i) = (29) 0, othrwis Similarly, a transition fatur function t j ( yi 1, yi, x, i) is dfind on two stats and a fatur. With such faturs and fatur functions, w can writ th probability of a labl squnc y givn an obsrvd squnc x as 1 P( y x, λ ) = xp( λ jfj ( y, x)) (30) Z( x) 1 Whr is a normalization factor and F j ( y, x) is th sum ovr all fatur functions, Z( x) both stat and transition. j

28 Givn this probability w can calculat th maximal probability by maximizing th logarithm of th liklihood, givn by: 1 ( k ) ( k ) log + k λ jff ( y, x ) ( ) Λ( λ ) = (31) k Z( x ) j Bcaus this function is dfind on th probabilitis of local labl squncs, w can nsur w gt th gnral labl squnc with th highst probability by maximizing this function. It is concav, which nsurs convrgnc to a global maximum. An NER systm that implmnts CRFs is th Stanford Namd Entity Rcognizr (Ros Finkl t al [17]), which scord F-scor on th CoNLL2002 shard task and on th CMU Sminar Announcmnt datast (ths ar th highst ovrall scors only). This systm will b usd in th CRF NER xprimnts dscribd in this thsis. 3.4 NER for SMT An NE is spcial from th prspctiv of SMT in th sns that ach NE can only hav xactly on translation, no mor and no lss. Though som ntitis may hav mor than on nam, thy ar typically indicatd with only on of th possibl nams throughout th us of th languag. For xampl, In English th Blgian town of Brugs is always rfrrd to by that nam, though in Dutch th town is invariably rfrrd to by its Dutch nam, Brugg. W can say that in this spcific xampl, Brugs in English is th corrct translation of Brugg in Dutch, and that any othr translation is wrong. Linguistically spaking NER is aimd towards finding and classifying xactly thos words that ar indd NEs, and no othrs. In tasks such as information rtrival it is important to achiv a high prcision and rcall undr thos constraints, bcaus of th smantic significanc of th words. In th contxt of training an SMT systm, howvr, thr is a diffrnt considration that coms into play. NER is not a mans for information xtraction but for xrting influnc ovr th EM algorithm. An NER algorithm that scors vry badly, but can b shown to improv th quality of th word-lvl alignmnt is still an ffctiv algorithm, vn if it is unintrsting for th NER task itslf. In an idal world a word tagging algorithm would b abl to rcogniz which words ar translations of ach othr in all cass. If such an algorithm xistd thr would ffctivly no longr b a nd for th EM algorithm. Clarly such an algorithm is not ralistic, but by applying NER an attmpt is mad to prform a small portion of this task.

29 3.5 NER algorithms for bilingual data Sntnc pair basd lxical similarity This algorithm tags words as NEs by chcking for a lxically similar word in th translation sntnc. For ach word it calculats th maximum lxical similarity scor for that word givn all th words in th translation. Lxical similarity is a masur of how closly two words rsmbl ach othr. Th words Parlmnt and Parliamnt, for xampl, ar lxically quit similar bcaus thy hav many lttrs in common. In th xprimnts, lxical similarity is dfind as similarity= 1 max( D1 N, N (32) 2) Whr D is th Lvnshtin distanc (Navarro, [18]) btwn th two words and N 1 and N 2 ar th lngths of th words. Th Lvnshtin distanc is a masur about how diffrnt th words ar in trms of how many lmntary oprations to on word nd to b prformd to obtain th othr word. An lmntary opration is ithr th substitution of any lttr with a diffrnt lttr, th addition of a lttr at any position in th word or th rmoval of any lttr in th word. In quation 30, th Lvnshtin distanc is normalizd by th lngth of th longr of th two words. Th rason for this is that th Lvnshtin distanc only indicats th absolut diffrnc btwn two squncs. For xampl, th words Europan and Europa hav a Lvnshtin distanc of 2, bcaus it taks 2 lmntary oprations to obtain on from th othr. Howvr, th sam can b said for th words of and in, which ar clarly not vry lxically similar. By normalizing for word lngth w obtain a scor that givs an indication of how many lttrs th two words hav in common rlativ to how many ar diffrnt. An xampl of this algorithm is shown in xampl 10. English: On of th popl assassinatd vry rcntly in Sri Lanka was Mr Kumar Ponnambalam, who had visitd th Europan Parliamnt just a fw months ago. Dutch: En van d mnsn di zr rcnt in Sri Lanka is vrmoord, is d hr Kumar Ponnambalam, di n paar maandn gldn nog n bzok bracht aan ht Europs Parlmnt. rcntly rcnt Sri Sri Lanka Lanka Kumar Kumar Ponnambalam Ponnambalam Europan Europs Parliamnt Parlmnt Exampl 10. A sntnc pair and its NEs, rcognizd by lxical similarity, givn a similarity thrshold of 0.4.

30 Whil all NEs in this xampl hav bn succssfully rcognizd, not how rcntly and rcnt wr also rcognizd as NEs bcaus thy ar sufficintly similar. This shows that judging NEs by lxical similarity is pron to fals positivs. Corpus basd lxical similarity Corpus basd lxical similarity is lik sntnc basd lxical similarity, but it attmpts to avoid som of th problms inhrnt to th sntnc basd approach. To combat fals positivs, th algorithm imposs two conditions on th word pairs. Firstly, a matching word pair must occur at last a crtain amount of tims in th corpus. Singltons ar statistically insignificant, and ar vn likly to dcras th alignmnt accuracy. Scondly, th algorithm calculats an occurrnc scor for ach matching word pair. Evn if a matching word pair is found, this pair cannot b considrd a NE if th componnt words appar unpaird too oftn in th txt. Th NER algorithm will not considr a matching word pair to b a NE pair if its occurrnc scor is too low. Formally, for vry uniqu word w and for all sntnc pairs ( i, j ) in th corpus th occurrnc scor for that word is calculatd as: 2 min( wi, w j ) ( i, j) scor( wi ) = (33) w + w Exampl 11 shows th rcognizd NEs in th sntnc pair from xampl 10. i i j j

Higher order derivatives

Higher order derivatives Robrto s Nots on Diffrntial Calculus Chaptr 4: Basic diffrntiation ruls Sction 7 Highr ordr drivativs What you nd to know alrady: Basic diffrntiation ruls. What you can larn hr: How to rpat th procss of