This excerpt from. Foundations of Statistical Natural Language Processing. Christopher D. Manning and Hinrich Schütze The MIT Press.

Size: px

Start display at page:

Download "This excerpt from. Foundations of Statistical Natural Language Processing. Christopher D. Manning and Hinrich Schütze The MIT Press."

Mark Boyd
5 years ago
Views:

1 Ths excerpt from Foundatons of Statstcal Natural Language Processng. Chrstopher D. Mannng and Hnrch Schütze The MIT Press. s provded n screen-vewable form for personal use only by members of MIT CogNet. Unauthorzed use or dssemnaton of ths nformaton s expressly forbdden. If you have any questons about ths materal, please contact cognetadmn@cognet.mt.edu.

2 6 Statstcal Inference: n-gram Models over Sparse Data statstcal nference language modelng Shannon game Statstcal NLP ams to do statstcal nference for the feld of natural language. Statstcal nference n general conssts of takng some data (generated n accordance wth some unknown probablty dstrbuton) and then makng some nferences about ths dstrbuton. For example, we mght look at lots of nstances of prepostonal phrase attachments n a corpus, and use them to try to predct prepostonal phrase attachments for Englsh n general. The dscusson n ths chapter dvdes the problem nto three areas (although they tend to overlap consderably): dvdng the tranng data nto equvalence classes, fndng a good statstcal estmator for each equvalence class, and combnng multple estmators. As a runnng example of statstcal estmaton, we wll examne the classc task of language modelng, where the problem s to predct the next word gven the prevous words. Ths task s fundamental to speech or optcal character recognton, and s also used for spellng correcton, handwrtng recognton, and statstcal machne translaton. Ths sort of task s often referred to as a Shannon game followng the presentaton of the task of guessng the next letter n a text n (Shannon 1951). Ths problem has been well-studed, and ndeed many estmaton methods were frst developed for ths task. In general, though, the methods we develop are not specfc to ths task, and can be drectly used for other tasks lke word sense dsambguaton or probablstc parsng. The word predcton task just provdes a clear easly-understood problem for whch the technques can be developed.

3 192 6 Statstcal Inference: n-gram Models over Sparse Data 6.1 Bns: Formng Equvalence Classes Relablty vs. dscrmnaton target feature classfcatory features Normally, n order to do nference about one feature, we wsh to fnd other features of the model that predct t. Here, we are assumng that past behavor s a good gude to what wll happen n the future (that s, that the model s roughly statonary). Ths gves us a classfcaton task: we try to predct the target feature on the bass of varous classfcatory features. When dong ths, we effectvely dvde the data nto equvalence classes that share values for certan of the classfcatory features, and use ths equvalence classng to help predct the value of the target feature on new peces of data. Ths means that we are tactly makng ndepen- dence assumptons: the data ether does not depend on other features, or the dependence s suffcently mnor that we hope that we can neglect t wthout dong too much harm. The more classfcatory features (of some relevance) that we dentfy, the more fnely condtons that determne the unknown probablty dstrbuton of the target feature can potentally be teased apart. In other words, dvdng the data nto many bns gves us greater dscrmnaton. Gong aganst ths s the problem that f we use a lot of bns then a partcular bn may contan no or a very small number of tranng nstances, and then we wll not be able to do statstcally relable estmaton of the target feature for that bn. Fndng equvalence classes that are a good compromse between these two crtera s our frst goal. ndependence assumptons bns relablty n-gram models The task of predctng the next word can be stated as attemptng to estmate the probablty functon P: (6.1) hstory P(w n w 1,...,w n 1 ) In such a stochastc problem, we use a classfcaton of the prevous words, the hstory, to predct the next word. On the bass of havng looked at a lot of text, we know whch words tend to follow other words. For ths task, we cannot possbly consder each textual hstory separately: most of the tme we wll be lstenng to a sentence that we have never heard before, and so there s no prevous dentcal textual hstory on whch to base our predctons, and even f we had heard the begnnng of the sentence before, t mght end dfferently ths tme. And so we

4 6.1 Bns: Formng Equvalence Classes 193 Markov assumpton bgram trgram four-gram need a method of groupng hstores that are smlar n some way so as to gve reasonable predctons as to whch words we can expect to come next. One possble way to group them s by makng a Markov assumpton that only the pror local context the last few words affects the next word. If we construct a model where all hstores that have the same last n 1 words are placed n the same equvalence class, then we have an (n 1) th order Markov model or an n-gram word model (the last word of the n-gram beng gven by the word we are predctng). Before contnung wth model-buldng, let us pause for a bref nterlude on namng. The cases of n-gram models that people usually use are for n = 2, 3, 4, and these alternatves are usually referred to as a bgram, a trgram, andafour-gram model, respectvely. Revealng ths wll surely be enough to cause any Classcsts who are readng ths book to stop, and to leave the feld to uneducated engneerng sorts: gram s a Greek root and so should be put together wth Greek number prefxes. Shannon actually dd use the term dgram, but wth the declnng levels of educa- ton n recent decades, ths usage has not survved. As non-prescrptve lngusts, however, we thnk that the curous mxture of Englsh, Greek, and Latn that our colleagues actually use s qute fun. So we wll not try to stamp t out. 1 Now n prncple, we would lke the n of our n-gram models to be farly large, because there are sequences of words lke: dgram (6.2) Sue swallowed the large green. where swallowed s presumably stll qute strongly nfluencng whch word wll come next pll or perhaps frog are lkely contnuatons, but tree, car or mountan are presumably unlkely, even though they are n general farly natural contnuatons after the large green. However, there s the problem that f we dvde the data nto too many bns, then there are a lot of parameters to estmate. For nstance, f we conser- vatvely assume that a speaker s stayng wthn a vocabulary of 20,000 words, then we get the estmates for numbers of parameters shown n table parameters 1. Rather than four-gram, some people do make an attempt at appearng educated by sayng quadgram, but ths s not really correct use of a Latn number prefx (whch would gve quadrgram, cf.quadrlateral), let alone correct use of a Greek number prefx, whch would gve us a tetragram model. 2. Gven a certan model space (here word n-gram models), the parameters are the numbers that we have to specfy to determne a partcular model wthn that model space.

5 194 6 Statstcal Inference: n-gram Models over Sparse Data Model Parameters 1st order (bgram model): 20, , 999 = 400 mllon 2nd order (trgram model): 20, , 999 = 8 trllon 3th order (four-gram model): 20, , 999= Table 6.1 Growth n number of parameters for n-gram models. stemmng So we quckly see that producng a fve-gram model, of the sort that we thought would be useful above, may well not be practcal, even f we have what we thnk s a very large corpus. For ths reason, n-gram systems currently usually use bgrams or trgrams (and often make do wth a smaller vocabulary). One way of reducng the number of parameters s to reduce the value of n, but t s mportant to realze that n-grams are not the only way of formng equvalence classes of the hstory. Among other operatons of equvalencng, we could consder stemmng (removng the nflectonal endngs from words) or groupng words nto semantc classes (by use of a pre-exstng thesaurus, or by some nduced clusterng). Ths s effectvely reducng the vocabulary sze over whch we form n-grams. But we do not need to use n-grams at all. There are myrad other ways of formng equvalence classes of the hstory t s just that they re all a bt more complcated than n-grams. The above example suggests that knowledge of the predcate n a clause s useful, so we can magne a model that predcts the next word based on the prevous word and the prevous predcate (no matter how far back t s). But ths model s harder to mplement, because we frst need a farly accurate method of dentfyng the man predcate of a clause. Therefore we wll just use n-gram models n ths chapter, but other technques are covered n chapters 12 and 14. For anyone from a lngustcs background, the dea that we would choose to use a model of language structure whch predcts the next word smply by examnng the prevous two words wth no reference to the structure of the sentence seems almost preposterous. But, actually, the Snce we are assumng nothng n partcular about the probablty dstrbuton, the number of parameters to be estmated s the number of bns tmes one less than the number of values of the target feature (one s subtracted because the probablty of the last target value s automatcally gven by the stochastc constrant that probabltes should sum to one).

6 6.1 Bns: Formng Equvalence Classes 195 lexcal co-occurrence, semantc, and basc syntactc relatonshps that appear n ths very local context are a good predctor of the next word, and such systems work surprsngly well. Indeed, t s dffcult to beat a trgram model on the purely lnear task of predctng the next word Buldng n-gram models In the fnal part of some sectons of ths chapter, we wll actually buld some models and show the results. The reader should be able to recreate our results by usng the tools and data on the accompanyng webste. The text that we wll use s Jane Austen s novels, and s avalable from the webste. Ths corpus has two advantages: () t s freely avalable through the work of Project Gutenberg, and () t s not too large. The small sze of the corpus s, of course, n many ways also a dsadvantage. Because of the huge number of parameters of n-gram models, as dscussed above, n-gram models work best when traned on enormous amounts of data. However, such tranng requres a lot of CPU tme and dskspace, so a small corpus s much more approprate for a textbook example. Even so, you wll want to make sure that you start off wth about 40Mb of free dskspace before attemptng to recreate our examples. As usual, the frst step s to preprocess the corpus. The Project Gutenberg Austen texts are very clean plan ASCII fles. But nevertheless, there are the usual problems of punctuaton marks attachng to words and so on (see chapter 4) that mean that we must do more than smply splt on whtespace. We decded that we could make do wth some very smple search-and-replace patterns that removed all punctuaton leavng whtespace separated words (see the webste for detals). We decded to use Emma, Mansfeld Park, Northanger Abbey, Prde and Prejudce, andsense and Sensblty as our corpus for buldng models, reservng Persuason for testng, as dscussed below. Ths gave us a (small) tranng corpus of N = 617, 091 words of text, contanng a vocabulary V of 14,585 word types. By smply removng all punctuaton as we dd, our fle s lterally a long sequence of words. Ths sn t actually what people do most of the tme. It s commonly felt that there are not very strong dependences between sentences, whle sentences tend to begn n characterstc ways. So people mark the sentences n the text most commonly by surroundng them wth the SGML tags <s> and </s>. The probablty calculatons at the

7 196 6 Statstcal Inference: n-gram Models over Sparse Data start of a sentence are then dependent not on the last words of the precedng sentence but upon a begnnng of sentence context. We should addtonally note that we ddn t remove case dstnctons, so captalzed words reman n the data, mperfectly ndcatng where new sentences begn. 6.2 Statstcal Estmators Gven a certan number of peces of tranng data that fall nto a certan bn, the second goal s then fndng out how to derve a good probablty estmate for the target feature based on these data. For our runnng example of n-grams, we wll be nterested n P(w 1 w n ) and the predcton task P(w n w 1 w n 1 ). Snce: (6.3) P(w n w 1 w n 1 ) = P(w 1 w n ) P(w 1 w n 1 ) estmatng good condtonal probablty dstrbutons can be reduced to havng good solutons to smply estmatng the unknown probablty dstrbuton of n-grams. 3 Let us assume that the tranng text conssts of N words. If we append n 1 dummy start symbols to the begnnng of the text, we can then also say that the corpus conssts of N n-grams, wth a unform amount of condtonng avalable for the next word n all cases. Let B be the number of bns (equvalence classes). Ths wll be V n 1,whereV s the vocabulary sze, for the task of workng out the next word and V n for the task of estmatng the probablty of dfferent n-grams. Let C(w 1 w n ) be the frequency of a certan n-gram n the tranng text, and let us say that there are N r n-grams that appeared r tmes n the tranng text (.e., N r = {w 1 w n : C(w 1 w n ) = r} ). These frequences of frequences are very commonly used n the estmaton methods whch we cover below. Ths notaton s summarzed n table However, when smoothng, one has a choce of whether to smooth the n-gram probablty estmates, or to smooth the condtonal probablty dstrbutons drectly. For many methods, these do not gve equvalent results snce n the latter case one s separately smoothng a large number of condtonal probablty dstrbutons (whch normally need to be themselves grouped nto classes n some way).

8 6.2 Statstcal Estmators 197 N B w 1n C(w 1 w n ) r f( ) N r T r h Number of tranng nstances Number of bns tranng nstances are dvded nto An n-gram w 1 w n n the tranng text Frequency of n-gram w 1 w n n tranng text Frequency of an n-gram Frequency estmate of a model Number of bns that have r tranng nstances n them Total count of n-grams of frequency r n further data Hstory of precedng words Table 6.2 Notaton for the statstcal estmaton chapter Maxmum Lkelhood Estmaton (MLE) MLE estmates from relatve frequences relatve frequency Regardless of how we form equvalence classes, we wll end up wth bns that contan a certan number of tranng nstances. Let us assume a trgram model where we are usng the two precedng words of context to predct the next word, and let us focus n on the bn for the case where the two precedng words were comes across. In a certan corpus, the authors found 10 tranng nstances of the words comes across, andof those, 8 tmes they were followed by as, oncebymore and once by a. The queston at ths pont s what probablty estmates we should use for estmatng the next word. The obvous frst answer (at least from a frequentst pont of vew) s to suggest usng the relatve frequency as a probablty estmate: P(as) = 0.8 P(more) = 0.1 P(a) = 0.1 P(x) = 0.0 for x not among the above 3 words maxmum lkelhood estmate (6.4) (6.5) Ths estmate s called the maxmum lkelhood estmate (MLE): P MLE (w 1 w n ) = C(w 1 w n ) N P MLE (w n w 1 w n 1 ) = C(w 1 w n ) C(w 1 w n 1 )

9 198 6 Statstcal Inference: n-gram Models over Sparse Data If one fxes the observed data, and then consders the space of all possble parameter assgnments wthn a certan dstrbuton (here a trgram lkelhood model) gven the data, then statstcans refer to ths as a lkelhood func- functon ton. The maxmum lkelhood estmate s so called because t s the choce of parameter values whch gves the hghest probablty to the tranng corpus. 4 The estmate that does that s the one shown above. It does not waste any probablty mass on events that are not n the tranng corpus, but rather t makes the probablty of observed events as hgh as t can subject to the normal stochastc constrants. But the MLE s n general unsutable for statstcal nference n NLP. The problem s the sparseness of our data (even f we are usng a large corpus). Whle a few words are common, the vast majorty of words are very uncommon and longer n-grams nvolvng them are thus much rarer agan. The MLE assgns a zero probablty to unseen events, and snce the probablty of a long strng s generally computed by multplyng the probabltes of subparts, these zeroes wll propagate and gve us bad (zero probablty) estmates for the probablty of sentences when we just happened not to see certan n-grams n the tranng text. 5 Wth respect to the example above, the MLE s not capturng the fact that there are other words whch can follow comes across, for example the and some. As an example of data sparseness, after tranng on 1.5 mllon words from the IBM Laser Patent Text corpus, Bahl et al. (1983) report that 23% of the trgram tokens found n further test data drawn from the same corpus were prevously unseen. Ths corpus s small by modern standards, and so one mght hope that by collectng much more data that the problem of data sparseness would smply go away. Whle ths may ntally seem hopeful (f we collect a hundred nstances of comes across, we wll probably fnd nstances wth t followed by the and some), n practce t s never a general soluton to the problem. Whle there are a lmted number of frequent events n language, there s a seemngly never end- 4. Ths s gven that the occurrence of a certan n-gram s assumed to be a random varable wth a bnomal dstrbuton (.e., each n-gram s ndependent of the next). Ths s a qute untrue (though usable) assumpton: frstly, each n-gram overlaps wth and hence partly determnes the next, and secondly, content words tend to clump (f you use a word once n a paper, you are lkely to use t agan), as we dscuss n secton Another way to state ths s to observe that f our probablty model assgns zero probablty to any event that turns out to actually occur, then both the cross-entropy and the KL dvergence wth respect to (data from) the real probablty dstrbuton s nfnte. In other words we have done a maxmally bad job at producng a probablty functon that s close to the one we are tryng to model.

10 6.2 Statstcal Estmators 199 rare events dscountng ng tal to the probablty dstrbuton of rarer and rarer events, and we can never collect enough data to get to the end of the tal. 6 For nstance comes across could be followed by any number, and we wll never see every number. In general, we need to devse better estmators that allow for the possblty that we wll see events that we ddn t see n the tranng text. All such methods effectvely work by somewhat decreasng the probablty of prevously seen events, so that there s a lttle bt of probablty mass left over for prevously unseen events. Thus these methods are frequently referred to as dscountng methods. The process of dscountng s often referred to as smoothng, presumably because a dstrbuton wth- out zeroes s smoother than one wth zeroes. We wll examne a number of smoothng methods n the followng sectons. smoothng Usng MLE estmates for n-gram models of Austen hapax legomena Based on our Austen corpus, we made n-gram models for dfferent values of n. It s qute straghtforward to wrte one s own program to do ths, by totallng up the frequences of n-grams and (n 1)-grams, and then dvdng to get MLE probablty estmates, but there s also software to do t on the webste. In practcal systems, t s usual to not actually calculate n-grams for all words. Rather, the n-grams are calculated as usual only for the most common k words, and all other words are regarded as Out-Of-Vocabulary (OOV) tems and mapped to a sngle token such as <UNK>. Commonly, ths wll be done for all words that have been encountered only once n the tranng corpus (hapax legomena). A useful varant n some domans s to notce the obvous semantc and dstrbutonal smlarty of rare numbers and to have two out-of-vocabulary tokens, one for numbers and one for everythng else. Because of the Zpfan dstrbuton of words, cuttng out low frequency tems wll greatly reduce the parameter space (and the memory requrements of the system beng bult), whle not apprecably affectng the model qualty (hapax legomena often consttute half of the types, but only a fracton of the tokens). We used the condtonal probabltes calculated from our tranng corpus to work out the probabltes of each followng word for part of a 6. Cf. Zpf s law the observaton that the relatonshp between a word s frequency and the rank order of ts frequency s roughly a recprocal curve as dscussed n secton

11 200 6 Statstcal Inference: n-gram Models over Sparse Data In person she was nferor to both ssters 1-gram P( ) P( ) P( ) P( ) P( ) P( ) 1 the the the the the the to to to to to to and and and and and of of of of of was was was was was she she she she both both both ssters ssters nferor gram P( person) P( she) P( was) P( nferor) P( to) P( both) 1 and had not to be of who was a the to to the her n n to have and she Mrs she what ssters both nferor 0 3-gram P( In,person) P( person,she) P( she,was) P( was,nf.) P( nferor,to) P( to,both) 1 Unseen dd 0.5 not Unseen the to was 0.5 very Mara Chapter n cherres Hour to her Twce nferor 0 both 0 ssters 0 4-gram P( u,i,p) P( I,p,s) P( p,s,w) P( s,w,) P( w,,t) P(,t,b) 1 Unseen Unseen n 1.0 Unseen Unseen Unseen nferor 0 Table 6.3 Probabltes of each successve word for a clause from Persuason. The probablty dstrbuton for the followng word s calculated by Maxmum Lkelhood Estmate n-gram models for varous values of n. The predcted lkelhood rank of dfferent words s shown n the frst column. The actual next word s shown at the top of the table n talcs, and n the table n bold.

12 6.2 Statstcal Estmators 201 sentence from our test corpus Persuason. We wll cover the ssue of test corpora n more detal later, but t s vtal for assessng a model that we try t on dfferent data otherwse t sn t a far test of how well the model allows us to predct the patterns of language. Extracts from these probablty dstrbutons ncludng the actual next word shown n bold are shown n table 6.3. The ungram dstrbuton gnores context entrely, and smply uses the overall frequency of dfferent words. But ths s not entrely useless, snce, as n ths clause, most words n most sentences are common words. The bgram model uses the precedng word to help predct the next word. In general, ths helps enormously, and gves us a much better model. In some cases the estmated probablty of the word that actually comes next has gone up by about an order of magntude (was, to, ssters). However, note that the bgram model s not guaranteed to ncrease the probablty estmate. The estmate for she has actually gone down, because she s n general very common n Austen novels (beng manly books about women), but somewhat unexpected after the noun person although qute possble when an adverbal phrase s beng used, such as In person here. The falure to predct nferor after was shows problems of data sparseness already startng to crop up. When the trgram model works, t can work brllantly. For example, t gves us a probablty estmate of 0.5 forwas followng person she. Butn general t s not usable. Ether the precedng bgram was never seen before, and then there s no probablty dstrbuton for the followng word, or a few words have been seen followng that bgram, but the data s so sparse that the resultng estmates are hghly unrelable. For example, the bgram to both was seen 9 tmes n the tranng text, twce followed by to, and once each followed by 7 other words, a few of whch are shown n the table. Ths s not the knd of densty of data on whch one can sensbly buld a probablstc model. The four-gram model s entrely useless. In general, four-gram models do not become usable untl one s tranng on several tens of mllons of words of data. Examnng the table suggests an obvous strategy: use hgher order n-gram models when one has seen enough data for them to be of some use, but back off to lower order n-gram models when there sn t enough data. Ths s a wdely used strategy, whch we wll dscuss below n the secton on combnng estmates, but t sn t by tself a complete soluton to the problem of n-gram estmates. For nstance, we saw qute a lot of words followng was n the tranng data 9409 tokens of 1481 types but nferor was not one of them. Smlarly, although we had seen qute

13 202 6 Statstcal Inference: n-gram Models over Sparse Data a lot of words n our tranng text overall, there are many words that dd not appear, ncludng perfectly ordnary words lke decdes or wart. So regardless of how we combne estmates, we stll defntely need a way to gve a non-zero probablty estmate to words or n-grams that we happened not to see n our tranng text, and so we wll work on that problem frst Laplace s law, Ldstone s law and the Jeffreys-Perks law Laplace s law (6.6) addng one The manfest falure of maxmum lkelhood estmaton forces us to examne better estmators. The oldest soluton s to employ Laplace s law (1814; 1995). Accordng to ths law, P Lap (w 1 w n ) = C(w 1 w n ) + 1 N + B Ths process s often nformally referred to as addng one, and has the effect of gvng a lttle bt of the probablty space to unseen events. But rather than smply beng an unprncpled move, ths s actually the Bayesan estmator that one derves f one assumes a unform pror on events (.e., that every n-gram was equally lkely). However, note that the estmates whch Laplace s law gves are dependent on the sze of the vocabulary. For sparse sets of data over large vocabulares, such as n-grams, Laplace s law actually gves far too much of the probablty space to unseen events. Consder some data dscussed by Church and Gale (1991a) n the context of ther dscusson of varous estmators for bgrams. Ther corpus of 44 mllon words of Assocated Press (AP) newswre yelded a vocabulary of 400,653 words (mantanng case dstnctons, splttng on hyphens, etc.). Note that ths vocabulary sze means that there s a space of possble bgrams, and so aprorbarely any of them wll actually occur n the corpus. It also means that n the calculaton of P Lap, B s far larger than N, and Laplace s method s completely unsatsfactory n such crcumstances. Church and Gale used half the corpus (22 mllon words) as a tranng text. Table 6.4 shows the expected frequency est- mates of varous methods that they dscuss, and Laplace s law estmates that we have calculated. Probablty estmates can be derved by dvdng the frequency estmates by the number of n-grams, N = 22 mllon. For Laplace s law, the probablty estmate for an n-gram seen r tmes s expected frequency estmates

14 6.2 Statstcal Estmators 203 r =f MLE f emprcal f Lap f del f GT N r T r Table 6.4 Estmated frequences for the AP data from Church and Gale (1991a). The frst fve columns show the estmated frequency calculated for a bgram that actually appeared r tmes n the tranng data accordng to dfferent estmators: r s the maxmum lkelhood estmate, f emprcal uses valdaton on the test set, f Lap s the add one method, f del s deleted nterpolaton (two-way cross valdaton, usng the tranng data), and f GT s the Good-Turng estmate. The last two columns gve the frequences of frequences and how often bgrams of a certan frequency occurred n further text. (r+1)/(n+b), so the frequency estmate becomes f Lap = (r+1)n/(n+b). These estmated frequences are often easer for humans to nterpret than probabltes, as one can more easly see the effect of the dscountng. Although each prevously unseen bgram has been gven a very low probablty, because there are so many of them, 46.5% of the probablty space has actually been gven to unseen bgrams. 7 Ths s far too much, and t s done at the cost of enormously reducng the probablty estmates of more frequent events. How do we know t s far too much? The second column of the table shows an emprcally determned estmate (whch we dscuss below) of how often unseen n-grams actually appeared n further text, and we see that the ndvdual frequency of occurrence of prevously unseen n-grams s much lower than Laplace s law predcts, whle the frequency of occurrence of prevously seen n-grams s much hgher than predcted. 8 In partcular, the emprcal model fnds that only 9.2% of the bgrams n further text were prevously unseen. 7. Ths s calculated as N 0 P Lap ( ) = 74, 671, 100, /22, 000, 000 = It s a bt hard dealng wth the astronomcal numbers n the table. A smaller example whch llustrates the same pont appears n exercse 6.2.

15 204 6 Statstcal Inference: n-gram Models over Sparse Data Ldstone s law and the Jeffreys-Perks law (6.7) (6.8) Expected Lkelhood Estmaton Because of ths overestmaton, a commonly adopted soluton to the problem of multnomal estmaton wthn statstcal practce s Ldstone s law of successon, where we add not one, but some (normally smaller) postve value λ: P Ld (w 1 w n ) = C(w 1 w n ) + λ N + Bλ Ths method was developed by the actuares Hardy and Ldstone, and Johnson showed that t can be vewed as a lnear nterpolaton (see below) between the MLE estmate and a unform pror. Ths may be seen by settng µ = N/(N + Bλ): P Ld (w 1 w n ) = µ C(w 1 w n ) + (1 µ) 1 N B The most wdely used value for λ s 1 2. Ths choce can be theoretcally justfed as beng the expectaton of the same quantty whch s maxmzed by MLE and so t has ts own names, the Jeffreys-Perks law, or Expected Lkelhood Estmaton (ELE) (Box and Tao 1973: 34 36). In practce, ths often helps. For example, we could avod the objecton above that two much of the probablty space was beng gven to unseen events by choosng a small λ. But there are two remanng objectons: () we need a good way to guess an approprate value for λ n advance, and () dscountng usng Ldstone s law always gves probablty estmates lnear n the MLE frequency and ths s not a good match to the emprcal dstrbuton at low frequences. Applyng these methods to Austen Despte the problems nherent n these methods, we wll nevertheless try applyng them, n partcular ELE, to our Austen corpus. Recall that up untl now the only probablty estmate we have been able to derve for the test corpus clause she was nferor to both ssters was the ungram estmate, whch (multplyng through the bold probabltes n the top part of table 6.3) gves as ts estmate for the probablty of the clause For the other models, the probablty estmate was ether zero or undefned, because of the sparseness of the data. Let us now calculate a probablty estmate for ths clause usng a bgram model and ELE. Followng the word was, whch appeared 9409

16 6.2 Statstcal Estmators 205 Rank Word MLE ELE 1 not a the to =1482 nferor Table 6.5 Expected Lkelhood Estmaton estmates for the word followng was. tmes, not appeared 608 tmes n the tranng corpus, whch overall contaned word types. So our new estmate for P(not was) s ( )/( ) = The estmate for P(not was) has thus been dscounted (by almost half!). If we do smlar calculatons for the other words, then we get the results shown n the last column of table 6.5. The orderng of most lkely words s naturally unchanged, but the probablty estmates of words that dd appear n the tranng text are dscounted, whle non-occurrng words, n partcular the actual next word, nferor, are gven a non-zero probablty of occurrence. Contnung n ths way to also estmate the other bgram probabltes, we fnd that ths language model gves a probablty estmate for the clause of Unfortunately, ths probablty estmate s actually lower than the MLE estmate based on ungram counts reflectng how greatly all the MLE probablty estmates for seen n-grams are dscounted n the constructon of the ELE model. Ths result substantates the slogan used n the ttles of (Gale and Church 1990a,b): poor estmates of context are worse than none. Note, however, that ths does not mean that the model that we have constructed s entrely useless. Although the probablty estmates t gves are extremely low, one can nevertheless use them to rank alternatves. For example, the model does correctly tell us that she was nferor to both ssters s a much more lkely clause n Englsh than nferor to was both she ssters, whereas the ungram estmate gves them both the same probablty Held out estmaton How do we know that gvng 46.5% of the probablty space to unseen events s too much? One way that we can test ths s emprcally. We

17 206 6 Statstcal Inference: n-gram Models over Sparse Data held out estmator can take further text (assumed to be from the same source) and see how often bgrams that appeared r tmes n the tranng text tend to turn up n the further text. The realzaton of ths dea s the held out estmator of Jelnek and Mercer (1985). The held out estmator For each n-gram, w 1 w n,let: C 1 (w 1 w n ) = frequency of w 1 w n n tranng data C 2 (w 1 w n ) = frequency of w 1 w n n held out data (6.9) and recall that N r s the number of bgrams wth frequency r (n the tranng text). Now let: T r = C 2 (w 1 w n ) {w 1 w n :C 1 (w 1 w n )=r} That s, T r s the total number of tmes that all n-grams that appeared r tmes n the tranng text appeared n the held out data. Then the average frequency of those n-grams s T r N r and so an estmate for the probablty of one of these n-grams s: (6.10) P ho (w 1 w n ) = T r N r N where C(w 1 w n ) = r Pots of data for developng and testng models tranng data overtranng test data A cardnal sn n Statstcal NLP s to test on your tranng data. Butwhys that? The dea of testng s to assess how well a partcular model works. That can only be done f t s a far test on data that has not been seen before. In general, models nduced from a sample of data have a tendency to be overtraned, that s, to expect future events to be lke the events on whch the model was traned, rather than allowng suffcently for other possbltes. (For nstance, stock market models sometmes suffer from ths falng.) So t s essental to test on dfferent data. A partcular case of ths s for the calculaton of cross entropy (secton 2.2.6). To calculate cross entropy, we take a large sample of text and calculate the per-word entropy of that text accordng to our model. Ths gves us a measure of the qualty of our model, and an upper bound for the entropy of the language that the text was drawn from n general. But all that s only true f the test data s ndependent of the tranng data, and large enough

18 6.2 Statstcal Estmators 207 to be ndcatve of the complexty of the language at hand. If we test on the tranng data, the cross entropy can easly be lower than the real entropy of the text. In the most blatant case we could buld a model that has memorzed the tranng text and always predcts the next word wth probablty 1. Even f we don t do that, we wll fnd that MLE s an excellent language model f you are testng on tranng data, whch s not the rght result. So when startng to work wth some data, one should always separate t mmedately nto a tranng porton and a testng porton. The test data s normally only a small percentage (5 10%) of the total data, but has to be suffcent for the results to be relable. You should always eyeball the tranng data you want to use your human pattern-fndng abltes to get hnts on how to proceed. You shouldn t eyeball the test data that s cheatng, even f less drectly than gettng your program to memorze t. Commonly, however, one wants to dvde both the tranng and test data nto two agan, for dfferent reasons. For many Statstcal NLP methods, such as held out estmaton of n-grams, one gathers counts from one lot of tranng data, and then one smooths these counts or estmates certan other parameters of the assumed model based on what turns up n further held out or valdaton data. The held out data needs to be nde- pendent of both the prmary tranng data and the test data. Normally the stage usng the held out data nvolves the estmaton of many fewer parameters than are estmated from counts over the prmary tranng data, and so t s approprate for the held out data to be much smaller than the prmary tranng data (commonly about 10% of the sze). Nevertheless, t s mportant that there s suffcent data for any addtonal parameters of the model to be accurately estmated, or sgnfcant performance losses can occur (as Chen and Goodman (1996: 317) show). A typcal pattern n Statstcal NLP research s to wrte an algorthm, tran t, and test t, note some thngs that t does wrong, revse t and then to repeat the process (often many tmes!). But, f one does that a lot, not only does one tend to end up seeng aspects of the test set, but just repeatedly tryng out dfferent varant algorthms and lookng at ther performance can be vewed as subtly probng the contents of the test set. Ths means that testng a successon of varant models can agan lead to overtranng. So the rght approach s to have two test sets: a development test set on whch successve varant methods are traled and a fnal test set whch s used to produce the fnal results that are publshed about the performance of the algorthm. One should expect performance on held out data valdaton data development test set fnal test set

19 208 6 Statstcal Inference: n-gram Models over Sparse Data varance the fnal test set to be slghtly lower than on the development test set (though sometmes one can be lucky). The dscusson so far leaves open exactly how to choose whch parts of the data are to be used as testng data. Actually here opnon dvdes nto two schools. One school favors selectng bts (sentences or even n- grams) randomly from throughout the data for the test set and usng the rest of the materal for tranng. The advantage of ths method s that the testng data s as smlar as possble (wth respect to genre, regster, wrter, and vocabulary) to the tranng data. That s, one s tranng from as accurate a sample as possble of the type of language n the test data. The other possblty s to set asde large contguous chunks as test data. The advantage of ths s the opposte: n practce, one wll end up usng any NLP system on data that vares a lttle from the tranng data, as language use changes a lttle n topc and structure wth the passage of tme. Therefore, some people thnk t best to smulate that a lttle by choosng test data that perhaps sn t qute statonary wth respect to the tranng data. At any rate, f usng held out estmaton of parameters, t s best to choose the same strategy for settng asde data for held out data as for test data, as ths makes the held out data a better smulaton of the test data. Ths choce s one of the many reasons why system results can be hard to compare: all else beng equal, one should expect slghtly worse performance results f usng the second approach. Whle coverng testng, let us menton one other ssue. In early work, t was common to just run the system on the test data and present a sngle performance fgure (for perplexty, percent correct or whatever). But ths sn t a very good way of testng, as t gves no dea of the varance n the performance of the system. A much better way s to dvde the test data nto, say 20, smaller samples, and work out a test result on each of them. From those results, one can work out a mean performance fgure, as before, but one can also calculate the varance that shows how much performance tends to vary. If usng ths method together wth contnuous chunks of tranng data, t s probably best to take the smaller testng samples from dfferent regons of the data, snce the testng lore tends to be full of stores about certan sectons of data sets beng easy, and so t s better to have used a range of test data from dfferent sectons of the corpus. If we proceed ths way, then one system can score hgher on average than another purely by accdent, especally when wthn-system varance s hgh. So just comparng average scores s not enough for meanngful

20 6.2 Statstcal Estmators 209 System 1 System 2 scores 71, 61, 55, 60, 68, 49, 42, 55, 75, 45, 54, 51 42, 72, 76, 55, 64 55, 36, 58, 55, 67 total n mean x s 2 = (x j x ) 2 1, ,228.8 df Pooled s 2 = t = x 1 x 2 2s 2 n = Table 6.6 Usng the t test for comparng the performance of two systems. Snce we calculate the mean for each data set, the denomnator n the calculaton of varance and the number of degrees of freedom s (11 1) + (11 1) = 20. The data do not provde clear support for the superorty of system 1. Despte the clear dfference n mean scores, the sample varance s too hgh to draw any defntve conclusons. t test system comparson. Instead, we need to apply a statstcal test that takes nto account both mean and varance. Only f the statstcal test rejects the possblty of an accdental dfference can we say wth confdence that one system s better than the other. 9 An example of usng the t test (whch we ntroduced n secton 5.3.1) for comparng the performance of two systems s shown n table 6.6 (adapted from (Snedecor and Cochran 1989: 92)). Note that we use a pooled estmate of the sample varance s 2 here under the assumpton that the varance of the two systems s the same (whch seems a reasonable assumpton here: 609 and 526 are close enough). Lookng up the t dstrbuton n the appendx, we fnd that, for rejectng the hypothess that the system 1 s better than system 2 at a probablty level of α = 0.05, the crtcal value s t = (usng a one-taled test wth 20 degrees of freedom). Snce we have t = 1.56 < 1.725, the data fal the sgnfcance test. Although the averages are farly dstnct, we cannot conclude superorty of system 1 here because of the large varance of scores. 9. Systematc dscusson of testng methodology for comparng statstcal and machne learnng algorthms can be found n (Detterch 1998). A good case study, for the example of word sense dsambguaton, s (Mooney 1996).

21 210 6 Statstcal Inference: n-gram Models over Sparse Data Usng held out estmaton on the test data So long as the frequency of an n-gram C(w 1 w n ) s the only thng that we are usng to predct ts future frequency n text, then we can use held out estmaton performed on the test set to provde the correct answer of what the dscounted estmates of probabltes should be n order to maxmze the probablty of the test set data. Dong ths emprcally measures how often n-grams that were seen r tmes n the tranng data actually do occur n the test text. The emprcal estmates f emprcal n table 6.4 were found by randomly dvdng the 44 mllon bgrams n the whole AP corpus nto equal-szed tranng and test sets, countng frequences n the 22 mllon word tranng set and then dong held out estmaton usng the test set. Whereas other estmates are calculated only from the 22 mllon words of tranng data, ths estmate can be regarded as an emprcally determned gold standard, acheved by allowng access to the test data Cross-valdaton (deleted estmaton) cross-valdaton deleted estmaton The f emprcal estmates dscussed mmedately above were constructed by lookng at what actually happened n the test data. But the dea of held out estmaton s that we can acheve the same effect by dvdng the tranng data nto two parts. We buld ntal estmates by dong counts on one part, and then we use the other pool of held out data to refne those estmates. The only cost of ths approach s that our ntal tranng data s now less, and so our probablty estmates wll be less relable. Rather than usng some of the tranng data only for frequency counts and some only for smoothng probablty estmates, more effcent schemes are possble where each part of the tranng data s used both as ntal tranng data and as held out data. In general, such methods n statstcs go under the name cross-valdaton. Jelnek and Mercer (1985) use a form of two-way cross-valdaton that they call deleted estmaton. Suppose we let Nr a be the number of n-grams occurrng r tmes n the a th part of the tranng data, and Tr ab be the total occurrences of those bgrams from part a n the b th part. Now dependng on whch part s vewed as the basc tranng data, standard held out estmates would be ether: P ho (w 1 w n ) = T r 01 Nr 0 N 10 Tr or Nr 1 N where C(w 1 w n ) = r

22 6.2 Statstcal Estmators 211 The more effcent deleted nterpolaton estmate does counts and smoothng on both halves and then averages the two: (6.11) P del (w 1 w n ) = T r 01 + Tr 10 N(Nr 0 + Nr 1 ) where C(w 1 w n ) = r Leavng-One-Out On large tranng corpora, dong deleted estmaton on the tranng data works better than dong held-out estmaton usng just the tranng data, and ndeed table 6.4 shows that t produces results that are qute close to the emprcal gold standard. 10 It s nevertheless stll some way off for low frequency events. It overestmates the expected frequency of unseen objects, whle underestmatng the expected frequency of objects that were seen once n the tranng data. By dvdng the text nto two parts lke ths, one estmates the probablty of an object by how many tmes t was seen n a sample of sze N 2, assumng that the probablty of a token seen r tmes n a sample of sze N 2 s double that of a token seen r tmes n a sample of sze N. However, t s generally true that as the sze of the tranng corpus ncreases, the percentage of unseen n-grams that one encounters n held out data, and hence one s probablty estmate for unseen n-grams, decreases (whle never becomng neglgble). It s for ths reason that collectng counts on a smaller tranng corpus has the effect of overestmatng the probablty of unseen n-grams. There are other ways of dong cross-valdaton. In partcular Ney et al. (1997) explore a method that they call Leavng-One-Out where the prmary tranng corpus s of sze N 1 tokens, whle 1 token s used as held out data for a sort of smulated testng. Ths process s repeated N tmes so that each pece of data s left out n turn. The advantage of ths tranng regme s that t explores the effect of how the model changes f any partcular pece of data had not been observed, and Ney et al. show strong connectons between the resultng formulas and the wdely-used Good-Turng method to whch we turn next Remember that, although the emprcal gold standard was derved by held out estmaton, t was held out estmaton based on lookng at the test data! Chen and Goodman (1998) fnd n ther study that for smaller tranng corpora, held out estmaton outperforms deleted estmaton. 11. However, Chen and Goodman (1996: 314) suggest that leavng one word out at a tme s problematc, and that usng larger deleted chunks n deleted nterpolaton s to be preferred.

23 212 6 Statstcal Inference: n-gram Models over Sparse Data Good-Turng estmaton The Good-Turng estmator (6.12) (6.13) Good (1953) attrbutes to Turng a method for determnng frequency or probablty estmates of tems, on the assumpton that ther dstrbuton s bnomal. Ths method s sutable for large numbers of observatons of data drawn from a large vocabulary, and works well for n-grams, despte the fact that words and n-grams do not have a bnomal dstrbuton. The probablty estmate n Good-Turng estmaton s of the form P GT = r*/n where r* can be thought of as an adjusted frequency. The theorem underlyng Good-Turng methods gves that for prevously observed tems: r* = (r + 1) E(N r+1) E(N r ) where E denotes the expectaton of a random varable (see (Church and Gale 1991a; Gale and Sampson 1995) for dscusson of the dervaton of ths formula). The total probablty mass reserved for unseen objects s then E(N 1 )/N (see exercse 6.5). Usng our emprcal estmates, we can hope to substtute the observed N r for E(N r ). However, we cannot do ths unformly, snce these emprcal estmates wll be very unrelable for hgh values of r. In partcular, the most frequent n-gram would be estmated to have probablty zero, snce the number of n-grams wth frequency one greater than t s zero! In practce, one of two solutons s employed. One s to use Good-Turng reestmaton only for frequences r < k for some constant k (e.g., 10). Low frequency words are numerous, so substtuton of the observed frequency of frequences for the expectaton s qute accurate, whle the MLE estmates of hgh frequency words wll also be qute accurate and so one doesn t need to dscount them. The other s to ft some functon S through the observed values of (r, N r ) and to use the smoothed values S(r) for the expectaton (ths leads to a famly of possbltes dependng on exactly whch method of curve fttng s employed Good (1953) dscusses several smoothng methods). The probablty mass N 1 N gven to unseen tems can ether be dvded among them unformly, or by some more sophstcated method (see under Combnng Estmators, below). So usng ths method wth a unform estmate for unseen events, we have: Good-Turng Estmator: If C(w 1 w n ) = r>0, P GT (w 1 w n ) = r* N where r* = (r + 1)S(r + 1) S(r)

24 6.2 Statstcal Estmators 213 If C(w 1 w n ) = 0, (6.14) renormalzaton P GT (w 1 w n ) = 1 r=1 N r r* N N 0 N 1 N 0 N Gale and Sampson (1995) present a smple and effectve approach, Smple Good-Turng, whch effectvely combnes these two approaches. As a smoothng curve they smply use a power curve N r = ar b (wth b< 1 to gve the approprate hyperbolc relatonshp), and estmate A and b by smple lnear regresson on the logarthmc form of ths equaton log N r = a + b log r (lnear regresson s covered n secton , or n all ntroductory statstcs books). However, they suggest that such a smple curve s probably only approprate for hgh values of r. For low values of r, they use the measured N r drectly. Workng up through frequences, these drect estmates are used untl for one of them there sn t a sgnfcant dfference between r* values calculated drectly or va the smoothng functon, and then smoothed estmates are used for all hgher frequences. 12 Smple Good-Turng can gve exceedngly good estmators, as can be seen by comparng the Good-Turng column f GT n table 6.4 wth the emprcal gold standard. Under any of these approaches, t s necessary to renormalze all the estmates to ensure that a proper probablty dstrbuton results. Ths can be done ether by adjustng the amount of probablty mass gven to unseen tems (as n equaton (6.14)), or, perhaps better, by keepng the estmate of the probablty mass for unseen tems as N 1 N and renormalzng all the estmates for prevously seen tems (as Gale and Sampson (1995) propose). Frequences of frequences n Austen count-counts To do Good-Turng, the frst step s to calculate the frequences of dfferent frequences (also known as count-counts). Table 6.7 shows extracts from the resultng lst of frequences of frequences for bgrams and trgrams. (The numbers are remnscent of the Zpfan dstrbutons of 12. An estmate of r* s deemed sgnfcantly dfferent f the dfference exceeds 1.65 tmes the standard devaton of the Good-Turng estmate, whch s gven by: (r + 1) 2 N r+1 N 2 r ( 1 + N r+1 N r )

Chapter 13: Multiple Regression

Chapter 13: Multiple Regression Chapter 13: Multple Regresson 13.1 Developng the multple-regresson Model The general model can be descrbed as: It smplfes for two ndependent varables: The sample ft parameter b 0, b 1, and b are used to