This excerpt from. Foundations of Statistical Natural Language Processing. Christopher D. Manning and Hinrich Schütze The MIT Press.
|
|
- Mark Boyd
- 5 years ago
- Views:
Transcription
1 Ths excerpt from Foundatons of Statstcal Natural Language Processng. Chrstopher D. Mannng and Hnrch Schütze The MIT Press. s provded n screen-vewable form for personal use only by members of MIT CogNet. Unauthorzed use or dssemnaton of ths nformaton s expressly forbdden. If you have any questons about ths materal, please contact cognetadmn@cognet.mt.edu.
2 6 Statstcal Inference: n-gram Models over Sparse Data statstcal nference language modelng Shannon game Statstcal NLP ams to do statstcal nference for the feld of natural language. Statstcal nference n general conssts of takng some data (generated n accordance wth some unknown probablty dstrbuton) and then makng some nferences about ths dstrbuton. For example, we mght look at lots of nstances of prepostonal phrase attachments n a corpus, and use them to try to predct prepostonal phrase attachments for Englsh n general. The dscusson n ths chapter dvdes the problem nto three areas (although they tend to overlap consderably): dvdng the tranng data nto equvalence classes, fndng a good statstcal estmator for each equvalence class, and combnng multple estmators. As a runnng example of statstcal estmaton, we wll examne the classc task of language modelng, where the problem s to predct the next word gven the prevous words. Ths task s fundamental to speech or optcal character recognton, and s also used for spellng correcton, handwrtng recognton, and statstcal machne translaton. Ths sort of task s often referred to as a Shannon game followng the presentaton of the task of guessng the next letter n a text n (Shannon 1951). Ths problem has been well-studed, and ndeed many estmaton methods were frst developed for ths task. In general, though, the methods we develop are not specfc to ths task, and can be drectly used for other tasks lke word sense dsambguaton or probablstc parsng. The word predcton task just provdes a clear easly-understood problem for whch the technques can be developed.
3 192 6 Statstcal Inference: n-gram Models over Sparse Data 6.1 Bns: Formng Equvalence Classes Relablty vs. dscrmnaton target feature classfcatory features Normally, n order to do nference about one feature, we wsh to fnd other features of the model that predct t. Here, we are assumng that past behavor s a good gude to what wll happen n the future (that s, that the model s roughly statonary). Ths gves us a classfcaton task: we try to predct the target feature on the bass of varous classfcatory features. When dong ths, we effectvely dvde the data nto equvalence classes that share values for certan of the classfcatory features, and use ths equvalence classng to help predct the value of the target feature on new peces of data. Ths means that we are tactly makng ndepen- dence assumptons: the data ether does not depend on other features, or the dependence s suffcently mnor that we hope that we can neglect t wthout dong too much harm. The more classfcatory features (of some relevance) that we dentfy, the more fnely condtons that determne the unknown probablty dstrbuton of the target feature can potentally be teased apart. In other words, dvdng the data nto many bns gves us greater dscrmnaton. Gong aganst ths s the problem that f we use a lot of bns then a partcular bn may contan no or a very small number of tranng nstances, and then we wll not be able to do statstcally relable estmaton of the target feature for that bn. Fndng equvalence classes that are a good compromse between these two crtera s our frst goal. ndependence assumptons bns relablty n-gram models The task of predctng the next word can be stated as attemptng to estmate the probablty functon P: (6.1) hstory P(w n w 1,...,w n 1 ) In such a stochastc problem, we use a classfcaton of the prevous words, the hstory, to predct the next word. On the bass of havng looked at a lot of text, we know whch words tend to follow other words. For ths task, we cannot possbly consder each textual hstory separately: most of the tme we wll be lstenng to a sentence that we have never heard before, and so there s no prevous dentcal textual hstory on whch to base our predctons, and even f we had heard the begnnng of the sentence before, t mght end dfferently ths tme. And so we
4 6.1 Bns: Formng Equvalence Classes 193 Markov assumpton bgram trgram four-gram need a method of groupng hstores that are smlar n some way so as to gve reasonable predctons as to whch words we can expect to come next. One possble way to group them s by makng a Markov assumpton that only the pror local context the last few words affects the next word. If we construct a model where all hstores that have the same last n 1 words are placed n the same equvalence class, then we have an (n 1) th order Markov model or an n-gram word model (the last word of the n-gram beng gven by the word we are predctng). Before contnung wth model-buldng, let us pause for a bref nterlude on namng. The cases of n-gram models that people usually use are for n = 2, 3, 4, and these alternatves are usually referred to as a bgram, a trgram, andafour-gram model, respectvely. Revealng ths wll surely be enough to cause any Classcsts who are readng ths book to stop, and to leave the feld to uneducated engneerng sorts: gram s a Greek root and so should be put together wth Greek number prefxes. Shannon actually dd use the term dgram, but wth the declnng levels of educa- ton n recent decades, ths usage has not survved. As non-prescrptve lngusts, however, we thnk that the curous mxture of Englsh, Greek, and Latn that our colleagues actually use s qute fun. So we wll not try to stamp t out. 1 Now n prncple, we would lke the n of our n-gram models to be farly large, because there are sequences of words lke: dgram (6.2) Sue swallowed the large green. where swallowed s presumably stll qute strongly nfluencng whch word wll come next pll or perhaps frog are lkely contnuatons, but tree, car or mountan are presumably unlkely, even though they are n general farly natural contnuatons after the large green. However, there s the problem that f we dvde the data nto too many bns, then there are a lot of parameters to estmate. For nstance, f we conser- vatvely assume that a speaker s stayng wthn a vocabulary of 20,000 words, then we get the estmates for numbers of parameters shown n table parameters 1. Rather than four-gram, some people do make an attempt at appearng educated by sayng quadgram, but ths s not really correct use of a Latn number prefx (whch would gve quadrgram, cf.quadrlateral), let alone correct use of a Greek number prefx, whch would gve us a tetragram model. 2. Gven a certan model space (here word n-gram models), the parameters are the numbers that we have to specfy to determne a partcular model wthn that model space.
5 194 6 Statstcal Inference: n-gram Models over Sparse Data Model Parameters 1st order (bgram model): 20, , 999 = 400 mllon 2nd order (trgram model): 20, , 999 = 8 trllon 3th order (four-gram model): 20, , 999= Table 6.1 Growth n number of parameters for n-gram models. stemmng So we quckly see that producng a fve-gram model, of the sort that we thought would be useful above, may well not be practcal, even f we have what we thnk s a very large corpus. For ths reason, n-gram systems currently usually use bgrams or trgrams (and often make do wth a smaller vocabulary). One way of reducng the number of parameters s to reduce the value of n, but t s mportant to realze that n-grams are not the only way of formng equvalence classes of the hstory. Among other operatons of equvalencng, we could consder stemmng (removng the nflectonal endngs from words) or groupng words nto semantc classes (by use of a pre-exstng thesaurus, or by some nduced clusterng). Ths s effectvely reducng the vocabulary sze over whch we form n-grams. But we do not need to use n-grams at all. There are myrad other ways of formng equvalence classes of the hstory t s just that they re all a bt more complcated than n-grams. The above example suggests that knowledge of the predcate n a clause s useful, so we can magne a model that predcts the next word based on the prevous word and the prevous predcate (no matter how far back t s). But ths model s harder to mplement, because we frst need a farly accurate method of dentfyng the man predcate of a clause. Therefore we wll just use n-gram models n ths chapter, but other technques are covered n chapters 12 and 14. For anyone from a lngustcs background, the dea that we would choose to use a model of language structure whch predcts the next word smply by examnng the prevous two words wth no reference to the structure of the sentence seems almost preposterous. But, actually, the Snce we are assumng nothng n partcular about the probablty dstrbuton, the number of parameters to be estmated s the number of bns tmes one less than the number of values of the target feature (one s subtracted because the probablty of the last target value s automatcally gven by the stochastc constrant that probabltes should sum to one).
6 6.1 Bns: Formng Equvalence Classes 195 lexcal co-occurrence, semantc, and basc syntactc relatonshps that appear n ths very local context are a good predctor of the next word, and such systems work surprsngly well. Indeed, t s dffcult to beat a trgram model on the purely lnear task of predctng the next word Buldng n-gram models In the fnal part of some sectons of ths chapter, we wll actually buld some models and show the results. The reader should be able to recreate our results by usng the tools and data on the accompanyng webste. The text that we wll use s Jane Austen s novels, and s avalable from the webste. Ths corpus has two advantages: () t s freely avalable through the work of Project Gutenberg, and () t s not too large. The small sze of the corpus s, of course, n many ways also a dsadvantage. Because of the huge number of parameters of n-gram models, as dscussed above, n-gram models work best when traned on enormous amounts of data. However, such tranng requres a lot of CPU tme and dskspace, so a small corpus s much more approprate for a textbook example. Even so, you wll want to make sure that you start off wth about 40Mb of free dskspace before attemptng to recreate our examples. As usual, the frst step s to preprocess the corpus. The Project Gutenberg Austen texts are very clean plan ASCII fles. But nevertheless, there are the usual problems of punctuaton marks attachng to words and so on (see chapter 4) that mean that we must do more than smply splt on whtespace. We decded that we could make do wth some very smple search-and-replace patterns that removed all punctuaton leavng whtespace separated words (see the webste for detals). We decded to use Emma, Mansfeld Park, Northanger Abbey, Prde and Prejudce, andsense and Sensblty as our corpus for buldng models, reservng Persuason for testng, as dscussed below. Ths gave us a (small) tranng corpus of N = 617, 091 words of text, contanng a vocabulary V of 14,585 word types. By smply removng all punctuaton as we dd, our fle s lterally a long sequence of words. Ths sn t actually what people do most of the tme. It s commonly felt that there are not very strong dependences between sentences, whle sentences tend to begn n characterstc ways. So people mark the sentences n the text most commonly by surroundng them wth the SGML tags <s> and </s>. The probablty calculatons at the
7 196 6 Statstcal Inference: n-gram Models over Sparse Data start of a sentence are then dependent not on the last words of the precedng sentence but upon a begnnng of sentence context. We should addtonally note that we ddn t remove case dstnctons, so captalzed words reman n the data, mperfectly ndcatng where new sentences begn. 6.2 Statstcal Estmators Gven a certan number of peces of tranng data that fall nto a certan bn, the second goal s then fndng out how to derve a good probablty estmate for the target feature based on these data. For our runnng example of n-grams, we wll be nterested n P(w 1 w n ) and the predcton task P(w n w 1 w n 1 ). Snce: (6.3) P(w n w 1 w n 1 ) = P(w 1 w n ) P(w 1 w n 1 ) estmatng good condtonal probablty dstrbutons can be reduced to havng good solutons to smply estmatng the unknown probablty dstrbuton of n-grams. 3 Let us assume that the tranng text conssts of N words. If we append n 1 dummy start symbols to the begnnng of the text, we can then also say that the corpus conssts of N n-grams, wth a unform amount of condtonng avalable for the next word n all cases. Let B be the number of bns (equvalence classes). Ths wll be V n 1,whereV s the vocabulary sze, for the task of workng out the next word and V n for the task of estmatng the probablty of dfferent n-grams. Let C(w 1 w n ) be the frequency of a certan n-gram n the tranng text, and let us say that there are N r n-grams that appeared r tmes n the tranng text (.e., N r = {w 1 w n : C(w 1 w n ) = r} ). These frequences of frequences are very commonly used n the estmaton methods whch we cover below. Ths notaton s summarzed n table However, when smoothng, one has a choce of whether to smooth the n-gram probablty estmates, or to smooth the condtonal probablty dstrbutons drectly. For many methods, these do not gve equvalent results snce n the latter case one s separately smoothng a large number of condtonal probablty dstrbutons (whch normally need to be themselves grouped nto classes n some way).
8 6.2 Statstcal Estmators 197 N B w 1n C(w 1 w n ) r f( ) N r T r h Number of tranng nstances Number of bns tranng nstances are dvded nto An n-gram w 1 w n n the tranng text Frequency of n-gram w 1 w n n tranng text Frequency of an n-gram Frequency estmate of a model Number of bns that have r tranng nstances n them Total count of n-grams of frequency r n further data Hstory of precedng words Table 6.2 Notaton for the statstcal estmaton chapter Maxmum Lkelhood Estmaton (MLE) MLE estmates from relatve frequences relatve frequency Regardless of how we form equvalence classes, we wll end up wth bns that contan a certan number of tranng nstances. Let us assume a trgram model where we are usng the two precedng words of context to predct the next word, and let us focus n on the bn for the case where the two precedng words were comes across. In a certan corpus, the authors found 10 tranng nstances of the words comes across, andof those, 8 tmes they were followed by as, oncebymore and once by a. The queston at ths pont s what probablty estmates we should use for estmatng the next word. The obvous frst answer (at least from a frequentst pont of vew) s to suggest usng the relatve frequency as a probablty estmate: P(as) = 0.8 P(more) = 0.1 P(a) = 0.1 P(x) = 0.0 for x not among the above 3 words maxmum lkelhood estmate (6.4) (6.5) Ths estmate s called the maxmum lkelhood estmate (MLE): P MLE (w 1 w n ) = C(w 1 w n ) N P MLE (w n w 1 w n 1 ) = C(w 1 w n ) C(w 1 w n 1 )
9 198 6 Statstcal Inference: n-gram Models over Sparse Data If one fxes the observed data, and then consders the space of all possble parameter assgnments wthn a certan dstrbuton (here a trgram lkelhood model) gven the data, then statstcans refer to ths as a lkelhood func- functon ton. The maxmum lkelhood estmate s so called because t s the choce of parameter values whch gves the hghest probablty to the tranng corpus. 4 The estmate that does that s the one shown above. It does not waste any probablty mass on events that are not n the tranng corpus, but rather t makes the probablty of observed events as hgh as t can subject to the normal stochastc constrants. But the MLE s n general unsutable for statstcal nference n NLP. The problem s the sparseness of our data (even f we are usng a large corpus). Whle a few words are common, the vast majorty of words are very uncommon and longer n-grams nvolvng them are thus much rarer agan. The MLE assgns a zero probablty to unseen events, and snce the probablty of a long strng s generally computed by multplyng the probabltes of subparts, these zeroes wll propagate and gve us bad (zero probablty) estmates for the probablty of sentences when we just happened not to see certan n-grams n the tranng text. 5 Wth respect to the example above, the MLE s not capturng the fact that there are other words whch can follow comes across, for example the and some. As an example of data sparseness, after tranng on 1.5 mllon words from the IBM Laser Patent Text corpus, Bahl et al. (1983) report that 23% of the trgram tokens found n further test data drawn from the same corpus were prevously unseen. Ths corpus s small by modern standards, and so one mght hope that by collectng much more data that the problem of data sparseness would smply go away. Whle ths may ntally seem hopeful (f we collect a hundred nstances of comes across, we wll probably fnd nstances wth t followed by the and some), n practce t s never a general soluton to the problem. Whle there are a lmted number of frequent events n language, there s a seemngly never end- 4. Ths s gven that the occurrence of a certan n-gram s assumed to be a random varable wth a bnomal dstrbuton (.e., each n-gram s ndependent of the next). Ths s a qute untrue (though usable) assumpton: frstly, each n-gram overlaps wth and hence partly determnes the next, and secondly, content words tend to clump (f you use a word once n a paper, you are lkely to use t agan), as we dscuss n secton Another way to state ths s to observe that f our probablty model assgns zero probablty to any event that turns out to actually occur, then both the cross-entropy and the KL dvergence wth respect to (data from) the real probablty dstrbuton s nfnte. In other words we have done a maxmally bad job at producng a probablty functon that s close to the one we are tryng to model.
10 6.2 Statstcal Estmators 199 rare events dscountng ng tal to the probablty dstrbuton of rarer and rarer events, and we can never collect enough data to get to the end of the tal. 6 For nstance comes across could be followed by any number, and we wll never see every number. In general, we need to devse better estmators that allow for the possblty that we wll see events that we ddn t see n the tranng text. All such methods effectvely work by somewhat decreasng the probablty of prevously seen events, so that there s a lttle bt of probablty mass left over for prevously unseen events. Thus these methods are frequently referred to as dscountng methods. The process of dscountng s often referred to as smoothng, presumably because a dstrbuton wth- out zeroes s smoother than one wth zeroes. We wll examne a number of smoothng methods n the followng sectons. smoothng Usng MLE estmates for n-gram models of Austen hapax legomena Based on our Austen corpus, we made n-gram models for dfferent values of n. It s qute straghtforward to wrte one s own program to do ths, by totallng up the frequences of n-grams and (n 1)-grams, and then dvdng to get MLE probablty estmates, but there s also software to do t on the webste. In practcal systems, t s usual to not actually calculate n-grams for all words. Rather, the n-grams are calculated as usual only for the most common k words, and all other words are regarded as Out-Of-Vocabulary (OOV) tems and mapped to a sngle token such as <UNK>. Commonly, ths wll be done for all words that have been encountered only once n the tranng corpus (hapax legomena). A useful varant n some domans s to notce the obvous semantc and dstrbutonal smlarty of rare numbers and to have two out-of-vocabulary tokens, one for numbers and one for everythng else. Because of the Zpfan dstrbuton of words, cuttng out low frequency tems wll greatly reduce the parameter space (and the memory requrements of the system beng bult), whle not apprecably affectng the model qualty (hapax legomena often consttute half of the types, but only a fracton of the tokens). We used the condtonal probabltes calculated from our tranng corpus to work out the probabltes of each followng word for part of a 6. Cf. Zpf s law the observaton that the relatonshp between a word s frequency and the rank order of ts frequency s roughly a recprocal curve as dscussed n secton
11 200 6 Statstcal Inference: n-gram Models over Sparse Data In person she was nferor to both ssters 1-gram P( ) P( ) P( ) P( ) P( ) P( ) 1 the the the the the the to to to to to to and and and and and of of of of of was was was was was she she she she both both both ssters ssters nferor gram P( person) P( she) P( was) P( nferor) P( to) P( both) 1 and had not to be of who was a the to to the her n n to have and she Mrs she what ssters both nferor 0 3-gram P( In,person) P( person,she) P( she,was) P( was,nf.) P( nferor,to) P( to,both) 1 Unseen dd 0.5 not Unseen the to was 0.5 very Mara Chapter n cherres Hour to her Twce nferor 0 both 0 ssters 0 4-gram P( u,i,p) P( I,p,s) P( p,s,w) P( s,w,) P( w,,t) P(,t,b) 1 Unseen Unseen n 1.0 Unseen Unseen Unseen nferor 0 Table 6.3 Probabltes of each successve word for a clause from Persuason. The probablty dstrbuton for the followng word s calculated by Maxmum Lkelhood Estmate n-gram models for varous values of n. The predcted lkelhood rank of dfferent words s shown n the frst column. The actual next word s shown at the top of the table n talcs, and n the table n bold.
12 6.2 Statstcal Estmators 201 sentence from our test corpus Persuason. We wll cover the ssue of test corpora n more detal later, but t s vtal for assessng a model that we try t on dfferent data otherwse t sn t a far test of how well the model allows us to predct the patterns of language. Extracts from these probablty dstrbutons ncludng the actual next word shown n bold are shown n table 6.3. The ungram dstrbuton gnores context entrely, and smply uses the overall frequency of dfferent words. But ths s not entrely useless, snce, as n ths clause, most words n most sentences are common words. The bgram model uses the precedng word to help predct the next word. In general, ths helps enormously, and gves us a much better model. In some cases the estmated probablty of the word that actually comes next has gone up by about an order of magntude (was, to, ssters). However, note that the bgram model s not guaranteed to ncrease the probablty estmate. The estmate for she has actually gone down, because she s n general very common n Austen novels (beng manly books about women), but somewhat unexpected after the noun person although qute possble when an adverbal phrase s beng used, such as In person here. The falure to predct nferor after was shows problems of data sparseness already startng to crop up. When the trgram model works, t can work brllantly. For example, t gves us a probablty estmate of 0.5 forwas followng person she. Butn general t s not usable. Ether the precedng bgram was never seen before, and then there s no probablty dstrbuton for the followng word, or a few words have been seen followng that bgram, but the data s so sparse that the resultng estmates are hghly unrelable. For example, the bgram to both was seen 9 tmes n the tranng text, twce followed by to, and once each followed by 7 other words, a few of whch are shown n the table. Ths s not the knd of densty of data on whch one can sensbly buld a probablstc model. The four-gram model s entrely useless. In general, four-gram models do not become usable untl one s tranng on several tens of mllons of words of data. Examnng the table suggests an obvous strategy: use hgher order n-gram models when one has seen enough data for them to be of some use, but back off to lower order n-gram models when there sn t enough data. Ths s a wdely used strategy, whch we wll dscuss below n the secton on combnng estmates, but t sn t by tself a complete soluton to the problem of n-gram estmates. For nstance, we saw qute a lot of words followng was n the tranng data 9409 tokens of 1481 types but nferor was not one of them. Smlarly, although we had seen qute
13 202 6 Statstcal Inference: n-gram Models over Sparse Data a lot of words n our tranng text overall, there are many words that dd not appear, ncludng perfectly ordnary words lke decdes or wart. So regardless of how we combne estmates, we stll defntely need a way to gve a non-zero probablty estmate to words or n-grams that we happened not to see n our tranng text, and so we wll work on that problem frst Laplace s law, Ldstone s law and the Jeffreys-Perks law Laplace s law (6.6) addng one The manfest falure of maxmum lkelhood estmaton forces us to examne better estmators. The oldest soluton s to employ Laplace s law (1814; 1995). Accordng to ths law, P Lap (w 1 w n ) = C(w 1 w n ) + 1 N + B Ths process s often nformally referred to as addng one, and has the effect of gvng a lttle bt of the probablty space to unseen events. But rather than smply beng an unprncpled move, ths s actually the Bayesan estmator that one derves f one assumes a unform pror on events (.e., that every n-gram was equally lkely). However, note that the estmates whch Laplace s law gves are dependent on the sze of the vocabulary. For sparse sets of data over large vocabulares, such as n-grams, Laplace s law actually gves far too much of the probablty space to unseen events. Consder some data dscussed by Church and Gale (1991a) n the context of ther dscusson of varous estmators for bgrams. Ther corpus of 44 mllon words of Assocated Press (AP) newswre yelded a vocabulary of 400,653 words (mantanng case dstnctons, splttng on hyphens, etc.). Note that ths vocabulary sze means that there s a space of possble bgrams, and so aprorbarely any of them wll actually occur n the corpus. It also means that n the calculaton of P Lap, B s far larger than N, and Laplace s method s completely unsatsfactory n such crcumstances. Church and Gale used half the corpus (22 mllon words) as a tranng text. Table 6.4 shows the expected frequency est- mates of varous methods that they dscuss, and Laplace s law estmates that we have calculated. Probablty estmates can be derved by dvdng the frequency estmates by the number of n-grams, N = 22 mllon. For Laplace s law, the probablty estmate for an n-gram seen r tmes s expected frequency estmates
14 6.2 Statstcal Estmators 203 r =f MLE f emprcal f Lap f del f GT N r T r Table 6.4 Estmated frequences for the AP data from Church and Gale (1991a). The frst fve columns show the estmated frequency calculated for a bgram that actually appeared r tmes n the tranng data accordng to dfferent estmators: r s the maxmum lkelhood estmate, f emprcal uses valdaton on the test set, f Lap s the add one method, f del s deleted nterpolaton (two-way cross valdaton, usng the tranng data), and f GT s the Good-Turng estmate. The last two columns gve the frequences of frequences and how often bgrams of a certan frequency occurred n further text. (r+1)/(n+b), so the frequency estmate becomes f Lap = (r+1)n/(n+b). These estmated frequences are often easer for humans to nterpret than probabltes, as one can more easly see the effect of the dscountng. Although each prevously unseen bgram has been gven a very low probablty, because there are so many of them, 46.5% of the probablty space has actually been gven to unseen bgrams. 7 Ths s far too much, and t s done at the cost of enormously reducng the probablty estmates of more frequent events. How do we know t s far too much? The second column of the table shows an emprcally determned estmate (whch we dscuss below) of how often unseen n-grams actually appeared n further text, and we see that the ndvdual frequency of occurrence of prevously unseen n-grams s much lower than Laplace s law predcts, whle the frequency of occurrence of prevously seen n-grams s much hgher than predcted. 8 In partcular, the emprcal model fnds that only 9.2% of the bgrams n further text were prevously unseen. 7. Ths s calculated as N 0 P Lap ( ) = 74, 671, 100, /22, 000, 000 = It s a bt hard dealng wth the astronomcal numbers n the table. A smaller example whch llustrates the same pont appears n exercse 6.2.
15 204 6 Statstcal Inference: n-gram Models over Sparse Data Ldstone s law and the Jeffreys-Perks law (6.7) (6.8) Expected Lkelhood Estmaton Because of ths overestmaton, a commonly adopted soluton to the problem of multnomal estmaton wthn statstcal practce s Ldstone s law of successon, where we add not one, but some (normally smaller) postve value λ: P Ld (w 1 w n ) = C(w 1 w n ) + λ N + Bλ Ths method was developed by the actuares Hardy and Ldstone, and Johnson showed that t can be vewed as a lnear nterpolaton (see below) between the MLE estmate and a unform pror. Ths may be seen by settng µ = N/(N + Bλ): P Ld (w 1 w n ) = µ C(w 1 w n ) + (1 µ) 1 N B The most wdely used value for λ s 1 2. Ths choce can be theoretcally justfed as beng the expectaton of the same quantty whch s maxmzed by MLE and so t has ts own names, the Jeffreys-Perks law, or Expected Lkelhood Estmaton (ELE) (Box and Tao 1973: 34 36). In practce, ths often helps. For example, we could avod the objecton above that two much of the probablty space was beng gven to unseen events by choosng a small λ. But there are two remanng objectons: () we need a good way to guess an approprate value for λ n advance, and () dscountng usng Ldstone s law always gves probablty estmates lnear n the MLE frequency and ths s not a good match to the emprcal dstrbuton at low frequences. Applyng these methods to Austen Despte the problems nherent n these methods, we wll nevertheless try applyng them, n partcular ELE, to our Austen corpus. Recall that up untl now the only probablty estmate we have been able to derve for the test corpus clause she was nferor to both ssters was the ungram estmate, whch (multplyng through the bold probabltes n the top part of table 6.3) gves as ts estmate for the probablty of the clause For the other models, the probablty estmate was ether zero or undefned, because of the sparseness of the data. Let us now calculate a probablty estmate for ths clause usng a bgram model and ELE. Followng the word was, whch appeared 9409
16 6.2 Statstcal Estmators 205 Rank Word MLE ELE 1 not a the to =1482 nferor Table 6.5 Expected Lkelhood Estmaton estmates for the word followng was. tmes, not appeared 608 tmes n the tranng corpus, whch overall contaned word types. So our new estmate for P(not was) s ( )/( ) = The estmate for P(not was) has thus been dscounted (by almost half!). If we do smlar calculatons for the other words, then we get the results shown n the last column of table 6.5. The orderng of most lkely words s naturally unchanged, but the probablty estmates of words that dd appear n the tranng text are dscounted, whle non-occurrng words, n partcular the actual next word, nferor, are gven a non-zero probablty of occurrence. Contnung n ths way to also estmate the other bgram probabltes, we fnd that ths language model gves a probablty estmate for the clause of Unfortunately, ths probablty estmate s actually lower than the MLE estmate based on ungram counts reflectng how greatly all the MLE probablty estmates for seen n-grams are dscounted n the constructon of the ELE model. Ths result substantates the slogan used n the ttles of (Gale and Church 1990a,b): poor estmates of context are worse than none. Note, however, that ths does not mean that the model that we have constructed s entrely useless. Although the probablty estmates t gves are extremely low, one can nevertheless use them to rank alternatves. For example, the model does correctly tell us that she was nferor to both ssters s a much more lkely clause n Englsh than nferor to was both she ssters, whereas the ungram estmate gves them both the same probablty Held out estmaton How do we know that gvng 46.5% of the probablty space to unseen events s too much? One way that we can test ths s emprcally. We
17 206 6 Statstcal Inference: n-gram Models over Sparse Data held out estmator can take further text (assumed to be from the same source) and see how often bgrams that appeared r tmes n the tranng text tend to turn up n the further text. The realzaton of ths dea s the held out estmator of Jelnek and Mercer (1985). The held out estmator For each n-gram, w 1 w n,let: C 1 (w 1 w n ) = frequency of w 1 w n n tranng data C 2 (w 1 w n ) = frequency of w 1 w n n held out data (6.9) and recall that N r s the number of bgrams wth frequency r (n the tranng text). Now let: T r = C 2 (w 1 w n ) {w 1 w n :C 1 (w 1 w n )=r} That s, T r s the total number of tmes that all n-grams that appeared r tmes n the tranng text appeared n the held out data. Then the average frequency of those n-grams s T r N r and so an estmate for the probablty of one of these n-grams s: (6.10) P ho (w 1 w n ) = T r N r N where C(w 1 w n ) = r Pots of data for developng and testng models tranng data overtranng test data A cardnal sn n Statstcal NLP s to test on your tranng data. Butwhys that? The dea of testng s to assess how well a partcular model works. That can only be done f t s a far test on data that has not been seen before. In general, models nduced from a sample of data have a tendency to be overtraned, that s, to expect future events to be lke the events on whch the model was traned, rather than allowng suffcently for other possbltes. (For nstance, stock market models sometmes suffer from ths falng.) So t s essental to test on dfferent data. A partcular case of ths s for the calculaton of cross entropy (secton 2.2.6). To calculate cross entropy, we take a large sample of text and calculate the per-word entropy of that text accordng to our model. Ths gves us a measure of the qualty of our model, and an upper bound for the entropy of the language that the text was drawn from n general. But all that s only true f the test data s ndependent of the tranng data, and large enough
18 6.2 Statstcal Estmators 207 to be ndcatve of the complexty of the language at hand. If we test on the tranng data, the cross entropy can easly be lower than the real entropy of the text. In the most blatant case we could buld a model that has memorzed the tranng text and always predcts the next word wth probablty 1. Even f we don t do that, we wll fnd that MLE s an excellent language model f you are testng on tranng data, whch s not the rght result. So when startng to work wth some data, one should always separate t mmedately nto a tranng porton and a testng porton. The test data s normally only a small percentage (5 10%) of the total data, but has to be suffcent for the results to be relable. You should always eyeball the tranng data you want to use your human pattern-fndng abltes to get hnts on how to proceed. You shouldn t eyeball the test data that s cheatng, even f less drectly than gettng your program to memorze t. Commonly, however, one wants to dvde both the tranng and test data nto two agan, for dfferent reasons. For many Statstcal NLP methods, such as held out estmaton of n-grams, one gathers counts from one lot of tranng data, and then one smooths these counts or estmates certan other parameters of the assumed model based on what turns up n further held out or valdaton data. The held out data needs to be nde- pendent of both the prmary tranng data and the test data. Normally the stage usng the held out data nvolves the estmaton of many fewer parameters than are estmated from counts over the prmary tranng data, and so t s approprate for the held out data to be much smaller than the prmary tranng data (commonly about 10% of the sze). Nevertheless, t s mportant that there s suffcent data for any addtonal parameters of the model to be accurately estmated, or sgnfcant performance losses can occur (as Chen and Goodman (1996: 317) show). A typcal pattern n Statstcal NLP research s to wrte an algorthm, tran t, and test t, note some thngs that t does wrong, revse t and then to repeat the process (often many tmes!). But, f one does that a lot, not only does one tend to end up seeng aspects of the test set, but just repeatedly tryng out dfferent varant algorthms and lookng at ther performance can be vewed as subtly probng the contents of the test set. Ths means that testng a successon of varant models can agan lead to overtranng. So the rght approach s to have two test sets: a development test set on whch successve varant methods are traled and a fnal test set whch s used to produce the fnal results that are publshed about the performance of the algorthm. One should expect performance on held out data valdaton data development test set fnal test set
19 208 6 Statstcal Inference: n-gram Models over Sparse Data varance the fnal test set to be slghtly lower than on the development test set (though sometmes one can be lucky). The dscusson so far leaves open exactly how to choose whch parts of the data are to be used as testng data. Actually here opnon dvdes nto two schools. One school favors selectng bts (sentences or even n- grams) randomly from throughout the data for the test set and usng the rest of the materal for tranng. The advantage of ths method s that the testng data s as smlar as possble (wth respect to genre, regster, wrter, and vocabulary) to the tranng data. That s, one s tranng from as accurate a sample as possble of the type of language n the test data. The other possblty s to set asde large contguous chunks as test data. The advantage of ths s the opposte: n practce, one wll end up usng any NLP system on data that vares a lttle from the tranng data, as language use changes a lttle n topc and structure wth the passage of tme. Therefore, some people thnk t best to smulate that a lttle by choosng test data that perhaps sn t qute statonary wth respect to the tranng data. At any rate, f usng held out estmaton of parameters, t s best to choose the same strategy for settng asde data for held out data as for test data, as ths makes the held out data a better smulaton of the test data. Ths choce s one of the many reasons why system results can be hard to compare: all else beng equal, one should expect slghtly worse performance results f usng the second approach. Whle coverng testng, let us menton one other ssue. In early work, t was common to just run the system on the test data and present a sngle performance fgure (for perplexty, percent correct or whatever). But ths sn t a very good way of testng, as t gves no dea of the varance n the performance of the system. A much better way s to dvde the test data nto, say 20, smaller samples, and work out a test result on each of them. From those results, one can work out a mean performance fgure, as before, but one can also calculate the varance that shows how much performance tends to vary. If usng ths method together wth contnuous chunks of tranng data, t s probably best to take the smaller testng samples from dfferent regons of the data, snce the testng lore tends to be full of stores about certan sectons of data sets beng easy, and so t s better to have used a range of test data from dfferent sectons of the corpus. If we proceed ths way, then one system can score hgher on average than another purely by accdent, especally when wthn-system varance s hgh. So just comparng average scores s not enough for meanngful
20 6.2 Statstcal Estmators 209 System 1 System 2 scores 71, 61, 55, 60, 68, 49, 42, 55, 75, 45, 54, 51 42, 72, 76, 55, 64 55, 36, 58, 55, 67 total n mean x s 2 = (x j x ) 2 1, ,228.8 df Pooled s 2 = t = x 1 x 2 2s 2 n = Table 6.6 Usng the t test for comparng the performance of two systems. Snce we calculate the mean for each data set, the denomnator n the calculaton of varance and the number of degrees of freedom s (11 1) + (11 1) = 20. The data do not provde clear support for the superorty of system 1. Despte the clear dfference n mean scores, the sample varance s too hgh to draw any defntve conclusons. t test system comparson. Instead, we need to apply a statstcal test that takes nto account both mean and varance. Only f the statstcal test rejects the possblty of an accdental dfference can we say wth confdence that one system s better than the other. 9 An example of usng the t test (whch we ntroduced n secton 5.3.1) for comparng the performance of two systems s shown n table 6.6 (adapted from (Snedecor and Cochran 1989: 92)). Note that we use a pooled estmate of the sample varance s 2 here under the assumpton that the varance of the two systems s the same (whch seems a reasonable assumpton here: 609 and 526 are close enough). Lookng up the t dstrbuton n the appendx, we fnd that, for rejectng the hypothess that the system 1 s better than system 2 at a probablty level of α = 0.05, the crtcal value s t = (usng a one-taled test wth 20 degrees of freedom). Snce we have t = 1.56 < 1.725, the data fal the sgnfcance test. Although the averages are farly dstnct, we cannot conclude superorty of system 1 here because of the large varance of scores. 9. Systematc dscusson of testng methodology for comparng statstcal and machne learnng algorthms can be found n (Detterch 1998). A good case study, for the example of word sense dsambguaton, s (Mooney 1996).
21 210 6 Statstcal Inference: n-gram Models over Sparse Data Usng held out estmaton on the test data So long as the frequency of an n-gram C(w 1 w n ) s the only thng that we are usng to predct ts future frequency n text, then we can use held out estmaton performed on the test set to provde the correct answer of what the dscounted estmates of probabltes should be n order to maxmze the probablty of the test set data. Dong ths emprcally measures how often n-grams that were seen r tmes n the tranng data actually do occur n the test text. The emprcal estmates f emprcal n table 6.4 were found by randomly dvdng the 44 mllon bgrams n the whole AP corpus nto equal-szed tranng and test sets, countng frequences n the 22 mllon word tranng set and then dong held out estmaton usng the test set. Whereas other estmates are calculated only from the 22 mllon words of tranng data, ths estmate can be regarded as an emprcally determned gold standard, acheved by allowng access to the test data Cross-valdaton (deleted estmaton) cross-valdaton deleted estmaton The f emprcal estmates dscussed mmedately above were constructed by lookng at what actually happened n the test data. But the dea of held out estmaton s that we can acheve the same effect by dvdng the tranng data nto two parts. We buld ntal estmates by dong counts on one part, and then we use the other pool of held out data to refne those estmates. The only cost of ths approach s that our ntal tranng data s now less, and so our probablty estmates wll be less relable. Rather than usng some of the tranng data only for frequency counts and some only for smoothng probablty estmates, more effcent schemes are possble where each part of the tranng data s used both as ntal tranng data and as held out data. In general, such methods n statstcs go under the name cross-valdaton. Jelnek and Mercer (1985) use a form of two-way cross-valdaton that they call deleted estmaton. Suppose we let Nr a be the number of n-grams occurrng r tmes n the a th part of the tranng data, and Tr ab be the total occurrences of those bgrams from part a n the b th part. Now dependng on whch part s vewed as the basc tranng data, standard held out estmates would be ether: P ho (w 1 w n ) = T r 01 Nr 0 N 10 Tr or Nr 1 N where C(w 1 w n ) = r
22 6.2 Statstcal Estmators 211 The more effcent deleted nterpolaton estmate does counts and smoothng on both halves and then averages the two: (6.11) P del (w 1 w n ) = T r 01 + Tr 10 N(Nr 0 + Nr 1 ) where C(w 1 w n ) = r Leavng-One-Out On large tranng corpora, dong deleted estmaton on the tranng data works better than dong held-out estmaton usng just the tranng data, and ndeed table 6.4 shows that t produces results that are qute close to the emprcal gold standard. 10 It s nevertheless stll some way off for low frequency events. It overestmates the expected frequency of unseen objects, whle underestmatng the expected frequency of objects that were seen once n the tranng data. By dvdng the text nto two parts lke ths, one estmates the probablty of an object by how many tmes t was seen n a sample of sze N 2, assumng that the probablty of a token seen r tmes n a sample of sze N 2 s double that of a token seen r tmes n a sample of sze N. However, t s generally true that as the sze of the tranng corpus ncreases, the percentage of unseen n-grams that one encounters n held out data, and hence one s probablty estmate for unseen n-grams, decreases (whle never becomng neglgble). It s for ths reason that collectng counts on a smaller tranng corpus has the effect of overestmatng the probablty of unseen n-grams. There are other ways of dong cross-valdaton. In partcular Ney et al. (1997) explore a method that they call Leavng-One-Out where the prmary tranng corpus s of sze N 1 tokens, whle 1 token s used as held out data for a sort of smulated testng. Ths process s repeated N tmes so that each pece of data s left out n turn. The advantage of ths tranng regme s that t explores the effect of how the model changes f any partcular pece of data had not been observed, and Ney et al. show strong connectons between the resultng formulas and the wdely-used Good-Turng method to whch we turn next Remember that, although the emprcal gold standard was derved by held out estmaton, t was held out estmaton based on lookng at the test data! Chen and Goodman (1998) fnd n ther study that for smaller tranng corpora, held out estmaton outperforms deleted estmaton. 11. However, Chen and Goodman (1996: 314) suggest that leavng one word out at a tme s problematc, and that usng larger deleted chunks n deleted nterpolaton s to be preferred.
23 212 6 Statstcal Inference: n-gram Models over Sparse Data Good-Turng estmaton The Good-Turng estmator (6.12) (6.13) Good (1953) attrbutes to Turng a method for determnng frequency or probablty estmates of tems, on the assumpton that ther dstrbuton s bnomal. Ths method s sutable for large numbers of observatons of data drawn from a large vocabulary, and works well for n-grams, despte the fact that words and n-grams do not have a bnomal dstrbuton. The probablty estmate n Good-Turng estmaton s of the form P GT = r*/n where r* can be thought of as an adjusted frequency. The theorem underlyng Good-Turng methods gves that for prevously observed tems: r* = (r + 1) E(N r+1) E(N r ) where E denotes the expectaton of a random varable (see (Church and Gale 1991a; Gale and Sampson 1995) for dscusson of the dervaton of ths formula). The total probablty mass reserved for unseen objects s then E(N 1 )/N (see exercse 6.5). Usng our emprcal estmates, we can hope to substtute the observed N r for E(N r ). However, we cannot do ths unformly, snce these emprcal estmates wll be very unrelable for hgh values of r. In partcular, the most frequent n-gram would be estmated to have probablty zero, snce the number of n-grams wth frequency one greater than t s zero! In practce, one of two solutons s employed. One s to use Good-Turng reestmaton only for frequences r < k for some constant k (e.g., 10). Low frequency words are numerous, so substtuton of the observed frequency of frequences for the expectaton s qute accurate, whle the MLE estmates of hgh frequency words wll also be qute accurate and so one doesn t need to dscount them. The other s to ft some functon S through the observed values of (r, N r ) and to use the smoothed values S(r) for the expectaton (ths leads to a famly of possbltes dependng on exactly whch method of curve fttng s employed Good (1953) dscusses several smoothng methods). The probablty mass N 1 N gven to unseen tems can ether be dvded among them unformly, or by some more sophstcated method (see under Combnng Estmators, below). So usng ths method wth a unform estmate for unseen events, we have: Good-Turng Estmator: If C(w 1 w n ) = r>0, P GT (w 1 w n ) = r* N where r* = (r + 1)S(r + 1) S(r)
24 6.2 Statstcal Estmators 213 If C(w 1 w n ) = 0, (6.14) renormalzaton P GT (w 1 w n ) = 1 r=1 N r r* N N 0 N 1 N 0 N Gale and Sampson (1995) present a smple and effectve approach, Smple Good-Turng, whch effectvely combnes these two approaches. As a smoothng curve they smply use a power curve N r = ar b (wth b< 1 to gve the approprate hyperbolc relatonshp), and estmate A and b by smple lnear regresson on the logarthmc form of ths equaton log N r = a + b log r (lnear regresson s covered n secton , or n all ntroductory statstcs books). However, they suggest that such a smple curve s probably only approprate for hgh values of r. For low values of r, they use the measured N r drectly. Workng up through frequences, these drect estmates are used untl for one of them there sn t a sgnfcant dfference between r* values calculated drectly or va the smoothng functon, and then smoothed estmates are used for all hgher frequences. 12 Smple Good-Turng can gve exceedngly good estmators, as can be seen by comparng the Good-Turng column f GT n table 6.4 wth the emprcal gold standard. Under any of these approaches, t s necessary to renormalze all the estmates to ensure that a proper probablty dstrbuton results. Ths can be done ether by adjustng the amount of probablty mass gven to unseen tems (as n equaton (6.14)), or, perhaps better, by keepng the estmate of the probablty mass for unseen tems as N 1 N and renormalzng all the estmates for prevously seen tems (as Gale and Sampson (1995) propose). Frequences of frequences n Austen count-counts To do Good-Turng, the frst step s to calculate the frequences of dfferent frequences (also known as count-counts). Table 6.7 shows extracts from the resultng lst of frequences of frequences for bgrams and trgrams. (The numbers are remnscent of the Zpfan dstrbutons of 12. An estmate of r* s deemed sgnfcantly dfferent f the dfference exceeds 1.65 tmes the standard devaton of the Good-Turng estmate, whch s gven by: (r + 1) 2 N r+1 N 2 r ( 1 + N r+1 N r )
Chapter 13: Multiple Regression
Chapter 13: Multple Regresson 13.1 Developng the multple-regresson Model The general model can be descrbed as: It smplfes for two ndependent varables: The sample ft parameter b 0, b 1, and b are used to
More informationBasically, if you have a dummy dependent variable you will be estimating a probability.
ECON 497: Lecture Notes 13 Page 1 of 1 Metropoltan State Unversty ECON 497: Research and Forecastng Lecture Notes 13 Dummy Dependent Varable Technques Studenmund Chapter 13 Bascally, f you have a dummy
More informationHomework Assignment 3 Due in class, Thursday October 15
Homework Assgnment 3 Due n class, Thursday October 15 SDS 383C Statstcal Modelng I 1 Rdge regresson and Lasso 1. Get the Prostrate cancer data from http://statweb.stanford.edu/~tbs/elemstatlearn/ datasets/prostate.data.
More informationFREQUENCY DISTRIBUTIONS Page 1 of The idea of a frequency distribution for sets of observations will be introduced,
FREQUENCY DISTRIBUTIONS Page 1 of 6 I. Introducton 1. The dea of a frequency dstrbuton for sets of observatons wll be ntroduced, together wth some of the mechancs for constructng dstrbutons of data. Then
More informationNote on EM-training of IBM-model 1
Note on EM-tranng of IBM-model INF58 Language Technologcal Applcatons, Fall The sldes on ths subject (nf58 6.pdf) ncludng the example seem nsuffcent to gve a good grasp of what s gong on. Hence here are
More information1. Inference on Regression Parameters a. Finding Mean, s.d and covariance amongst estimates. 2. Confidence Intervals and Working Hotelling Bands
Content. Inference on Regresson Parameters a. Fndng Mean, s.d and covarance amongst estmates.. Confdence Intervals and Workng Hotellng Bands 3. Cochran s Theorem 4. General Lnear Testng 5. Measures of
More informationComparison of Regression Lines
STATGRAPHICS Rev. 9/13/2013 Comparson of Regresson Lnes Summary... 1 Data Input... 3 Analyss Summary... 4 Plot of Ftted Model... 6 Condtonal Sums of Squares... 6 Analyss Optons... 7 Forecasts... 8 Confdence
More informationA Robust Method for Calculating the Correlation Coefficient
A Robust Method for Calculatng the Correlaton Coeffcent E.B. Nven and C. V. Deutsch Relatonshps between prmary and secondary data are frequently quantfed usng the correlaton coeffcent; however, the tradtonal
More informationQuestion Classification Using Language Modeling
Queston Classfcaton Usng Language Modelng We L Center for Intellgent Informaton Retreval Department of Computer Scence Unversty of Massachusetts, Amherst, MA 01003 ABSTRACT Queston classfcaton assgns a
More informationLimited Dependent Variables
Lmted Dependent Varables. What f the left-hand sde varable s not a contnuous thng spread from mnus nfnty to plus nfnty? That s, gven a model = f (, β, ε, where a. s bounded below at zero, such as wages
More informationDifference Equations
Dfference Equatons c Jan Vrbk 1 Bascs Suppose a sequence of numbers, say a 0,a 1,a,a 3,... s defned by a certan general relatonshp between, say, three consecutve values of the sequence, e.g. a + +3a +1
More informationLinear Feature Engineering 11
Lnear Feature Engneerng 11 2 Least-Squares 2.1 Smple least-squares Consder the followng dataset. We have a bunch of nputs x and correspondng outputs y. The partcular values n ths dataset are x y 0.23 0.19
More informationNegative Binomial Regression
STATGRAPHICS Rev. 9/16/2013 Negatve Bnomal Regresson Summary... 1 Data Input... 3 Statstcal Model... 3 Analyss Summary... 4 Analyss Optons... 7 Plot of Ftted Model... 8 Observed Versus Predcted... 10 Predctons...
More information4 Analysis of Variance (ANOVA) 5 ANOVA. 5.1 Introduction. 5.2 Fixed Effects ANOVA
4 Analyss of Varance (ANOVA) 5 ANOVA 51 Introducton ANOVA ANOVA s a way to estmate and test the means of multple populatons We wll start wth one-way ANOVA If the populatons ncluded n the study are selected
More informationKernel Methods and SVMs Extension
Kernel Methods and SVMs Extenson The purpose of ths document s to revew materal covered n Machne Learnng 1 Supervsed Learnng regardng support vector machnes (SVMs). Ths document also provdes a general
More informationBoostrapaggregating (Bagging)
Boostrapaggregatng (Baggng) An ensemble meta-algorthm desgned to mprove the stablty and accuracy of machne learnng algorthms Can be used n both regresson and classfcaton Reduces varance and helps to avod
More information2E Pattern Recognition Solutions to Introduction to Pattern Recognition, Chapter 2: Bayesian pattern classification
E395 - Pattern Recognton Solutons to Introducton to Pattern Recognton, Chapter : Bayesan pattern classfcaton Preface Ths document s a soluton manual for selected exercses from Introducton to Pattern Recognton
More informationEcon107 Applied Econometrics Topic 3: Classical Model (Studenmund, Chapter 4)
I. Classcal Assumptons Econ7 Appled Econometrcs Topc 3: Classcal Model (Studenmund, Chapter 4) We have defned OLS and studed some algebrac propertes of OLS. In ths topc we wll study statstcal propertes
More information/ n ) are compared. The logic is: if the two
STAT C141, Sprng 2005 Lecture 13 Two sample tests One sample tests: examples of goodness of ft tests, where we are testng whether our data supports predctons. Two sample tests: called as tests of ndependence
More informationSee Book Chapter 11 2 nd Edition (Chapter 10 1 st Edition)
Count Data Models See Book Chapter 11 2 nd Edton (Chapter 10 1 st Edton) Count data consst of non-negatve nteger values Examples: number of drver route changes per week, the number of trp departure changes
More informationChapter 11: Simple Linear Regression and Correlation
Chapter 11: Smple Lnear Regresson and Correlaton 11-1 Emprcal Models 11-2 Smple Lnear Regresson 11-3 Propertes of the Least Squares Estmators 11-4 Hypothess Test n Smple Lnear Regresson 11-4.1 Use of t-tests
More informationLinear Approximation with Regularization and Moving Least Squares
Lnear Approxmaton wth Regularzaton and Movng Least Squares Igor Grešovn May 007 Revson 4.6 (Revson : March 004). 5 4 3 0.5 3 3.5 4 Contents: Lnear Fttng...4. Weghted Least Squares n Functon Approxmaton...
More informationx = , so that calculated
Stat 4, secton Sngle Factor ANOVA notes by Tm Plachowsk n chapter 8 we conducted hypothess tests n whch we compared a sngle sample s mean or proporton to some hypotheszed value Chapter 9 expanded ths to
More information1 Generating functions, continued
Generatng functons, contnued. Generatng functons and parttons We can make use of generatng functons to answer some questons a bt more restrctve than we ve done so far: Queston : Fnd a generatng functon
More information= z 20 z n. (k 20) + 4 z k = 4
Problem Set #7 solutons 7.2.. (a Fnd the coeffcent of z k n (z + z 5 + z 6 + z 7 + 5, k 20. We use the known seres expanson ( n+l ( z l l z n below: (z + z 5 + z 6 + z 7 + 5 (z 5 ( + z + z 2 + z + 5 5
More informationStructure and Drive Paul A. Jensen Copyright July 20, 2003
Structure and Drve Paul A. Jensen Copyrght July 20, 2003 A system s made up of several operatons wth flow passng between them. The structure of the system descrbes the flow paths from nputs to outputs.
More information10-701/ Machine Learning, Fall 2005 Homework 3
10-701/15-781 Machne Learnng, Fall 2005 Homework 3 Out: 10/20/05 Due: begnnng of the class 11/01/05 Instructons Contact questons-10701@autonlaborg for queston Problem 1 Regresson and Cross-valdaton [40
More informationStatistics for Economics & Business
Statstcs for Economcs & Busness Smple Lnear Regresson Learnng Objectves In ths chapter, you learn: How to use regresson analyss to predct the value of a dependent varable based on an ndependent varable
More informationModule 9. Lecture 6. Duality in Assignment Problems
Module 9 1 Lecture 6 Dualty n Assgnment Problems In ths lecture we attempt to answer few other mportant questons posed n earler lecture for (AP) and see how some of them can be explaned through the concept
More informationBayesian Learning. Smart Home Health Analytics Spring Nirmalya Roy Department of Information Systems University of Maryland Baltimore County
Smart Home Health Analytcs Sprng 2018 Bayesan Learnng Nrmalya Roy Department of Informaton Systems Unversty of Maryland Baltmore ounty www.umbc.edu Bayesan Learnng ombnes pror knowledge wth evdence to
More informationLecture 6: Introduction to Linear Regression
Lecture 6: Introducton to Lnear Regresson An Manchakul amancha@jhsph.edu 24 Aprl 27 Lnear regresson: man dea Lnear regresson can be used to study an outcome as a lnear functon of a predctor Example: 6
More informationEvaluation for sets of classes
Evaluaton for Tet Categorzaton Classfcaton accuracy: usual n ML, the proporton of correct decsons, Not approprate f the populaton rate of the class s low Precson, Recall and F 1 Better measures 21 Evaluaton
More informationj) = 1 (note sigma notation) ii. Continuous random variable (e.g. Normal distribution) 1. density function: f ( x) 0 and f ( x) dx = 1
Random varables Measure of central tendences and varablty (means and varances) Jont densty functons and ndependence Measures of assocaton (covarance and correlaton) Interestng result Condtonal dstrbutons
More informationPredictive Analytics : QM901.1x Prof U Dinesh Kumar, IIMB. All Rights Reserved, Indian Institute of Management Bangalore
Sesson Outlne Introducton to classfcaton problems and dscrete choce models. Introducton to Logstcs Regresson. Logstc functon and Logt functon. Maxmum Lkelhood Estmator (MLE) for estmaton of LR parameters.
More informationSection 8.3 Polar Form of Complex Numbers
80 Chapter 8 Secton 8 Polar Form of Complex Numbers From prevous classes, you may have encountered magnary numbers the square roots of negatve numbers and, more generally, complex numbers whch are the
More informationPsychology 282 Lecture #24 Outline Regression Diagnostics: Outliers
Psychology 282 Lecture #24 Outlne Regresson Dagnostcs: Outlers In an earler lecture we studed the statstcal assumptons underlyng the regresson model, ncludng the followng ponts: Formal statement of assumptons.
More informationChapter Newton s Method
Chapter 9. Newton s Method After readng ths chapter, you should be able to:. Understand how Newton s method s dfferent from the Golden Secton Search method. Understand how Newton s method works 3. Solve
More informationMaxent Models & Deep Learning
Maxent Models & Deep Learnng 1. Last bts of maxent (sequence) models 1.MEMMs vs. CRFs 2.Smoothng/regularzaton n maxent models 2. Deep Learnng 1. What s t? Why s t good? (Part 1) 2. From logstc regresson
More informationChapter 5 Multilevel Models
Chapter 5 Multlevel Models 5.1 Cross-sectonal multlevel models 5.1.1 Two-level models 5.1.2 Multple level models 5.1.3 Multple level modelng n other felds 5.2 Longtudnal multlevel models 5.2.1 Two-level
More informationLecture 4. Instructor: Haipeng Luo
Lecture 4 Instructor: Hapeng Luo In the followng lectures, we focus on the expert problem and study more adaptve algorthms. Although Hedge s proven to be worst-case optmal, one may wonder how well t would
More informationRetrieval Models: Language models
CS-590I Informaton Retreval Retreval Models: Language models Luo S Department of Computer Scence Purdue Unversty Introducton to language model Ungram language model Document language model estmaton Maxmum
More informationLecture 2: Prelude to the big shrink
Lecture 2: Prelude to the bg shrnk Last tme A slght detour wth vsualzaton tools (hey, t was the frst day... why not start out wth somethng pretty to look at?) Then, we consdered a smple 120a-style regresson
More informationLecture 7: Boltzmann distribution & Thermodynamics of mixing
Prof. Tbbtt Lecture 7 etworks & Gels Lecture 7: Boltzmann dstrbuton & Thermodynamcs of mxng 1 Suggested readng Prof. Mark W. Tbbtt ETH Zürch 13 März 018 Molecular Drvng Forces Dll and Bromberg: Chapters
More informationStatistics Chapter 4
Statstcs Chapter 4 "There are three knds of les: les, damned les, and statstcs." Benjamn Dsrael, 1895 (Brtsh statesman) Gaussan Dstrbuton, 4-1 If a measurement s repeated many tmes a statstcal treatment
More information1 GSW Iterative Techniques for y = Ax
1 for y = A I m gong to cheat here. here are a lot of teratve technques that can be used to solve the general case of a set of smultaneous equatons (wrtten n the matr form as y = A), but ths chapter sn
More informationChapter 6. Supplemental Text Material
Chapter 6. Supplemental Text Materal S6-. actor Effect Estmates are Least Squares Estmates We have gven heurstc or ntutve explanatons of how the estmates of the factor effects are obtaned n the textboo.
More informationANSWERS. Problem 1. and the moment generating function (mgf) by. defined for any real t. Use this to show that E( U) var( U)
Econ 413 Exam 13 H ANSWERS Settet er nndelt 9 deloppgaver, A,B,C, som alle anbefales å telle lkt for å gøre det ltt lettere å stå. Svar er gtt . Unfortunately, there s a prntng error n the hnt of
More informationDepartment of Quantitative Methods & Information Systems. Time Series and Their Components QMIS 320. Chapter 6
Department of Quanttatve Methods & Informaton Systems Tme Seres and Ther Components QMIS 30 Chapter 6 Fall 00 Dr. Mohammad Zanal These sldes were modfed from ther orgnal source for educatonal purpose only.
More informationLecture 3 Stat102, Spring 2007
Lecture 3 Stat0, Sprng 007 Chapter 3. 3.: Introducton to regresson analyss Lnear regresson as a descrptve technque The least-squares equatons Chapter 3.3 Samplng dstrbuton of b 0, b. Contnued n net lecture
More information28. SIMPLE LINEAR REGRESSION III
8. SIMPLE LINEAR REGRESSION III Ftted Values and Resduals US Domestc Beers: Calores vs. % Alcohol To each observed x, there corresponds a y-value on the ftted lne, y ˆ = βˆ + βˆ x. The are called ftted
More informationEconomics 130. Lecture 4 Simple Linear Regression Continued
Economcs 130 Lecture 4 Contnued Readngs for Week 4 Text, Chapter and 3. We contnue wth addressng our second ssue + add n how we evaluate these relatonshps: Where do we get data to do ths analyss? How do
More informationSimulated Power of the Discrete Cramér-von Mises Goodness-of-Fit Tests
Smulated of the Cramér-von Mses Goodness-of-Ft Tests Steele, M., Chaselng, J. and 3 Hurst, C. School of Mathematcal and Physcal Scences, James Cook Unversty, Australan School of Envronmental Studes, Grffth
More informationLecture 6 More on Complete Randomized Block Design (RBD)
Lecture 6 More on Complete Randomzed Block Desgn (RBD) Multple test Multple test The multple comparsons or multple testng problem occurs when one consders a set of statstcal nferences smultaneously. For
More informationProblem Set 9 Solutions
Desgn and Analyss of Algorthms May 4, 2015 Massachusetts Insttute of Technology 6.046J/18.410J Profs. Erk Demane, Srn Devadas, and Nancy Lynch Problem Set 9 Solutons Problem Set 9 Solutons Ths problem
More informationSTAT 3008 Applied Regression Analysis
STAT 3008 Appled Regresson Analyss Tutoral : Smple Lnear Regresson LAI Chun He Department of Statstcs, The Chnese Unversty of Hong Kong 1 Model Assumpton To quantfy the relatonshp between two factors,
More informationLecture 10 Support Vector Machines II
Lecture 10 Support Vector Machnes II 22 February 2016 Taylor B. Arnold Yale Statstcs STAT 365/665 1/28 Notes: Problem 3 s posted and due ths upcomng Frday There was an early bug n the fake-test data; fxed
More informationLecture 12: Classification
Lecture : Classfcaton g Dscrmnant functons g The optmal Bayes classfer g Quadratc classfers g Eucldean and Mahalanobs metrcs g K Nearest Neghbor Classfers Intellgent Sensor Systems Rcardo Guterrez-Osuna
More informationJoint Statistical Meetings - Biopharmaceutical Section
Iteratve Ch-Square Test for Equvalence of Multple Treatment Groups Te-Hua Ng*, U.S. Food and Drug Admnstraton 1401 Rockvlle Pke, #200S, HFM-217, Rockvlle, MD 20852-1448 Key Words: Equvalence Testng; Actve
More informationIntroduction to Vapor/Liquid Equilibrium, part 2. Raoult s Law:
CE304, Sprng 2004 Lecture 4 Introducton to Vapor/Lqud Equlbrum, part 2 Raoult s Law: The smplest model that allows us do VLE calculatons s obtaned when we assume that the vapor phase s an deal gas, and
More informationAssortment Optimization under MNL
Assortment Optmzaton under MNL Haotan Song Aprl 30, 2017 1 Introducton The assortment optmzaton problem ams to fnd the revenue-maxmzng assortment of products to offer when the prces of products are fxed.
More information4.3 Poisson Regression
of teratvely reweghted least squares regressons (the IRLS algorthm). We do wthout gvng further detals, but nstead focus on the practcal applcaton. > glm(survval~log(weght)+age, famly="bnomal", data=baby)
More informationUncertainty as the Overlap of Alternate Conditional Distributions
Uncertanty as the Overlap of Alternate Condtonal Dstrbutons Olena Babak and Clayton V. Deutsch Centre for Computatonal Geostatstcs Department of Cvl & Envronmental Engneerng Unversty of Alberta An mportant
More informationCSC 411 / CSC D11 / CSC C11
18 Boostng s a general strategy for learnng classfers by combnng smpler ones. The dea of boostng s to take a weak classfer that s, any classfer that wll do at least slghtly better than chance and use t
More informationUCLA STAT 13 Introduction to Statistical Methods for the Life and Health Sciences. Chapter 11 Analysis of Variance - ANOVA. Instructor: Ivo Dinov,
UCLA STAT 3 ntroducton to Statstcal Methods for the Lfe and Health Scences nstructor: vo Dnov, Asst. Prof. of Statstcs and Neurology Chapter Analyss of Varance - ANOVA Teachng Assstants: Fred Phoa, Anwer
More informationLecture Notes on Linear Regression
Lecture Notes on Lnear Regresson Feng L fl@sdueducn Shandong Unversty, Chna Lnear Regresson Problem In regresson problem, we am at predct a contnuous target value gven an nput feature vector We assume
More informationCIS526: Machine Learning Lecture 3 (Sept 16, 2003) Linear Regression. Preparation help: Xiaoying Huang. x 1 θ 1 output... θ M x M
CIS56: achne Learnng Lecture 3 (Sept 6, 003) Preparaton help: Xaoyng Huang Lnear Regresson Lnear regresson can be represented by a functonal form: f(; θ) = θ 0 0 +θ + + θ = θ = 0 ote: 0 s a dummy attrbute
More informationStatistics II Final Exam 26/6/18
Statstcs II Fnal Exam 26/6/18 Academc Year 2017/18 Solutons Exam duraton: 2 h 30 mn 1. (3 ponts) A town hall s conductng a study to determne the amount of leftover food produced by the restaurants n the
More informationCSci 6974 and ECSE 6966 Math. Tech. for Vision, Graphics and Robotics Lecture 21, April 17, 2006 Estimating A Plane Homography
CSc 6974 and ECSE 6966 Math. Tech. for Vson, Graphcs and Robotcs Lecture 21, Aprl 17, 2006 Estmatng A Plane Homography Overvew We contnue wth a dscusson of the major ssues, usng estmaton of plane projectve
More informationCHAPTER IV RESEARCH FINDING AND DISCUSSIONS
CHAPTER IV RESEARCH FINDING AND DISCUSSIONS A. Descrpton of Research Fndng. The Implementaton of Learnng Havng ganed the whole needed data, the researcher then dd analyss whch refers to the statstcal data
More informationx yi In chapter 14, we want to perform inference (i.e. calculate confidence intervals and perform tests of significance) in this setting.
The Practce of Statstcs, nd ed. Chapter 14 Inference for Regresson Introducton In chapter 3 we used a least-squares regresson lne (LSRL) to represent a lnear relatonshp etween two quanttatve explanator
More informationLearning from Data 1 Naive Bayes
Learnng from Data 1 Nave Bayes Davd Barber dbarber@anc.ed.ac.uk course page : http://anc.ed.ac.uk/ dbarber/lfd1/lfd1.html c Davd Barber 2001, 2002 1 Learnng from Data 1 : c Davd Barber 2001,2002 2 1 Why
More informationAS-Level Maths: Statistics 1 for Edexcel
1 of 6 AS-Level Maths: Statstcs 1 for Edecel S1. Calculatng means and standard devatons Ths con ndcates the slde contans actvtes created n Flash. These actvtes are not edtable. For more detaled nstructons,
More informationTHE SUMMATION NOTATION Ʃ
Sngle Subscrpt otaton THE SUMMATIO OTATIO Ʃ Most of the calculatons we perform n statstcs are repettve operatons on lsts of numbers. For example, we compute the sum of a set of numbers, or the sum of the
More informationOnline Appendix to: Axiomatization and measurement of Quasi-hyperbolic Discounting
Onlne Appendx to: Axomatzaton and measurement of Quas-hyperbolc Dscountng José Lus Montel Olea Tomasz Strzaleck 1 Sample Selecton As dscussed before our ntal sample conssts of two groups of subjects. Group
More informationModule 3 LOSSY IMAGE COMPRESSION SYSTEMS. Version 2 ECE IIT, Kharagpur
Module 3 LOSSY IMAGE COMPRESSION SYSTEMS Verson ECE IIT, Kharagpur Lesson 6 Theory of Quantzaton Verson ECE IIT, Kharagpur Instructonal Objectves At the end of ths lesson, the students should be able to:
More informationNotes on Frequency Estimation in Data Streams
Notes on Frequency Estmaton n Data Streams In (one of) the data streamng model(s), the data s a sequence of arrvals a 1, a 2,..., a m of the form a j = (, v) where s the dentty of the tem and belongs to
More informationTracking with Kalman Filter
Trackng wth Kalman Flter Scott T. Acton Vrgna Image and Vdeo Analyss (VIVA), Charles L. Brown Department of Electrcal and Computer Engneerng Department of Bomedcal Engneerng Unversty of Vrgna, Charlottesvlle,
More informationCopyright 2017 by Taylor Enterprises, Inc., All Rights Reserved. Adjusted Control Limits for U Charts. Dr. Wayne A. Taylor
Taylor Enterprses, Inc. Adjusted Control Lmts for U Charts Copyrght 207 by Taylor Enterprses, Inc., All Rghts Reserved. Adjusted Control Lmts for U Charts Dr. Wayne A. Taylor Abstract: U charts are used
More informationprinceton univ. F 13 cos 521: Advanced Algorithm Design Lecture 3: Large deviations bounds and applications Lecturer: Sanjeev Arora
prnceton unv. F 13 cos 521: Advanced Algorthm Desgn Lecture 3: Large devatons bounds and applcatons Lecturer: Sanjeev Arora Scrbe: Today s topc s devaton bounds: what s the probablty that a random varable
More information18. SIMPLE LINEAR REGRESSION III
8. SIMPLE LINEAR REGRESSION III US Domestc Beers: Calores vs. % Alcohol Ftted Values and Resduals To each observed x, there corresponds a y-value on the ftted lne, y ˆ ˆ = α + x. The are called ftted values.
More informationBOOTSTRAP METHOD FOR TESTING OF EQUALITY OF SEVERAL MEANS. M. Krishna Reddy, B. Naveen Kumar and Y. Ramu
BOOTSTRAP METHOD FOR TESTING OF EQUALITY OF SEVERAL MEANS M. Krshna Reddy, B. Naveen Kumar and Y. Ramu Department of Statstcs, Osmana Unversty, Hyderabad -500 007, Inda. nanbyrozu@gmal.com, ramu0@gmal.com
More informationCHAPTER 8. Exercise Solutions
CHAPTER 8 Exercse Solutons 77 Chapter 8, Exercse Solutons, Prncples of Econometrcs, 3e 78 EXERCISE 8. When = N N N ( x x) ( x x) ( x x) = = = N = = = N N N ( x ) ( ) ( ) ( x x ) x x x x x = = = = Chapter
More informationGlobal Sensitivity. Tuesday 20 th February, 2018
Global Senstvty Tuesday 2 th February, 28 ) Local Senstvty Most senstvty analyses [] are based on local estmates of senstvty, typcally by expandng the response n a Taylor seres about some specfc values
More informationLogistic Regression. CAP 5610: Machine Learning Instructor: Guo-Jun QI
Logstc Regresson CAP 561: achne Learnng Instructor: Guo-Jun QI Bayes Classfer: A Generatve model odel the posteror dstrbuton P(Y X) Estmate class-condtonal dstrbuton P(X Y) for each Y Estmate pror dstrbuton
More informationWeek 5: Neural Networks
Week 5: Neural Networks Instructor: Sergey Levne Neural Networks Summary In the prevous lecture, we saw how we can construct neural networks by extendng logstc regresson. Neural networks consst of multple
More information8.6 The Complex Number System
8.6 The Complex Number System Earler n the chapter, we mentoned that we cannot have a negatve under a square root, snce the square of any postve or negatve number s always postve. In ths secton we want
More informationLecture 4 Hypothesis Testing
Lecture 4 Hypothess Testng We may wsh to test pror hypotheses about the coeffcents we estmate. We can use the estmates to test whether the data rejects our hypothess. An example mght be that we wsh to
More informationGeneralized Linear Methods
Generalzed Lnear Methods 1 Introducton In the Ensemble Methods the general dea s that usng a combnaton of several weak learner one could make a better learner. More formally, assume that we have a set
More informationDurban Watson for Testing the Lack-of-Fit of Polynomial Regression Models without Replications
Durban Watson for Testng the Lack-of-Ft of Polynomal Regresson Models wthout Replcatons Ruba A. Alyaf, Maha A. Omar, Abdullah A. Al-Shha ralyaf@ksu.edu.sa, maomar@ksu.edu.sa, aalshha@ksu.edu.sa Department
More information2016 Wiley. Study Session 2: Ethical and Professional Standards Application
6 Wley Study Sesson : Ethcal and Professonal Standards Applcaton LESSON : CORRECTION ANALYSIS Readng 9: Correlaton and Regresson LOS 9a: Calculate and nterpret a sample covarance and a sample correlaton
More informationCase A. P k = Ni ( 2L i k 1 ) + (# big cells) 10d 2 P k.
THE CELLULAR METHOD In ths lecture, we ntroduce the cellular method as an approach to ncdence geometry theorems lke the Szemeréd-Trotter theorem. The method was ntroduced n the paper Combnatoral complexty
More informationNumerical Heat and Mass Transfer
Master degree n Mechancal Engneerng Numercal Heat and Mass Transfer 06-Fnte-Dfference Method (One-dmensonal, steady state heat conducton) Fausto Arpno f.arpno@uncas.t Introducton Why we use models and
More informationTHE CHINESE REMAINDER THEOREM. We should thank the Chinese for their wonderful remainder theorem. Glenn Stevens
THE CHINESE REMAINDER THEOREM KEITH CONRAD We should thank the Chnese for ther wonderful remander theorem. Glenn Stevens 1. Introducton The Chnese remander theorem says we can unquely solve any par of
More informationGaussian Mixture Models
Lab Gaussan Mxture Models Lab Objectve: Understand the formulaton of Gaussan Mxture Models (GMMs) and how to estmate GMM parameters. You ve already seen GMMs as the observaton dstrbuton n certan contnuous
More informationDETERMINATION OF UNCERTAINTY ASSOCIATED WITH QUANTIZATION ERRORS USING THE BAYESIAN APPROACH
Proceedngs, XVII IMEKO World Congress, June 7, 3, Dubrovn, Croata Proceedngs, XVII IMEKO World Congress, June 7, 3, Dubrovn, Croata TC XVII IMEKO World Congress Metrology n the 3rd Mllennum June 7, 3,
More informationMidterm Examination. Regression and Forecasting Models
IOMS Department Regresson and Forecastng Models Professor Wllam Greene Phone: 22.998.0876 Offce: KMC 7-90 Home page: people.stern.nyu.edu/wgreene Emal: wgreene@stern.nyu.edu Course web page: people.stern.nyu.edu/wgreene/regresson/outlne.htm
More informationHashing. Alexandra Stefan
Hashng Alexandra Stefan 1 Hash tables Tables Drect access table (or key-ndex table): key => ndex Hash table: key => hash value => ndex Man components Hash functon Collson resoluton Dfferent keys mapped
More informationExercises. 18 Algorithms
18 Algorthms Exercses 0.1. In each of the followng stuatons, ndcate whether f = O(g), or f = Ω(g), or both (n whch case f = Θ(g)). f(n) g(n) (a) n 100 n 200 (b) n 1/2 n 2/3 (c) 100n + log n n + (log n)
More informationBasic Business Statistics, 10/e
Chapter 13 13-1 Basc Busness Statstcs 11 th Edton Chapter 13 Smple Lnear Regresson Basc Busness Statstcs, 11e 009 Prentce-Hall, Inc. Chap 13-1 Learnng Objectves In ths chapter, you learn: How to use regresson
More informationECONOMICS 351*-A Mid-Term Exam -- Fall Term 2000 Page 1 of 13 pages. QUEEN'S UNIVERSITY AT KINGSTON Department of Economics
ECOOMICS 35*-A Md-Term Exam -- Fall Term 000 Page of 3 pages QUEE'S UIVERSITY AT KIGSTO Department of Economcs ECOOMICS 35* - Secton A Introductory Econometrcs Fall Term 000 MID-TERM EAM ASWERS MG Abbott
More information