This excerpt from. Foundations of Statistical Natural Language Processing. Christopher D. Manning and Hinrich Schütze The MIT Press.

Size: px
Start display at page:

Download "This excerpt from. Foundations of Statistical Natural Language Processing. Christopher D. Manning and Hinrich Schütze The MIT Press."

Transcription

1 Ths excerpt from Foundatons of Statstcal Natural Language Processng. Chrstopher D. Mannng and Hnrch Schütze The MIT Press. s provded n screen-vewable form for personal use only by members of MIT CogNet. Unauthorzed use or dssemnaton of ths nformaton s expressly forbdden. If you have any questons about ths materal, please contact cognetadmn@cognet.mt.edu.

2 6 Statstcal Inference: n-gram Models over Sparse Data statstcal nference language modelng Shannon game Statstcal NLP ams to do statstcal nference for the feld of natural language. Statstcal nference n general conssts of takng some data (generated n accordance wth some unknown probablty dstrbuton) and then makng some nferences about ths dstrbuton. For example, we mght look at lots of nstances of prepostonal phrase attachments n a corpus, and use them to try to predct prepostonal phrase attachments for Englsh n general. The dscusson n ths chapter dvdes the problem nto three areas (although they tend to overlap consderably): dvdng the tranng data nto equvalence classes, fndng a good statstcal estmator for each equvalence class, and combnng multple estmators. As a runnng example of statstcal estmaton, we wll examne the classc task of language modelng, where the problem s to predct the next word gven the prevous words. Ths task s fundamental to speech or optcal character recognton, and s also used for spellng correcton, handwrtng recognton, and statstcal machne translaton. Ths sort of task s often referred to as a Shannon game followng the presentaton of the task of guessng the next letter n a text n (Shannon 1951). Ths problem has been well-studed, and ndeed many estmaton methods were frst developed for ths task. In general, though, the methods we develop are not specfc to ths task, and can be drectly used for other tasks lke word sense dsambguaton or probablstc parsng. The word predcton task just provdes a clear easly-understood problem for whch the technques can be developed.

3 192 6 Statstcal Inference: n-gram Models over Sparse Data 6.1 Bns: Formng Equvalence Classes Relablty vs. dscrmnaton target feature classfcatory features Normally, n order to do nference about one feature, we wsh to fnd other features of the model that predct t. Here, we are assumng that past behavor s a good gude to what wll happen n the future (that s, that the model s roughly statonary). Ths gves us a classfcaton task: we try to predct the target feature on the bass of varous classfcatory features. When dong ths, we effectvely dvde the data nto equvalence classes that share values for certan of the classfcatory features, and use ths equvalence classng to help predct the value of the target feature on new peces of data. Ths means that we are tactly makng ndepen- dence assumptons: the data ether does not depend on other features, or the dependence s suffcently mnor that we hope that we can neglect t wthout dong too much harm. The more classfcatory features (of some relevance) that we dentfy, the more fnely condtons that determne the unknown probablty dstrbuton of the target feature can potentally be teased apart. In other words, dvdng the data nto many bns gves us greater dscrmnaton. Gong aganst ths s the problem that f we use a lot of bns then a partcular bn may contan no or a very small number of tranng nstances, and then we wll not be able to do statstcally relable estmaton of the target feature for that bn. Fndng equvalence classes that are a good compromse between these two crtera s our frst goal. ndependence assumptons bns relablty n-gram models The task of predctng the next word can be stated as attemptng to estmate the probablty functon P: (6.1) hstory P(w n w 1,...,w n 1 ) In such a stochastc problem, we use a classfcaton of the prevous words, the hstory, to predct the next word. On the bass of havng looked at a lot of text, we know whch words tend to follow other words. For ths task, we cannot possbly consder each textual hstory separately: most of the tme we wll be lstenng to a sentence that we have never heard before, and so there s no prevous dentcal textual hstory on whch to base our predctons, and even f we had heard the begnnng of the sentence before, t mght end dfferently ths tme. And so we

4 6.1 Bns: Formng Equvalence Classes 193 Markov assumpton bgram trgram four-gram need a method of groupng hstores that are smlar n some way so as to gve reasonable predctons as to whch words we can expect to come next. One possble way to group them s by makng a Markov assumpton that only the pror local context the last few words affects the next word. If we construct a model where all hstores that have the same last n 1 words are placed n the same equvalence class, then we have an (n 1) th order Markov model or an n-gram word model (the last word of the n-gram beng gven by the word we are predctng). Before contnung wth model-buldng, let us pause for a bref nterlude on namng. The cases of n-gram models that people usually use are for n = 2, 3, 4, and these alternatves are usually referred to as a bgram, a trgram, andafour-gram model, respectvely. Revealng ths wll surely be enough to cause any Classcsts who are readng ths book to stop, and to leave the feld to uneducated engneerng sorts: gram s a Greek root and so should be put together wth Greek number prefxes. Shannon actually dd use the term dgram, but wth the declnng levels of educa- ton n recent decades, ths usage has not survved. As non-prescrptve lngusts, however, we thnk that the curous mxture of Englsh, Greek, and Latn that our colleagues actually use s qute fun. So we wll not try to stamp t out. 1 Now n prncple, we would lke the n of our n-gram models to be farly large, because there are sequences of words lke: dgram (6.2) Sue swallowed the large green. where swallowed s presumably stll qute strongly nfluencng whch word wll come next pll or perhaps frog are lkely contnuatons, but tree, car or mountan are presumably unlkely, even though they are n general farly natural contnuatons after the large green. However, there s the problem that f we dvde the data nto too many bns, then there are a lot of parameters to estmate. For nstance, f we conser- vatvely assume that a speaker s stayng wthn a vocabulary of 20,000 words, then we get the estmates for numbers of parameters shown n table parameters 1. Rather than four-gram, some people do make an attempt at appearng educated by sayng quadgram, but ths s not really correct use of a Latn number prefx (whch would gve quadrgram, cf.quadrlateral), let alone correct use of a Greek number prefx, whch would gve us a tetragram model. 2. Gven a certan model space (here word n-gram models), the parameters are the numbers that we have to specfy to determne a partcular model wthn that model space.

5 194 6 Statstcal Inference: n-gram Models over Sparse Data Model Parameters 1st order (bgram model): 20, , 999 = 400 mllon 2nd order (trgram model): 20, , 999 = 8 trllon 3th order (four-gram model): 20, , 999= Table 6.1 Growth n number of parameters for n-gram models. stemmng So we quckly see that producng a fve-gram model, of the sort that we thought would be useful above, may well not be practcal, even f we have what we thnk s a very large corpus. For ths reason, n-gram systems currently usually use bgrams or trgrams (and often make do wth a smaller vocabulary). One way of reducng the number of parameters s to reduce the value of n, but t s mportant to realze that n-grams are not the only way of formng equvalence classes of the hstory. Among other operatons of equvalencng, we could consder stemmng (removng the nflectonal endngs from words) or groupng words nto semantc classes (by use of a pre-exstng thesaurus, or by some nduced clusterng). Ths s effectvely reducng the vocabulary sze over whch we form n-grams. But we do not need to use n-grams at all. There are myrad other ways of formng equvalence classes of the hstory t s just that they re all a bt more complcated than n-grams. The above example suggests that knowledge of the predcate n a clause s useful, so we can magne a model that predcts the next word based on the prevous word and the prevous predcate (no matter how far back t s). But ths model s harder to mplement, because we frst need a farly accurate method of dentfyng the man predcate of a clause. Therefore we wll just use n-gram models n ths chapter, but other technques are covered n chapters 12 and 14. For anyone from a lngustcs background, the dea that we would choose to use a model of language structure whch predcts the next word smply by examnng the prevous two words wth no reference to the structure of the sentence seems almost preposterous. But, actually, the Snce we are assumng nothng n partcular about the probablty dstrbuton, the number of parameters to be estmated s the number of bns tmes one less than the number of values of the target feature (one s subtracted because the probablty of the last target value s automatcally gven by the stochastc constrant that probabltes should sum to one).

6 6.1 Bns: Formng Equvalence Classes 195 lexcal co-occurrence, semantc, and basc syntactc relatonshps that appear n ths very local context are a good predctor of the next word, and such systems work surprsngly well. Indeed, t s dffcult to beat a trgram model on the purely lnear task of predctng the next word Buldng n-gram models In the fnal part of some sectons of ths chapter, we wll actually buld some models and show the results. The reader should be able to recreate our results by usng the tools and data on the accompanyng webste. The text that we wll use s Jane Austen s novels, and s avalable from the webste. Ths corpus has two advantages: () t s freely avalable through the work of Project Gutenberg, and () t s not too large. The small sze of the corpus s, of course, n many ways also a dsadvantage. Because of the huge number of parameters of n-gram models, as dscussed above, n-gram models work best when traned on enormous amounts of data. However, such tranng requres a lot of CPU tme and dskspace, so a small corpus s much more approprate for a textbook example. Even so, you wll want to make sure that you start off wth about 40Mb of free dskspace before attemptng to recreate our examples. As usual, the frst step s to preprocess the corpus. The Project Gutenberg Austen texts are very clean plan ASCII fles. But nevertheless, there are the usual problems of punctuaton marks attachng to words and so on (see chapter 4) that mean that we must do more than smply splt on whtespace. We decded that we could make do wth some very smple search-and-replace patterns that removed all punctuaton leavng whtespace separated words (see the webste for detals). We decded to use Emma, Mansfeld Park, Northanger Abbey, Prde and Prejudce, andsense and Sensblty as our corpus for buldng models, reservng Persuason for testng, as dscussed below. Ths gave us a (small) tranng corpus of N = 617, 091 words of text, contanng a vocabulary V of 14,585 word types. By smply removng all punctuaton as we dd, our fle s lterally a long sequence of words. Ths sn t actually what people do most of the tme. It s commonly felt that there are not very strong dependences between sentences, whle sentences tend to begn n characterstc ways. So people mark the sentences n the text most commonly by surroundng them wth the SGML tags <s> and </s>. The probablty calculatons at the

7 196 6 Statstcal Inference: n-gram Models over Sparse Data start of a sentence are then dependent not on the last words of the precedng sentence but upon a begnnng of sentence context. We should addtonally note that we ddn t remove case dstnctons, so captalzed words reman n the data, mperfectly ndcatng where new sentences begn. 6.2 Statstcal Estmators Gven a certan number of peces of tranng data that fall nto a certan bn, the second goal s then fndng out how to derve a good probablty estmate for the target feature based on these data. For our runnng example of n-grams, we wll be nterested n P(w 1 w n ) and the predcton task P(w n w 1 w n 1 ). Snce: (6.3) P(w n w 1 w n 1 ) = P(w 1 w n ) P(w 1 w n 1 ) estmatng good condtonal probablty dstrbutons can be reduced to havng good solutons to smply estmatng the unknown probablty dstrbuton of n-grams. 3 Let us assume that the tranng text conssts of N words. If we append n 1 dummy start symbols to the begnnng of the text, we can then also say that the corpus conssts of N n-grams, wth a unform amount of condtonng avalable for the next word n all cases. Let B be the number of bns (equvalence classes). Ths wll be V n 1,whereV s the vocabulary sze, for the task of workng out the next word and V n for the task of estmatng the probablty of dfferent n-grams. Let C(w 1 w n ) be the frequency of a certan n-gram n the tranng text, and let us say that there are N r n-grams that appeared r tmes n the tranng text (.e., N r = {w 1 w n : C(w 1 w n ) = r} ). These frequences of frequences are very commonly used n the estmaton methods whch we cover below. Ths notaton s summarzed n table However, when smoothng, one has a choce of whether to smooth the n-gram probablty estmates, or to smooth the condtonal probablty dstrbutons drectly. For many methods, these do not gve equvalent results snce n the latter case one s separately smoothng a large number of condtonal probablty dstrbutons (whch normally need to be themselves grouped nto classes n some way).

8 6.2 Statstcal Estmators 197 N B w 1n C(w 1 w n ) r f( ) N r T r h Number of tranng nstances Number of bns tranng nstances are dvded nto An n-gram w 1 w n n the tranng text Frequency of n-gram w 1 w n n tranng text Frequency of an n-gram Frequency estmate of a model Number of bns that have r tranng nstances n them Total count of n-grams of frequency r n further data Hstory of precedng words Table 6.2 Notaton for the statstcal estmaton chapter Maxmum Lkelhood Estmaton (MLE) MLE estmates from relatve frequences relatve frequency Regardless of how we form equvalence classes, we wll end up wth bns that contan a certan number of tranng nstances. Let us assume a trgram model where we are usng the two precedng words of context to predct the next word, and let us focus n on the bn for the case where the two precedng words were comes across. In a certan corpus, the authors found 10 tranng nstances of the words comes across, andof those, 8 tmes they were followed by as, oncebymore and once by a. The queston at ths pont s what probablty estmates we should use for estmatng the next word. The obvous frst answer (at least from a frequentst pont of vew) s to suggest usng the relatve frequency as a probablty estmate: P(as) = 0.8 P(more) = 0.1 P(a) = 0.1 P(x) = 0.0 for x not among the above 3 words maxmum lkelhood estmate (6.4) (6.5) Ths estmate s called the maxmum lkelhood estmate (MLE): P MLE (w 1 w n ) = C(w 1 w n ) N P MLE (w n w 1 w n 1 ) = C(w 1 w n ) C(w 1 w n 1 )

9 198 6 Statstcal Inference: n-gram Models over Sparse Data If one fxes the observed data, and then consders the space of all possble parameter assgnments wthn a certan dstrbuton (here a trgram lkelhood model) gven the data, then statstcans refer to ths as a lkelhood func- functon ton. The maxmum lkelhood estmate s so called because t s the choce of parameter values whch gves the hghest probablty to the tranng corpus. 4 The estmate that does that s the one shown above. It does not waste any probablty mass on events that are not n the tranng corpus, but rather t makes the probablty of observed events as hgh as t can subject to the normal stochastc constrants. But the MLE s n general unsutable for statstcal nference n NLP. The problem s the sparseness of our data (even f we are usng a large corpus). Whle a few words are common, the vast majorty of words are very uncommon and longer n-grams nvolvng them are thus much rarer agan. The MLE assgns a zero probablty to unseen events, and snce the probablty of a long strng s generally computed by multplyng the probabltes of subparts, these zeroes wll propagate and gve us bad (zero probablty) estmates for the probablty of sentences when we just happened not to see certan n-grams n the tranng text. 5 Wth respect to the example above, the MLE s not capturng the fact that there are other words whch can follow comes across, for example the and some. As an example of data sparseness, after tranng on 1.5 mllon words from the IBM Laser Patent Text corpus, Bahl et al. (1983) report that 23% of the trgram tokens found n further test data drawn from the same corpus were prevously unseen. Ths corpus s small by modern standards, and so one mght hope that by collectng much more data that the problem of data sparseness would smply go away. Whle ths may ntally seem hopeful (f we collect a hundred nstances of comes across, we wll probably fnd nstances wth t followed by the and some), n practce t s never a general soluton to the problem. Whle there are a lmted number of frequent events n language, there s a seemngly never end- 4. Ths s gven that the occurrence of a certan n-gram s assumed to be a random varable wth a bnomal dstrbuton (.e., each n-gram s ndependent of the next). Ths s a qute untrue (though usable) assumpton: frstly, each n-gram overlaps wth and hence partly determnes the next, and secondly, content words tend to clump (f you use a word once n a paper, you are lkely to use t agan), as we dscuss n secton Another way to state ths s to observe that f our probablty model assgns zero probablty to any event that turns out to actually occur, then both the cross-entropy and the KL dvergence wth respect to (data from) the real probablty dstrbuton s nfnte. In other words we have done a maxmally bad job at producng a probablty functon that s close to the one we are tryng to model.

10 6.2 Statstcal Estmators 199 rare events dscountng ng tal to the probablty dstrbuton of rarer and rarer events, and we can never collect enough data to get to the end of the tal. 6 For nstance comes across could be followed by any number, and we wll never see every number. In general, we need to devse better estmators that allow for the possblty that we wll see events that we ddn t see n the tranng text. All such methods effectvely work by somewhat decreasng the probablty of prevously seen events, so that there s a lttle bt of probablty mass left over for prevously unseen events. Thus these methods are frequently referred to as dscountng methods. The process of dscountng s often referred to as smoothng, presumably because a dstrbuton wth- out zeroes s smoother than one wth zeroes. We wll examne a number of smoothng methods n the followng sectons. smoothng Usng MLE estmates for n-gram models of Austen hapax legomena Based on our Austen corpus, we made n-gram models for dfferent values of n. It s qute straghtforward to wrte one s own program to do ths, by totallng up the frequences of n-grams and (n 1)-grams, and then dvdng to get MLE probablty estmates, but there s also software to do t on the webste. In practcal systems, t s usual to not actually calculate n-grams for all words. Rather, the n-grams are calculated as usual only for the most common k words, and all other words are regarded as Out-Of-Vocabulary (OOV) tems and mapped to a sngle token such as <UNK>. Commonly, ths wll be done for all words that have been encountered only once n the tranng corpus (hapax legomena). A useful varant n some domans s to notce the obvous semantc and dstrbutonal smlarty of rare numbers and to have two out-of-vocabulary tokens, one for numbers and one for everythng else. Because of the Zpfan dstrbuton of words, cuttng out low frequency tems wll greatly reduce the parameter space (and the memory requrements of the system beng bult), whle not apprecably affectng the model qualty (hapax legomena often consttute half of the types, but only a fracton of the tokens). We used the condtonal probabltes calculated from our tranng corpus to work out the probabltes of each followng word for part of a 6. Cf. Zpf s law the observaton that the relatonshp between a word s frequency and the rank order of ts frequency s roughly a recprocal curve as dscussed n secton

11 200 6 Statstcal Inference: n-gram Models over Sparse Data In person she was nferor to both ssters 1-gram P( ) P( ) P( ) P( ) P( ) P( ) 1 the the the the the the to to to to to to and and and and and of of of of of was was was was was she she she she both both both ssters ssters nferor gram P( person) P( she) P( was) P( nferor) P( to) P( both) 1 and had not to be of who was a the to to the her n n to have and she Mrs she what ssters both nferor 0 3-gram P( In,person) P( person,she) P( she,was) P( was,nf.) P( nferor,to) P( to,both) 1 Unseen dd 0.5 not Unseen the to was 0.5 very Mara Chapter n cherres Hour to her Twce nferor 0 both 0 ssters 0 4-gram P( u,i,p) P( I,p,s) P( p,s,w) P( s,w,) P( w,,t) P(,t,b) 1 Unseen Unseen n 1.0 Unseen Unseen Unseen nferor 0 Table 6.3 Probabltes of each successve word for a clause from Persuason. The probablty dstrbuton for the followng word s calculated by Maxmum Lkelhood Estmate n-gram models for varous values of n. The predcted lkelhood rank of dfferent words s shown n the frst column. The actual next word s shown at the top of the table n talcs, and n the table n bold.

12 6.2 Statstcal Estmators 201 sentence from our test corpus Persuason. We wll cover the ssue of test corpora n more detal later, but t s vtal for assessng a model that we try t on dfferent data otherwse t sn t a far test of how well the model allows us to predct the patterns of language. Extracts from these probablty dstrbutons ncludng the actual next word shown n bold are shown n table 6.3. The ungram dstrbuton gnores context entrely, and smply uses the overall frequency of dfferent words. But ths s not entrely useless, snce, as n ths clause, most words n most sentences are common words. The bgram model uses the precedng word to help predct the next word. In general, ths helps enormously, and gves us a much better model. In some cases the estmated probablty of the word that actually comes next has gone up by about an order of magntude (was, to, ssters). However, note that the bgram model s not guaranteed to ncrease the probablty estmate. The estmate for she has actually gone down, because she s n general very common n Austen novels (beng manly books about women), but somewhat unexpected after the noun person although qute possble when an adverbal phrase s beng used, such as In person here. The falure to predct nferor after was shows problems of data sparseness already startng to crop up. When the trgram model works, t can work brllantly. For example, t gves us a probablty estmate of 0.5 forwas followng person she. Butn general t s not usable. Ether the precedng bgram was never seen before, and then there s no probablty dstrbuton for the followng word, or a few words have been seen followng that bgram, but the data s so sparse that the resultng estmates are hghly unrelable. For example, the bgram to both was seen 9 tmes n the tranng text, twce followed by to, and once each followed by 7 other words, a few of whch are shown n the table. Ths s not the knd of densty of data on whch one can sensbly buld a probablstc model. The four-gram model s entrely useless. In general, four-gram models do not become usable untl one s tranng on several tens of mllons of words of data. Examnng the table suggests an obvous strategy: use hgher order n-gram models when one has seen enough data for them to be of some use, but back off to lower order n-gram models when there sn t enough data. Ths s a wdely used strategy, whch we wll dscuss below n the secton on combnng estmates, but t sn t by tself a complete soluton to the problem of n-gram estmates. For nstance, we saw qute a lot of words followng was n the tranng data 9409 tokens of 1481 types but nferor was not one of them. Smlarly, although we had seen qute

13 202 6 Statstcal Inference: n-gram Models over Sparse Data a lot of words n our tranng text overall, there are many words that dd not appear, ncludng perfectly ordnary words lke decdes or wart. So regardless of how we combne estmates, we stll defntely need a way to gve a non-zero probablty estmate to words or n-grams that we happened not to see n our tranng text, and so we wll work on that problem frst Laplace s law, Ldstone s law and the Jeffreys-Perks law Laplace s law (6.6) addng one The manfest falure of maxmum lkelhood estmaton forces us to examne better estmators. The oldest soluton s to employ Laplace s law (1814; 1995). Accordng to ths law, P Lap (w 1 w n ) = C(w 1 w n ) + 1 N + B Ths process s often nformally referred to as addng one, and has the effect of gvng a lttle bt of the probablty space to unseen events. But rather than smply beng an unprncpled move, ths s actually the Bayesan estmator that one derves f one assumes a unform pror on events (.e., that every n-gram was equally lkely). However, note that the estmates whch Laplace s law gves are dependent on the sze of the vocabulary. For sparse sets of data over large vocabulares, such as n-grams, Laplace s law actually gves far too much of the probablty space to unseen events. Consder some data dscussed by Church and Gale (1991a) n the context of ther dscusson of varous estmators for bgrams. Ther corpus of 44 mllon words of Assocated Press (AP) newswre yelded a vocabulary of 400,653 words (mantanng case dstnctons, splttng on hyphens, etc.). Note that ths vocabulary sze means that there s a space of possble bgrams, and so aprorbarely any of them wll actually occur n the corpus. It also means that n the calculaton of P Lap, B s far larger than N, and Laplace s method s completely unsatsfactory n such crcumstances. Church and Gale used half the corpus (22 mllon words) as a tranng text. Table 6.4 shows the expected frequency est- mates of varous methods that they dscuss, and Laplace s law estmates that we have calculated. Probablty estmates can be derved by dvdng the frequency estmates by the number of n-grams, N = 22 mllon. For Laplace s law, the probablty estmate for an n-gram seen r tmes s expected frequency estmates

14 6.2 Statstcal Estmators 203 r =f MLE f emprcal f Lap f del f GT N r T r Table 6.4 Estmated frequences for the AP data from Church and Gale (1991a). The frst fve columns show the estmated frequency calculated for a bgram that actually appeared r tmes n the tranng data accordng to dfferent estmators: r s the maxmum lkelhood estmate, f emprcal uses valdaton on the test set, f Lap s the add one method, f del s deleted nterpolaton (two-way cross valdaton, usng the tranng data), and f GT s the Good-Turng estmate. The last two columns gve the frequences of frequences and how often bgrams of a certan frequency occurred n further text. (r+1)/(n+b), so the frequency estmate becomes f Lap = (r+1)n/(n+b). These estmated frequences are often easer for humans to nterpret than probabltes, as one can more easly see the effect of the dscountng. Although each prevously unseen bgram has been gven a very low probablty, because there are so many of them, 46.5% of the probablty space has actually been gven to unseen bgrams. 7 Ths s far too much, and t s done at the cost of enormously reducng the probablty estmates of more frequent events. How do we know t s far too much? The second column of the table shows an emprcally determned estmate (whch we dscuss below) of how often unseen n-grams actually appeared n further text, and we see that the ndvdual frequency of occurrence of prevously unseen n-grams s much lower than Laplace s law predcts, whle the frequency of occurrence of prevously seen n-grams s much hgher than predcted. 8 In partcular, the emprcal model fnds that only 9.2% of the bgrams n further text were prevously unseen. 7. Ths s calculated as N 0 P Lap ( ) = 74, 671, 100, /22, 000, 000 = It s a bt hard dealng wth the astronomcal numbers n the table. A smaller example whch llustrates the same pont appears n exercse 6.2.

15 204 6 Statstcal Inference: n-gram Models over Sparse Data Ldstone s law and the Jeffreys-Perks law (6.7) (6.8) Expected Lkelhood Estmaton Because of ths overestmaton, a commonly adopted soluton to the problem of multnomal estmaton wthn statstcal practce s Ldstone s law of successon, where we add not one, but some (normally smaller) postve value λ: P Ld (w 1 w n ) = C(w 1 w n ) + λ N + Bλ Ths method was developed by the actuares Hardy and Ldstone, and Johnson showed that t can be vewed as a lnear nterpolaton (see below) between the MLE estmate and a unform pror. Ths may be seen by settng µ = N/(N + Bλ): P Ld (w 1 w n ) = µ C(w 1 w n ) + (1 µ) 1 N B The most wdely used value for λ s 1 2. Ths choce can be theoretcally justfed as beng the expectaton of the same quantty whch s maxmzed by MLE and so t has ts own names, the Jeffreys-Perks law, or Expected Lkelhood Estmaton (ELE) (Box and Tao 1973: 34 36). In practce, ths often helps. For example, we could avod the objecton above that two much of the probablty space was beng gven to unseen events by choosng a small λ. But there are two remanng objectons: () we need a good way to guess an approprate value for λ n advance, and () dscountng usng Ldstone s law always gves probablty estmates lnear n the MLE frequency and ths s not a good match to the emprcal dstrbuton at low frequences. Applyng these methods to Austen Despte the problems nherent n these methods, we wll nevertheless try applyng them, n partcular ELE, to our Austen corpus. Recall that up untl now the only probablty estmate we have been able to derve for the test corpus clause she was nferor to both ssters was the ungram estmate, whch (multplyng through the bold probabltes n the top part of table 6.3) gves as ts estmate for the probablty of the clause For the other models, the probablty estmate was ether zero or undefned, because of the sparseness of the data. Let us now calculate a probablty estmate for ths clause usng a bgram model and ELE. Followng the word was, whch appeared 9409

16 6.2 Statstcal Estmators 205 Rank Word MLE ELE 1 not a the to =1482 nferor Table 6.5 Expected Lkelhood Estmaton estmates for the word followng was. tmes, not appeared 608 tmes n the tranng corpus, whch overall contaned word types. So our new estmate for P(not was) s ( )/( ) = The estmate for P(not was) has thus been dscounted (by almost half!). If we do smlar calculatons for the other words, then we get the results shown n the last column of table 6.5. The orderng of most lkely words s naturally unchanged, but the probablty estmates of words that dd appear n the tranng text are dscounted, whle non-occurrng words, n partcular the actual next word, nferor, are gven a non-zero probablty of occurrence. Contnung n ths way to also estmate the other bgram probabltes, we fnd that ths language model gves a probablty estmate for the clause of Unfortunately, ths probablty estmate s actually lower than the MLE estmate based on ungram counts reflectng how greatly all the MLE probablty estmates for seen n-grams are dscounted n the constructon of the ELE model. Ths result substantates the slogan used n the ttles of (Gale and Church 1990a,b): poor estmates of context are worse than none. Note, however, that ths does not mean that the model that we have constructed s entrely useless. Although the probablty estmates t gves are extremely low, one can nevertheless use them to rank alternatves. For example, the model does correctly tell us that she was nferor to both ssters s a much more lkely clause n Englsh than nferor to was both she ssters, whereas the ungram estmate gves them both the same probablty Held out estmaton How do we know that gvng 46.5% of the probablty space to unseen events s too much? One way that we can test ths s emprcally. We

17 206 6 Statstcal Inference: n-gram Models over Sparse Data held out estmator can take further text (assumed to be from the same source) and see how often bgrams that appeared r tmes n the tranng text tend to turn up n the further text. The realzaton of ths dea s the held out estmator of Jelnek and Mercer (1985). The held out estmator For each n-gram, w 1 w n,let: C 1 (w 1 w n ) = frequency of w 1 w n n tranng data C 2 (w 1 w n ) = frequency of w 1 w n n held out data (6.9) and recall that N r s the number of bgrams wth frequency r (n the tranng text). Now let: T r = C 2 (w 1 w n ) {w 1 w n :C 1 (w 1 w n )=r} That s, T r s the total number of tmes that all n-grams that appeared r tmes n the tranng text appeared n the held out data. Then the average frequency of those n-grams s T r N r and so an estmate for the probablty of one of these n-grams s: (6.10) P ho (w 1 w n ) = T r N r N where C(w 1 w n ) = r Pots of data for developng and testng models tranng data overtranng test data A cardnal sn n Statstcal NLP s to test on your tranng data. Butwhys that? The dea of testng s to assess how well a partcular model works. That can only be done f t s a far test on data that has not been seen before. In general, models nduced from a sample of data have a tendency to be overtraned, that s, to expect future events to be lke the events on whch the model was traned, rather than allowng suffcently for other possbltes. (For nstance, stock market models sometmes suffer from ths falng.) So t s essental to test on dfferent data. A partcular case of ths s for the calculaton of cross entropy (secton 2.2.6). To calculate cross entropy, we take a large sample of text and calculate the per-word entropy of that text accordng to our model. Ths gves us a measure of the qualty of our model, and an upper bound for the entropy of the language that the text was drawn from n general. But all that s only true f the test data s ndependent of the tranng data, and large enough

18 6.2 Statstcal Estmators 207 to be ndcatve of the complexty of the language at hand. If we test on the tranng data, the cross entropy can easly be lower than the real entropy of the text. In the most blatant case we could buld a model that has memorzed the tranng text and always predcts the next word wth probablty 1. Even f we don t do that, we wll fnd that MLE s an excellent language model f you are testng on tranng data, whch s not the rght result. So when startng to work wth some data, one should always separate t mmedately nto a tranng porton and a testng porton. The test data s normally only a small percentage (5 10%) of the total data, but has to be suffcent for the results to be relable. You should always eyeball the tranng data you want to use your human pattern-fndng abltes to get hnts on how to proceed. You shouldn t eyeball the test data that s cheatng, even f less drectly than gettng your program to memorze t. Commonly, however, one wants to dvde both the tranng and test data nto two agan, for dfferent reasons. For many Statstcal NLP methods, such as held out estmaton of n-grams, one gathers counts from one lot of tranng data, and then one smooths these counts or estmates certan other parameters of the assumed model based on what turns up n further held out or valdaton data. The held out data needs to be nde- pendent of both the prmary tranng data and the test data. Normally the stage usng the held out data nvolves the estmaton of many fewer parameters than are estmated from counts over the prmary tranng data, and so t s approprate for the held out data to be much smaller than the prmary tranng data (commonly about 10% of the sze). Nevertheless, t s mportant that there s suffcent data for any addtonal parameters of the model to be accurately estmated, or sgnfcant performance losses can occur (as Chen and Goodman (1996: 317) show). A typcal pattern n Statstcal NLP research s to wrte an algorthm, tran t, and test t, note some thngs that t does wrong, revse t and then to repeat the process (often many tmes!). But, f one does that a lot, not only does one tend to end up seeng aspects of the test set, but just repeatedly tryng out dfferent varant algorthms and lookng at ther performance can be vewed as subtly probng the contents of the test set. Ths means that testng a successon of varant models can agan lead to overtranng. So the rght approach s to have two test sets: a development test set on whch successve varant methods are traled and a fnal test set whch s used to produce the fnal results that are publshed about the performance of the algorthm. One should expect performance on held out data valdaton data development test set fnal test set

19 208 6 Statstcal Inference: n-gram Models over Sparse Data varance the fnal test set to be slghtly lower than on the development test set (though sometmes one can be lucky). The dscusson so far leaves open exactly how to choose whch parts of the data are to be used as testng data. Actually here opnon dvdes nto two schools. One school favors selectng bts (sentences or even n- grams) randomly from throughout the data for the test set and usng the rest of the materal for tranng. The advantage of ths method s that the testng data s as smlar as possble (wth respect to genre, regster, wrter, and vocabulary) to the tranng data. That s, one s tranng from as accurate a sample as possble of the type of language n the test data. The other possblty s to set asde large contguous chunks as test data. The advantage of ths s the opposte: n practce, one wll end up usng any NLP system on data that vares a lttle from the tranng data, as language use changes a lttle n topc and structure wth the passage of tme. Therefore, some people thnk t best to smulate that a lttle by choosng test data that perhaps sn t qute statonary wth respect to the tranng data. At any rate, f usng held out estmaton of parameters, t s best to choose the same strategy for settng asde data for held out data as for test data, as ths makes the held out data a better smulaton of the test data. Ths choce s one of the many reasons why system results can be hard to compare: all else beng equal, one should expect slghtly worse performance results f usng the second approach. Whle coverng testng, let us menton one other ssue. In early work, t was common to just run the system on the test data and present a sngle performance fgure (for perplexty, percent correct or whatever). But ths sn t a very good way of testng, as t gves no dea of the varance n the performance of the system. A much better way s to dvde the test data nto, say 20, smaller samples, and work out a test result on each of them. From those results, one can work out a mean performance fgure, as before, but one can also calculate the varance that shows how much performance tends to vary. If usng ths method together wth contnuous chunks of tranng data, t s probably best to take the smaller testng samples from dfferent regons of the data, snce the testng lore tends to be full of stores about certan sectons of data sets beng easy, and so t s better to have used a range of test data from dfferent sectons of the corpus. If we proceed ths way, then one system can score hgher on average than another purely by accdent, especally when wthn-system varance s hgh. So just comparng average scores s not enough for meanngful

20 6.2 Statstcal Estmators 209 System 1 System 2 scores 71, 61, 55, 60, 68, 49, 42, 55, 75, 45, 54, 51 42, 72, 76, 55, 64 55, 36, 58, 55, 67 total n mean x s 2 = (x j x ) 2 1, ,228.8 df Pooled s 2 = t = x 1 x 2 2s 2 n = Table 6.6 Usng the t test for comparng the performance of two systems. Snce we calculate the mean for each data set, the denomnator n the calculaton of varance and the number of degrees of freedom s (11 1) + (11 1) = 20. The data do not provde clear support for the superorty of system 1. Despte the clear dfference n mean scores, the sample varance s too hgh to draw any defntve conclusons. t test system comparson. Instead, we need to apply a statstcal test that takes nto account both mean and varance. Only f the statstcal test rejects the possblty of an accdental dfference can we say wth confdence that one system s better than the other. 9 An example of usng the t test (whch we ntroduced n secton 5.3.1) for comparng the performance of two systems s shown n table 6.6 (adapted from (Snedecor and Cochran 1989: 92)). Note that we use a pooled estmate of the sample varance s 2 here under the assumpton that the varance of the two systems s the same (whch seems a reasonable assumpton here: 609 and 526 are close enough). Lookng up the t dstrbuton n the appendx, we fnd that, for rejectng the hypothess that the system 1 s better than system 2 at a probablty level of α = 0.05, the crtcal value s t = (usng a one-taled test wth 20 degrees of freedom). Snce we have t = 1.56 < 1.725, the data fal the sgnfcance test. Although the averages are farly dstnct, we cannot conclude superorty of system 1 here because of the large varance of scores. 9. Systematc dscusson of testng methodology for comparng statstcal and machne learnng algorthms can be found n (Detterch 1998). A good case study, for the example of word sense dsambguaton, s (Mooney 1996).

21 210 6 Statstcal Inference: n-gram Models over Sparse Data Usng held out estmaton on the test data So long as the frequency of an n-gram C(w 1 w n ) s the only thng that we are usng to predct ts future frequency n text, then we can use held out estmaton performed on the test set to provde the correct answer of what the dscounted estmates of probabltes should be n order to maxmze the probablty of the test set data. Dong ths emprcally measures how often n-grams that were seen r tmes n the tranng data actually do occur n the test text. The emprcal estmates f emprcal n table 6.4 were found by randomly dvdng the 44 mllon bgrams n the whole AP corpus nto equal-szed tranng and test sets, countng frequences n the 22 mllon word tranng set and then dong held out estmaton usng the test set. Whereas other estmates are calculated only from the 22 mllon words of tranng data, ths estmate can be regarded as an emprcally determned gold standard, acheved by allowng access to the test data Cross-valdaton (deleted estmaton) cross-valdaton deleted estmaton The f emprcal estmates dscussed mmedately above were constructed by lookng at what actually happened n the test data. But the dea of held out estmaton s that we can acheve the same effect by dvdng the tranng data nto two parts. We buld ntal estmates by dong counts on one part, and then we use the other pool of held out data to refne those estmates. The only cost of ths approach s that our ntal tranng data s now less, and so our probablty estmates wll be less relable. Rather than usng some of the tranng data only for frequency counts and some only for smoothng probablty estmates, more effcent schemes are possble where each part of the tranng data s used both as ntal tranng data and as held out data. In general, such methods n statstcs go under the name cross-valdaton. Jelnek and Mercer (1985) use a form of two-way cross-valdaton that they call deleted estmaton. Suppose we let Nr a be the number of n-grams occurrng r tmes n the a th part of the tranng data, and Tr ab be the total occurrences of those bgrams from part a n the b th part. Now dependng on whch part s vewed as the basc tranng data, standard held out estmates would be ether: P ho (w 1 w n ) = T r 01 Nr 0 N 10 Tr or Nr 1 N where C(w 1 w n ) = r

22 6.2 Statstcal Estmators 211 The more effcent deleted nterpolaton estmate does counts and smoothng on both halves and then averages the two: (6.11) P del (w 1 w n ) = T r 01 + Tr 10 N(Nr 0 + Nr 1 ) where C(w 1 w n ) = r Leavng-One-Out On large tranng corpora, dong deleted estmaton on the tranng data works better than dong held-out estmaton usng just the tranng data, and ndeed table 6.4 shows that t produces results that are qute close to the emprcal gold standard. 10 It s nevertheless stll some way off for low frequency events. It overestmates the expected frequency of unseen objects, whle underestmatng the expected frequency of objects that were seen once n the tranng data. By dvdng the text nto two parts lke ths, one estmates the probablty of an object by how many tmes t was seen n a sample of sze N 2, assumng that the probablty of a token seen r tmes n a sample of sze N 2 s double that of a token seen r tmes n a sample of sze N. However, t s generally true that as the sze of the tranng corpus ncreases, the percentage of unseen n-grams that one encounters n held out data, and hence one s probablty estmate for unseen n-grams, decreases (whle never becomng neglgble). It s for ths reason that collectng counts on a smaller tranng corpus has the effect of overestmatng the probablty of unseen n-grams. There are other ways of dong cross-valdaton. In partcular Ney et al. (1997) explore a method that they call Leavng-One-Out where the prmary tranng corpus s of sze N 1 tokens, whle 1 token s used as held out data for a sort of smulated testng. Ths process s repeated N tmes so that each pece of data s left out n turn. The advantage of ths tranng regme s that t explores the effect of how the model changes f any partcular pece of data had not been observed, and Ney et al. show strong connectons between the resultng formulas and the wdely-used Good-Turng method to whch we turn next Remember that, although the emprcal gold standard was derved by held out estmaton, t was held out estmaton based on lookng at the test data! Chen and Goodman (1998) fnd n ther study that for smaller tranng corpora, held out estmaton outperforms deleted estmaton. 11. However, Chen and Goodman (1996: 314) suggest that leavng one word out at a tme s problematc, and that usng larger deleted chunks n deleted nterpolaton s to be preferred.

23 212 6 Statstcal Inference: n-gram Models over Sparse Data Good-Turng estmaton The Good-Turng estmator (6.12) (6.13) Good (1953) attrbutes to Turng a method for determnng frequency or probablty estmates of tems, on the assumpton that ther dstrbuton s bnomal. Ths method s sutable for large numbers of observatons of data drawn from a large vocabulary, and works well for n-grams, despte the fact that words and n-grams do not have a bnomal dstrbuton. The probablty estmate n Good-Turng estmaton s of the form P GT = r*/n where r* can be thought of as an adjusted frequency. The theorem underlyng Good-Turng methods gves that for prevously observed tems: r* = (r + 1) E(N r+1) E(N r ) where E denotes the expectaton of a random varable (see (Church and Gale 1991a; Gale and Sampson 1995) for dscusson of the dervaton of ths formula). The total probablty mass reserved for unseen objects s then E(N 1 )/N (see exercse 6.5). Usng our emprcal estmates, we can hope to substtute the observed N r for E(N r ). However, we cannot do ths unformly, snce these emprcal estmates wll be very unrelable for hgh values of r. In partcular, the most frequent n-gram would be estmated to have probablty zero, snce the number of n-grams wth frequency one greater than t s zero! In practce, one of two solutons s employed. One s to use Good-Turng reestmaton only for frequences r < k for some constant k (e.g., 10). Low frequency words are numerous, so substtuton of the observed frequency of frequences for the expectaton s qute accurate, whle the MLE estmates of hgh frequency words wll also be qute accurate and so one doesn t need to dscount them. The other s to ft some functon S through the observed values of (r, N r ) and to use the smoothed values S(r) for the expectaton (ths leads to a famly of possbltes dependng on exactly whch method of curve fttng s employed Good (1953) dscusses several smoothng methods). The probablty mass N 1 N gven to unseen tems can ether be dvded among them unformly, or by some more sophstcated method (see under Combnng Estmators, below). So usng ths method wth a unform estmate for unseen events, we have: Good-Turng Estmator: If C(w 1 w n ) = r>0, P GT (w 1 w n ) = r* N where r* = (r + 1)S(r + 1) S(r)

24 6.2 Statstcal Estmators 213 If C(w 1 w n ) = 0, (6.14) renormalzaton P GT (w 1 w n ) = 1 r=1 N r r* N N 0 N 1 N 0 N Gale and Sampson (1995) present a smple and effectve approach, Smple Good-Turng, whch effectvely combnes these two approaches. As a smoothng curve they smply use a power curve N r = ar b (wth b< 1 to gve the approprate hyperbolc relatonshp), and estmate A and b by smple lnear regresson on the logarthmc form of ths equaton log N r = a + b log r (lnear regresson s covered n secton , or n all ntroductory statstcs books). However, they suggest that such a smple curve s probably only approprate for hgh values of r. For low values of r, they use the measured N r drectly. Workng up through frequences, these drect estmates are used untl for one of them there sn t a sgnfcant dfference between r* values calculated drectly or va the smoothng functon, and then smoothed estmates are used for all hgher frequences. 12 Smple Good-Turng can gve exceedngly good estmators, as can be seen by comparng the Good-Turng column f GT n table 6.4 wth the emprcal gold standard. Under any of these approaches, t s necessary to renormalze all the estmates to ensure that a proper probablty dstrbuton results. Ths can be done ether by adjustng the amount of probablty mass gven to unseen tems (as n equaton (6.14)), or, perhaps better, by keepng the estmate of the probablty mass for unseen tems as N 1 N and renormalzng all the estmates for prevously seen tems (as Gale and Sampson (1995) propose). Frequences of frequences n Austen count-counts To do Good-Turng, the frst step s to calculate the frequences of dfferent frequences (also known as count-counts). Table 6.7 shows extracts from the resultng lst of frequences of frequences for bgrams and trgrams. (The numbers are remnscent of the Zpfan dstrbutons of 12. An estmate of r* s deemed sgnfcantly dfferent f the dfference exceeds 1.65 tmes the standard devaton of the Good-Turng estmate, whch s gven by: (r + 1) 2 N r+1 N 2 r ( 1 + N r+1 N r )

Chapter 13: Multiple Regression

Chapter 13: Multiple Regression Chapter 13: Multple Regresson 13.1 Developng the multple-regresson Model The general model can be descrbed as: It smplfes for two ndependent varables: The sample ft parameter b 0, b 1, and b are used to

More information

Basically, if you have a dummy dependent variable you will be estimating a probability.

Basically, if you have a dummy dependent variable you will be estimating a probability. ECON 497: Lecture Notes 13 Page 1 of 1 Metropoltan State Unversty ECON 497: Research and Forecastng Lecture Notes 13 Dummy Dependent Varable Technques Studenmund Chapter 13 Bascally, f you have a dummy

More information

Homework Assignment 3 Due in class, Thursday October 15

Homework Assignment 3 Due in class, Thursday October 15 Homework Assgnment 3 Due n class, Thursday October 15 SDS 383C Statstcal Modelng I 1 Rdge regresson and Lasso 1. Get the Prostrate cancer data from http://statweb.stanford.edu/~tbs/elemstatlearn/ datasets/prostate.data.

More information

FREQUENCY DISTRIBUTIONS Page 1 of The idea of a frequency distribution for sets of observations will be introduced,

FREQUENCY DISTRIBUTIONS Page 1 of The idea of a frequency distribution for sets of observations will be introduced, FREQUENCY DISTRIBUTIONS Page 1 of 6 I. Introducton 1. The dea of a frequency dstrbuton for sets of observatons wll be ntroduced, together wth some of the mechancs for constructng dstrbutons of data. Then

More information

Note on EM-training of IBM-model 1

Note on EM-training of IBM-model 1 Note on EM-tranng of IBM-model INF58 Language Technologcal Applcatons, Fall The sldes on ths subject (nf58 6.pdf) ncludng the example seem nsuffcent to gve a good grasp of what s gong on. Hence here are

More information

1. Inference on Regression Parameters a. Finding Mean, s.d and covariance amongst estimates. 2. Confidence Intervals and Working Hotelling Bands

1. Inference on Regression Parameters a. Finding Mean, s.d and covariance amongst estimates. 2. Confidence Intervals and Working Hotelling Bands Content. Inference on Regresson Parameters a. Fndng Mean, s.d and covarance amongst estmates.. Confdence Intervals and Workng Hotellng Bands 3. Cochran s Theorem 4. General Lnear Testng 5. Measures of

More information

Comparison of Regression Lines

Comparison of Regression Lines STATGRAPHICS Rev. 9/13/2013 Comparson of Regresson Lnes Summary... 1 Data Input... 3 Analyss Summary... 4 Plot of Ftted Model... 6 Condtonal Sums of Squares... 6 Analyss Optons... 7 Forecasts... 8 Confdence

More information

A Robust Method for Calculating the Correlation Coefficient

A Robust Method for Calculating the Correlation Coefficient A Robust Method for Calculatng the Correlaton Coeffcent E.B. Nven and C. V. Deutsch Relatonshps between prmary and secondary data are frequently quantfed usng the correlaton coeffcent; however, the tradtonal

More information

Question Classification Using Language Modeling

Question Classification Using Language Modeling Queston Classfcaton Usng Language Modelng We L Center for Intellgent Informaton Retreval Department of Computer Scence Unversty of Massachusetts, Amherst, MA 01003 ABSTRACT Queston classfcaton assgns a

More information

Limited Dependent Variables

Limited Dependent Variables Lmted Dependent Varables. What f the left-hand sde varable s not a contnuous thng spread from mnus nfnty to plus nfnty? That s, gven a model = f (, β, ε, where a. s bounded below at zero, such as wages

More information

Difference Equations

Difference Equations Dfference Equatons c Jan Vrbk 1 Bascs Suppose a sequence of numbers, say a 0,a 1,a,a 3,... s defned by a certan general relatonshp between, say, three consecutve values of the sequence, e.g. a + +3a +1

More information

Linear Feature Engineering 11

Linear Feature Engineering 11 Lnear Feature Engneerng 11 2 Least-Squares 2.1 Smple least-squares Consder the followng dataset. We have a bunch of nputs x and correspondng outputs y. The partcular values n ths dataset are x y 0.23 0.19

More information

Negative Binomial Regression

Negative Binomial Regression STATGRAPHICS Rev. 9/16/2013 Negatve Bnomal Regresson Summary... 1 Data Input... 3 Statstcal Model... 3 Analyss Summary... 4 Analyss Optons... 7 Plot of Ftted Model... 8 Observed Versus Predcted... 10 Predctons...

More information

4 Analysis of Variance (ANOVA) 5 ANOVA. 5.1 Introduction. 5.2 Fixed Effects ANOVA

4 Analysis of Variance (ANOVA) 5 ANOVA. 5.1 Introduction. 5.2 Fixed Effects ANOVA 4 Analyss of Varance (ANOVA) 5 ANOVA 51 Introducton ANOVA ANOVA s a way to estmate and test the means of multple populatons We wll start wth one-way ANOVA If the populatons ncluded n the study are selected

More information

Kernel Methods and SVMs Extension

Kernel Methods and SVMs Extension Kernel Methods and SVMs Extenson The purpose of ths document s to revew materal covered n Machne Learnng 1 Supervsed Learnng regardng support vector machnes (SVMs). Ths document also provdes a general

More information

Boostrapaggregating (Bagging)

Boostrapaggregating (Bagging) Boostrapaggregatng (Baggng) An ensemble meta-algorthm desgned to mprove the stablty and accuracy of machne learnng algorthms Can be used n both regresson and classfcaton Reduces varance and helps to avod

More information

2E Pattern Recognition Solutions to Introduction to Pattern Recognition, Chapter 2: Bayesian pattern classification

2E Pattern Recognition Solutions to Introduction to Pattern Recognition, Chapter 2: Bayesian pattern classification E395 - Pattern Recognton Solutons to Introducton to Pattern Recognton, Chapter : Bayesan pattern classfcaton Preface Ths document s a soluton manual for selected exercses from Introducton to Pattern Recognton

More information

Econ107 Applied Econometrics Topic 3: Classical Model (Studenmund, Chapter 4)

Econ107 Applied Econometrics Topic 3: Classical Model (Studenmund, Chapter 4) I. Classcal Assumptons Econ7 Appled Econometrcs Topc 3: Classcal Model (Studenmund, Chapter 4) We have defned OLS and studed some algebrac propertes of OLS. In ths topc we wll study statstcal propertes

More information

/ n ) are compared. The logic is: if the two

/ n ) are compared. The logic is: if the two STAT C141, Sprng 2005 Lecture 13 Two sample tests One sample tests: examples of goodness of ft tests, where we are testng whether our data supports predctons. Two sample tests: called as tests of ndependence

More information

See Book Chapter 11 2 nd Edition (Chapter 10 1 st Edition)

See Book Chapter 11 2 nd Edition (Chapter 10 1 st Edition) Count Data Models See Book Chapter 11 2 nd Edton (Chapter 10 1 st Edton) Count data consst of non-negatve nteger values Examples: number of drver route changes per week, the number of trp departure changes

More information

Chapter 11: Simple Linear Regression and Correlation

Chapter 11: Simple Linear Regression and Correlation Chapter 11: Smple Lnear Regresson and Correlaton 11-1 Emprcal Models 11-2 Smple Lnear Regresson 11-3 Propertes of the Least Squares Estmators 11-4 Hypothess Test n Smple Lnear Regresson 11-4.1 Use of t-tests

More information

Linear Approximation with Regularization and Moving Least Squares

Linear Approximation with Regularization and Moving Least Squares Lnear Approxmaton wth Regularzaton and Movng Least Squares Igor Grešovn May 007 Revson 4.6 (Revson : March 004). 5 4 3 0.5 3 3.5 4 Contents: Lnear Fttng...4. Weghted Least Squares n Functon Approxmaton...

More information

x = , so that calculated

x = , so that calculated Stat 4, secton Sngle Factor ANOVA notes by Tm Plachowsk n chapter 8 we conducted hypothess tests n whch we compared a sngle sample s mean or proporton to some hypotheszed value Chapter 9 expanded ths to

More information

1 Generating functions, continued

1 Generating functions, continued Generatng functons, contnued. Generatng functons and parttons We can make use of generatng functons to answer some questons a bt more restrctve than we ve done so far: Queston : Fnd a generatng functon

More information

= z 20 z n. (k 20) + 4 z k = 4

= z 20 z n. (k 20) + 4 z k = 4 Problem Set #7 solutons 7.2.. (a Fnd the coeffcent of z k n (z + z 5 + z 6 + z 7 + 5, k 20. We use the known seres expanson ( n+l ( z l l z n below: (z + z 5 + z 6 + z 7 + 5 (z 5 ( + z + z 2 + z + 5 5

More information

Structure and Drive Paul A. Jensen Copyright July 20, 2003

Structure and Drive Paul A. Jensen Copyright July 20, 2003 Structure and Drve Paul A. Jensen Copyrght July 20, 2003 A system s made up of several operatons wth flow passng between them. The structure of the system descrbes the flow paths from nputs to outputs.

More information

10-701/ Machine Learning, Fall 2005 Homework 3

10-701/ Machine Learning, Fall 2005 Homework 3 10-701/15-781 Machne Learnng, Fall 2005 Homework 3 Out: 10/20/05 Due: begnnng of the class 11/01/05 Instructons Contact questons-10701@autonlaborg for queston Problem 1 Regresson and Cross-valdaton [40

More information

Statistics for Economics & Business

Statistics for Economics & Business Statstcs for Economcs & Busness Smple Lnear Regresson Learnng Objectves In ths chapter, you learn: How to use regresson analyss to predct the value of a dependent varable based on an ndependent varable

More information

Module 9. Lecture 6. Duality in Assignment Problems

Module 9. Lecture 6. Duality in Assignment Problems Module 9 1 Lecture 6 Dualty n Assgnment Problems In ths lecture we attempt to answer few other mportant questons posed n earler lecture for (AP) and see how some of them can be explaned through the concept

More information

Bayesian Learning. Smart Home Health Analytics Spring Nirmalya Roy Department of Information Systems University of Maryland Baltimore County

Bayesian Learning. Smart Home Health Analytics Spring Nirmalya Roy Department of Information Systems University of Maryland Baltimore County Smart Home Health Analytcs Sprng 2018 Bayesan Learnng Nrmalya Roy Department of Informaton Systems Unversty of Maryland Baltmore ounty www.umbc.edu Bayesan Learnng ombnes pror knowledge wth evdence to

More information

Lecture 6: Introduction to Linear Regression

Lecture 6: Introduction to Linear Regression Lecture 6: Introducton to Lnear Regresson An Manchakul amancha@jhsph.edu 24 Aprl 27 Lnear regresson: man dea Lnear regresson can be used to study an outcome as a lnear functon of a predctor Example: 6

More information

Evaluation for sets of classes

Evaluation for sets of classes Evaluaton for Tet Categorzaton Classfcaton accuracy: usual n ML, the proporton of correct decsons, Not approprate f the populaton rate of the class s low Precson, Recall and F 1 Better measures 21 Evaluaton

More information

j) = 1 (note sigma notation) ii. Continuous random variable (e.g. Normal distribution) 1. density function: f ( x) 0 and f ( x) dx = 1

j) = 1 (note sigma notation) ii. Continuous random variable (e.g. Normal distribution) 1. density function: f ( x) 0 and f ( x) dx = 1 Random varables Measure of central tendences and varablty (means and varances) Jont densty functons and ndependence Measures of assocaton (covarance and correlaton) Interestng result Condtonal dstrbutons

More information

Predictive Analytics : QM901.1x Prof U Dinesh Kumar, IIMB. All Rights Reserved, Indian Institute of Management Bangalore

Predictive Analytics : QM901.1x Prof U Dinesh Kumar, IIMB. All Rights Reserved, Indian Institute of Management Bangalore Sesson Outlne Introducton to classfcaton problems and dscrete choce models. Introducton to Logstcs Regresson. Logstc functon and Logt functon. Maxmum Lkelhood Estmator (MLE) for estmaton of LR parameters.

More information

Section 8.3 Polar Form of Complex Numbers

Section 8.3 Polar Form of Complex Numbers 80 Chapter 8 Secton 8 Polar Form of Complex Numbers From prevous classes, you may have encountered magnary numbers the square roots of negatve numbers and, more generally, complex numbers whch are the

More information

Psychology 282 Lecture #24 Outline Regression Diagnostics: Outliers

Psychology 282 Lecture #24 Outline Regression Diagnostics: Outliers Psychology 282 Lecture #24 Outlne Regresson Dagnostcs: Outlers In an earler lecture we studed the statstcal assumptons underlyng the regresson model, ncludng the followng ponts: Formal statement of assumptons.

More information

Chapter Newton s Method

Chapter Newton s Method Chapter 9. Newton s Method After readng ths chapter, you should be able to:. Understand how Newton s method s dfferent from the Golden Secton Search method. Understand how Newton s method works 3. Solve

More information

Maxent Models & Deep Learning

Maxent Models & Deep Learning Maxent Models & Deep Learnng 1. Last bts of maxent (sequence) models 1.MEMMs vs. CRFs 2.Smoothng/regularzaton n maxent models 2. Deep Learnng 1. What s t? Why s t good? (Part 1) 2. From logstc regresson

More information

Chapter 5 Multilevel Models

Chapter 5 Multilevel Models Chapter 5 Multlevel Models 5.1 Cross-sectonal multlevel models 5.1.1 Two-level models 5.1.2 Multple level models 5.1.3 Multple level modelng n other felds 5.2 Longtudnal multlevel models 5.2.1 Two-level

More information

Lecture 4. Instructor: Haipeng Luo

Lecture 4. Instructor: Haipeng Luo Lecture 4 Instructor: Hapeng Luo In the followng lectures, we focus on the expert problem and study more adaptve algorthms. Although Hedge s proven to be worst-case optmal, one may wonder how well t would

More information

Retrieval Models: Language models

Retrieval Models: Language models CS-590I Informaton Retreval Retreval Models: Language models Luo S Department of Computer Scence Purdue Unversty Introducton to language model Ungram language model Document language model estmaton Maxmum

More information

Lecture 2: Prelude to the big shrink

Lecture 2: Prelude to the big shrink Lecture 2: Prelude to the bg shrnk Last tme A slght detour wth vsualzaton tools (hey, t was the frst day... why not start out wth somethng pretty to look at?) Then, we consdered a smple 120a-style regresson

More information

Lecture 7: Boltzmann distribution & Thermodynamics of mixing

Lecture 7: Boltzmann distribution & Thermodynamics of mixing Prof. Tbbtt Lecture 7 etworks & Gels Lecture 7: Boltzmann dstrbuton & Thermodynamcs of mxng 1 Suggested readng Prof. Mark W. Tbbtt ETH Zürch 13 März 018 Molecular Drvng Forces Dll and Bromberg: Chapters

More information

Statistics Chapter 4

Statistics Chapter 4 Statstcs Chapter 4 "There are three knds of les: les, damned les, and statstcs." Benjamn Dsrael, 1895 (Brtsh statesman) Gaussan Dstrbuton, 4-1 If a measurement s repeated many tmes a statstcal treatment

More information

1 GSW Iterative Techniques for y = Ax

1 GSW Iterative Techniques for y = Ax 1 for y = A I m gong to cheat here. here are a lot of teratve technques that can be used to solve the general case of a set of smultaneous equatons (wrtten n the matr form as y = A), but ths chapter sn

More information

Chapter 6. Supplemental Text Material

Chapter 6. Supplemental Text Material Chapter 6. Supplemental Text Materal S6-. actor Effect Estmates are Least Squares Estmates We have gven heurstc or ntutve explanatons of how the estmates of the factor effects are obtaned n the textboo.

More information

ANSWERS. Problem 1. and the moment generating function (mgf) by. defined for any real t. Use this to show that E( U) var( U)

ANSWERS. Problem 1. and the moment generating function (mgf) by. defined for any real t. Use this to show that E( U) var( U) Econ 413 Exam 13 H ANSWERS Settet er nndelt 9 deloppgaver, A,B,C, som alle anbefales å telle lkt for å gøre det ltt lettere å stå. Svar er gtt . Unfortunately, there s a prntng error n the hnt of

More information

Department of Quantitative Methods & Information Systems. Time Series and Their Components QMIS 320. Chapter 6

Department of Quantitative Methods & Information Systems. Time Series and Their Components QMIS 320. Chapter 6 Department of Quanttatve Methods & Informaton Systems Tme Seres and Ther Components QMIS 30 Chapter 6 Fall 00 Dr. Mohammad Zanal These sldes were modfed from ther orgnal source for educatonal purpose only.

More information

Lecture 3 Stat102, Spring 2007

Lecture 3 Stat102, Spring 2007 Lecture 3 Stat0, Sprng 007 Chapter 3. 3.: Introducton to regresson analyss Lnear regresson as a descrptve technque The least-squares equatons Chapter 3.3 Samplng dstrbuton of b 0, b. Contnued n net lecture

More information

28. SIMPLE LINEAR REGRESSION III

28. SIMPLE LINEAR REGRESSION III 8. SIMPLE LINEAR REGRESSION III Ftted Values and Resduals US Domestc Beers: Calores vs. % Alcohol To each observed x, there corresponds a y-value on the ftted lne, y ˆ = βˆ + βˆ x. The are called ftted

More information

Economics 130. Lecture 4 Simple Linear Regression Continued

Economics 130. Lecture 4 Simple Linear Regression Continued Economcs 130 Lecture 4 Contnued Readngs for Week 4 Text, Chapter and 3. We contnue wth addressng our second ssue + add n how we evaluate these relatonshps: Where do we get data to do ths analyss? How do

More information

Simulated Power of the Discrete Cramér-von Mises Goodness-of-Fit Tests

Simulated Power of the Discrete Cramér-von Mises Goodness-of-Fit Tests Smulated of the Cramér-von Mses Goodness-of-Ft Tests Steele, M., Chaselng, J. and 3 Hurst, C. School of Mathematcal and Physcal Scences, James Cook Unversty, Australan School of Envronmental Studes, Grffth

More information

Lecture 6 More on Complete Randomized Block Design (RBD)

Lecture 6 More on Complete Randomized Block Design (RBD) Lecture 6 More on Complete Randomzed Block Desgn (RBD) Multple test Multple test The multple comparsons or multple testng problem occurs when one consders a set of statstcal nferences smultaneously. For

More information

Problem Set 9 Solutions

Problem Set 9 Solutions Desgn and Analyss of Algorthms May 4, 2015 Massachusetts Insttute of Technology 6.046J/18.410J Profs. Erk Demane, Srn Devadas, and Nancy Lynch Problem Set 9 Solutons Problem Set 9 Solutons Ths problem

More information

STAT 3008 Applied Regression Analysis

STAT 3008 Applied Regression Analysis STAT 3008 Appled Regresson Analyss Tutoral : Smple Lnear Regresson LAI Chun He Department of Statstcs, The Chnese Unversty of Hong Kong 1 Model Assumpton To quantfy the relatonshp between two factors,

More information

Lecture 10 Support Vector Machines II

Lecture 10 Support Vector Machines II Lecture 10 Support Vector Machnes II 22 February 2016 Taylor B. Arnold Yale Statstcs STAT 365/665 1/28 Notes: Problem 3 s posted and due ths upcomng Frday There was an early bug n the fake-test data; fxed

More information

Lecture 12: Classification

Lecture 12: Classification Lecture : Classfcaton g Dscrmnant functons g The optmal Bayes classfer g Quadratc classfers g Eucldean and Mahalanobs metrcs g K Nearest Neghbor Classfers Intellgent Sensor Systems Rcardo Guterrez-Osuna

More information

Joint Statistical Meetings - Biopharmaceutical Section

Joint Statistical Meetings - Biopharmaceutical Section Iteratve Ch-Square Test for Equvalence of Multple Treatment Groups Te-Hua Ng*, U.S. Food and Drug Admnstraton 1401 Rockvlle Pke, #200S, HFM-217, Rockvlle, MD 20852-1448 Key Words: Equvalence Testng; Actve

More information

Introduction to Vapor/Liquid Equilibrium, part 2. Raoult s Law:

Introduction to Vapor/Liquid Equilibrium, part 2. Raoult s Law: CE304, Sprng 2004 Lecture 4 Introducton to Vapor/Lqud Equlbrum, part 2 Raoult s Law: The smplest model that allows us do VLE calculatons s obtaned when we assume that the vapor phase s an deal gas, and

More information

Assortment Optimization under MNL

Assortment Optimization under MNL Assortment Optmzaton under MNL Haotan Song Aprl 30, 2017 1 Introducton The assortment optmzaton problem ams to fnd the revenue-maxmzng assortment of products to offer when the prces of products are fxed.

More information

4.3 Poisson Regression

4.3 Poisson Regression of teratvely reweghted least squares regressons (the IRLS algorthm). We do wthout gvng further detals, but nstead focus on the practcal applcaton. > glm(survval~log(weght)+age, famly="bnomal", data=baby)

More information

Uncertainty as the Overlap of Alternate Conditional Distributions

Uncertainty as the Overlap of Alternate Conditional Distributions Uncertanty as the Overlap of Alternate Condtonal Dstrbutons Olena Babak and Clayton V. Deutsch Centre for Computatonal Geostatstcs Department of Cvl & Envronmental Engneerng Unversty of Alberta An mportant

More information

CSC 411 / CSC D11 / CSC C11

CSC 411 / CSC D11 / CSC C11 18 Boostng s a general strategy for learnng classfers by combnng smpler ones. The dea of boostng s to take a weak classfer that s, any classfer that wll do at least slghtly better than chance and use t

More information

UCLA STAT 13 Introduction to Statistical Methods for the Life and Health Sciences. Chapter 11 Analysis of Variance - ANOVA. Instructor: Ivo Dinov,

UCLA STAT 13 Introduction to Statistical Methods for the Life and Health Sciences. Chapter 11 Analysis of Variance - ANOVA. Instructor: Ivo Dinov, UCLA STAT 3 ntroducton to Statstcal Methods for the Lfe and Health Scences nstructor: vo Dnov, Asst. Prof. of Statstcs and Neurology Chapter Analyss of Varance - ANOVA Teachng Assstants: Fred Phoa, Anwer

More information

Lecture Notes on Linear Regression

Lecture Notes on Linear Regression Lecture Notes on Lnear Regresson Feng L fl@sdueducn Shandong Unversty, Chna Lnear Regresson Problem In regresson problem, we am at predct a contnuous target value gven an nput feature vector We assume

More information

CIS526: Machine Learning Lecture 3 (Sept 16, 2003) Linear Regression. Preparation help: Xiaoying Huang. x 1 θ 1 output... θ M x M

CIS526: Machine Learning Lecture 3 (Sept 16, 2003) Linear Regression. Preparation help: Xiaoying Huang. x 1 θ 1 output... θ M x M CIS56: achne Learnng Lecture 3 (Sept 6, 003) Preparaton help: Xaoyng Huang Lnear Regresson Lnear regresson can be represented by a functonal form: f(; θ) = θ 0 0 +θ + + θ = θ = 0 ote: 0 s a dummy attrbute

More information

Statistics II Final Exam 26/6/18

Statistics II Final Exam 26/6/18 Statstcs II Fnal Exam 26/6/18 Academc Year 2017/18 Solutons Exam duraton: 2 h 30 mn 1. (3 ponts) A town hall s conductng a study to determne the amount of leftover food produced by the restaurants n the

More information

CSci 6974 and ECSE 6966 Math. Tech. for Vision, Graphics and Robotics Lecture 21, April 17, 2006 Estimating A Plane Homography

CSci 6974 and ECSE 6966 Math. Tech. for Vision, Graphics and Robotics Lecture 21, April 17, 2006 Estimating A Plane Homography CSc 6974 and ECSE 6966 Math. Tech. for Vson, Graphcs and Robotcs Lecture 21, Aprl 17, 2006 Estmatng A Plane Homography Overvew We contnue wth a dscusson of the major ssues, usng estmaton of plane projectve

More information

CHAPTER IV RESEARCH FINDING AND DISCUSSIONS

CHAPTER IV RESEARCH FINDING AND DISCUSSIONS CHAPTER IV RESEARCH FINDING AND DISCUSSIONS A. Descrpton of Research Fndng. The Implementaton of Learnng Havng ganed the whole needed data, the researcher then dd analyss whch refers to the statstcal data

More information

x yi In chapter 14, we want to perform inference (i.e. calculate confidence intervals and perform tests of significance) in this setting.

x yi In chapter 14, we want to perform inference (i.e. calculate confidence intervals and perform tests of significance) in this setting. The Practce of Statstcs, nd ed. Chapter 14 Inference for Regresson Introducton In chapter 3 we used a least-squares regresson lne (LSRL) to represent a lnear relatonshp etween two quanttatve explanator

More information

Learning from Data 1 Naive Bayes

Learning from Data 1 Naive Bayes Learnng from Data 1 Nave Bayes Davd Barber dbarber@anc.ed.ac.uk course page : http://anc.ed.ac.uk/ dbarber/lfd1/lfd1.html c Davd Barber 2001, 2002 1 Learnng from Data 1 : c Davd Barber 2001,2002 2 1 Why

More information

AS-Level Maths: Statistics 1 for Edexcel

AS-Level Maths: Statistics 1 for Edexcel 1 of 6 AS-Level Maths: Statstcs 1 for Edecel S1. Calculatng means and standard devatons Ths con ndcates the slde contans actvtes created n Flash. These actvtes are not edtable. For more detaled nstructons,

More information

THE SUMMATION NOTATION Ʃ

THE SUMMATION NOTATION Ʃ Sngle Subscrpt otaton THE SUMMATIO OTATIO Ʃ Most of the calculatons we perform n statstcs are repettve operatons on lsts of numbers. For example, we compute the sum of a set of numbers, or the sum of the

More information

Online Appendix to: Axiomatization and measurement of Quasi-hyperbolic Discounting

Online Appendix to: Axiomatization and measurement of Quasi-hyperbolic Discounting Onlne Appendx to: Axomatzaton and measurement of Quas-hyperbolc Dscountng José Lus Montel Olea Tomasz Strzaleck 1 Sample Selecton As dscussed before our ntal sample conssts of two groups of subjects. Group

More information

Module 3 LOSSY IMAGE COMPRESSION SYSTEMS. Version 2 ECE IIT, Kharagpur

Module 3 LOSSY IMAGE COMPRESSION SYSTEMS. Version 2 ECE IIT, Kharagpur Module 3 LOSSY IMAGE COMPRESSION SYSTEMS Verson ECE IIT, Kharagpur Lesson 6 Theory of Quantzaton Verson ECE IIT, Kharagpur Instructonal Objectves At the end of ths lesson, the students should be able to:

More information

Notes on Frequency Estimation in Data Streams

Notes on Frequency Estimation in Data Streams Notes on Frequency Estmaton n Data Streams In (one of) the data streamng model(s), the data s a sequence of arrvals a 1, a 2,..., a m of the form a j = (, v) where s the dentty of the tem and belongs to

More information

Tracking with Kalman Filter

Tracking with Kalman Filter Trackng wth Kalman Flter Scott T. Acton Vrgna Image and Vdeo Analyss (VIVA), Charles L. Brown Department of Electrcal and Computer Engneerng Department of Bomedcal Engneerng Unversty of Vrgna, Charlottesvlle,

More information

Copyright 2017 by Taylor Enterprises, Inc., All Rights Reserved. Adjusted Control Limits for U Charts. Dr. Wayne A. Taylor

Copyright 2017 by Taylor Enterprises, Inc., All Rights Reserved. Adjusted Control Limits for U Charts. Dr. Wayne A. Taylor Taylor Enterprses, Inc. Adjusted Control Lmts for U Charts Copyrght 207 by Taylor Enterprses, Inc., All Rghts Reserved. Adjusted Control Lmts for U Charts Dr. Wayne A. Taylor Abstract: U charts are used

More information

princeton univ. F 13 cos 521: Advanced Algorithm Design Lecture 3: Large deviations bounds and applications Lecturer: Sanjeev Arora

princeton univ. F 13 cos 521: Advanced Algorithm Design Lecture 3: Large deviations bounds and applications Lecturer: Sanjeev Arora prnceton unv. F 13 cos 521: Advanced Algorthm Desgn Lecture 3: Large devatons bounds and applcatons Lecturer: Sanjeev Arora Scrbe: Today s topc s devaton bounds: what s the probablty that a random varable

More information

18. SIMPLE LINEAR REGRESSION III

18. SIMPLE LINEAR REGRESSION III 8. SIMPLE LINEAR REGRESSION III US Domestc Beers: Calores vs. % Alcohol Ftted Values and Resduals To each observed x, there corresponds a y-value on the ftted lne, y ˆ ˆ = α + x. The are called ftted values.

More information

BOOTSTRAP METHOD FOR TESTING OF EQUALITY OF SEVERAL MEANS. M. Krishna Reddy, B. Naveen Kumar and Y. Ramu

BOOTSTRAP METHOD FOR TESTING OF EQUALITY OF SEVERAL MEANS. M. Krishna Reddy, B. Naveen Kumar and Y. Ramu BOOTSTRAP METHOD FOR TESTING OF EQUALITY OF SEVERAL MEANS M. Krshna Reddy, B. Naveen Kumar and Y. Ramu Department of Statstcs, Osmana Unversty, Hyderabad -500 007, Inda. nanbyrozu@gmal.com, ramu0@gmal.com

More information

CHAPTER 8. Exercise Solutions

CHAPTER 8. Exercise Solutions CHAPTER 8 Exercse Solutons 77 Chapter 8, Exercse Solutons, Prncples of Econometrcs, 3e 78 EXERCISE 8. When = N N N ( x x) ( x x) ( x x) = = = N = = = N N N ( x ) ( ) ( ) ( x x ) x x x x x = = = = Chapter

More information

Global Sensitivity. Tuesday 20 th February, 2018

Global Sensitivity. Tuesday 20 th February, 2018 Global Senstvty Tuesday 2 th February, 28 ) Local Senstvty Most senstvty analyses [] are based on local estmates of senstvty, typcally by expandng the response n a Taylor seres about some specfc values

More information

Logistic Regression. CAP 5610: Machine Learning Instructor: Guo-Jun QI

Logistic Regression. CAP 5610: Machine Learning Instructor: Guo-Jun QI Logstc Regresson CAP 561: achne Learnng Instructor: Guo-Jun QI Bayes Classfer: A Generatve model odel the posteror dstrbuton P(Y X) Estmate class-condtonal dstrbuton P(X Y) for each Y Estmate pror dstrbuton

More information

Week 5: Neural Networks

Week 5: Neural Networks Week 5: Neural Networks Instructor: Sergey Levne Neural Networks Summary In the prevous lecture, we saw how we can construct neural networks by extendng logstc regresson. Neural networks consst of multple

More information

8.6 The Complex Number System

8.6 The Complex Number System 8.6 The Complex Number System Earler n the chapter, we mentoned that we cannot have a negatve under a square root, snce the square of any postve or negatve number s always postve. In ths secton we want

More information

Lecture 4 Hypothesis Testing

Lecture 4 Hypothesis Testing Lecture 4 Hypothess Testng We may wsh to test pror hypotheses about the coeffcents we estmate. We can use the estmates to test whether the data rejects our hypothess. An example mght be that we wsh to

More information

Generalized Linear Methods

Generalized Linear Methods Generalzed Lnear Methods 1 Introducton In the Ensemble Methods the general dea s that usng a combnaton of several weak learner one could make a better learner. More formally, assume that we have a set

More information

Durban Watson for Testing the Lack-of-Fit of Polynomial Regression Models without Replications

Durban Watson for Testing the Lack-of-Fit of Polynomial Regression Models without Replications Durban Watson for Testng the Lack-of-Ft of Polynomal Regresson Models wthout Replcatons Ruba A. Alyaf, Maha A. Omar, Abdullah A. Al-Shha ralyaf@ksu.edu.sa, maomar@ksu.edu.sa, aalshha@ksu.edu.sa Department

More information

2016 Wiley. Study Session 2: Ethical and Professional Standards Application

2016 Wiley. Study Session 2: Ethical and Professional Standards Application 6 Wley Study Sesson : Ethcal and Professonal Standards Applcaton LESSON : CORRECTION ANALYSIS Readng 9: Correlaton and Regresson LOS 9a: Calculate and nterpret a sample covarance and a sample correlaton

More information

Case A. P k = Ni ( 2L i k 1 ) + (# big cells) 10d 2 P k.

Case A. P k = Ni ( 2L i k 1 ) + (# big cells) 10d 2 P k. THE CELLULAR METHOD In ths lecture, we ntroduce the cellular method as an approach to ncdence geometry theorems lke the Szemeréd-Trotter theorem. The method was ntroduced n the paper Combnatoral complexty

More information

Numerical Heat and Mass Transfer

Numerical Heat and Mass Transfer Master degree n Mechancal Engneerng Numercal Heat and Mass Transfer 06-Fnte-Dfference Method (One-dmensonal, steady state heat conducton) Fausto Arpno f.arpno@uncas.t Introducton Why we use models and

More information

THE CHINESE REMAINDER THEOREM. We should thank the Chinese for their wonderful remainder theorem. Glenn Stevens

THE CHINESE REMAINDER THEOREM. We should thank the Chinese for their wonderful remainder theorem. Glenn Stevens THE CHINESE REMAINDER THEOREM KEITH CONRAD We should thank the Chnese for ther wonderful remander theorem. Glenn Stevens 1. Introducton The Chnese remander theorem says we can unquely solve any par of

More information

Gaussian Mixture Models

Gaussian Mixture Models Lab Gaussan Mxture Models Lab Objectve: Understand the formulaton of Gaussan Mxture Models (GMMs) and how to estmate GMM parameters. You ve already seen GMMs as the observaton dstrbuton n certan contnuous

More information

DETERMINATION OF UNCERTAINTY ASSOCIATED WITH QUANTIZATION ERRORS USING THE BAYESIAN APPROACH

DETERMINATION OF UNCERTAINTY ASSOCIATED WITH QUANTIZATION ERRORS USING THE BAYESIAN APPROACH Proceedngs, XVII IMEKO World Congress, June 7, 3, Dubrovn, Croata Proceedngs, XVII IMEKO World Congress, June 7, 3, Dubrovn, Croata TC XVII IMEKO World Congress Metrology n the 3rd Mllennum June 7, 3,

More information

Midterm Examination. Regression and Forecasting Models

Midterm Examination. Regression and Forecasting Models IOMS Department Regresson and Forecastng Models Professor Wllam Greene Phone: 22.998.0876 Offce: KMC 7-90 Home page: people.stern.nyu.edu/wgreene Emal: wgreene@stern.nyu.edu Course web page: people.stern.nyu.edu/wgreene/regresson/outlne.htm

More information

Hashing. Alexandra Stefan

Hashing. Alexandra Stefan Hashng Alexandra Stefan 1 Hash tables Tables Drect access table (or key-ndex table): key => ndex Hash table: key => hash value => ndex Man components Hash functon Collson resoluton Dfferent keys mapped

More information

Exercises. 18 Algorithms

Exercises. 18 Algorithms 18 Algorthms Exercses 0.1. In each of the followng stuatons, ndcate whether f = O(g), or f = Ω(g), or both (n whch case f = Θ(g)). f(n) g(n) (a) n 100 n 200 (b) n 1/2 n 2/3 (c) 100n + log n n + (log n)

More information

Basic Business Statistics, 10/e

Basic Business Statistics, 10/e Chapter 13 13-1 Basc Busness Statstcs 11 th Edton Chapter 13 Smple Lnear Regresson Basc Busness Statstcs, 11e 009 Prentce-Hall, Inc. Chap 13-1 Learnng Objectves In ths chapter, you learn: How to use regresson

More information

ECONOMICS 351*-A Mid-Term Exam -- Fall Term 2000 Page 1 of 13 pages. QUEEN'S UNIVERSITY AT KINGSTON Department of Economics

ECONOMICS 351*-A Mid-Term Exam -- Fall Term 2000 Page 1 of 13 pages. QUEEN'S UNIVERSITY AT KINGSTON Department of Economics ECOOMICS 35*-A Md-Term Exam -- Fall Term 000 Page of 3 pages QUEE'S UIVERSITY AT KIGSTO Department of Economcs ECOOMICS 35* - Secton A Introductory Econometrcs Fall Term 000 MID-TERM EAM ASWERS MG Abbott

More information