This excerpt from. Foundations of Statistical Natural Language Processing. Christopher D. Manning and Hinrich Schütze The MIT Press.

Size: px

Start display at page:

Download "This excerpt from. Foundations of Statistical Natural Language Processing. Christopher D. Manning and Hinrich Schütze The MIT Press."

Allison Burns
6 years ago
Views:

1 Ths excerpt from Foundatons of Statstcal Natural Language Processng. Chrstopher D. Mannng and Hnrch Schütze The MIT Press. s provded n screen-vewable form for personal use only by members of MIT CogNet. Unauthorzed use or dssemnaton of ths nformaton s expressly forbdden. If you have any questons about ths materal, please contact cognetadmn@cognet.mt.edu.

2 15 Topcs n Informaton Retreval Informaton Retreval (IR) research s concerned wth developng algorthms and models for retrevng nformaton from document repostores. IR mght be regarded as a natural subfeld of NLP because t deals wth a partcular applcaton of natural language processng. (Tradtonal IR research deals wth text although retreval of speech, mages and vdeo are becomng ncreasngly common.) But n actualty, nteractons between the felds have been lmted, partly because the specal demands of IR were not seen as nterestng problems n NLP, partlybecause statstcal methods, the domnant approach n IR, were out of favor n NLP. Wth the resurgence of quanttatve methods n NLP, the connectons between the felds have ncreased. We have selected four examples of recent nteracton between the felds: probablstc models of term dstrbuton n documents, a problem that has receved attenton n both Statstcal NLP and IR; dscourse segmentaton, an NLP technque that has been used for more effectve document retreval; and the Vector Space Model and Latent Semantc Indexng (LSI), two IR technques that have been used n Statstcal NLP. Latent Semantc Indexng wll also serve as an example of dmensonalty reducton, anmportantstatstcaltechnque n tself. Our selecton s qute subjectve and we refer the reader to the IR lterature for coverage of other topcs (see secton 15.6). In the followng secton, we gve some basc background on IR, and then dscuss the four topcs n turn.

3 Topcs n Informaton Retreval 15.1 Some Background on Informaton Retreval ad-hoc retreval problem The goal of IR research s to develop models and algorthms for retrevng nformaton from document repostores, n partcular, textual nformaton. The classcal problem n IR s the ad-hoc retreval problem. Inad-hoc retreval, the user enters a query descrbng the desred nformaton. The system then returns a lst of documents. There are two man models. Exact match systems return documents that precsely satsfy some struc- tured query expresson, of whch the best known type s Boolean queres, whch are stll wdely used n commercal nformaton systems. But for large and heterogeneous document collectons, the result sets of exact match systems usually are ether empty or huge and unweldy, and so most recent work has concentrated on systems whch rank documents accordng to ther estmated relevance to the query. It s wthn such an approach that probablstc methods are useful, and so we restrct our attenton to such systems henceforth. An example of ad-hoc retreval s shown n fgure The query s glass pyramd Pe Louvre, entered on the nternet search engne Alta Vsta. The user s lookng for web pages about I. M. Pe s glass pyramd over the Louvre entrance n Pars. The search engne returns several relevant pages, but also some non-relevant ones a result that s typcal for ad-hoc searches due to the dffculty of the problem. Some of the aspects of ad-hoc retreval that are addressed n IR research are how users can mprove the orgnal formulaton of a query nteractvely, by way of relevance feedback; how results from several text databases can be merged nto one result lst (database mergng); whch models are approprate for partally corrupted data, for example, OCRed documents; and how the specal problems that languages other than Englsh pose can be addressed n IR. Some subfelds of nformaton retreval rely on a tranng corpus of documents that have been classfed as ether relevant or non-relevant to a partcular query. In text categorzaton, one attempts to assgn docu- ments to two or more pre-defned categores. An example s the subject codes assgned by Reuters to ts news stores (Lews 1992). Codes lke CORP-NEWS (corporate news), CRUDE (crude ol) or ACQ (acqustons) make t easer for subscrbers to fnd stores of nterest to them. A fnancal analyst nterested n acqustons can request a customzed newsfeed that only delvers documents tagged wth ACQ. Flterng and routng are specal cases of text categorzaton wth only exact match Boolean queres relevance feedback database mergng text categorzaton flterng routng

4 15.1 Some Background on Informaton Retreval 531 [ AltaVsta] [ Advanced Query] [ Smple Query] [ Prvate extenson Products] [ Help wth Query] Search the Web Usenet Dsplay results Compact Detaled Tp: When n doubt use lower-case. Check out Help for better matches. Word count: glass pyramd: about 200; Pe:9453; Louvre:26578 Documents 1-10 of about matchng the query, best matches frst. Pars, France Pars, France. Practcal Info.-A Bref Overvew. Layout: One of the most densely populated ctes n Europe, Pars s also one of the most accessble, sze 8K - 29 Sep 95 Culture Culture. French culture s an ntegral part of France s mage, as foregn toursts are the frst to acknowledge by throngng to the Louvre and the Centre sze 48K - 20 Jun 96 Travel World - Scence Educaton Tour of Europe Scence Educaton Tour of Europe. B E M I D J I S T A T E U N I V E R S I T Y Scence Educaton Tour of EUROPE July 19-August 1, sze 16K - 21 Jul sze 16K - 15 May 95 FRANCE REAL ESTATE RENTAL LOIRE VALLEY RENTAL. ANCIENT STONE HOME FOR RENT. Avalable to rent s a furnshed, french country decorated, two bedroom, small stone home, bult n the.. sze 10K - 21 Jun 96 LINKS PAUL S LINKS. Clck here to vew CNN nteractve and WEBNEWSor CNET. Clck here to make your own web ste. Clck here to manage your cash. Interested n... sze 9K - 19 Jun 96 Dgtal Desgn Meda, Chapter 9: Lnes n Space Constructon planes... Glass-sheet models... Three-dmensonal geometrc transformatons... Sweepng ponts... Space curves... Structurng wreframe... sze 36K - 22 Jul 95 No Ttle Boston Update 94: A VISION FOR BOSTON S FUTURE. Ian Menzes. Senor Fellow, McCormack Insttute. Unversty of Massachusetts Boston. Aprl Prepared.. sze 25K - 31 Jan 96 Pars - Photograph The Arc de Tromphe du Carrousel neatly frames IM Pe s glass pyramd, Pars 1/ Rchard Nebesky. Fgure 15.1 Results of the search glass pyramd Pe Louvre on an nternet search engne.

5 Topcs n Informaton Retreval two categores: relevant and non-relevant to a partcular query (or nfor- maton need). In routng, the desred output s a rankng of documents accordng to estmated relevance, smlar to the rankng shown n fgure 15.1 for the ad-hoc problem. The dfference between routng and ad-hoc s that tranng nformaton n the form of relevance labels s avalable n routng, but not n ad-hoc retreval. In flterng, an estmaton of relevance has to be made for each document, typcally n the form of a probablty estmate. Flterng s harder than routng because an absolute ( Document d s relevant ) rather than a relatve assessment of relevance ( Document d 1 s more relevant than d 2 ) s requred. In many practcal applcatons, an absolute assessment of relevance for each ndvdual document s necessary. For example, when a news group s fltered for stores about a partcular company, users do not want to wat for a month, and then receve a ranked lst of all stores about the company n the past month, wth the most relevant shown at the top. Instead, t s desrable to delver relevant stores as soon as they come n wthout knowledge about subsequent postngs. As specal cases of classfcaton, flterng and routng can be accomplshed usng any of the classfcaton algorthms descrbed n chapter 16 or elsewhere n ths book. nformaton need Common desgn features of IR systems nverted ndex postngs Most IR systems have as ther prmary data structure an nverted ndex. An nverted ndex s a data structure that lsts for each word n the collecton all documents that contan t (the postngs) and the frequency of occurrence n each document. An nverted ndex makes t easy to search for hts of a query word. One just goes to the part of the nverted ndex that corresponds to the query word and retreves the documents lsted there. A more sophstcated verson of the nverted ndex also contans pos- ton nformaton. Instead of just lstng the documents that a word occurs n, the postons of all occurrences n the document are also lsted. A poston of occurrence can be encoded as a byte offset relatve to the begnnng of the document. An nverted ndex wth poston nformaton lets us search for phrases. For example, to search for car nsurance, we smultaneously work through the entres for car and nsurance n the nverted ndex. Frst, we ntersect the two sets so that we only have documents n whch both words occur. Then we look at the poston nformaton and keep only those hts for whch the poston nformaton poston nformaton phrases

6 15.1 Some Background on Informaton Retreval 533 a also an and as at be but by can could do for from go have he her here hs how f n nto t ts my of on or our say she that the ther there therefore they ths these those through to untl we what when where whch whle who wth would you your Table 15.1 A small stop lst for Englsh. Stop words are functon words that can be gnored n keyword-orented nformaton retreval wthout a sgnfcant effect on retreval accuracy. ndcates that nsurance occurs mmedately after car. Thssmuchmore effcent than havng to read n and process all documents of the collecton sequentally. The noton of phrase used here s a farly prmtve one. We can only search for fxed phrases. For example, a search for car nsurance rates would not fnd documents talkng about rates for car nsurance. Ths s an area n whch future Statstcal NLP research can make mportant contrbutons to nformaton retreval. Most recent research on phrases n IR has taken the approach of desgnng a separate phrase dentfcaton module and then ndexng documents for dentfed phrases as well as words. In such a system, a phrase s treated as no dfferent from an ordnary word. The smplest approach to phrase dentfcaton, whch s anathema to NLP researchers, but often performs surprsngly well, s to just select the most frequent bgrams as phrases, for example, those that occur at least 25 tmes. In cases where phrase dentfcaton s a separate module, t s very smlar to the problem of dscoverng collocatons. Many of the technques n chapter 5 for fndng collocatons can therefore also be appled to dentfyng good phrases for ndexng and searchng. In some IR systems, not all words are represented n the nverted ndex. stop lst A stop lst of grammatcal or functon words lsts those words that are functon words deemed unlkely to be useful for searchng. Common stop words are the, from and could. These words have mportant semantc functons n Englsh, but they rarely contrbute nformaton f the search crteron s

7 Topcs n Informaton Retreval stemmng Lovns stemmer Porter stemmer a smple word-by-word match. A small stop lst for Englsh s shown n table A stop lst has the advantage that t reduces the sze of the nverted ndex. Accordng to Zpf s law (see secton 1.4.3), a stop lst that covers a few dozen words can reduce the sze of the nverted ndex by half. However, t s mpossble to search for phrases that contan stop words once the stop lst has been appled note that some occasonally used phrases lke when and where consst entrely of words n the stop lst n table For ths reason, many retreval engnes do not make use of a stop lst for ndexng. Another common feature of IR systems s stemmng, whch we brefly dscussed n secton In IR, stemmng usually refers to a smplfed form of morphologcal analyss consstng smply of truncatng a word. For example, laughng, laugh, laughs and laughed are all stemmed to laugh-. Common stemmers are the Lovns and Porter stemmers, whch dffer n the actual algorthms used for determnng where to truncate words (Lovns 1968; Porter 1980). Two problems wth truncaton stemmers are that they conflate semantcally dfferent words (for example, gallery and gall may both be stemmed to gall-) and that the truncated stems can be unntellgble to users (for example, f gallery s presented as gall-). They are also much harder to make work well for morphologyrch languages Evaluaton measures Snce the qualty of many retreval systems depends on how well they manage to rank relevant documents before non-relevant ones, IR researchers have developed evaluaton measures specfcally desgned to evaluate rankngs. Most of these measures combne precson and recall n a way that takes account of the rankng. As we explaned n secton 8.1, precson s the percentage of relevant tems n the returned set and recall s the percentage of all relevant documents n the collecton that s n the returned set. Fgure 15.2 demonstrates why the rankng of documents s mportant. All three retreved sets have the same number of relevant and not relevant documents. A smple measure of precson (50% correct) would not dstngush between them. But rankng 1 s clearly better than rankng 2 for a user who scans a returned lst of documents from top to bottom

8 15.1 Some Background on Informaton Retreval 535 Evaluaton Rankng 1 Rankng 2 Rankng 3 d1: d10: d6: d2: d9: d1: d3: d8: d2: d4: d7: d10: d5: d6: d9: d6: d1: d3: d7: d2: d5: d8: d3: d4: d9: d4: d7: d10: d5: d8: precson at precson at unnterpolated av. prec nterpolated av. prec. (11-pont) Table 15.2 An example of the evaluaton of rankngs. The columns show three dfferent rankngs of ten documents, where a ndcates a relevant document and a ndcates a non-relevant document. The rankngs are evaluated accordng to four measures: precson at 5 documents, precson at 10 documents, unnterpolated average precson, and nterpolated average precson over 11 ponts. cutoff unnterpolated average precson (whch s what users do n many practcal stuatons, for example, when web searchng). One measure used s precson at a partcular cutoff, for example 5 or 10 documents (other typcal cutoffs are 20 and 100). By lookng at precson for several ntal segments of the ranked lst, one can gan a good mpresson of how well a method ranks relevant documents before non-relevant documents. Unnterpolated average precson aggregates many precson numbers nto one evaluaton fgure. Precson s computed for each pont n the lst where we fnd a relevant document and these precson numbers are then averaged. For example, for rankng 1 precson s 1.0 for d1, d2, d3, d4 and d5 snce for each of these documents there are only relevant documents up to that pont n the lst. The unnterpolated average s therefore also 1.0. For rankng 3, we get the followng precson numbers for the relevant documents: 1/2 (d1), 2/3 (d2), 3/6 (d3), 4/7 (d5), 5/8

9 Topcs n Informaton Retreval (d4), whch averages to If there are other relevant documents further down the lst then these also have to be taken nto account n computng unnterpolated average precson. Precson at relevant documents that are not n the returned set s assumed to be zero. Ths shows that average precson ndrectly measures recall, the percentage of relevant documents that were returned n the retreved set (snce omtted documents are entered as zero precson). nterpolated Interpolated average precson s more drectly based on recall. Prec- average precson son numbers are computed for varous levels of recall, for example for levels of recall the levels 0%, 10%, 20%, 30%, 40%, 50%, 60%, 70%, 80%, 90%, and 100% n the case of an 11-pont average (the most wdely used measure). At recall level α, precson β s computed at the pont of the ranked lst where the proporton of retreved relevant documents reaches α. However, f precson goes up agan whle we are movng down the lst, then we nterpolate nterpolate and take the hghest value of precson anywhere beyond the pont where recall level α was frst reached. For example, for rankng 3 n fgure 15.2 nterpolated precson for recall level 60% s not 4/7, the precson at the pont where 60% recall s frst reached as shown n the top dagram of fgure Instead, t s 5/8 > 4/7 as shown n the bottom dagram of fgure (We are assumng that the fve relevant documents shown are the only relevant documents.) The thnkng here s that the user wll be wllng to look at more documents f the precson goes up. The two precson-recall graphs n fgure 15.2 are so-called precson-recall curves, wth nterpo- curves lated and unnterpolated values for 0%, 20%, 40%, 60%, 80%, and 100% recall for rankng 3 n table There s an obvous trade-off between precson and recall. If the whole collecton s retreved, then recall s 100%, but precson s low. On the other hand, f only a few documents are retreved, then the most relevantseemng documents wll be returned, resultng n hgh precson, but recall wll be low. Average precson s one way of computng a measure that captures F measure both precson and recall. Another way s the F measure, whch we ntroduced n secton 8.1: 1 (15.1) F = α 1 P + (1 α) 1 R where P s the precson, R s the recall and α determnes the weghtng of precson and recall. The F measure can be used for evaluaton at fxed cutoffs f both recall and precson are mportant.

10 15.1 Some Background on Informaton Retreval 537 precson recall nterpolated precson recall Fgure 15.2 Two examples of precson-recall curves. The two curves are for rankng 3 n table 15.2: unnterpolated (above) and nterpolated (below).

11 Topcs n Informaton Retreval Any of the measures dscussed above can be used to compare the performance of nformaton retreval systems. One common approach s to run the systems on a corpus and a set of queres and average the performance measure over queres. If the average of system 1 s better than the average of system 2, then that s evdence that system 1 s better than system 2. Unfortunately, there are several problems wth ths expermental desgn. The dfference n averages could be due to chance. Or t could be due to one query on whch system 1 outperforms system 2 by a large margn wth performance on all other queres beng about the same. It s therefore advsable to use a statstcal test lke the t test for system comparson (as shown n secton 6.2.3) The probablty rankng prncple (PRP) Rankng documents s ntutvely plausble snce t gves the user some control over the tradeoff between precson and recall. If recall for the frst page of results s low and the desred nformaton s not found, then the user can look at the next page, whch n most cases trades hgher recall for lower precson. The followng prncple s a gudelne whch s one way to make the assumptons explct that underle the desgn of retreval by rankng. We present t n a form smplfed from (van Rjsbergen 1979: 113): Probablty Rankng Prncple (PRP). Rankng documents n order of decreasng probablty of relevance s optmal. The basc dea s that we vew retreval as a greedy search that ams to dentfy the most valuable document at any gven tme. The document d that s most lkely to be valuable s the one wth the hghest estmated probablty of relevance (where we consder all documents that haven t been retreved yet), that s, wth a maxmum value for P(R d). Aftermakng many consecutve decsons lke ths, we arrve at a lst of documents that s ranked n order of decreasng probablty of relevance. Many retreval systems are based on the PRP, sotsmportanttobe clear about the assumptons that are made when t s accepted. One assumpton of the PRP s that documents are ndependent. The clearest counterexamples are duplcates. If we have two duplcates d 1 and d 2, then the estmated probablty of relevance of d 2 does not change after we have presented d 1 further up n the lst. But d 2 does not gve

12 15.2 The Vector Space Model 539 the user any nformaton that s not already contaned n d 1. Clearly, a better desgn s to show only one of the set of dentcal documents, but that volates the PRP. Another smplfcaton made by the PRP s to break up a complex nformaton need nto a number of queres whch are each optmzed n solaton. In practce, a document can be hghly relevant to the complex nformaton need as a whole even f t s not the optmal one for an ntermedate step. An example here s an nformaton need that the user ntally expresses usng ambguous words, for example, the query jaguar to search for nformaton on the anmal (as opposed to the car). The optmal response to ths query may be the presentaton of documents that make the user aware of the ambguty and permt dsambguaton of the query. In contrast, the PRP would mandate the presentaton of documents that are hghly relevant to ether the car or the anmal. A thrd mportant caveat s that the probablty of relevance s only estmated. Gven the many smplfyng assumptons we make n desgnng probablstc models for IR, we cannot completely trust the probablty estmates. One aspect of ths problem s that the varance of the est- mate of probablty of relevance may be an mportant pece of evdence n some retreval contexts. For example, a user may prefer a document that we are certan s probably relevant (low varance of probablty estmate) to one whose estmated probablty of relevance s hgher, but that also has a hgher varance of the estmate. varance 15.2 The Vector Space Model vector space model The vector space model s one of the most wdely used models for ad-hoc retreval, manly because of ts conceptual smplcty and the appeal of the underlyng metaphor of usng spatal proxmty for semantc proxmty. Documents and queres are represented n a hgh-dmensonal space, n whch each dmenson of the space corresponds to a word n the document collecton. The most relevant documents for a query are expected to be those represented by the vectors closest to the query, that s, documents that use smlar words to the query. Rather than consderng the magntude of the vectors, closeness s often calculated by just lookng at angles and choosng documents that enclose the smallest angle wth the query vector. In fgure 15.3, we show a vector space wth two dmensons, corre-

13 Topcs n Informaton Retreval car 1 d 1 q d d 3 nsurance Fgure 15.3 A vector space wth two dmensons. The two dmensons correspond to the terms car and nsurance. One query and three documents are represented n the space. term weghts term spondng to the words car and nsurance. The enttes represented n the space are the query q represented by the vector (0.71, 0.71), and three documents d 1, d 2,andd 3 wth the followng coordnates: (0.13, 0.99), (0.8, 0.6), and(0.99, 0.13). The coordnates or term weghts are derved from occurrence counts as we wll see below. For example, nsurance may have only a passng reference n d 1 whle there are several occurrences of car hence the low weght for nsurance and the hgh weght for car. (In the context of nformaton retreval, the word term s used for both words and phrases. We say term weghts rather than word weghts because dmensons n the vector space model can correspond to phrases as well as words.) In the fgure, document d 2 has the smallest angle wth q, sotwllbe the top-ranked document n response to the query car nsurance. Thss because both concepts (car and nsurance) are salent n d 2 and therefore have hgh weghts. The other two documents also menton both terms, but n each case one of them s not a centrally mportant term n the document Vector smlarty cosne To do retreval n the vector space model, documents are ranked accordng to smlarty wth the query as measured by the cosne measure or

14 15.2 The Vector Space Model 541 normalzed correlaton coeffcent (15.2) normalzed correlaton coeffcent. We ntroduced the cosne as a measure of vector smlarty n secton and repeat ts defnton here: n =1 cos( q, d) = q d n =1 q 2 n =1 d 2 (15.3) where q and d are n-dmensonal vectors n a real-valued space, the space of all terms n the case of the vector space model. We compute how well the occurrence of term (measured by q and d ) correlates n query and document and then dvde by the Eucldean length of the two vectors to scale for the magntude of the ndvdual q and d. Recall also from secton that cosne and Eucldean dstance gve rse to the same rankng for normalzed vectors: n ( x y ) 2 = (x y ) 2 = =1 n n n x 2 2 x y + =1 = 1 2 = 2(1 =1 n x y + 1 =1 n x y ) =1 y 2 =1 So for a partcular query q and any two documents d 1 and d 2 we have: (15.4) cos( q, d 1 )>cos( q, d 2 ) q d 1 < q d 2 whch mples that the rankngs are the same. (We agan assume normalzed vectors here.) If the vectors are normalzed, we can compute the cosne as a smple dot product. Normalzaton s generally seen as a good thng otherwse longer vectors (correspondng to longer documents) would have an unfar advantage and get ranked hgher than shorter ones. (We leave t as an exercse to show that the vectors n fgure 15.3 are normalzed, that s, d 2 = 1.) Term weghtng We now turn to the queston of how to weght words n the vector space model. One could just use the count of a word n a document as ts term

15 Topcs n Informaton Retreval Quantty Symbol Defnton term frequency tf,j number of occurrences of w n d j document frequency df number of documents n the collecton that w occurs n collecton frequency cf total number of occurrences of w n the collecton Table 15.3 Three quanttes that are commonly used n term weghtng n nformaton retreval. Word Collecton Frequency Document Frequency nsurance try Term and document frequences of two words n an example cor- Table 15.4 pus. term frequency document frequency collecton frequency weght, but there are more effectve methods of term weghtng. The basc nformaton used n term weghtng s term frequency, document frequency, and sometmes collecton frequency as defned n table Note that df cf and that j tf,j = cf. It s also mportant to note that document frequency and collecton frequency can only be used f there s a collecton. Ths assumpton s not always true, for example f collectons are created dynamcally by selectng several databases from a large set (as may be the case on one of the large on-lne nformaton servces), and jonng them nto a temporary collecton. The nformaton that s captured by term frequency s how salent a word s wthn a gven document. The hgher the term frequency (the more often the word occurs) the more lkely t s that the word s a good descrpton of the content of the document. Term frequency s usually dampened by a functon lke f(tf) = tf or f(tf) = 1 + log(tf), tf > 0 because more occurrences of a word ndcate hgher mportance, but not as much mportance as the undampened count would suggest. For example, 3or1+ log 3 better reflect the mportance of a word wth three occurrences than the count 3 tself. The document s somewhat more mportant than a document wth one occurrence, but not three tmes as mportant. The second quantty, document frequency, can be nterpreted as an ndcator of nformatveness. A semantcally focussed word wll often occur several tmes n a document f t occurs at all. Semantcally unfocussed words are spread out homogeneously over all documents. An example

16 15.2 The Vector Space Model 543 from a corpus of New York Tmes artcles s the words nsurance and try n table The two words have about the same collecton frequency, the total number of occurrences n the document collecton. But nsurance occurs n only half as many documents as try. Ths s because the word try can be used when talkng about almost any topc snce one can try to do somethng n any context. In contrast, nsurance refers to a narrowly defned concept that s only relevant to a small set of topcs. Another property of semantcally focussed words s that, f they come up once n a document, they often occur several tmes. Insurance occurs about three tmes per document, averaged over documents t occurs n at least once. Ths s smply due to the fact that most artcles about health nsurance, car nsurance or smlar topcs wll refer multple tmes to the concept of nsurance. One way to combne a word s term frequency tf,j and document frequency df nto a sngle weght s as follows: { (1 + log(tf,j )) log N weght(, j) = df f tf,j 1 0 f tf,j = 0 where N s the total number of documents. The frst clause apples for words occurrng n the document, whereas for words that do not appear (tf,j = 0), we set weght(, j) = 0. Document frequency s also scaled logarthmcally. The formula log N df = log N log df gves full weght to words that occur n 1 document (log N log df = log N log 1 = log N). A word that occurred n all documents would get zero weght (log N log df = log N log N = 0). Ths form of document frequency weghtng s often called nverse doc- ument frequency or df weghtng. More generally, the weghtng scheme n (15.5) s an example of a larger famly of so-called tf.df weghtng schemes. Each such scheme can be characterzed by ts term occurrence weghtng, ts document frequency weghtng and ts normalzaton. In one descrpton scheme, we assgn a letter code to each component of the tf.df scheme. The scheme n (15.5) can then be descrbed as ltn for logarthmc occurrence count weghtng (l), logarthmc document frequency weghtng (t), and no normalzaton (n). Other weghtng possbltes are lsted n table For example, ann s augmented term occurrence weghtng, no document frequency weghtng and no normalzaton. We refer to vector length normalzaton as cosne normalzaton because the nner product between two length-normalzed vectors (the query-document smlarty measure used n the vector space model) s (15.5) nverse document frequency df tf.df

17 Topcs n Informaton Retreval Term occurrence Document frequency Normalzaton n (natural) tf t,d n (natural) df t n (no normalzaton) l (logarthm) 1 + log(tf t,d ) t log N df t c (cosne) a (augmented) tf t,d max t (tf t,d ) 1 w1 2 +w w n 2 Table 15.5 Components of tf.df weghtng schemes. tf t,d s the frequency of term t n document d, df t s the number of documents t occurs n, N s the total number of documents, and w s the weght of term. ther cosne. Dfferent weghtng schemes can be appled to queres and documents. In the name ltc.lnn, the halves refer to document and query weghtng, respectvely. The famly of weghtng schemes shown n table 15.5 s sometmes crtczed as ad-hoc because t s not drectly derved from a mathematcal model of term dstrbutons or relevancy. However, these schemes are effectve n practce and work robustly n a broad range of applcatons. For ths reason, they are often used n stuatons where a rough measure of smlarty between vectors of counts s needed Term Dstrbuton Models Zpf s law An alternatve to tf.df weghtng s to develop a model for the dstrbuton of a word and to use ths model to characterze ts mportance for retreval. That s, we wsh to estmate P (k), the proporton of tmes that word w appears k tmes n a document. In the smplest case, the dstrbuton model s used for dervng a probablstcally motvated term weghtng scheme for the vector space model. But models of term dstrbuton can also be embedded n other nformaton retreval frameworks. Apart from ts mportance for term weghtng, a precse characterzaton of the occurrence patterns of words n text s arguably at least as mportant a topc n Statstcal NLP as Zpf s law. Zpf s law descrbes word behavor n an entre corpus. In contrast, term dstrbuton models capture regulartes of word occurrence n subunts of a corpus (e.g., documents or chapters of a book). In addton to nformaton retreval, a good understandng of dstrbuton patterns s useful wherever we want to assess the lkelhood of a certan number of occurrences of a specfc word n a unt of text. For example, t s also mportant for author dentf-

18 15.3 Term Dstrbuton Models 545 caton where one compares the lkelhood that dfferent wrters produced a text of unknown authorshp. Most term dstrbuton models try to characterze how nformatve a word s, whch s also the nformaton that nverse document frequency s gettng at. One could cast the problem as one of dstngushng content words from non-content (or functon) words, but most models have a graded noton of how nformatve a word s. In ths secton, we ntroduce several models that formalze notons of nformatveness. Three are based on the Posson dstrbuton, one motvates nverse document frequency as a weght optmal for Bayesan classfcaton and the fnal one, resdual nverse document frequency, can be nterpreted as a combnaton of df and the Posson dstrbuton The Posson dstrbuton Posson dstrbuton The standard probablstc model for the dstrbuton of a certan type of event over unts of a fxed sze (such as perods of tme or volumes of lqud) s the Posson dstrbuton. Classcal examples of Posson dstrbutons are the number of tems that wll be returned as defects n a gven perod of tme, the number of typng mstakes on a page, and the number of mcrobes that occur n a gven volume of water. The defnton of the Posson dstrbuton s as follows. Posson Dstrbuton. p(k; λ ) = e λ λ k for some λ > 0 k! In the most common model of the Posson dstrbuton n IR, the parameter λ > 0 s the average number of occurrences of w per document, that s, λ = cf N where cf s the collecton frequency and N s the total number of documents n the collecton. Both the mean and the varance of the Posson dstrbuton are equal to λ : E(p) = Var(p) = λ Fgure 15.4 shows two examples of the Posson dstrbuton. In our case, the event we are nterested n s the occurrence of a partcular word w and the fxed unt s the document. We can use the Posson dstrbuton to estmate an answer to the queston: What s the probablty that a word occurs a partcular number of tmes n a document. We mght say that P (k) = p(k; λ ) s the probablty of a document havng exactly k occurrences of w,whereλ s approprately estmated for each word.

19 Topcs n Informaton Retreval probablty Fgure 15.4 The Posson dstrbuton. The graph shows p(k; 0.5) (sold lne) and p(k; 2.0) (dotted lne) for 0 k 6. In the most common use of ths dstrbuton n IR, k s the number of occurrences of term n a document, and p(k; λ ) s the probablty of a document wth that many occurrences. count The Posson dstrbuton s a lmt of the bnomal dstrbuton. For the bnomal dstrbuton b(k; n, p), fweletn and p 0nsuchaway that np remans fxed at value λ>0, then b(x; n, p) p(k; λ). Assumng a Posson dstrbuton for a term s approprate f the followng condtons hold. The probablty of one occurrence of the term n a (short) pece of text s proportonal to the length of the text. The probablty of more than one occurrence of a term n a short pece of text s neglgble compared to the probablty of one occurrence. Occurrence events n non-overlappng ntervals of text are ndependent. We wll dscuss problems wth these assumptons for modelng the dstrbuton of terms shortly. Let us frst look at some examples.

20 15.3 Term Dstrbuton Models 547 Word df cf λ N(1 p(0; λ )) Overestmaton follows transformed sovet students james freshly Table 15.6 Document frequency (df) and collecton frequency (cf) for 6 words n the New York Tmes corpus. Computng N(1 p(0; λ )) accordng to the Posson dstrbuton s a reasonable estmator of df for non-content words (lke follows), but severely overestmates df for content words (lke sovet). The parameter λ of the Posson dstrbuton s the average number of occurrences of term per document. The corpus has N = documents. Table 15.6 shows for sx terms n the New York Tmes newswre how well the Posson dstrbuton predcts document frequency. For each word, we show document frequency df, collecton frequency cf, the estmate of λ (collecton frequency dvded by total number of documents (79291)), the predcted df, and the rato of predcted df and actual df. Examnng document frequency s the easest way to check whether a term s Posson dstrbuted. The number of documents predcted to have at least one occurrence of a term can be computed as the complement of the predcted number wth no occurrences. Thus, the Posson predcts that the document frequency s df = N(1 P (0)) where N s the number of documents n the corpus. A better way to check the ft of the Posson s to look at the complete dstrbuton: the number of documents wth 0, 1, 2, 3, etc. occurrences. We wll do ths below. In table 15.6, we can see that the Posson estmates are good for noncontent words lke follows and transformed. We use the term non-content word loosely to refer to words that taken n solaton (whch s what most IR systems do) do not gve much nformaton about the contents of the document. But the estmates for content words are much too hgh, by a factor of about 3 (3.48 and 2.91). Ths result s not surprsng snce the Posson dstrbuton assumes ndependence between term occurrences. Ths assumpton holds approxmately for non-content words, but most content words are much more lkely to occur agan n a text once they have occurred once, a property

21 Topcs n Informaton Retreval burstness term clusterng that s sometmes called burstness or term clusterng. However, there are some subtletes n the behavor of words as we can see for the last two words n the table. The dstrbuton of james s surprsngly close to Posson, probably because n many cases a person s full name s gven at frst menton n a newspaper artcle, but followng mentons only use the last name or a pronoun. On the other hand, freshly s surprsngly non- Posson. Here we get strong dependence because of the genre of recpes n the New York Tmes n whch freshly frequently occurs several tmes. So non-posson-ness can also be a sgn of clustered term occurrences n a partcular genre lke recpes. The tendency of content word occurrences to cluster s the man problem wth usng the Posson dstrbuton for words. But there s also the opposte effect. We are taught n school to avod repettve wrtng. In many cases, the probablty of reusng a word mmedately after ts frst occurrence n a text s lower than n general. A fnal problem wth the Posson s that documents n many collectons dffer wdely n sze. So documents are not a unform unt of measurement as the second s for tme or the klogram s for mass. But that s one of the assumptons of the Posson dstrbuton The two-posson model two-posson Model A better ft to the frequency dstrbuton of content words s provded by the two-posson Model (Booksten and Swanson 1975), a mxture of two Possons. The model assumes that there are two classes of documents assocated wth a term, one class wth a low average number of occurrences (the non-prvleged class) and one wth a hgh average number of occurrences (the prvleged class): tp(k; π,λ 1,λ 2 ) = πe λ λ 1 1 k + (1 π)e λ λ 2 2 k k! k! where π s the probablty of a document beng n the prvleged class, (1 π) s the probablty of a document beng n the non-prvleged class, and λ 1 and λ 2 are the average number of occurrences of word w n the prvleged and non-prvleged classes, respectvely. The two-posson model postulates that a content word plays two dfferent roles n documents. In the non-prvleged class, ts occurrence s accdental and t should therefore not be used as an ndex term, just as a non-content word. The average number of occurrences of the word n

22 15.3 Term Dstrbuton Models 549 negatve bnomal ths class s low. In the prvleged class, the word s a central content word. The average number of occurrences of the word n ths class s hgh and t s a good ndex term. Emprcal tests of the two-posson model have found a spurous dp at frequency 2. The model ncorrectly predcts that documents wth 2 occurrences of a term are less lkely than documents wth 3 or 4 occurrences. In realty, the dstrbuton for most terms s monotoncally decreasng. If P (k) s the proporton of tmes that word w appears k tmes n a document, then P (0) >P (1) >P (2) >P (3) >P (4) >... As a fx, one can use more than two Posson dstrbutons. The negatve bnomal s one such mxture of an nfnte number of Possons (Mosteller and Wallace 1984), but there are many others (Church and Gale 1995). The negatve bnomal fts term dstrbutons better than one or two Possons, but t can be hard to work wth n practce because t nvolves the computaton of large bnomal coeffcents The K mxture A smpler dstrbuton that fts emprcal word dstrbutons about as well as the negatve bnomal s Katz s K mxture: P (k) = (1 α)δ k,0 + α ( β ) k β + 1 β + 1 where δ k,0 = 1ffk = 0andδ k,0 = 0 otherwse and α and β are parameters that can be ft usng the observed mean λ and the observed nverse document frequency IDF as follows. λ = cf N IDF = log 2 N df β = λ 2 IDF 1 = α = λ β cf df df The parameter β s the number of extra terms per document n whch the term occurs (compared to the case where a term has only one occurrence per document). The decay factor β β+1 = cf df cf (extra terms per term occurrence) determnes the rato P (k) 1 P (k 1). For example, f there are 10 as

23 Topcs n Informaton Retreval Word k follows act est trans- act formed est sovet act est students act est james act est freshly act est Table 15.7 Actual and estmated number of documents wth k occurrences for sx terms. For example, there were 1435 documents wth 2 occurrences of follows. The K mxture estmate s many extra terms as term occurrences, then there wll be ten tmes as many documents wth 1 occurrence as wth 2 occurrences and ten tmes as many wth 2 occurrences as wth 3 occurrences. If there are no extra β β+1 terms (cf = df = 0), then we predct that there are no documents wth more than 1 occurrence. The parameter α captures the absolute frequency of the term. Two terms wth the same β have dentcal ratos of collecton frequency to document frequency, but dfferent values for α f ther collecton frequences are dfferent. Table 15.7 shows the number of documents wth k occurrences n the New York Tmes corpus for the sx words that we looked at earler. We observe that the ft s always perfect for k = 0. It s easy to show that ths s a general property of the K mxture (see exercse 15.3). The K mxture s a farly good approxmaton of term dstrbuton, especally for non-content words. However, t s apparent from the emprcal numbers n table 15.7 that the assumpton: P (k) P (k + 1) = c, k 1 does not hold perfectly for content words. As n the case of the two- Posson mxture we are makng a dstncton between low base rate of occurrence and another class of documents that have clusters of occurrences. The K mxture assumes = c for k 1, whch concedes P (k) P (k+1)

24 15.3 Term Dstrbuton Models 551 that k = 0 s a specal case due to a low base rate of occurrence for many words. But the rato P (k) P (k+1) seems to declne for content words even for k 1. For example, for sovet we have: P (0) P (1) = P (2) P (3) = P (4) P (5) = P (1) P (2) = P (3) P (4) = In other words, each occurrence of a content word we fnd n a text decreases the probablty of fndng an addtonal term, but the decreases become consecutvely smaller. The reason s that occurrences of content words tend to cluster n documents whose core topc s assocated wth the content word. A large number of occurrences ndcates that the content word descrbes a central concept of the document. Such a central concept s lkely to be mentoned more often than a constant decay model would predct. We have ntroduced Katz s K mxture here as an example of a term dstrbuton model that s more accurate than the Posson dstrbuton and the two-posson model. The nterested reader can fnd more dscusson of the characterstcs of content words n text and of several probablstc models wth a better ft to emprcal dstrbutons n (Katz 1996) Inverse document frequency odds of relevance We motvated nverse document frequency (IDF) heurstcally n secton , but we can also derve t from a term dstrbuton model. In the dervaton we present here, we only use bnary occurrence nformaton and do not take nto account term frequency. To derve IDF, we vew ad-hoc retreval as the task of rankng documents accordng to the odds of relevance: O(d) = P(R d) P( R d) where P(R d) s the probablty of relevance of d and P( R d) s the probablty of non-relevance. We then take logs to compute the log odds, and apply Bayes formula: log O(d) = log P(R d) P( R d)

25 Topcs n Informaton Retreval = log P(d R)P(R) P(d) P(d R)P( R) P(d) = log P(d R) log P(d R) + log P(R) log P( R) Let us assume that the query Q s the set of words {w }, and let the ndcator random varables X be 1 or 0, correspondng to occurrence and non-occurrence of word w n d. If we then make the condtonal ndependence assumpton dscussed n secton 7.2.1, we can wrte: log O(d) = [ log P(X R) log P(X R) ] + log P(R) log P( R) Snce we are only nterested n rankng, we can create a new rankng functon g(d) whch drops the constant term log P(R) log P( R). Wth the abbrevatons p = P(X = 1 R) (word occurrng n a relevant document) and q = P(X = 1 R) (word occurrng n a nonrelevant document), we can wrte g(d) as follows. (In the second lne, we make use of P(X = 1 _) = y = y 1 (1 y) 0 = y X (1 y) 1 X and P(X = 0 _) = 1 y = y 0 (1 y) 1 = y X (1 y) 1 X so that we can wrte the equaton more compactly.) g(d) = = [log P(X R) log P(X R)] [log(p X (1 p ) 1 X ) log(q X (1 q ) 1 X )] = X log p (1 q ) (1 p )q + log 1 p 1 q = p X log + 1 p X log 1 q q + log 1 p 1 q In the last equaton above, log 1 p 1 q s another constant term whch does not affect the rankng of documents, and so we can drop t as well gvng the fnal rankng functon: (15.6) g (d) = p X log + 1 p X log 1 q q If we have a set of documents that s categorzed accordng to relevance to the query, we can estmate the p and q drectly. However, n ad-hoc retreval we do not have such relevance nformaton. That means we

26 15.3 Term Dstrbuton Models 553 have to make some smplfyng assumptons n order to be able to rank documents n a meanngful way. Frst, we assume that p s small and constant for all terms. The frst p 1 p term of g then becomes X log = c X, a smple count of the number of matches between query and document, weghted by c. The fracton n the second term can be approxmated by assumng that most documents are not relevant so q = P(X = 1 R) P(w ) = df N, whch s the maxmum lkelhood estmate of P(w ), the probablty of occurrence of w not condtoned on relevance. 1 q q = 1 df N df N = N df df N df N df The last approxmaton, df N df, holds for most words snce most words are relatvely rare. After applyng the logarthm, we have now arrved at the IDF weght we ntroduced earler. Substtutng t back nto the formula for g we get: (15.7) g (d) c X + X df Ths dervaton may not satsfy everyone snce we weght the term accordng to the opposte of the probablty of non-relevance rather than drectly accordng to the probablty of relevance. But the probablty of relevance s mpossble to estmate n ad-hoc retreval. As n many other cases n Statstcal NLP, we take a somewhat crcutous route to get to a desred quantty from others that can be more easly estmated Resdual nverse document frequency resdual nverse document frequency RIDF An alternatve to IDF s resdual nverse document frequency or RIDF. Resdual IDF s defned as the dfference between the logs of actual document frequency and document frequency predcted by Posson: RIDF = IDF log 2 (1 p(0; λ )) where IDF = log 2 N df, and p s the Posson dstrbuton wth parameter λ = cf N, the average number of occurrences of w per document. 1 p(0; λ ) s the Posson probablty of a document wth at least one occurrence. So, for example, RIDF for nsurance and try n table 15.4 would be 7.3 and 6.2, respectvely (wth N = 79291, verfy ths!).

27 Topcs n Informaton Retreval Term 1 Term 2 Term 3 Term 4 Query user nterface Document 1 user nterface HCI nteracton Document 2 HCI nteracton Table 15.8 Example for explotng co-occurrence n computng content smlarty. For the query and the two documents, the terms they contan are lsted n ther respectve rows. As we saw above, the Posson dstrbuton only fts the dstrbuton of non-content words well. Therefore, the devaton from Posson s a good predctor of the degree to whch a word s a content word Usage of term dstrbuton models We can explot term dstrbuton models n nformaton retreval by usng the parameters of the model ft for a partcular term as ndcators of relevance. For example, we could use RIDF or the β n the K mxture as a replacement for IDF weghts (snce content words have large β and large RIDF, non-content words have smaller β and smaller RIDF). Better models of term dstrbuton than IDF have the potental of assessng a term s propertes more accurately, leadng to a better model of query-document smlarty. Although there has been lttle work on employng term dstrbuton models dfferent from IDF n IR, t s to be hoped that such models wll eventually lead to better measures of content smlarty Latent Semantc Indexng co-occurrence Latent Semantc Indexng In the prevous secton, we looked at the occurrence patterns of ndvdual words. A dfferent source of nformaton about terms that can be exploted n nformaton retreval s co-occurrence: the fact that two or more terms occur n the same documents more often than chance. Consder the example n table Document 1 s lkely to be relevant to the query snce t contans all the terms n the query. But document 2 s also a good canddate for retreval. Its terms HCI and nteracton co-occur wth user and nterface, whch can be evdence for semantc relatedness. Latent Semantc Indexng (LSI) s a technque that projects queres and

CS47300: Web Information Search and Management

CS47300: Web Information Search and Management CS47300: Web Informaton Search and Management Probablstc Retreval Models Prof. Chrs Clfton 7 September 2018 Materal adapted from course created by Dr. Luo S, now leadng Albaba research group 14 Why probabltes