A probabilistic justification for using tf idf term weighting in information retrieval

Size: px

Start display at page:

Download "A probabilistic justification for using tf idf term weighting in information retrieval"

Brett Ward
6 years ago
Views:

1 In J Digi Libr (2000) 3: Springer-Verlag A probabilisic jusificaion for using f idf erm weighing in informaion rerieval Djoerd Hiemsra Cenre for Telemaics and Informaion Technology, Universiy of Twene, The Neherlands; hiemsra@ci.uwene.nl Received: 17 December 1998/Revised: 31 May 1999 Absrac. This paper presens a new probabilisic model of informaion rerieval. The mos imporan modeling assumpion made is ha documens and queries are defined by an ordered sequence of single erms. This assumpion is no made in well-known exising models of informaion rerieval, bu is essenial in he field of saisical naural language processing. Advances already made in saisical naural language processing will be used in his paper o formulae a probabilisic jusificaion for using f idf erm weighing. The paper shows ha he new probabilisic inerpreaion of f idf erm weighing migh lead o beer undersanding of saisical ranking mechanisms, for example by explaining how hey relae o coordinaion level ranking. A pilo experimen on he TREC collecion shows ha he linguisically moivaed weighing algorihm ouperforms he popular BM25 weighing algorihm. Key words: Informaion rerieval heory Saisical informaion rerieval Saisical naural language processing 1 Inroducion There are hree basic processes an informaion rerieval sysem has o suppor: he represenaion of documens, he represenaion of a user reques, and he comparison of hese wo represenaions. In ex rerieval he documens and he user reques are expressed in naural language. Alhough ex rerieval has had by far he mos aenion in he informaion rerieval communiy, so far he success of naural language processing echniques has been limied. Mos of he effor in he field of ex rerieval has been pu in he developmen of saisical rerieval models like he vecor space model (proposed by Salon e al. [16]), he classical probabilisic model (proposed by Roberson and Spark Jones [12]) and more recenly he inference nework model (proposed by Crof and Turle [3]). The applicaion of naural language processing echniques in combinaion wih hese models has solid bu limied impac on he performance of ex rerieval 1 [19]. The research does however provide lile insigh o he quesion how o use naural language processing. Naural language processing modules are usually considered as preprocessing seps, ha is, hey are no included in he model iself. This paper aemps o formulae a model ha capures saisical informaion rerieval and saisical naural language processing ino one unifying framework, an approach ha ohers are also beginning o invesigae [9, 11]. I is he model iself ha explicily defines how documens and queries should be analysed. This seems a raher rivial requiremen, bu we claim ha his is no he general idea behind he exising models for informaion rerieval. The (implici) assumpion made by hese rerieval models is ha some procedure, eiher manual or auomaic, is used o assign index erms o documens. I is he resul of his procedure ha can be refleced by he model, no he procedure iself. This paper is organised as follows. In Sec. 2 he basics of he linguisically moivaed rerieval model are presened. Secion 3 gives a new probabilisic inerpreaion of f idf erm weighing by using esimaion procedures developed in he field of saisical naural language processing. Secion 4 presens a number of experimens, one pilo experimen on he relaively oudaed Cranfield collecion and wo addiional experimens on he TREC ad hoc and TREC-CLIR collecion. Finally, Sec. 5 presens 1 Rerieval performance is usually measured in erms of precision (he fracion of he rerieved documens ha is acually relevan) and recall (he fracion of he relevan documens ha is acually rerieved).

2 132 D. Hiemsra: A probabilisicjusificaion for using f idf erm weighing in informaion rerieval conclusions and plans for fuure work. These plans include he developmen of a model for phrases and he developmen of a model for cross-language informaion rerieval. An early version of his paper was presened a he Second European Conference on Digial Libraries (ECDL) [5]. 2 The basic rerieval model This paper defines a linguisically moivaed model of full ex informaion rerieval. The mos imporan modeling assumpion we make is ha documens and queries are defined by an ordered sequence of words or erms. 2 This assumpion is usually no made in informaion rerieval. In he models menioned in he inroducion, documens and queries are modeled as unordered collecions of erms or conceps. In he field of saisical naural language processing he word order assumpion is essenial for many applicaions, for insance par-of-speech agging, speech recogniion and parsing. By making he ordered sequence of erms assumpion we will be able o use advances already made in saisical naural language processing. In his secion we will define he framework ha will be used in he subsequen secions o give a probabilisic inerpreaion of f idf erm weighing. 2.1 An informal descripion: drawing query erms from a documen Before we describe he new rerieval model mahemaically, his secion gives an informal descripion of he underlying ideas. The main goal of an informaion rerieval sysem is o find hose documens in a documen collecion ha are relevan o a query. A full ex rerieval sysem compares he words in he query wih he words in each documen o rank he documens. Documens ha are likely o be relevan should be ranked a he op and documens ha are unlikely o be relevan should be ranked a he boom of he ranked lis. A mahemaical model of informaion rerieval formally defines how he sysem should perform his ranking, usually based on inuiions or meaphors from some well-undersood branch of mahemaics. For example Salon s vecor space model is based on inuiions from geomery: documens and queries are vecors in a high-dimensional space and documens are ranked by he cosine of he angle ha separaes he documen vecor and he query vecor. The Roberson/Sparck-Jones probabilisic model is based on he inuiion ha a sysem can learn from he disribuion of erms over relevan and non-relevan documens which documens are probably relevan o a query. 2 In he linguisically moivaed model erms and words are equivalen, boh expressions will be used in his paper. A classical index erm ha consiss of more han one word will be called a phrase. We will use probabiliy heory in a differen way here by using a meaphor ha is very similar o he sampling coloured balls from urns examples ha are ofen used in inroducory saisics courses [10]. Insead of drawing balls a random wih replacemen from an urn, we will consider he process of drawing words a random wih replacemen from a documen. Suppose someone selecs one documen in he documen collecion; draws a random, one a a ime, wih replacemen en words from his documen and hands hose en words (he query erms) over o he sysem. The sysem now can make an educaed guess as from which documen he words came from, by calculaing for each documen he probabiliy ha he en words were sampled from i and by ranking he documens accordingly. This meaphor for informaion rerieval was inroduced by Pone and Crof [11]. The inuiion behind i is ha users have a reasonable idea of which erms are likely o occur in documens of ineres and will choose query erms accordingly. The meaphor is a very powerful one as i can be exended in various ways. Because of he sequenial naure of he sampling process, i can be exended o model phrases as done by Miller, Leek and Schwarz [9]. I can be exended o Boolean queries by reaing he sampling process as an AND-query and allowing ha each draw is specified by a disjuncion of more han one erm. For example, he probabiliy of firs drawing he erm informaion and hen drawing eiher he erm rerieval orhe erm filering from a documen can be calculaed by he model inroduced in his paper wihou any addiional modeling assumpions. Furhermore, i can be exended wih addiional saisical processes o model differences beween he vocabulary of he query and he vocabulary of he documens. For insance, for cross-language rerieval, saisical ranslaion can be added o he process of sampling erms from a documen: e.g., firs an English word is sampled from he documen, and hen his word is ranslaed o Duch wih some probabiliy ha can be esimaed from a parallel corpus. Evaluaions of Boolean queries and saisical ranslaion are described in [6, 7]. In his paper we will focus on he basics of he new rerieval model by defining i mahemaically and by saing how i relaes o exising f idf erm weighing algorihms. We will deal wih he mahemaical deails of he exensions of he model in fuure publicaions. 2.2 The sample space We assume ha a collecion consiss of a finie number of exual documens. The documens are wrien in a language ha exiss of a finie number of words or erms. Definiion 1. Le P be a probabiliy funcion on he join sample space Ω D Ω T.LeΩ D be a discree sample space ha conains a finie number of poins d such ha each d refers o an acual documen in he documen collecion. Le D be discree random variable over Ω D.Le

3 D. Hiemsra: A probabilisicjusificaion for using f idf erm weighing in informaion rerieval 133 Ω T be a discree sample space ha conains a finie number of poins such ha each refers o an acual erm ha is used o represen he documens. Le T be a discree random variable over Ω T. In oher words, he random variable D refers o a documen id and he random variable T refers o an index erm. 2.3 Modeling documens and queries Queries will be modeled as compound evens. A compound even is an even ha consiss of wo or more single evens, as when a die is ossed wice or hree cards are drawn one a a ime from a deck [10]. The single evens ha define he compound even are he query erms. In general he probabiliy of a compound even does depend on he order of he single evens. For example a query of lengh n is modeled by an ordered sequence on n single erms T 1,T 2,,T n. Given a documen id D he probabiliy of he ordered sequence will be defined by P (T 1,T 2,,T n D). Mos pracical models for informaion rerieval assume independence beween index erms. Assuming condiional independence of erms given a documen id leads o he following model. P (T 1,T 2,,T n D)= n P (T i D) (1) Noe ha he assumpion of independence beween query erms does no conradic he assumpion ha erms in queries have a paricular order. The independence assumpion merely saes ha every possible order of erms has he same probabiliy. I is made o illusrae ha a simple version of he linguisically moivaed model is very similar o exising informaion rerieval models. 2.4 The maching process Equaion (1) can be used direcly o rank documens given a query T 1,T 2,,T n. I migh however be ineresing o rewrie (1) o a probabiliy measure ha explicily ranks documens given a query: P (D T 1,T 2,,T n ). This measure can be relaed o (1) by applying Bayes rule. P (D T 1,T 2,,T n )=P(D) P (T 1,T 2,,T n D) (2) P (T 1,T 2,,T n ) n = P (D) P (T i D) (3) P (T 1,T 2,,T n ) Equaion (2) is he direc resul of applying Bayes rule. Filling in he independence assumpion of (1) leads o (3). I seems emping o make he assumpion ha erms are also independen if hey are no condiioned on a documen D. This will however lead o an inconsisency of he model (e.g., see Cooper s paper on modeling assumpions for he classical probabilisic rerieval model [2]). Since d P (D = d T 1,T 2,,T n ) = 1 we can scale he formula using a consan C such ha 1 C = d P (D = d T 1,T 2,,T n ). P (D T 1,T 2,,T n )=CP(D) n P (T i D) (4) Equaion (4) defines he ranking formula of he linguisic moivaed probabilisic rerieval model if we assume erm independence. 3 Esimaing he probabiliies The process of probabiliy esimaion defines how probabiliies should be esimaed from he frequency of erms in he acual documen collecion. We will look a he esimaing process by drawing a parallel o saisical naural language processing and corpus linguisics. 3.1 Viewing documens as language samples The general idea is he following. Each documen conains a small sample of naural language. For each documen he rerieval sysem should build a lile saisical language model P (T D) wheret is a single even. Such a language model migh indicae ha he auhor of ha documen used a cerain word 5 ou of 1000 imes; i migh indicae ha he auhor used a cerain synacic consrucion like a phrase 5 ou of 1000 imes; or ulimaely indicae ha he auhor used a cerain logical semanic srucure 5 ou of 1000 imes. One of he main problems in saisical naural language processing and corpus linguisics is he problem of sparse daa. If he sample ha is used o esimae he parameers of a language model is small, hen many possible language evens never ake place in he acual daa. Suppose for example ha an auhor wroe a documen abou informaion rerieval wihou using he words keyword and crocodile. The reason ha he auhor did no menion he word keyword is probably differen from he reason for no menioning he word crocodile. If we were able o ask an exper in he field of informaion rerieval o esimae probabiliies for he erms keyword and crocodile he/she migh for example indicae ha he chance ha he erm keyword occurred is one in a housand erms and he chance ha he erm crocodile occurred is much lower: one in a million. If we however base he probabiliies on he frequency of erms in he acual documen hen he probabiliy esimaes of low frequen and medium frequen erms will be unreliable. A full ex informaion rerieval sysem based on hese frequencies canno make a difference beween words ha were no used by chance, like he word keyword, and words ha were no used because hey are no par of he vocabulary of he

4 134 D. Hiemsra: A probabilisicjusificaion for using f idf erm weighing in informaion rerieval subjec, like he word crocodile. Furhermore here is always a small chance ha compleely off he subjec words occur like he word crocodile in his paper. We believe ha he sparse daa problem is exacly he reason ha i is hard for informaion rerieval sysems o obain high recall values wihou degrading values for precision. Many soluions o he sparse daa problem were proposed in he field of saisical naural language processing ( e.g., see [8] for an overview). We will use he combinaion of esimaors by linear inerpolaion o esimae parameers of he probabiliy measure P (T D). 3.2 Esimaing probabiliies from sparse daa Perhaps he mos sraighforward way o esimae probabiliies from frequency informaion is maximum likelihood esimaion [10]. A maximum likelihood esimae makes he probabiliy of observed evens as high as possible and assigns zero probabiliy o unseen evens. This makes he maximum likelihood esimae unsuiable for direcly esimaing P (T D). One way of removing he zero probabiliies is o mix he maximum likelihood model of P (T D) wih a model ha suffers less from sparseness like he marginal P (T ). I is possible o make a linear combinaion of boh probabiliy esimaes so ha he resul is anoher probabiliy funcion. This mehod is called linear inerpolaion: P li (T D)= P mle (T )+α 2 P mle (T D), (0 <,α 2 < 1 and + α 2 =1) (5) The weighs and α 2 migh be se by hand, in which case we would choose hem in such a way ha P mle (T = ) is smaller han α 2 P mle (T = D) foreach erm. This will give erms ha did no appear in he documen a much smaller probabiliy han erms ha did appear in he documen. In general one wans o find he combinaion of weighs ha works he bes, for example by opimising hem on a es collecion consising of documens, queries and corresponding relevance judgemens. Table 1 liss he frequencies ha are used o esimae he probabiliies of he model. Two frequencies are paricularly imporan, he erm frequency and he documen frequency. The erm frequency of a erm is defined by he number of imes a erm appears in a documen and can be viewed a as local or documen specific informaion. Given a specific documen many erms will have a frequency of zero, so he erm frequency suffers from sparseness. The documen frequency of a erm is defined by he number of documens in which a erm appears and can be viewed as global informaion. (Someimes documen frequency is referred o as collecion frequency.) The documen frequency of a erm will never be zero, because by definiion 1, erms ha do no appear in any documen will no be included in he model. The sparseness problem can be avoided by esimaing P (T D) as a linear combinaion of a probabiliy model based on documen Table 1. Frequency informaion N he number of documens in he collecion f (, d)erm frequency: he number of imes he erm appears in he documen d. df () documen frequency: henumberof documens in which he erm appears. frequency and a probabiliy model based on erm frequency as in (7): P (D = d)= 1 N (6) df ( i ) P (T i = i D = d)= df () + α f ( i,d) 2 f (, d) (7) Noe ha erm frequency and documen frequency are no derived from he same disribuion. Alhough he erm frequency can also be used o compue global informaion of a erm by summing over all possible documens, his informaion will usually no be he same as he documen frequency of a erm, more formally: df () d f (, d). Equaions (4) and (7) define he ranking algorihm. The formula bears some resemblance wih he ranking formula used by Miller, Leek and Schwarz [9]. They showed ha he model can be inerpreed as a wo-sae hidden Markov model in which and α 2 define he sae ransiions. 3.3 Relaion o f idf The use of erm frequency and documen frequency o rank documens was exensively sudied, especially by Salon e al., for he vecor space model [15]. Following consideraions of he erm discriminaion model [17], hey argued ha erms appearing in documens should be weighed proporional o he erm frequency and inversely proporional o he documen frequency. Weighing schemes ha follow his approach are called f idf (erm frequency inverse documen frequency) weighing schemes. The combinaion of f idf weighs and documen lengh normalisaion gave he bes rerieval resuls on several es collecions, bu hey were no able o jusify heir approach by probabiliy heory (which is no a prerequisie for using i in he vecor space model anyway):... The erm discriminaion model has been criicised because i does no exhibi well subsaniaed heoreical properies. This in conras wih he probabilisic model of informaion rerieval... The lack of heoreical jusificaion of f idf weighs did no keep developers of he probabilisic model and he inference nework model from using hem. Roberson e al. [13] jusified he use of erm frequency in he probabilisic model by approximaing a ranking formula ha is

5 D. Hiemsra: A probabilisicjusificaion for using f idf erm weighing in informaion rerieval 135 based on he combinaion of he probabilisic model and he 2-Poisson model. There is however a more plausible probabilisic jusificaion of f idf weighing which can be jusified by he linear inerpolaion esimaor of (7). This can be shown by rewriing. Muliplying he ranking formula defined by (4), (6) and (7) wih values ha are he same for each documen will no affec he final ranking, so we can muliply he ranking formula by df () and as follows: P (D = d T 1 = 1,,T n = n ) n df ( i ) ( df () + α f ( i,d) 2 [by (4),(6) and (7)] f (, d)) n f ( i,d) df () ( + α 2 f (, d) df ( i) ) [ n df () df ( i ) ] n (1 + f ( i,d) df ( i ) 1 α2 df () )[ n 1 f (, d) ] The resuling formula can direcly be inerpreed as a f idf weighing algorihm wih documen lengh normalisaion, because: α 2 df () is consan for any documen d and erm f ( i,d) is he f idf weigh of he erm i in he df ( i ) documen d 1 f (, d) is he inverse lengh of he documen d Any monoonic ransformaion of he documen ranking funcion will produce he same ranking of he documens. Insead of he produc of weighs we could herefore also rank he documens by he sum of logarihmic weighs. n log(1 + f ( i,d) df ( i ) 1 α2 f (, d) df () )[log. r.] Roberson and Sparck-Jones [12] call he resuling formula a presence weighing scheme (as opposed o a presence/absence weighing scheme) because he formula assigns a zero weigh o erms ha are no presen in a documen. Presence weighing schemes can be implemened using he vecor produc formula as inroduced by Salon e al. [15]. The query weighs of he vecor produc formula can be used o accoun for muliple occurrences of he same erm in he query. The resuling vecor produc version of he ranking formula is displayed in Table 2. On firs glance he consan α 2 df ()/ seems o have lile impac on he final ranking. Bu in fac, differen values of and α 2 will lead o differen documen rankings. In Sec. 3.5 we will show some effecs of differen values of and α 2 on he ranking of documens, especially for shor queries. As said above, he weighing algorihm can, by he definiion of Salon e al., be inerpreed as a f idf Table 2. Vecor produc version of weighing algorihm l similariy(q, D)= w qk w dk k=1 w qk = f ( k,q) f ( k,d) w dk =log(1+ df ( k ) α2 f (, d) df () ) weigh wih documen lengh normalisaion. However, he f (, d) (ha is: he documen lengh) in he denominaor of he documen erm weigh in Table 2 is he resul of he requiremen ha probabiliies have o sum up o one and no he resuls of documen lengh normalisaion. Documen lengh normalisaion is assumed by (6). In fac we migh assume ha longer documens are more likely o be relevan by using he prior probabiliy of (8). f (, d) P (D = d)= d f (, d) (8) This resuls in a weighing algorihm ha canno be rewrien ino he vecor produc normal form. I can however be implemened fairly easily by iniialising similariies o log( f (, d)) insead of o zero when processing he query. This version of he weighing algorihm was used in he experimens of Sec A new informal definiion of f idf weighing Equaion (7) gives rise o a new informal definiion of f idf weighing. Giving an informal definiion afer he formal definiion seems a bi useless, bu we believe ha i will help o undersand wha exacly makes f idf weighing successful. The classical definiion of f idf weighing can be formulaed as follows: Definiion 2. The weigh of a erm ha appears in a documen should increase wih he erm frequency of he erm in he documen and decrease wih he documen frequency of he erm. Terms ha do no appear in a documen should all ge he same weighs (zero weighs). An alernaive definiion is based on (7). If we assume ha df ( i )/ df () is much smaller han α 2 f ( i,d)/ f (, d) hen i can be formulaed as follows: Definiion 3. The weigh of a erm ha appears in a documen should increase wih he erm frequency of he erm in he documen. The weigh of a erm ha does no appear in a documen should increase wih he documen frequency of he erm. An example may clarify he implicaions of boh definiions. Suppose he user formulaes he query informaion rerieval and here is no documen in he collecion

6 136 D. Hiemsra: A probabilisicjusificaion for using f idf erm weighing in informaion rerieval in which he erms informaion and rerieval boh appear. Furhermore, suppose ha he erm informaion is much more common, i.e., has a higher documen frequency han he erm rerieval. Now he sysem will rank documens conaining k occurrences of he erm rerieval above documens conaining k occurrences of he erm informaion. The classical explanaion would be as follows: Explanaion 1. The query erm rerieval maches beer wih documens conaining rerieval han he query erm informaion maches wih documens conaining informaion, because rerieval has a higher inverse documen frequency han informaion. The alernaive explanaion would be ha: Explanaion 2. The query erm informaion maches beer wih documens no conaining informaion han he query erm rerieval maches documens no conaining rerieval because informaion has a higher documen frequency han rerieval. There is no a priori reason o prefer one explanaion above he oher. However, he idea of definiion 3, ha global informaion is only used o weigh erms of which here is no local informaion, migh lead o beer undersanding of probabilisic erm weighing in ex rerieval. 3.5 The problem of non-coordinaion level ranking There is a well-known problem wih saisical informaion rerieval sysems ha use f idf weighing: someimes documens conaining n query erms are ranked higher han documens conaining n + 1 query erms. We will call his problem he problem of non-coordinaion level ranking in which he coordinaion level refers o he number of disinc query erms conained in a documen. A coordinaion level ranking procedure will always rank documens conaining n +1 query erms above documens conaining n query erms even if he op documens have lile evidence for he presence of n +1query erms and lower-ranked documens have a lo of evidence for he presence of n erms. According o sudies of user preferences and evaluaions on es collecions he problem of non-coordinaion level ranking becomes paricularly apparen if shor queries are used [14]. In a lo of pracical siuaions shor queries are he rule raher han he excepion, especially in siuaions where here is no or lile user raining like wih Web-based search engines. For some research groups, he imporance of coordinaion level is he reason for developing ranking mehods ha are based on he lexical disance of search erms in documens insead of on documen frequency of erms [1, 4]. However, as poined ou by experimens of Wilkinson e al. [21], some f idf measures (e.g., like he measure proposed by Roberson e al. [13]) are more like coordinaion level ranking han ohers (e.g., like he measure proposed by Salon e al. [15]). Wilkinson e al., showed ha weighing measures ha are more like coordinaion level ranking perform beer on he TREC collecion, especially if shor queries are used. Following he resuls of Wilkinson e al., i migh be useful o invesigae wha exacly makes a weighing measure like coordinaion level ranking. The following example may provide some insigh. Firs of all, suppose we use he following ranking formula which can be derived from he probabiliy ranking funcion in a similar way as is shown in Sec P (D = d T 1 = 1,,T n = n ) n ( α 2 df () + f ( i,d) df ( i ) 1 (9) f (, d)) Now suppose he user eners a small query of only wo erms a and b. As in he previous example a migh be he erm informaion and b migh be he erm rerieval. Furhermore, suppose ha he documen d 1 conains a lo of evidence for erm a and no evidence for he erm b; and ha documen d 2 conains lile evidence of boh erms. I can be shown ha documen d 1 will have a lo of evidence for a and none for b if f (a, d 1 )ishigh,f (b, d 1 )=0 and he lengh of d 1 is shor. Documen d 2 conains lile evidence of a and b if f (a, d 2 )=f (b, d 2 )=1 and if d 2 is a long documen. Now he following equaion defines he requiremen for coordinaion level ranking, ha is, he similariy of documen d 1 o he query abshould be smaller han he similariy of documen d 2 o he query. The lef hand side of he equaion conains he similariy of he query compared o documen d 1 and he righ hand side conains he similariy of he query o documen d 2. We will use shor noaions of he documen lengh and he consan of (9): l(d)= f (, d) and c = /α 2 df (). (c + f (a, d 1) df (a)l(d 1 ) )(c +0) < (c + 1 df (a)l(d 2 ) )(c + 1 df (b)l(d 2 ) ) c 2 + cf(a, d 1) c df (a)l(d 1 ) <c2 + df (a)l(d 2 ) + c df (b)l(d 2 ) 1 + df (a)df (b)l(d 2 ) 2 cf(a, d 1 ) df (a)l(d 1 ) c df (a)l(d 2 ) c df (b)l(d 2 ) < 1 df (a)df (b)l(d 2 ) 2 c<. l(d 1 ) l(d 2 )(f (a, d 1 )df (b)l(d 2 ) df (b)l(d 1 ) df (a)l(d 1 )) (10) Equaion (10) shows ha we can rewrie he requiremen for coordinaion level ranking as a requiremen for he consan c. Ifc is small enough hen he problem of noncoordinaion level ranking will never occur. By changing he value of c he ranking formula can be adaped o

7 D. Hiemsra: A probabilisicjusificaion for using f idf erm weighing in informaion rerieval 137 differen applicaions. If we are developing a Web-based search engine we migh choose a relaively small value for c, bu if we are developing a search engine for evaluaion in TREC [20] we migh choose a higher value of c.forhe Web-based search engine we migh define a collecion specific lower bound of c by keeping rack of he collecion exrema like maximum erm frequency and documen frequency (maxf and maxdf) and maximum and minimum documen lenghs (maxl and minl). If we fill in hese exrema and he definiion c = /α 2 df () in ((10)) hen he lower bound will be defined as follows on he raio beween and α 2. minl df () < α 2 maxl (maxf maxdf maxl maxdf minl minl) (11) Equaion (11) defines a ranking formula ha always produces coordinaion level ranking for queries of wo words. For longer queries he bound will be lower and for queries wih unresriced lengh only = 0 will guaranee coordinaion level ranking. 3.6 A plausible explanaion of non-coordinaion level ranking The argumens in he previous secion showed he following. The smaller he value of he consan c, hemorehe ranking formula will behave like coordinaion level ranking. I is good o noe ha mos f idf measures defined for he exising models of informaion rerieval include consans for which he argumens inroduced above also hold (for insance he +0.5 in he Roberson/Sparck Jones formula [12, 13]). However, he classical definiion of f idf weighing (definiion 2) does no give a plausible explanaion of why and when non-coordinaion level ranking does happen. Using he new definiion 3 and he fac ha c is defined by he raio /α 2 we can give he following explanaion of non-coordinaion level ranking when f idf weighs are used. Explanaion 3. Non-coordinaion level ranking occurs if query erms ha do no appear in a documen are weighed oo high compared o query erms ha do appear in a documen. According o definiion 3 erms ha do no appear in a documen are weighed proporional o he documen frequency. If we choose a relaively high value for he consan hen query erms ha do no appear in a documen will be weighed oo high, possibly causing noncoordinaion level ranking. 4 Experimenal resuls This secion briefly describes he resuls of a number of experimens wih he linguisically moivaed erm weighing algorihm. A pilo experimen uses he relaively oudaed Cranfield collecion as repored earlier in [5]. Addiional experimens using he TREC collecion indicae ha he new weighing algorihm ouperforms he popular Cornell version of BM The Cranfield collecion The Cranfield collecion is a small collecion (1398 documens) wih a relaively large number of queries (255 queries). In he experimen we implemened a linguisically moivaed probabilisic rerieval engine and a sandard vecor engine. Boh engines used he same okenisaion and semming of he words in he documens. As a es collecion we used he Cranfield collecion which was also used exensively in early experimens wih he vecor space model [15]. Table 3 liss he non-inerpolaed average precision averaged over 225 queries of he Cranfield collecion for differen values of and α 2. Table 3. Experimenal resuls on he Cranfield collecion weigh avg. precision =0.05 α 2 = =0.2 α 2 = =0.35 α 2 = =0.5 α 2 = =0.65 α 2 = =0.8 α 2 = =0.95 α 2 = To evaluae how our weighing scheme performs relaive o oher f idf weighing schemes wih documen lengh normalisaion we implemened he vecor space model wih fc.nfx weighing as proposed by Salon and Buckley [15]. The non-inerpolaed average precision averaged over 225 queries of his sysem was on he Cranfield collecion. 3 The linguisically moivaed sysem performs beer for quie a wide range of differen values of and α 2. The bes performance in erms of average precision is approximaely a = Coordinaion level ranking Cranfield has he following collecion exrema: The smalles documen is 18 words long, he longes 354 words. The maximum erm frequency is 28 and he maximum documen frequency 729. Following he argumens of Sec. 3.5 i is possible o calculae a lower bound on he raio beween and α 2 ha will define coordinaion level ranking given a query of lengh 2. This leads 3 Salon and Buckley [15] repor a 3-poin inerpolaed average precision of Our version of heir sysem reaches a 3-poin inerpolaed average precision of which is probably due o he use of a semmer.

8 138 D. Hiemsra: A probabilisicjusificaion for using f idf erm weighing in informaion rerieval o a lower bound of on he raio beween and α 2 which corresponds roughly o = and α 2 = Alhough correc, he lower bound inroduced by (11) is obviously no very useful for idenifying proper values for and α 2. There are several reasons ha migh explain why he sysem performs opimally for much higher values of : 1. Coordinaion level ranking does no lead o good average precision on he Cranfield collecion. 2. The sysem does produce coordinaion level ranking, bu he bound on he raio beween and α 2 is oo low o be of any use. 3. The sysem does produce coordinaion level ranking, bu he bound is no useful because he collecion does no have very small queries (he average query lengh is abou 9.5words). Addiional experimens have o poin ou which reason or reasons acually explain he experimenal resuls he bes. 4.3 The TREC collecion Since he exisence of he TEx Rerieval Conference (TREC), experimens have resuled in weighing algorihms ha perform much beer han fc.nfx weighing. According o Voorhees and Harman [20], odays mos popular weighing algorihm is he Cornell implemenaion of he Okapi BM25 algorihm [13, 18]. Table 4 displays he version of he BM25 weighing algorihm ha was used for comparison. Table 5 liss he evaluaion resuls in erms of average precision of he new weighing algorihm compared o he resuls of he Cornell/BM25 algorihm. We used wo modern es collecions provided by NIST via TREC. One is he ad hoc collecion consising of aricles from he LA Times, he Financial Times, he Federal Regiser and Foreign Broadcas Informaion Service. 4 The oher collecion uses he English documens and English opics 1-24 of he TREC Cross-Language Informaion Rerieval (CLIR) collecion consising of AP Newswire aricles. On boh collecions we used =0.85. Table 4. Cornell version of BM25 weighing algorihm l similariy(q, D)= w qk w dk k=1 w qk = f ( k,q) f ( k,d) log( N df ( k )+0.5 df ( w dk = k )+0.5 ) 2 ( f (,d),d f (,d)/n )+f ( k,d) Boh weighing algorihms perform approximaely equally well on Cranfield, bu on he collecions of more 4 We lef ou he Congressional Records because NIST decided no o include his subcollecion in TREC-7. realisic size he linguisically moivaed weighing algorihm ouperforms he Cornell/BM25 weighing algorihm. The performance gain of 17% on he ad hoc collecion is specacular. Experimens using he ad hoc opics show a significan, hough less specacular, improvemen [7]. Table 5. Performance on hree es collecions average precision collecion BM25 new algorihm % diff. Cranfield TREC CLIR TREC ad-hoc Maybe more imporan han he resuls in erms of average precision is he he fac ha he bes value of he unknown parameer of he linguisically moivaed weighing algorihm is sable across differen es collecions. On all hree collecions he bes performance of he linguisically moivaed weighing algorihm lies a approximaely =0.85. For =0.8 and =0.9 he average precision on he English CLIRcollecion is respecively and and on he ad hoc collecion respecively and Conclusion and fuure plans This paper has presened he linguisically moivaed probabilisic model of informaion rerieval. Using esimaion by linear inerpolaion which is ofen used in he field of saisical naural language processing we have been able o presen a probabilisic inerpreaion of f idf erm weighing. We have shown ha his new inerpreaion leads o beer undersanding of he behaviour of f idf ranking. Experimens wih he TREC collecion show ha a rerieval sysem ha uses he new model ouperforms he same sysem using he Cornell/BM25 weighing algorihm. This paper has no presened he linguisically moivaed model of informaion rerieval in is full srengh. Alhough we claim ha he mos imporan modeling assumpion of he model is ha documens and queries are defined by an ordered sequence of erms, he assumpion is no essenial for he claims made in his paper. In fuure papers we will invesigae wo major informaion rerieval issues ha require naural language processing echniques. The firs issue is he use of phrases in informaion rerieval. The second issue is he problem of cross-language informaion rerieval. Acknowledgemens. The work repored in his paper is funded in par by he Duch Telemaics Insiue projec DRUID. The repored TREC evaluaions were funded in par by he EU projecs Tweny-One (IE 2108) and Pop-Eye (LE 4234). I would like o hank he following people for heir suppor. Franciska de Jong, Paul van der Ve and Wilber Kallenberg for

9 D. Hiemsra: A probabilisicjusificaion for using f idf erm weighing in informaion rerieval 139 general advice, and David Hawking of he Ausralian Naional Universiy for his advice on coordinaion level ranking. Special hanks go o Wessel Kraaij of TNO-TPD for running he TREC ad hoc experimens. References 1. Clarke, C.L.A., Cormack, G.V., Tudhope, E.A.: Relevance ranking for one o hree erm queries. In: Proc. RIAO 97, 1997, pp Cooper, W.S.: Some inconsisencies and misidenified modeling assumpions in probabilisic informaion rerieval. ACM Trans. Informaion Sysems 13: , Crof, W.B., Turle, H.R.: Tex rerieval and inference. In: Jacobs, P. (ed.): Tex-based Inelligen Sysems. Lawrence Erlbaum, 1992, pp Hawking, D., Thislewaie, P.: Relevance weighing using disance beween erm occurrences. Technical Repor TR-CS-96-08, The Ausralian Naional Universiy, 1996 hp://cs.anu.edu.au/echrepors 5. Hiemsra, D.: A linguisically moivaed probabilisic model of informaion rerieval. In: Nicolaou, C., Sephanidis, C. (eds.): Proc. 2nd European Conference on Research and Advanced Technology for Digial Libraries (ECDL-2), 1998, pp Hiemsra, D., de Jong, F.M.G.: Cross-language rerieval in Tweny-One: using one, some or all possible ranslaions? In: Proc. 14h Twene Workshop on Language Technology (TWLT-14), 1998, pp Hiemsra, D., Kraaij, W.: Tweny-One a TREC-7: Ad-hoc and cross-language rack. In: Proc. 7h Tex Rerieval Conference (TREC-7). NIST Special Publicaions, Manning, C., Schüze, H. (eds.): Saisical Naural Language Processing: Theory and Pracice (draf) hp://www. sulry.ars.su.edu.au/manning/courses/sanlp 9. Miller, D.R.H., Leek, T., Schwarz, R.M.: BBN a TREC-7: using Hidden Markov Models for informaion rerieval. In: Proc. 7h Tex Rerieval Conference, TREC-7. NIST Special Publicaions, Mood, A.M., Graybill, F.A. (eds.): Inroducion o he Theory of Saisics. Second ediion. McGraw-Hill, Pone, J.M., Crof, W.B.: A language modeling approach o informaion rerieval. In: Proc. 21s ACM SIGIR Conference on Research and Developmen in Informaion Rerieval (SI- GIR 98), Roberson, S.E., Sparck Jones, K.: Relevance weighing of search erms. J. American Sociey Informaion Science 27: , Roberson, S.E., Walker, S.: Some simple effecive approximaions o he 2-poisson model for probabilisic weighed rerieval. In: Proc. 17h ACM SIGIR Conference on Research and Developmen in Informaion Rerieval (SIGIR 94), 1994, pp Rose, D.E., Sevens, C.: V-win: A lighweigh engine for ineracive use. In: Voorhees, E.M., Harman, D.K. (eds.): Proc. 5h Tex Rerieval Conference TREC-5, NIST Special Publicaion , 1997, pp Salon, G., Buckley, C.: Term-weighing approaches in auomaic ex rerieval. Informaion Processing & Managemen 24(5): , Salon, G., McGill, M.J. (eds.): Inroducion o Modern Informaion Rerieval. McGraw-Hill, Salon, G., Yang, C.S.: On he specificaion of erm values in auomaic indexing. J. Documenaion 29(4): , Singhal, A., Salon, G., Mira, M., Buckley, C.: Documen lengh normalizaion. Technical Repor TR , Cornell Universiy, hp://cs-r.cs.cornell.edu/ 19. Srzalkowski, T., Sparck Jones, K.: NLP rack a TREC-5. In: Voorhees, E.M., Harman, D.K. (eds): Proc. 5h Tex Rerieval Conference TREC-5, NIST Special Publicaion , 1997, pp Voorhees, E.M., Harman, D.K.: Overview of he 6h ex rerieval conference. In: Proc. 6h Tex Rerieval Conference TREC-6, NIST Special Publicaion , 1998, pp Wilkinson, R., Zobel, J., Sacks-Davis, R.: Similariy measures for shor queries. In: Harman, D.K. (ed.): Proc. 4h Tex Rerieval Conference TREC-4, NIST Special Publicaion , 1996, pp

Retrieval Models. Boolean and Vector Space Retrieval Models. Common Preprocessing Steps. Boolean Model. Boolean Retrieval Model

Retrieval Models. Boolean and Vector Space Retrieval Models. Common Preprocessing Steps. Boolean Model. Boolean Retrieval Model 1 Boolean and Vecor Space Rerieval Models Many slides in his secion are adaped from Prof. Joydeep Ghosh (UT ECE) who in urn adaped hem from Prof. Dik Lee (Univ. of Science and Tech, Hong Kong) Rerieval