On Statistical Analysis and Optimization of Information Retrieval Effectiveness Metrics

Size: px

Start display at page:

Download "On Statistical Analysis and Optimization of Information Retrieval Effectiveness Metrics"

Arnold Todd
5 years ago
Views:

1 On Statstcal Analyss and Optmzaton of Informaton Retreval Effectveness Metrcs Jun Wang and Janhan Zhu Department of Computer Scence, Unversty College London, UK ABSTRACT Ths paper presents a new way of thnkng for IR metrc optmzaton. It s argued that the optmal rankng problem should be factorzed nto two dstnct yet nterrelated stages: the relevance predcton stage and rankng decson stage. Durng retreval the relevance of documents s not known a pror, and the jont probablty of relevance s used to measure the uncertanty of documents relevance n the collecton as a whole. The resultng optmzaton objectve functon n the latter stage s, thus, the expected value of the IR metrc wth respect to ths probablty measure of relevance. Through statstcally analyzng the expected values of IR metrcs under such uncertanty, we dscover and explan some nterestng propertes of IR metrcs that have not been known before. Our analyss and optmzaton framework do not assume a partcular (relevance) retreval model and metrc, makng t applcable to many exstng IR models and metrcs. The experments on one of resultng applcatons have demonstrated ts sgnfcance n adaptng to varous IR metrcs. Categores and Subject Descrptors H.3 [Informaton Storage and Retreval]: H3.Content analyss and Indexng; H.3.3 Informaton Search and Retreval General Terms Algorthms, Expermentaton, Measurement, Performance. INTRODUCTION In Informaton Retreval Modellng, the man efforts have been devoted to, for a specfc nformaton need (query), automatcally scorng ndvdual documents wth respect to ther relevance states. Representatve examples nclude the Probablstc Indexng model that studes how lkely a query term s assgned to a relevant document [7], the RSJ model that derves a scorng functon on the bass of the log-rato of probablty of relevance [], to name just a few. And yet, gven the fact that n many practcal stuatons relevance nformaton s not steadly avalable, major developments have shfted ther focus to estmatng text statstcs n the documents and queres and then buldng up the lnk through these statstcs[,, 34]. For example, scorng functons Permsson to make dgtal or hard copes of all or part of ths work for personal or classroom use s granted wthout fee provded that copes are not made or dstrbuted for proft or commercal advantage and that copes bear ths notce and the full ctaton on the frst page. To copy otherwse, to republsh, to post on servers or to redstrbute to lsts, requres pror specfc permsson and/or a fee. SIGIR, July 9 3,, Geneva, Swtzerland. Copyrght ACM //7...$.. such as TF IDF, Vector Space Model, and the Dvergence from Randomness (DFR) model [] have been developed [6]. A practcal approxmaton of the RSJ model led to the popular BM5 scorng functon []. Another drecton n probablstc modellng was to buld a language model of a document and assess ts lkelhood of generatng a gven query [34]; a query language model s also covered under the Kullback-Lebler dvergence based loss functon [5]. Despte the efforts for retreval, when n the evaluaton phase, many IR tasks have evaluaton crtera that go beyond smply countng the number of relevant documents n a ranked lst. Measurng IR effectveness by dfferent metrcs s crtcal because, for dfferent retreval goals, we need to capture dfferent aspects of retreval performance. In the case where the preference goes strongly towards earlyretreved documents, MRR (Mean Recprocal Rank) s a good measure [8], whereas f we try to capture a broader summary of retreval performance, MAP (Mean Average Precson) becomes sutable [3]. Thus, there s a gap between the underlyng (rankng) decson process of retreval models and the fnal evaluaton crteron used to measure success n a task. Ideally, t s desrable to have retreval systems adapted to the specfc IR effectveness metrcs. In fact, IR researchers have already started to explore the opportunty. One extreme case s learnng to rank; t drectly constructs a document rankng model from tranng data, bypassng the step of estmatng the relevance states of ndvdual documents [8]. Under ths paradgm, some attempts have been made to drectly optmzng IR metrcs such as NDCG (Normalzed Dscounted Cumulated Gan) and MAP [3, 33]. However, t s known that some evaluaton metrcs are less nformatve than others [4]. As argued n [3], some IR metrcs thus do not necessarly summarze the (tranng) data well; f we begn optmzng IR metrcs rght from the data, the statstcs of the data may not be fully explored and utlzed. A somewhat opposte drecton s to focus stll on desgnng a scorng functon of a document, but wth the acknowledgement of varous retreval goals and the fnal rank context. The less s more model proposed n [] s one of the examples. By treatng the prevously retreved documents as non-relevant when calculatng the relevance of documents for the current rank poston, the algorthm s shown to be equvalent to maxmzng the Recprocal Rank measure. In [35], a more general and flexble treatment n ths drecton s proposed. In the framework, Bayesan decson theory s appled to ncorporate varous rankng strateges through predefned loss functons. Despte ts generalty, the resultng IR models, however, lack the ablty of drectly ncorporatng IR metrcs nto the rank decson. In ths paper, we argue that regardng the retreval task solely as ether optmzng IR metrcs or dervng a (rele-

2 Fgure : The two dstnct stages n the statstcal document rankng process. vance) scorng functon presents a partal vew of the underlyng problem; a more unfed vew s to dvde the retreval process nto two dstnct stages, namely relevance predcton and rankng decson optmzaton stages, and solve them sequentally. In the frst stage, the am s to estmate the relevance of documents as accurate as possble, and summarze t by the jont probablty of documents relevance. Only n the second stage s the rank preference specfed, possbly by an IR metrc. The rank decson makng s a stochastc one due to the uncertanty about the relevance. As a result, the optmal rankng acton s the one that maxmzes the expected value of the IR metrc. We shall show that statstcal analyss of the expected value of IR metrcs gves nsght nto the propertes of the metrcs. One of the fndngs s that AP (Average Precson) encourages documents whose relevance s postvely correlated wth prevous retreved documents, whle RR (Recprocal Rank) does otherwse. It follows that f a rank acheves superor results on AP, t must pay wth nferorty on RR. Apart from a theoretcal contrbuton, our experments on TREC data sets demonstrate the sgnfcance of our probablstc framework. The remander of the paper s organzed as follows. We frst establsh our optmzaton scheme, and study major expected IR metrcs and practcal ssues. We then provde an emprcal evaluaton, and fnally conclude our work.. STATISTICAL RANKING MECHANICS In ths secton, we present the framework of optmzng IR metrcs n the stuaton where the relevance of documents s unknown. To keep our dscusson smple, we consder bnary relevance, whle graded relevance can be extended smlarly. Gven an nformaton need, let us assume each document n the corpus s ether relevant or non-relevant. We denote them jontly as a vector r (r,..., r k,..., r N) {, } N, where k = {,..., N}, N denotes the number of documents. r k = f document k s relevant; otherwse r k =. Our vew s the followng: frstly the IR model should focus on estmatng the relevance of documents. The relevance n ths stage s the true topcal relevance [8], dfferent from the user perceved relevance that wll be qualfed n the next stage. In statstcal modellng, we assgn to every possble relevance state r a number p(r q), whch we nterpret as the probablty that a user, who ssues query q, wll fnd the documents relevance states as r. Gven the observaton so far (the query, the user s nteracton etc), the posteror probablty p(r q) presents our (or the IR model s) belef about the relevance states of the documents n the collecton as a whole. Note that we use the jont dstrbuton of relevance nstead of the margnal dstrbuton p(r k q) to cover the dependency of relevance among documents. It s argued that only n the second stage does the retreval model make a rankng decson under the uncertanty specfed by the jont probablty of relevance. To formulate ths, we follow the termnology n natural language processng [6]; a rankng order s represented by a vector a (a,..., a,..., a N), where a {,..., N}. If a document k s n rank poston, then a = k. The retreval task s, thus, to fnd an optmal rank order a to maxmze a certan retreval objectve. Formally, an IR metrc (measure) m(a r) s defned as a score functon of a gven r. A good metrc should be able to measure the user s gan or utlty of a rank order a when the true relevance states of all the documents, r, are known. m(a r) can be also seen as a measure of the user s perceved relevance n the context of a ranked lst. For example, Precson concerns a soluton that fnds relevant documents as many as possble n the lst regardless of ther order, whle Recprocal Rank (nverse of the rank of the frst relevant document retreved) makes sure to retreve the frst relevant document as early as possble regardless of the rank postons of remanng relevant documents. Gven the fact that dfferent IR effectveness metrcs are useful for capturng dfferent aspects of retreval qualty, t s desrable to optmze a wth respect to the specfc metrc m. Bayesan decson theory suggests that the optmal rank order â s obtaned by maxmzng the expected IR metrc: â = argmax a E r[m q] = argmax a X r {,} N m(a r)p(r q), where E[ q] denotes an expectaton wth respect to a condtonal dstrbuton p( q). The subscrpt r ndcates t s averaged over all possble r. Eq. () shows that: frstly the true relevance state of documents, r, s generated from probablty p(r q) estmated by an IR model. Under the relevance state r, the score of a gven rank order a s calculated. E r[m q], the expected score of the rank order, s the one averagng over all possble relevance states of r. Fnally, the optmal rank order s chosen by maxmzng E r[m q]. Although the formulaton can be thought of as a specal nstantaton of the general retreval decson framework n [5, 35], our underlyng dea and development are qute dfferent from ther nstantated models. The advantage s that, as llustrated n Fgure, n our framework, the IR metrc (utlty) reles only on the true relevance and rankng order, whle (relevance) IR models are for estmatng the relevance. Decouplng them s essental to drectly use any retreval metrc and plug t nto the optmzaton procedure. More dscusson can be found n Secton 4. To obtan Eq. (), we analyze the expected IR metrcs E r[m q] n Secton. and present a practcal mplementaton and maxmzaton (search) method n Secton... Analyss of Expected IR metrcs.. Expected Average Precson Average Precson (AP) s a wdely-adopted metrc. For each query, t s the average of the precson scores obtaned across rank postons where each relevant document s retreved; relevant documents that are not retreved receve a precson score of zero [7]. The metrc, n fact, s the area under the Precson-Recall curve, capturng a broad summary of retreval performance wth a sngle value [4]. By defnton, the Average Precson measure s as follows: m A(a r) ( + P r a = j= ra j ) (), () where M N ( P j= ra j when =). NR s the number P of relevant documents, and ts expected value equals N = p(ra = ), the summaton of the margnal probablty of relevance. For smplcty, we defne p(r a = ) p(r a ) n the remander of the paper. Because durng retreval r s hdden, m A(a r) cannot be calculated exactly. Instead, ts expected value under the jont probablty of relevance s derved by makng use of the propertes of expectaton (Throughout ths paper the expectaton s all condtoned on a gven query q and wth respect to r. For smplcty, we drop the subscrpt r and notaton q n E[ ] from now on): E[m A] = X p( q)e[m A ] (3)

3 Weght Rato. p(r) Weght Rato.. p(r) (a) (b) (c) Weght Rato Expected AP: P(r) =.9 Expected AP: P(r) =.5 Expected DCG Expected RR: P(r) =.5 Expected RR: P(r) = Fgure : (a) The adaptve weght w A of the expected Average Precson, (b) The adaptve weght w R of the expected Recprocal Rank, and (c) Comparson of the weghts n dfferent expected IR metrcs. = X p( q) = X p( q) X j= = = ( E[ra NR] ( E[ra NR] X + + j= E[r a r aj ] ) Cov(r a, r aj ) + E[r a ]E[r aj ] ), where Cov(r a, r aj ) denotes the correlaton between the relevance values of documents at rank and j gven the number of relevant documents s. Eq. (3) shows that the expected AP can be nterpreted as: for the gven query, an IR model frst estmates the number of relevant documents n the collecton, and then estmates the expected AP for that number of relevant documents. The fnal expected measure s the average, weghted by p( q), across all the possble numbers of relevant documents. We can obtan more nsght nto the expected AP by makng a smple approxmaton to the average over. By assumng that the posteror dstrbuton of s sharply peaked around the most probable value (the mode) ˆ, we can use the mode to approxmate the average [5]. Ths gves: E R[m A] ˆNR M X = X w A p(r a ) + j= Cov(r a, r aj ), (4) where E[r a ] = P r a r a p(r a ) = p(r a ), the margnal probablty of a document s relevance at rank. Note that the equaton removes the dependency of ˆNR because the condtonal expectaton and varance are well approxmated by the non-condtonal ones when P( ˆ q). To smplfy the equaton, we also defne w A, whch s regarded as an adaptve weght of rank. The frst term n ths smple approxmaton ndcates that the expected AP s a weghted average of the scores across all rank postons, and as we ncrease the margnal probablty of relevance p(r a ) n the ranked lst, the expected AP ncreases. Furthermore, because the weght rato: w+ a w a +P j= p(ra j ) = + ( + p(r a ) + P j= p(ra j )) (5) + and.the rato s adaptve + )) receved s n the range between to the expected relevance (defned as P j= p(ra j so far. To get the nsght nto t, we approxmate the weght by settng p(r a ) all equal to p(r). We plot the weght rato aganst the margnal dstrbuton p(r) and rank poston n Fgure (a). It llustrates that when we have more confdence about the relevance of the early retreved documents (p(r) approaches one), the weght rato becomes near one. As a result, the metrc s less worred about the early retreved documents, thus puttng equal weghts to the laterretreved documents. Ths s smlar to the Precson metrc. But once less confdent documents (p(r) approaches zero) are retreved, partcularly n the top ranked postons, the + weght rato approaches ts lower bound. As a consequence, the weght penalzes more the later-retreved relevant documents, and the rato of the expected AP behaves more lke that of the expected DCG, whch wll be dscussed later. The second term n Eq. (4) ndcates that a document wll contrbute more to the expected AP f ts relevance s more postvely correlated wth those of prevous retreved documents. The consequence s that t wll push postvely correlated documents up n the ranked lst. Ths s an nterestng fndng because t shows that the expected AP s n fact nonlnear t models well the dependences between documents relevance and ncorporates them n decdng the preferred rank order. The ratonal of encouragng postvely correlated relevant documents s that f a document s relevant, t s lkely that ts postvely correlated documents are also relevant. It theoretcally explans why pseudo relevance feedback,.e., the top ranked documents are generally lkely to be relevant, and fndng other documents smlar to these top ranked ones helps mprove MAP [4]... Expected DCG and Precson Dscounted Cumulatve Gan (DCG) s another popular measure for rankng effectveness, especally n web search. DCG measures the usefulness, or gan, of a document based on ts (graded) relevance[4] (for the moment, let us consder r a to cover the graded relevance too); the gan s accumulated from the top of the result lst to the bottom. To penalze late-retreved relevant documents, the gan of each result s dscounted by a functon of ts rank poston. By defnton, we have the DCG measure as: m D(a r) = w D g(r a ), (6) = where w D s the dscount weght for rank poston, and g(r a ) s a gan functon mappng the relevance value to the retreval gan. Unlke the expected AP, the expect DCG s lnear wth respect to rank postons. We thus have: E r[m D] = w D E[g(r a )] (7) = Snce g(r a ) s nfntely dfferentable n the neghborhood

4 of the mean of r a,.e., ˆr a E[r a ], the mean of g(r a ) can be represented by a Taylor power seres as: E[g(r a )] =E[g(ˆr a )] + E[(r a ˆr a )g (ˆr a )]+ E[ (ra ˆra ) g (ˆr a )] +... =g(ˆr a ) + + V ar(ra )g (ˆr a ) +... g(ˆr a ) + V ar(r a ) g (ˆr a ), The expected DCG s thus approxmated by: E r[m D] = w D (8) g`ˆr a + g (ˆr a )V AR(r a ), (9) where V AR(r a ) denotes the varance of r a. Eq. (9) shows that the expected value of DCG s determned by both the mean and varance of the relevance of documents at rank postons from to M. Whether t should add varance or mnus varance depends on the sgn of the second dervatve of the gan functon. In the case of graded relevance, f consder hghly relevant documents more valuable than margnally relevant documents and gve them more gan, we can then use a gan functon lke g(r a ) = ra. In ths case, we need to add varance. It s shown that when w D > w D... > wm, D the document wth the hghest score of g(ˆr a )+ g (ˆr a )V AR(r a ) s retreved frst, the document wth the next hghest score s retreved second, and so on. It s common to defne w D. Compared to the adaptve weght n the log (+) expected AP, t penalzes more the late-retreved relevant documents. Fgure (c) compares ther weght ratos. Precson at M s a specal case of DCG, where the dscount s a constant and the gan functon s lnear. Thus, the expected Precson measure s E[m P] = E(r a ) p(r a ) () M M =..3 Expected Recprocal Rank In the cases lke web search and queston answerng tasks, we qute often expect a relevant document to be retreved as early as possble [, 8]. Expected Search Length and Recprocal Rank (RR) are strongly based towards earlyretreved documents. Ths secton analyzes RR, whle Expected Search Length can be derved smlarly. RR s the nverse of the rank of the frst relevant document and bounded between and. It s formally defned as: m R(a r) =r a + ra ( r a ) = = + r a3 ( r a )( r a ) NX = r a Y NX ( r aj ) = j= = vra, () where we defne v Q j= ( ra j ), a functon of the relevance values of documents ranked above ; (v when = ). Conceptually, RR measure can be thought of as a weghted average of relevance values at dfferent rank postons, where the weghts are adaptve to earler retreved documents. The expected value of the RR measure s the followng: E[m R] =E[ = = = vra ] = X M = E[v r a ] E[v ]E[r a ] + Cov(r a, v ) = `wr p(r a ) + Cov(ra, v), = () where, smlarly, we consder E[v ] as an adaptve weght and denote t as w R. It can be approxmated by assumng that the rrelevance of documents above rank s ndependent when calculatng w R,.e., w R E[v Q ] j= ( p(ra )). j Thus w R > w+. R On the one hand, smlar to the expected DCG, the weght w R s a dscount factor penalzng late retreved relevant documents. As a result, maxmzng the measure ntends to push documents that have hgh margnal dstrbuton of relevance p(r j) to the top. However, the penalty s much larger than the ones n expected DCG and expected AP. To see ths, let us agan approxmate the weght by settng p(r a ) p(r). The weght rato s compared wth those of the expected AP and expected DCG n Fgure (c). It shows that expected RR has the smallest weght rato, whle expected AP has the largest. Expected DCG s the one n the mddle. One the other hand, the weght s updated n a completely dfferent way compared to expected AP. Fgure (b) plots the weght rato aganst the margnal dstrbuton p(r) and rank poston. Dfferent from expected AP, the weght rato of expected RR becomes larger when p(r) s larger, renforcng the dscount further. As a consequence, t entrely focuses on the qualty of a few early retreval documents. For example, the upper bound for w3 R s. If we consder p(r a ) >.5 for = {,, 3}, whle for DCG t usually equals = and for expected AP even larger. log 4 The covarance bt n Eq. () shows that overall the expected value of RR ncreases when relevance of a document s more postvely correlated wth v, the product of nonrelevances ( r aj ) of the documents above. The effect s that negatvely correlated documents wll have hgher expected RR than postvely correlated documents. Such effect wll be dscounted by a factor / at rank. Ths s an entrely opposte preference compared to the expected AP. To see ths, suppose we have two documents to rank: E[m RR] =E[R a ] + E[Ra ] =p(r a ) + p(ra ) E[ra r a ] Cov[ra, r a ] + p(r a )p(r a ) =p(r a ) + w R p(r a ) Cov[ra, r a ], (3) where w R = ( p(ra )). It shows that negatvely correlated document has a hgher value of the expected RR, confrmng the fndngs n [, 9] that the RR metrc s optmzed by dversfyng the ranked lst of documents...4 A General Vew Through our analyss, t can be seen that the expected IR metrcs roughly have two components. A unfed defnton s gven as follows: E[m(a r)] = W p(r a ) + = V (r a,..., r a ), (4) where W s the dscount weght n poston, and V s a

5 Defnton: W Table : A unfed vew of expected IR metrcs. Precson DCG AP RR P M P M = ra = ra P M (+ P log (+) = ra j= ra j ) P M r a = Expected Precson Expected DCG Expected AP Expected RR + P j= p(ra j ) Q j= ( p(ra j )) log (+) V (r a,..., r a ) Q j= ( ra j ) P j= Cov(ra, ra j ) Cov(ra, Q j= ( ra j )) functon defnng the correlaton between documents. The specfc defntons wth respect to dfferent metrcs are summarzed n Table. Notce that for DCG, n the case of bnary relevance, g(r a ) = ra can be approxmated as a lnear functon, and the varance bt vanshes n Eq. (9). The frst bt s a lnear one wth respect to the margnal probablty p(r a ). Strctly speakng, ths s untrue as W s adaptve to prevously retreved documents. But snce the weght rato W +/W s usually smaller than one, the maxmum value of the frst bt s stll acheved by rankng n the decreasng order of the margnal probablty of relevance. Ths s dentcal to what the Probablty Rankng Prncple has suggested [9]. We call t the general rankng preference. The second bt makes the IR metrcs dfferent from each other. It s called the specfc rankng preference. A more detaled dscusson and comparson about t s presented n Secton 3. through a smulaton.. Practcal Consderatons Stack Search Maxmzng Eq. (4) s a non-trval task because t needs to search over all possble rankng combnatons. We use stack search smlar to [3], whch keeps a lst of the best n rankng combnatons as canddates seen so far. These canddates are ncomplete solutons tll rank. It then teratvely expands each of the best partal solutons by addng a document at rank +. For each canddate, we select top-n documents that have the maxmum ncreases of the expected IR metrc n Eq. (4). We then put all resultng partal solutons (n ths case, n n) onto the stack and then trm the resultng lst of partal solutons to the top n canddates agan. We repeat the loop untl the end of the rank lst s reached. The soluton s the one havng the maxmum value among the canddate solutons. Such a sequental update may not necessarly provde a global optmzaton soluton, but t provdes an excellent trade off between accuracy and effcency by adjustng n. When n s, t goes back to the greedy approach. When we ncrease n, better solutons may be found at the expense of more computatonal cost. For detals refer to [3]. IR Model Calbraton To calculate the expected IR metrcs durng retreval, we need to estmate the jont probablty of relevance. An obvous soluton s to drectly estmate t from the (tranng) data []. Relevance nformaton s, however, not steadly avalable n many practcal stuatons to buld a robust relevance model. In ths paper, we ntend to conduct an ndrect estmaton usng exstng IR models. It s observed that n many text retreval experments that the calculated rankng scores can serve as robust ndcators of documents relevance wth respect to queres. Thus, a mappng functon can be developed to map from the rankng scores to the probablty of relevance. Smlar to [9], the jont probablty of relevance p(r q) s summarzed by the margnal probablty p(r a q) and covarance Cov[r a, r aj ]. Let us frst look at p(r a q), and treat t as the utlty of rankng scores. We expect the utlty, defned as u, to be a non-decreasng functon of the rankng score. Thus the frst dervatve u >. It s also expected that u has a maxmum value as the rankng score ncreases. Thus the Fgure 3: By adjustng the correlaton between documents from -. to., the gan on performance for average precson, DCG, and RR, respectvely. second dervatve u <. Our experment (Secton 3.) on TREC data has confrmed our ntuton. Applyng an exponental utlty functon (u > and u < ) [] gves the mappng functon as: p(r a q) u(s) = e bs, (5) where u(s), n the range [, ), s the utlty of the rankng score s, where s. b denotes a constant. For the emprcal study of the mappng, we refer to Secton 3.. The next queston s how to estmate the covarance q Cov[r a, r aj ] = ρ(r a, r aj ) V ar[r a ]V ar[r aj ], (6) where V ar[r a ] = ( p(r a ))p(r a ) f r a follows a Bernoull dstrbuton. The correlaton coeffcent ρ(r a, r aj ) models the dependency of relevance between documents at rank and j. Durng retreval, t s reasonable to use the documents score correlaton to estmate the relevance correlaton,.e., ρ(r a, r aj ) ρ(s a, s aj ). Strctly speakng, the score correlaton s query-dependent. A practcal soluton s, however, to approxmate t by samplng queres and calculatng the correlaton between documents rankng scores from an IR model. In our mplementaton, we construct each of these queres by randomly samplng query terms from the vocabulary of a data set. For the expected RR, we need to compute the covarance between document a and varable v, where v s the meta-relevance of prevously retreved documents,.e., ) as defned n Secton..3. In our mplementaton, we aggregate the content of the top - documents as a meta document, and estmate the correlaton between r a and v as mnus the correlaton between the meta document s rankng score and document a s rankng score. v Q j= ( ra j 3. EXPERIMENTS 3. Smulaton In ths secton, we carred out a smulaton as a confrmaton of our analyss about the effect of correlaton between dfferent documents relevance on a range of IR metrcs. The relevance states of documents were generated for, trals. At each tral, for each rank poston, we kept

6 Fgure 4: Probablty that a result from each bn s relevant aganst the medan of each bn. the margnal probablty of relevance p(r a q) unchanged and generated the relevance/nonrelevance states of the document. The samples were then randomly perturbed so that the correlaton between each par of varables ncreases from negatve to postve (x axs n Fgure (3) ). For each sample n each tral we calculated the value of an IR metrc. We then averaged the metrc values across all the trals to obtan the average value. We used the value of the IR metrc when the correlaton s set as zero as the bass for calculatng the gan on the metrc when the correlaton changes. The results for AP, DCG, and RR are shown n Fgure (3). It confrms our dervaton of the expected DCG that t s nsenstve to correlaton. AP value ncreases when correlaton ncreases, whereas RR does otherwse. We tred wth dfferent settngs such as the number of documents, and margnals etc, and got smlar fndngs to the reported above. Prevous emprcal studes on TREC data have found out that one cannot optmze both the RR and AP metrcs at the same tme [4, 9]. The analytcal forms and the smulaton provde drect evdence that the AP metrc encourage postvely correlated documents whereas the RR metrc encourages the opposte. 3. IR Model Calbraton In ths secton, TREC data s used to get an nsght nto how the mappng functon u looks lke. Smlar to the expermental setup n [], we measured the utlty of rankng scores by the probablty that documents gven the rankng scores are judged relevant. Documents were bnned based on ther rankng scores for analyss; we judged the probablty that a randomly pcked document from each bn s judged as relevant. More specfcally, we ran the Jelnek-Mercer smoothng language model on the TREC4 Robust Track 49 topcs wth the parameter λ set as ts typcal value. [34]. The top documents were returned for each topc, and there were n total 4,66 results returned for these 49 queres, among whch there are 7,9 relevant documents out of a total number of 7,4 relevant documents n the track. The queres contan dfferent numbers of terms. To makng the rankng scores comparable across queres, we normalzed the rankng scores for all results of each query by dvdng these rankng scores by the number of terms n the query. We sorted the 4,66 results n the descendng order n terms of ther scores, and dvded ths ranked lst nto bns of,5 results each, yeldng 6 bns: the frst 6 bns contanng,5 results each, and the last bn contanng the 66 documents wth the lowest scores. We selected the medan score n each bn to represent the bn. In Fgure 4, the utlty of each bn,.e., the probablty that a randomly chosen result from the bn s relevant, s estmated as the number of relevant documents n each bn dvded by the bn sze. The data ponts are based on the pars of the medan of each bn and probablty of relevance, and the data ponts are connected by smoothed curves. Table : Overvew of sx TREC collectons. Name Descrpton Sze # Docs Topcs TREC8 TREC dsks.86 GB 58, &5 mnus CR Robust TREC dsks.86 GB 58, and 6-4 4&5 mnus CR 7 mnus 67 Robust TREC dsks.86 GB 58,55 5 dffcult Robust4 Hard 4&5 mnus CR topcs WTg TREC Web GB,69, collecton CSIRO CSIRO crawl 4. GB 37,75-5 mnus 8 unjudged topcs.gov crawl of.gov doman 8 GB,47, Fgure 4 confrms our ntuton that the mappng functon s approxmately a concave curve (u > and u < ) and fttng Eq. (5) to the data n Fgure 4 gves b= Our experments showed that the performance of our approach s robust wth respect to the choce of b, and a value of b anywhere between 7. and. results n neglgble changes of the performance on all the test collectons. For the remanng experments, we fx the parameter b as 9, whle bearng n mnd that tunng t from tranng data mght have potentals for further performance mprovement. 3.3 Performance We contnued our emprcal study of the proposed probablstc retreval framework, focusng on understandng ts ablty of optmzng IR metrcs. Drchlet and Jelnek-Mercer smoothng language models were chosen as the two baselne IR models snce they are frequently reported for good performance on TREC test collectons [34]. For each query, the rankng score of each document, calculated by ether of the two IR models, s normalzed by dvdng them over the number of terms n the query. It s used as the nput to estmate the margnal probabltes and covarance on the bass of the dscusson n Secton.. The stack search s then appled to fnd an optmal rankng lst that maxmzes a gven IR metrc n Eq. (4). For the stack search, we smply set n=,.e., equvalent to a greedy approach, whle leavng ths lne of research to future work. Standard stemmng and stopword removng were carred out for both queres and documents. The smoothng parameters of the language models were tuned for the optmal performance for a metrc on each data set. The results are reported on sx TREC test collectons, descrbed n Table. TREC8, Robust 4, and Robust 4 Hard topcs are three plan text collectons, and TREC ad hoc task on WTg data, TREC 7 enterprse track document search task on CSIRO data, and TREC topc dstllaton task on.gov data are on three Web collectons. The results n Table 3 ndcate that f we choose a certan IR metrc to maxmze, we obtaned n most cases the best performance on ths metrc than optmzng other metrcs and the baselnes. More specfcally, our approach always had the best performance wth respect to MAP and MRR when the objectve was to maxmze the expected AP and RR, respectvely. When we amed to optmze the expected DCG, our approach mproved the baselne on 8 out of occasons n terms of NDCG. It s worth mentonng that no parameter was needed when optmzng the metrcs. Wthout any parameter tunng, our approach consstently outperformed the two baselne models, and eght mprovements are statstcally sgnfcant. Recall the analyss n Secton that the expected AP and RR have a rather opposte rank preference (utlty) the expected AP favors a document whose relevance s postvely correlated wth those of the documents ranked above, whereas the expected RR suggests otherwse. Table 3 demonstrates that the optmzaton of the expected RR always leads to better performance on MRR than optmzaton

7 Table 3: Performance on MAP, NDCG and MRR when the objectve s to optmze AP, DCG, and RR, respectvely. We used the Drchlet and Jelnek-Mercer smoothng language models, whose smoothng parameters were tuned for the optmal performance of a metrc on each data set, as the baselnes n optmzaton. We hghlght the hghest performance n bold. A Wlcoxon sgned-rank test (p <.5) s conducted and statstcally sgnfcant mprovements over the baselnes are marked wth. TREC8 MAP NDCG MRR Robust4 MAP NDCG MRR Robust hard MAP NDCG MRR Drchlet (Baselne) 8 6 Drchlet (Baselne)..596 Drchlet (Baselne) Maxmze AP.36 8 Maxmze AP Maxmze AP Maxmze DCG 4 5 Maxmze DCG Maxmze DCG Maxmze RR Maxmze RR Maxmze RR.76.3 Jelnek-Mercer (Baselne) 4 58 Jelnek-Mercer (Baselne)..54 Jelnek-Mercer (Baselne) Maxmze AP Maxmze AP.593 Maxmze AP Maxmze DCG Maxmze DCG Maxmze DCG Maxmze RR Maxmze RR Maxmze RR WTg MAP NDCG MRR CSIRO MAP NDCG MRR.Gov MAP NDCG MRR Drchlet (Baselne)..55 Drchlet (Baselne) Drchlet (Baselne) Maxmze AP Maxmze AP Maxmze AP Maxmze DCG Maxmze DCG Maxmze DCG Maxmze RR Maxmze RR Maxmze RR Jelnek-Mercer (Baselne) Jelnek-Mercer (Baselne) Jelnek-Mercer (Baselne) Maxmze AP Maxmze AP Maxmze AP Maxmze DCG Maxmze DCG Maxmze DCG Maxmze RR Maxmze RR Maxmze RR of the expected AP, and vce versa. The result supports our theoretcal fndng that RR and AP are two dfferent types of metrcs, and optmzng ether of them cannot lead to the optmal performance of the other. Table 3 also shows that optmzaton of AP can sometmes lead to better performance on NDCG than drect optmzaton of DCG. Smlar fndng appeared n the learnng to rank paradgm, and t was argued that the reason s due to the fact that MAP s more nformatve than DCG [3]. Yet, we thnk that the nformatve explanaton, although true n learnng to rank, does not necessarly hold n our probablstc framework snce we do not use IR metrcs to summarze the tranng data. Our belef s supported by the results from the smulaton n Secton 3. that the expected DCG s nvarant to the changes of relevance correlaton between documents; and as a result, optmzng AP (promptng documents whose relevance s postvely correlated wth prevous documents) shouldn t do any better than drectly optmzng DCG for the NDCG metrc. We thus beleve the somewhat contradcted fndng n the real data set may be attrbuted to the estmaton of the jont probablty of relevance, more specfcally the relevance correlaton, gven the fact we used textual content to nfer relevancy. As the cluster hypothess suggests that relevant documents tend to be smlar to each other to form clusters [5], a document s lkely to be relevant f t s smlar to relevant documents. As a result, the expected AP bases towards puttng documents smlar wth each other n the top rank postons. When top ranked documents are relevant, these other documents are also lkely to be relevant - ther margnal probabltes of relevance mght be hgher than the estmated. As a result, metrcs such as NDCG and Precson are mproved. Fnally, we provde a further account of RR and AP, the two dfferently behavng metrcs. Recall that n Fgure the propertes of the expected RR and AP were depcted by adjustng the weght functons w A and w R usng a sngle parameter p(r). Fgure (5) used TREC8 test collecton to further show the effect of p(r) on the resultng MRR and MAP performance. For comparson, the performance of the baselne Drchlet smoothng language model, and the exact optmzaton of RR, MAP and DCG was also plotted. It shows that adjustng p(r) to approxmate AP s very stable snce the soluton keeps roughly the same for all eght values of p(r). Ths could be explaned by the fact that the weght rato between w+ A and w A saturates at for Fgure 5: MRR v.s. MAP all values of p(r) when ncreases above 4. By contrast, the RR approxmaton s more volatle wth respect to p(r). As p(r) ncreases from. to.5, the MRR performance ncreases whereas the MAP performance decreases. Ths s due to the fact that as p(r) decreases, the weght rato of RR becomes smlar to that of DCG and AP. p(r) can be used to trade off between the performance of MAP and MRR. When p(r) =.3 and, the performance on MRR even slghtly exceeds that on the exact optmzaton of RR. Ths suggests that there mght be stll scope to mprove our stack search algorthm by settng n hgher than. 4. LINKS TO OTHER WORK To complement Secton, we contnue the dscusson of related work. In the learnng to rank paradgm, optmzng IR metrcs s conducted n a dscrmnatve manner where Support Vector Machnes or Neural Networks were commonly used [3, 33]. By contrast, we study the problem n a probablstc framework where the ntenton s to combne both the generatve and dscrmnatve processes. Our formulaton of optmal rankng also fundamentally departs from the dea n [6], where a probablty dstrbuton over document permutatons (rank) s defned, and the expectaton of IR metrcs s consdered under ths dstrbuton. In ths paper, we, however, beleve that the expectaton of IR metrcs should be wth respect to a dstrbuton of relevance, because the uncertanty comes only from the fact that we cannot know the relevance of documents wth absolute certanty. For the purpose of evaluaton, the estmaton of IR metrcs, partcularly MAP, has been nvestgated n the past.

8 For example, to reduce the varablty of test collecton, a normalzaton technque was ntroduced []; to deal wth ncomplete judgements, samplng approaches were proposed [3, 3]. Emprcally, ther error rates were measured [7]; and the uncertanty from the varablty of relevance judgments n TREC were also examned [7]. By contrast, our study s for the purpose of retreval, and thus the IR metrc estmaton and optmzaton were explored n a complete dfferent stuaton where the relevance s not known a pror. The most relevant work can be found n [, 5, 35]. The study n [] argued that n some tasks users would be satsfed wth a lmted number of relevant documents, rather than requrng all relevant documents. The authors therefore proposed to maxmze the probablty of fndng a relevant document among the top n. By treatng the prevously retreved documents as non-relevant ones, ther algorthm s equvalent to optmzng Recprocal Rank. A more general soluton s proposed n [35] on the bass of the Bayesan rank decson framework n [5]. In ther solutons, dfferent rank preferences are expressed by dfferent utlty functons and can be ncorporated when calculatng the score for each of the documents. The two deas are close n sprt to the Maxmal Margnal Relevance (MMR) crteron n [9], and can be called margnal relevance IR models because they are desgned to calculate the addtonal nformaton a document contrbutes n a result lst. But unfortunately ths framework does not allow the capacty to model and optmze dfferent IR metrcs. Ths paper takes a rather dfferent vew, although smlar to [5, 35] we also follow the Bayesan decson theory. We argue that the rank utlty s nothng to do wth the (relevance) model parameters but only wth the hdden true topcal relevance; and the relevance states of documents need to be estmated before knowng any user (rank) utlty. A good IR metrc could be able to specfy one type of rank utltes. Once we summarze our belef about the true relevance by the jont probablty of relevance, the utlty, expressed by an evaluaton metrc, can be estmated under such uncertanty, and the optmal decson s the one that optmzes that expected value. The two dstnct retreval steps do not assume a partcular (relevance) retreval model, makng t applcable to many exstng IR models and IR metrcs. Our work s also related to the portfolo theory of document rankng [9]. By an analogy wth the fnancal problems, they argued that an optmal rank order s the one that balances the overall relevance (mean) of the ranked lst aganst ts rsk level (varance). Ths paper follows the dea of usng mean and varance to summarze a dstrbuton and to analyze the expected IR metrcs. Our analytcal forms of expected IR metrcs on the bass of the mean and varance reveal some nterestng propertes that have not been shown n the past. 5. CONCLUSIONS In ths paper, we have studed the statstcal propertes of expected IR metrcs when the relevance of documents s unknown. An mplementaton based on our analyss and the two-stage framework has already shown ts ablty of optmzng major IR metrcs n a probablstc framework. In the future, t s of great nterest to seek ts usage n web search where clck-through data can be vewed as ndrect evdence of documents relevance. Also, durng evaluaton, the Cranfeld paradgm consders relevance as determnstc values, ether bnary or graded ones. It s, however, more general to consder IR evaluaton as a stochastc process too. Thus, despte the fact that our study of the expected IR metrcs s for retreval, the analyss and development are also relevant to evaluaton f the dsagreement between relevance assessors needs to be modelled. 6. REFERENCES [] G. Amat and C. J. V. Rjsbergen. Probablstc models of nformaton retreval based on measurng the dvergence from randomness. ACM Trans. Inf. Syst., (4): ,. [] K. Arrow. Aspects of the Theory of Rsk-Bearng. Helsnk: Yrjö Hahnsson Foundaton, 965. [3] J. A. Aslam, V. Pavlu, and E. Ylmaz. A statstcal method for system evaluaton usng ncomplete judgments. In SIGIR, 6. [4] J. A. Aslam, E. Ylmaz, and V. Pavlu. The maxmum entropy method for analyzng retreval measures. In SIGIR, 5. [5] C. M. Bshop. Pattern Recognton and Machne Learnng. Sprnger, 6. [6] P. F. Brown, V. J. D. Petra, S. A. D. Petra, and R. L. Mercer. The mathematcs of statstcal machne translaton: parameter estmaton. Comput. Lngust., 993. [7] C. Buckley and E. M. Voorhees. Evaluatng evaluaton measure stablty. In SIGIR,. [8] C. Burges, T. Shaked, E. Renshaw, A. Lazer, M. Deeds, N. Hamlton, and G. Hullender. Learnng to rank usng gradent descent. In ICML 5, 5. [9] J. Carbonell and J. Goldsten. The use of MMR, dversty-based rerankng for reorderng documents and producng summares. In SIGIR, 998. [] H. Chen and D. R. Karger. Less s more: probablstc models for retrevng fewer relevant documents. In SIGIR, 6. [] G. V. Cormack and T. R. Lynam. Statstcal precson of nformaton retreval evaluaton. In SIGIR, 6. [] W. B. Croft and D. J. Harper. Usng probablstc models of document retreval wthout relevance nformaton. Document Retreval Systems, 988. [3] D. Harman. Overvew of the second text retreval conference (trec-). In HLT 94, 994. [4] K. Järveln and J. Kekälänen. Cumulated gan-based evaluaton of IR technques. ACM Trans. Inf. Syst.,. [5] J. Lafferty and C. Zha. Document language models, query models, and rsk mnmzaton for nformaton retreval. In SIGIR,. [6] C. D. Mannng, P. Raghavan, and H. Schütze. Introducton to Informaton Retreval. Cambrdge Unversty Press, 8. [7] M. E. Maron and J. L. Kuhns. On relevance, probablstc ndexng and nformaton retreval. J. ACM, 96. [8] S. Mzzaro. Relevance: The whole hstory. Journal of the Amercan Socety of Informaton Scence, 997. [9] S. E. Robertson. The probablty rankng prncple n IR. Journal of Documentaton, pages 94 34, 977. [] S. E. Robertson and K. Spärck Jones. Relevance weghtng of search terms. Journal of the Amercan Socety for Informaton Scence, 7(3):9 46, 976. [] S. E. Robertson and S. Walker. Some smple effectve approxmatons to the -posson model for probablstc weghted retreval. In SIGIR, 994. [] A. Snghal, C. Buckley, and M. Mtra. Pvoted document length normalzaton. In SIGIR, pages 9, 996. [3] M. Taylor, J. Guver, S. Robertson, and T. Mnka. Softrank: optmzng non-smooth rank metrcs. In WSDM, 8. [4] S. Tomlnson. Early precson measures: mplcatons from the downsde of blnd feedback. In SIGIR, 6. [5] C. J. van Rjsbergen. Informaton Retreval. Butterworths, London, London, UK, 979. [6] M. N. Volkovs and R. S. Zemel. Boltzrank: learnng to maxmze expected rankng gan. In ICML 9, 9. [7] E. Voorhees. Varatons n relevance judgments and the measurement of retreval effectveness. In Informaton Processng and Management, pages ACM Press, 998. [8] E. M. Voorhees. The TREC-8 queston answerng track report. In TREC-8, pages 77 8, 999. [9] J. Wang and J. Zhu. Portfolo theory of nformaton retreval. In SIGIR, 9. [3] Y. Wang and A. Wabel. Decodng algorthm n statstcal machne translaton. In EACL, 997. [3] E. Ylmaz, E. Kanoulas, and J. A. Aslam. A smple and effcent samplng method for estmatng ap and ndcg. In SIGIR, 8. [3] E. Ylmaz and S. Robertson. On the choce of effectveness measures for learnng to rank. Informaton Retreval, 9. [33] Y. Yue, T. Fnley, F. Radlnsk, and T. Joachms. A support vector method for optmzng average precson. In SIGIR, 7. [34] C. Zha. Statstcal language models for nformaton retreval a crtcal revew. Found. Trends Inf. Retr., (3):37 3, 8. [35] C. Zha and J. D. Lafferty. A rsk mnmzaton framework for nformaton retreval. Inf. Process. Manage., 4():3 55, 6.

Lecture Notes on Linear Regression

Lecture Notes on Linear Regression Lecture Notes on Lnear Regresson Feng L fl@sdueducn Shandong Unversty, Chna Lnear Regresson Problem In regresson problem, we am at predct a contnuous target value gven an nput feature vector We assume