A New Evaluation Measure. J. Joiner and L. Werner. The problems of evaluation and the needed criteria of evaluation

III-l III. A New Evaluatin Measure J. Jiner and L. Werner Abstract The prblems f evaluatin and the needed criteria f evaluatin measures in the SMART system f infrmatin retrieval are reviewed and discussed. Perfrmance characteristics f a gd evaluatin measure are examined. The suggested measure Pr N (P<R/n), (the prbability under the hypergemetric distributin, that the precisin culd be strictly less than that precisin attained, where R = number relevant in the sample drawn, N = ttal number in cllectin and n = size f sample drawn) is intrduced and tested against the varius criteria needed fr a gd evaluatin measure. A statistical test f significance is explained.. Intrductin Amng the principal bstacles t the evaluatin f infrmatin retrieval methds are the fllwing: ) Interplatin between pints f recall results in errrs which are unsatisfactry in ne way r anther, depending upn the type f interplatin used. ) A recall-precisin curve smetimes gyrates wildly and the averaging f many curves ver queries has questinable reliability. L The statement "methd A is better than methd B u ften depends upn the value ne is measuring. A unique value measuring bth recall and precisin wuld be best. 4) Queries with different numbers f relevant dcuments d nt receive different amunts f credit, althugh just by randm

III- chance it is easier t get a relevant dcument fr a query with relevant than fr ne with 4- relevant* ) It has nt been determined hw t handle the evaluatin f feedback methds in relatin t the relevant dcuments retrieved befre feedback. This reprt will attempt t discuss and prpse slutins t these specific prblems.. Prblems f Evaluatin One imprtant aspect f infrmatin retrieval is btaining a value fr a given methd which is a true measure f the methds effectiveness ver many queries. On a recall-precisin graph, the pints f recall where ne measures this effectiveness are.,.,...,,9,.-. Hwever, the nly pints available fr a query with n relevant dcuments are /n, /n,..., n-l/n, n/n. Obviusly, fr queries with different numbers f rele- vant dcuments, ne may expect that nne f the query! s pints will cincide with the pints.,.,...,.9,.. But presently, by interplatin cf sme kine, the precisin values are fund fr each query at these pints. Fig. shws ne such methd. There can be n real justificatin fr any methd f interplatin used, fr it is impssible t estimate a discrete functin at a nnexistent pint. Therefre, what is needed t slve this prblem is a new base index fr the graph that wuld invlve n interplatin. An index that wuld, ver many queries with different numbers f relevant dcuments, have nly cmmn pints fr all queries. The averaging f pints f recall-precisin where interplatin must ccur tends further t distrt the measure f effectiveness. But even ver

in-; Ranks f Relevant Dcuments m CVJ ^ <fr C J - CVJ Q_ O Ttal N. f Relevant Dcuments <fr CVJ c > Q> (Z CO a) a Query Number CX O" c Ct sz CL *~ CD c if) <D w_ Q_ ILO al O CD (T CD :

III-4 pints that cincide with equal recall, different values can be btained by different methds f averaging. Even at these cmmn pints, the values being averaged are smewhat in dubt. N crrelatins are made in the pre- cisin values fr the generality number (jg = number f Relevant/ttal number in cllectin), which reflects hw easy it wuld be, under randm cnditins alne, t select relevant dcuments. shuld cntrl this randmness factr. that when the generality rati A gd perfrmance measure Cntrl shuld be in the sense is decreased in a way which preserves the bserved perfrmance level, the effect f the generality rati n a perfrmance measure culd be bserved. The measure prpsed is knwn t re- flect the generality number under equal perfrmance but a methd f splitting a cllectin int tw cllectins suggested by R. Williamsn has nt been tested.. Criteria fr a Gd Evaluatin Measure A gd perfrmance measure shuld fulfill the fllwing criteria: l) Recall values measure the effectiveness f a methd by cmparing the number f relevant dcuments retrieved t t ttal number f relevant dcuments, while precisin measures this perfrmance by cmparing the number f relevant retrieved dcuments t the ttal number f retrieved. These intuitively seem t be the best measures f perfrmance available. Their biggest drawback is that they are tw unique values nt ne. A gd measure shuld reflect bth. l The generality number, as stated befre, reflects the degree f effect that pure randm chance selectin will have n the methd f retrieving relevant dcuments fr a certain query. With this cntrlled queries can be cm-

- pared n a mre cmmn basis. ) In thery Cat the least, the measure shuld appeal t the user and tester and the values btained shuld have a lgical range. A range frm t best suits a measure f perfrmance and effectiveness. Any system which is effective at all shuld have values f the measure clser t than t. 4. The Prbability Measure A very large urn is filled with Q. dcuments. Fr query q there are dcuments that are relevant and that are nt. If, at randm, dcuments are drawn frm that urn withut replacement, the prbability that less than relevant dcuments are chsen cmpletely at randm is P E, <*<*> = P H,Q- CR = lt P^QQ (R = U t P^Q CR =>,^.. rs,ck = C K ^ + l K 9/ t K j,. r ) ( ) l ; ^ l This is equivalent t finding the prbability by randm chance that the precisin is less than / fr F R Q CR<) = P R QQ CR/n</n) = P R Q CP</). The higher this prbability is, the less likely it wuld be that the precisin achieved was btained by chance. This measure culd be evaluated at any pint n (equals the number f dcuments retrieved) that might be wanted fr investigatin. As precisin increases frm m/n t (m + l)/n, this value ges frm P R ^P<m/n;) t? H QQ ( ' P<^m + ^/n) which is e l ual t P H ^<m^ and P 9 n^r m +!) where P H, (R<m) = P H, CR = m " } + P H, CR = m " ) + + P H, CR = )

III-6 and ^ O O ^ ^ = P H, CR = m) + ^^QO = m ' } + ' ' + P H, (R " ) " When P ^^^CR<m) is subtracted frm P^ pncr<m+l) the answer is always psitive since ne mre single hypergemetric prbability is added t P CR<m+l). Since P CR<m) < P tr<m+l) is equivalent t stating that P H CR/n<P<m/n = p ) P (P<m+l/n = p ), and p. <p, then as precisin increases the perfrm* -nance measure increases. This same argument hlds fr recall because P H, (r<m) = P H, Cr/R<m/R) = P H, Crecall<m/R). As the recall increases frm m/r t m+l/r mre prbability is added t the measure and it therefre increases. The prbability itself incrprates the generality number and it will be shwn by example hw this generality affects the measure. All three f the criteria which are mst needed by a unique perfrmance measure are therefre cmbined in this value. The theretical range, -, f this measure is als appealing t testing prcedures and analyzing f results. Sme measures fr arbitrarily chsen results are shwn in Table. The use f this measure fr feedback is the same as withut feedback except that when the ranks f the relevant dcuments retrieved in the first pass are frzen the measure adjusts fr this by use f a new generality number. Suppse fr a single query and tw methds number f dcuments = number f relevant dcuments =

Ranks f Relevant Dcuments:,,,,, 4,,, 4,, 69,. Number Relevant Number Drawn Measure 4 6 4 6 9-4 6 9 Q 4 6 9.94.9966.999.999.9944.9969.9949.99.99.996.999.999.9996.9999.99999.99999.99999.99999.9999.9999.9999.9999.9999.99996.9999.99994.9999.9999.999.999 Perfrmance Results fr up t Retrieved Dcuments Table

II I- Suppse that in the first dcuments Methd I prduces 4 relevant and Methd II relevant. Then evaluatin starting with this Infrmatin n a feedbeick pass wuld evaluate the measure as Methd I Cnditins Methd II Cnditins n = 9 n = 9 number relevant = number relevant = Perfrmance wuld thereafter reflect exactly the same measures as if cnditins fr Methds I and II were starting cnditins.. Tests One methd f cmparing tw r mre methds ver the same set f queries in the same dcument cllectin wuld be t average the measure ver the number f dcuments retrieved. This prcedure wuld give ne number fr each- methd and the highest such number culd be stated t represent the best rr.ethd. The difficulty with this methd is that there is n way t knw the statistical prperties f this average and therefre slight differences in the average f methd i vs. methd j cannt be prven significant. With a fixed set f queries and a fixed cllectin there is n randmness invlved anyway. Randmness can be intrduced int the prblem by claiming that the queries are a sample drawn frm a set f queries and that the test results shw that at any pint n a ppulatin f queries divides int a multinmial distributin where methd i has prbability This prcedure is discussed in May T s thesis. p. f being the mst successful. Table shws the suggested partitin f queries and methds ver n, the number f dcuments retrieved. Table als shws a fictitius set f results. There is n hpe f being crrect in a decisin if in reality the methds are exactly alike, S D

Three Methds M M M Five Queries Q Q Q Q^ Q _L Z O At pint n =, dcument retrieved III-9 Value given t methd which has highest value.. In case f a tie at sme pint, chse ne f the tied methds by chance. Example: M l M M i Q l Q Q % Q Fr each n. = i, i dcuments retrieved, sum ver queries fr each methd Example: \ - N - N = T M M * = N = : Again, sum, ver n. this time, fr ttal fr methds, Ttal: M M, M " " 4 Estimate: p l fr M l 4.4 P P fr fr M M " ".. Sample Calculatin Table

- ne can nly state the "prbability" f being crrect in chsing methd in the example given if the rati P /p 9 C=). is actually greater than sme specified by the experimenter. Fr the example given assuming there is a multinmial distributin Cwhich is unlikely) and further that P/P =., then the prbability that the chice f methd is best is ver.9, using Bechhfer l s prcedures. It shuld be stressed that this is nt t claim a valid statistical test but nly t give sme idea f the pssible cnfidence ne culd have in chsing the largest p. as representing the best methd.

III-ll Bibligraphy Bechhfer, R. E., Elmaghraby, S., and Mrse, N., M A Single-sample multiple Decisin Prcedure fr Selecting the Multinmial Event which has the Highest Prbability", Annals f Mathematical Statistics^, Vl., N., March 99.. Cper, W. S., "Expected Search Length A Single Measure f Retrieval Effectiveness Based n the Weak Ordering Actin f Retrieval Systems", American Dcumentatin, January 96. Gffman, W., and Newill, y. A., "A Methdlgy fr Test and Evaluatin f Infrmatin Retrieval Systems", Infrmatin Strage and Retrieval, Vl., 966. Hdges, J. L., and Lehmann, E. L., Basic Cncepts f Statistics, HldenDay, San Francisc, 9-64. Lesk, M. E., "SIG T h e Significance Prgrams fr Testing the Evaluatin Output", Reprt N. ISR- t the Natinal Science Fundatin, Sectin II, Crnell University, Department f Cmputer Science, 96. May,. C, "Evaluatin f Search Methds in an Infrmatin Retrieval System", an unpublished thesis fr Masterls f Arts degree, June 96. Saltn, G., and Lesk, M.E., "Cmputer Evaluatin f Indexing and Text Prcessing", Reprt N. ISR- t the Natinal Science Fundatin, Sectin III, Crnell University, Department f Cmputer Science, 96. Williamsn, R. E., "A Prpsal t Ascertain the Relatinship between the Generality Rati and Perfrmance Measure", unpublished paper.