A New Evaluation Measure. J. Joiner and L. Werner. The problems of evaluation and the needed criteria of evaluation

Similar documents
Bootstrap Method > # Purpose: understand how bootstrap method works > obs=c(11.96, 5.03, 67.40, 16.07, 31.50, 7.73, 11.10, 22.38) > n=length(obs) >

Math Foundations 20 Work Plan

CS 477/677 Analysis of Algorithms Fall 2007 Dr. George Bebis Course Project Due Date: 11/29/2007

Resampling Methods. Chapter 5. Chapter 5 1 / 52

CONSTRUCTING STATECHART DIAGRAMS

Chapter 3: Cluster Analysis

CAUSAL INFERENCE. Technical Track Session I. Phillippe Leite. The World Bank

Hypothesis Tests for One Population Mean

Comprehensive Exam Guidelines Department of Chemical and Biomolecular Engineering, Ohio University

Least Squares Optimal Filtering with Multirate Observations

CHAPTER 3 INEQUALITIES. Copyright -The Institute of Chartered Accountants of India

How do scientists measure trees? What is DBH?

WRITING THE REPORT. Organizing the report. Title Page. Table of Contents

, which yields. where z1. and z2

Chapter Summary. Mathematical Induction Strong Induction Recursive Definitions Structural Induction Recursive Algorithms

B. Definition of an exponential

What is Statistical Learning?

Lesson Plan. Recode: They will do a graphic organizer to sequence the steps of scientific method.

READING STATECHART DIAGRAMS

UNIV1"'RSITY OF NORTH CAROLINA Department of Statistics Chapel Hill, N. C. CUMULATIVE SUM CONTROL CHARTS FOR THE FOLDED NORMAL DISTRIBUTION

Name: Block: Date: Science 10: The Great Geyser Experiment A controlled experiment

AP Statistics Notes Unit Two: The Normal Distributions

A Few Basic Facts About Isothermal Mass Transfer in a Binary Mixture

2004 AP CHEMISTRY FREE-RESPONSE QUESTIONS

Lab #3: Pendulum Period and Proportionalities

Lead/Lag Compensator Frequency Domain Properties and Design Methods

Particle Size Distributions from SANS Data Using the Maximum Entropy Method. By J. A. POTTON, G. J. DANIELL AND B. D. RAINFORD

Determining the Accuracy of Modal Parameter Estimation Methods

Pattern Recognition 2014 Support Vector Machines

Evaluating enterprise support: state of the art and future challenges. Dirk Czarnitzki KU Leuven, Belgium, and ZEW Mannheim, Germany

CHAPTER 24: INFERENCE IN REGRESSION. Chapter 24: Make inferences about the population from which the sample data came.

Checking the resolved resonance region in EXFOR database

ENSC Discrete Time Systems. Project Outline. Semester

Eric Klein and Ning Sa

o o IMPORTANT REMINDERS Reports will be graded largely on their ability to clearly communicate results and important conclusions.

The Single Pass Clustering Method. S. Rieber and V. P. Marathe. In information retrieval, several complex clustering methods exist

AIP Logic Chapter 4 Notes

MODULE 1. e x + c. [You can t separate a demominator, but you can divide a single denominator into each numerator term] a + b a(a + b)+1 = a + b

NAME TEMPERATURE AND HUMIDITY. I. Introduction

Performance Bounds for Detect and Avoid Signal Sensing

Math Foundations 10 Work Plan

Differentiation Applications 1: Related Rates

A Regression Solution to the Problem of Criterion Score Comparability

Lifting a Lion: Using Proportions

Public Key Cryptography. Tim van der Horst & Kent Seamons

4th Indian Institute of Astrophysics - PennState Astrostatistics School July, 2013 Vainu Bappu Observatory, Kavalur. Correlation and Regression

Study Group Report: Plate-fin Heat Exchangers: AEA Technology

Tree Structured Classifier

Distributions, spatial statistics and a Bayesian perspective

Internal vs. external validity. External validity. This section is based on Stock and Watson s Chapter 9.

NAME: Prof. Ruiz. 1. [5 points] What is the difference between simple random sampling and stratified random sampling?

BASD HIGH SCHOOL FORMAL LAB REPORT

Lab 1 The Scientific Method

Section 6-2: Simplex Method: Maximization with Problem Constraints of the Form ~

3.4 Shrinkage Methods Prostate Cancer Data Example (Continued) Ridge Regression

Reinforcement Learning" CMPSCI 383 Nov 29, 2011!

SticiGui Chapter 4: Measures of Location and Spread Philip Stark (2013)

LHS Mathematics Department Honors Pre-Calculus Final Exam 2002 Answers

Thermodynamics Partial Outline of Topics

Assessment Primer: Writing Instructional Objectives

Semester 2 AP Chemistry Unit 12

EASTERN ARIZONA COLLEGE Introduction to Statistics

A mathematical model for complete stress-strain curve prediction of permeable concrete

IB Sports, Exercise and Health Science Summer Assignment. Mrs. Christina Doyle Seneca Valley High School

Support-Vector Machines

ABSORPTION OF GAMMA RAYS

Sequential Allocation with Minimal Switching

General Chemistry II, Unit I: Study Guide (part I)

5 th grade Common Core Standards

Resampling Methods. Cross-validation, Bootstrapping. Marek Petrik 2/21/2017

INSTRUMENTAL VARIABLES

Determining Optimum Path in Synthesis of Organic Compounds using Branch and Bound Algorithm

SUPPLEMENTARY MATERIAL GaGa: a simple and flexible hierarchical model for microarray data analysis

1b) =.215 1c).080/.215 =.372

Biochemistry Summer Packet

Surface and Contact Stress

Coalition Formation and Data Envelopment Analysis

ECEN 4872/5827 Lecture Notes

A solution of certain Diophantine problems

Physics 2010 Motion with Constant Acceleration Experiment 1

ALE 21. Gibbs Free Energy. At what temperature does the spontaneity of a reaction change?

1 The limitations of Hartree Fock approximation

Admin. MDP Search Trees. Optimal Quantities. Reinforcement Learning

Revisiting the Socrates Example

Chapters 29 and 35 Thermochemistry and Chemical Thermodynamics

MODULE FOUR. This module addresses functions. SC Academic Elementary Algebra Standards:

AMERICAN PETROLEUM INSTITUTE API RP 581 RISK BASED INSPECTION BASE RESOURCE DOCUMENT BALLOT COVER PAGE

CS:4420 Artificial Intelligence

Biplots in Practice MICHAEL GREENACRE. Professor of Statistics at the Pompeu Fabra University. Chapter 13 Offprint

DEFENSE OCCUPATIONAL AND ENVIRONMENTAL HEALTH READINESS SYSTEM (DOEHRS) ENVIRONMENTAL HEALTH SAMPLING ELECTRONIC DATA DELIVERABLE (EDD) GUIDE

Introduction to Quantitative Genetics II: Resemblance Between Relatives

Drought damaged area

Dead-beat controller design

NUMBERS, MATHEMATICS AND EQUATIONS

COMP 551 Applied Machine Learning Lecture 5: Generative models for linear classification

k-nearest Neighbor How to choose k Average of k points more reliable when: Large k: noise in attributes +o o noise in class labels

making triangle (ie same reference angle) ). This is a standard form that will allow us all to have the X= y=

Chapter 8: The Binomial and Geometric Distributions

22.54 Neutron Interactions and Applications (Spring 2004) Chapter 11 (3/11/04) Neutron Diffusion

CHM112 Lab Graphing with Excel Grading Rubric

Transcription:

III-l III. A New Evaluatin Measure J. Jiner and L. Werner Abstract The prblems f evaluatin and the needed criteria f evaluatin measures in the SMART system f infrmatin retrieval are reviewed and discussed. Perfrmance characteristics f a gd evaluatin measure are examined. The suggested measure Pr N (P<R/n), (the prbability under the hypergemetric distributin, that the precisin culd be strictly less than that precisin attained, where R = number relevant in the sample drawn, N = ttal number in cllectin and n = size f sample drawn) is intrduced and tested against the varius criteria needed fr a gd evaluatin measure. A statistical test f significance is explained.. Intrductin Amng the principal bstacles t the evaluatin f infrmatin retrieval methds are the fllwing: ) Interplatin between pints f recall results in errrs which are unsatisfactry in ne way r anther, depending upn the type f interplatin used. ) A recall-precisin curve smetimes gyrates wildly and the averaging f many curves ver queries has questinable reliability. L The statement "methd A is better than methd B u ften depends upn the value ne is measuring. A unique value measuring bth recall and precisin wuld be best. 4) Queries with different numbers f relevant dcuments d nt receive different amunts f credit, althugh just by randm

III- chance it is easier t get a relevant dcument fr a query with relevant than fr ne with 4- relevant* ) It has nt been determined hw t handle the evaluatin f feedback methds in relatin t the relevant dcuments retrieved befre feedback. This reprt will attempt t discuss and prpse slutins t these specific prblems.. Prblems f Evaluatin One imprtant aspect f infrmatin retrieval is btaining a value fr a given methd which is a true measure f the methds effectiveness ver many queries. On a recall-precisin graph, the pints f recall where ne measures this effectiveness are.,.,...,,9,.-. Hwever, the nly pints available fr a query with n relevant dcuments are /n, /n,..., n-l/n, n/n. Obviusly, fr queries with different numbers f rele- vant dcuments, ne may expect that nne f the query! s pints will cincide with the pints.,.,...,.9,.. But presently, by interplatin cf sme kine, the precisin values are fund fr each query at these pints. Fig. shws ne such methd. There can be n real justificatin fr any methd f interplatin used, fr it is impssible t estimate a discrete functin at a nnexistent pint. Therefre, what is needed t slve this prblem is a new base index fr the graph that wuld invlve n interplatin. An index that wuld, ver many queries with different numbers f relevant dcuments, have nly cmmn pints fr all queries. The averaging f pints f recall-precisin where interplatin must ccur tends further t distrt the measure f effectiveness. But even ver

in-; Ranks f Relevant Dcuments m CVJ ^ <fr C J - CVJ Q_ O Ttal N. f Relevant Dcuments <fr CVJ c > Q> (Z CO a) a Query Number CX O" c Ct sz CL *~ CD c if) <D w_ Q_ ILO al O CD (T CD :

III-4 pints that cincide with equal recall, different values can be btained by different methds f averaging. Even at these cmmn pints, the values being averaged are smewhat in dubt. N crrelatins are made in the pre- cisin values fr the generality number (jg = number f Relevant/ttal number in cllectin), which reflects hw easy it wuld be, under randm cnditins alne, t select relevant dcuments. shuld cntrl this randmness factr. that when the generality rati A gd perfrmance measure Cntrl shuld be in the sense is decreased in a way which preserves the bserved perfrmance level, the effect f the generality rati n a perfrmance measure culd be bserved. The measure prpsed is knwn t re- flect the generality number under equal perfrmance but a methd f splitting a cllectin int tw cllectins suggested by R. Williamsn has nt been tested.. Criteria fr a Gd Evaluatin Measure A gd perfrmance measure shuld fulfill the fllwing criteria: l) Recall values measure the effectiveness f a methd by cmparing the number f relevant dcuments retrieved t t ttal number f relevant dcuments, while precisin measures this perfrmance by cmparing the number f relevant retrieved dcuments t the ttal number f retrieved. These intuitively seem t be the best measures f perfrmance available. Their biggest drawback is that they are tw unique values nt ne. A gd measure shuld reflect bth. l The generality number, as stated befre, reflects the degree f effect that pure randm chance selectin will have n the methd f retrieving relevant dcuments fr a certain query. With this cntrlled queries can be cm-

- pared n a mre cmmn basis. ) In thery Cat the least, the measure shuld appeal t the user and tester and the values btained shuld have a lgical range. A range frm t best suits a measure f perfrmance and effectiveness. Any system which is effective at all shuld have values f the measure clser t than t. 4. The Prbability Measure A very large urn is filled with Q. dcuments. Fr query q there are dcuments that are relevant and that are nt. If, at randm, dcuments are drawn frm that urn withut replacement, the prbability that less than relevant dcuments are chsen cmpletely at randm is P E, <*<*> = P H,Q- CR = lt P^QQ (R = U t P^Q CR =>,^.. rs,ck = C K ^ + l K 9/ t K j,. r ) ( ) l ; ^ l This is equivalent t finding the prbability by randm chance that the precisin is less than / fr F R Q CR<) = P R QQ CR/n</n) = P R Q CP</). The higher this prbability is, the less likely it wuld be that the precisin achieved was btained by chance. This measure culd be evaluated at any pint n (equals the number f dcuments retrieved) that might be wanted fr investigatin. As precisin increases frm m/n t (m + l)/n, this value ges frm P R ^P<m/n;) t? H QQ ( ' P<^m + ^/n) which is e l ual t P H ^<m^ and P 9 n^r m +!) where P H, (R<m) = P H, CR = m " } + P H, CR = m " ) + + P H, CR = )

III-6 and ^ O O ^ ^ = P H, CR = m) + ^^QO = m ' } + ' ' + P H, (R " ) " When P ^^^CR<m) is subtracted frm P^ pncr<m+l) the answer is always psitive since ne mre single hypergemetric prbability is added t P CR<m+l). Since P CR<m) < P tr<m+l) is equivalent t stating that P H CR/n<P<m/n = p ) P (P<m+l/n = p ), and p. <p, then as precisin increases the perfrm* -nance measure increases. This same argument hlds fr recall because P H, (r<m) = P H, Cr/R<m/R) = P H, Crecall<m/R). As the recall increases frm m/r t m+l/r mre prbability is added t the measure and it therefre increases. The prbability itself incrprates the generality number and it will be shwn by example hw this generality affects the measure. All three f the criteria which are mst needed by a unique perfrmance measure are therefre cmbined in this value. The theretical range, -, f this measure is als appealing t testing prcedures and analyzing f results. Sme measures fr arbitrarily chsen results are shwn in Table. The use f this measure fr feedback is the same as withut feedback except that when the ranks f the relevant dcuments retrieved in the first pass are frzen the measure adjusts fr this by use f a new generality number. Suppse fr a single query and tw methds number f dcuments = number f relevant dcuments =

Ranks f Relevant Dcuments:,,,,, 4,,, 4,, 69,. Number Relevant Number Drawn Measure 4 6 4 6 9-4 6 9 Q 4 6 9.94.9966.999.999.9944.9969.9949.99.99.996.999.999.9996.9999.99999.99999.99999.99999.9999.9999.9999.9999.9999.99996.9999.99994.9999.9999.999.999 Perfrmance Results fr up t Retrieved Dcuments Table

II I- Suppse that in the first dcuments Methd I prduces 4 relevant and Methd II relevant. Then evaluatin starting with this Infrmatin n a feedbeick pass wuld evaluate the measure as Methd I Cnditins Methd II Cnditins n = 9 n = 9 number relevant = number relevant = Perfrmance wuld thereafter reflect exactly the same measures as if cnditins fr Methds I and II were starting cnditins.. Tests One methd f cmparing tw r mre methds ver the same set f queries in the same dcument cllectin wuld be t average the measure ver the number f dcuments retrieved. This prcedure wuld give ne number fr each- methd and the highest such number culd be stated t represent the best rr.ethd. The difficulty with this methd is that there is n way t knw the statistical prperties f this average and therefre slight differences in the average f methd i vs. methd j cannt be prven significant. With a fixed set f queries and a fixed cllectin there is n randmness invlved anyway. Randmness can be intrduced int the prblem by claiming that the queries are a sample drawn frm a set f queries and that the test results shw that at any pint n a ppulatin f queries divides int a multinmial distributin where methd i has prbability This prcedure is discussed in May T s thesis. p. f being the mst successful. Table shws the suggested partitin f queries and methds ver n, the number f dcuments retrieved. Table als shws a fictitius set f results. There is n hpe f being crrect in a decisin if in reality the methds are exactly alike, S D

Three Methds M M M Five Queries Q Q Q Q^ Q _L Z O At pint n =, dcument retrieved III-9 Value given t methd which has highest value.. In case f a tie at sme pint, chse ne f the tied methds by chance. Example: M l M M i Q l Q Q % Q Fr each n. = i, i dcuments retrieved, sum ver queries fr each methd Example: \ - N - N = T M M * = N = : Again, sum, ver n. this time, fr ttal fr methds, Ttal: M M, M " " 4 Estimate: p l fr M l 4.4 P P fr fr M M " ".. Sample Calculatin Table

- ne can nly state the "prbability" f being crrect in chsing methd in the example given if the rati P /p 9 C=). is actually greater than sme specified by the experimenter. Fr the example given assuming there is a multinmial distributin Cwhich is unlikely) and further that P/P =., then the prbability that the chice f methd is best is ver.9, using Bechhfer l s prcedures. It shuld be stressed that this is nt t claim a valid statistical test but nly t give sme idea f the pssible cnfidence ne culd have in chsing the largest p. as representing the best methd.

III-ll Bibligraphy Bechhfer, R. E., Elmaghraby, S., and Mrse, N., M A Single-sample multiple Decisin Prcedure fr Selecting the Multinmial Event which has the Highest Prbability", Annals f Mathematical Statistics^, Vl., N., March 99.. Cper, W. S., "Expected Search Length A Single Measure f Retrieval Effectiveness Based n the Weak Ordering Actin f Retrieval Systems", American Dcumentatin, January 96. Gffman, W., and Newill, y. A., "A Methdlgy fr Test and Evaluatin f Infrmatin Retrieval Systems", Infrmatin Strage and Retrieval, Vl., 966. Hdges, J. L., and Lehmann, E. L., Basic Cncepts f Statistics, HldenDay, San Francisc, 9-64. Lesk, M. E., "SIG T h e Significance Prgrams fr Testing the Evaluatin Output", Reprt N. ISR- t the Natinal Science Fundatin, Sectin II, Crnell University, Department f Cmputer Science, 96. May,. C, "Evaluatin f Search Methds in an Infrmatin Retrieval System", an unpublished thesis fr Masterls f Arts degree, June 96. Saltn, G., and Lesk, M.E., "Cmputer Evaluatin f Indexing and Text Prcessing", Reprt N. ISR- t the Natinal Science Fundatin, Sectin III, Crnell University, Department f Cmputer Science, 96. Williamsn, R. E., "A Prpsal t Ascertain the Relatinship between the Generality Rati and Perfrmance Measure", unpublished paper.