Hypothesis Testing for the Risk-Sensitive Evaluation of Retrieval Systems

Size: px

Start display at page:

Download "Hypothesis Testing for the Risk-Sensitive Evaluation of Retrieval Systems"

Malcolm Douglas
6 years ago
Views:

1 Hypothesis Testing for the Risk-Sensitive Evaluation of Retrieval Systems B. Taner Dinçer Dept of Statistis & Computer Engineering Mugla University Mugla, Turkey Craig Madonald and Iadh Ounis Shool of Computing Siene University of Glasgow Glasgow, UK ABSTRACT The aim of risk-sensitive evaluation is to measure when a given information retrieval (IR) system does not perform worse than a orresponding baseline system for any topi. This paper argues that risk-sensitive evaluation is akin to the underlying methodology of the Student s t test for mathed pairs. Hene, we introdue a risk-reward tradeoff measure T Risk that generalises the existing U Risk measure (as used in the TREC 2013 Web trak s risk-sensitive task) while being theoretially grounded in statistial hypothesis testing and easily interpretable. In partiular, we show that T Risk is a linear transformation of the t statisti, whih is the test statisti used in the Student s t test. This inherent relationship between T Risk and the t statisti, turns risk-sensitive evaluation from a desriptive analysis to a fully-fledged inferential analysis. Speifially, we demonstrate using past TREC data, that by using the inferential analysis tehniques introdued in this paper, we an (1) deide whether an observed level of risk for an IR system is statistially signifiant, and thereby infer whether the system exhibits a real risk, and (2) determine the topis that individually lead to a signifiant level of risk. Indeed, we show that the latter permits a state-of-the-art learning to rank algorithm (LambdaMART) to fous on those topis in order to learn effetive yet risk-averse ranking systems. Categories and Subjet Desriptors: H.3.3 [Information Storage & Retrieval]: Information Searh & Retrieval; G3.3 [Probability and Statistis]: Experimental design Keywords: Risk-Sensitive Evaluation, Student s t Test 1. INTRODUCTION Various paradigms for the evaluation of information retrieval (IR) systems rely on many topis to produe reliable estimates of their effetiveness. For instane, in the TREC series of evaluation forums, 50 topis is generally seen as the minimum for produing a reliable test olletion [2, 25]. However, in more reent times, the evaluation of systems has inreasingly foused upon their robustness - ensuring that a given IR system performs well on diffiult topis (as Permission to make digital or hard opies of all or part of this work for personal or lassroom use is granted without fee provided that opies are not made or distributed for profit or ommerial advantage and that opies bear this notie and the full itation on the first page. Copyrights for omponents of this work owned by others than ACM must be honored. Abstrating with redit is permitted. To opy otherwise, or republish, to post on servers or to redistribute to lists, requires prior speifi permission and/or a fee. Request permissions from permissions@am.org. SIGIR 14, July 6 11, 2014, Gold Coast, Queensland, Australia. Copyright 2014 ACM /14/07...$ investigated by the TREC Robust trak [24]), or at least as well as a baseline system (whih is known as risk-sensitive evaluation [26]). Reently, the TREC 2013 Web trak introdued a risk-sensitive task, whih assessed how systems ould perform effetively yet without exhibiting large losses ompared to a pre-determined baseline system [10]. In suh a risk-sensitive evaluation, the risk assoiated with an IR system is definedas the risk of performing a given partiular topi less effetively than a given baseline system [8, 9, 26]. Inpartiular, theu Risk risk-sensitiveevaluationmeasure [26] alulates the absolute differene of an effetiveness measure (e.g. NDCG) between a given retrieval system and the baseline system, in a manner that more strongly emphasises dereases with respet to the baseline (known as risk) than gains (reward). A parameter α 0 ontrols the riskreward tradeoff towards losses in effetiveness ompared to the baseline, where α = 0 weights risk and rewards equally. In this paper, we argue that in the urrent pratie of risk-sensitive evaluation based on U Risk, any amount of loss in an IR system s average effetiveness, observed on a partiular set of topis, is onsidered enough in magnitude to infer that the system exhibits a real risk. However, from a statistial viewpoint, suh an inferential deision may be said to be valid only if the observed amount of loss annot be attributed to hane flutuation. Otherwise, it will be equally likely that the orresponding system may or may not be under a real risk, meaning that it is possible that the systeman perform everytopi withasore higherthanthat of the baseline system on another set of topis that ould be drawn from the population of topis. On the other hand, it is also possible that the observed amount of loss in a partiular system s average effetiveness an be attributed to a hane flutuation, while the orresponding performane losses for some individual topis are statistially signifiant in magnitude. In other words, signifiant performane losses for a few topis may not result in a signifiant total loss on average, given a relatively large set of topis. Hene, we advoate that risk-sensitive evaluation an atually provide the neessary basis for (i) testing the signifiane of the observed amount of loss in a given IR system s average effetiveness, alled inferential risk analysis in this paper, and (ii) testing the signifiane of the orresponding losses for individual topis, alled exploratory risk analysis. Indeed, we show that the U Risk risk-reward tradeoff measure is atually a linear transformation of the t statisti, as used in the Student s t test. Therefore, using this statistial interpretation of U Risk based upon hypothesis testing, this paper proposes a new risk-reward tradeoff measure, T Risk, whih is a linear transformation of the existing U Risk measure, yet is theoretially grounded upon the Student s t test

2 for testing the signifiane of the observed amount of loss in a given IR system s average effetiveness. For α = 0, T Risk is equivalent to the standard t statisti used typially in the Student s t test for testing the null hypothesis of equality in the population mean effetiveness for two IR systems. However, for α > 0, the U Risk measure emphasises performane losses ompared to the baseline effetiveness. This raises hallenges in the estimation of the standard error of the alulated U Risk sores. For this reason, we propose the use of the Jakknife tehnique (or leave-one-out) [11], whih is a re-sampling tehnique for estimating the bias and the standard error of any estimate. The Jakknife tehnique serves two purposes: firstly, to allow the empirial verifiation of the estimation of the standard error of U Risk as valid; and seondly, for testing the signifiane of the orresponding performane losses for individual topis. From a pratial perspetive, a risk-sensitive evaluation serves two objetives: firstly, as a step further than the lassial evaluation of IR systems, whih takes into aount the stability or variane of retrieval results aross queries as well as for the average retrieval effetiveness [8, 9]; and seondly, as a tehnique for jointly optimising the retrieval effetiveness and robustness of retrieval frameworks suh as learning to rank [26]. Indeed, ompared to the existing U Risk measure, this paper ontributes to both objetives, by exploiting the theory of statistial hypothesis testing for allowing meaningful interpretation of risk-sensitive evaluation sores, and also by allowing a learning to rank tehnique, namely LambdaMART, to fous on those topis that lead to a signifiant level of risk, in order to learn effetive yet risk-averse ranking systems. The remainder of this paper is strutured as follows: Setion 2 provides an overview of risk-sensitive evaluation praties, inluding U Risk ; Setion 3 relates the U Risk measure to the t statisti, and hene proposes the new T Risk risk-sensitive evaluation measure, and disusses the estimation of the standard error. Setion 4 and Setion 5 desribe new forms of analysis, inferential and exploratory respetively, that arise from the T Risk measure, and demonstrate their appliation upon the TREC 2012 Web trak. Next, Setion 6 shows how T Risk an improve the robustness of the LambdaMART state-of-the-art learning to rank tehnique. Finally, we review some related work and provide onluding remarks in Setions 7 & 8, respetively. 2. RISK-SENSITIVE EVALUATION Different approahes in IR suh as query expansion [1, 5] and learning to rank [17] behave differently aross topis, often improving the effetiveness for some of the topis while degrading performane for others. This results in a high variation in effetiveness aross the topis. To address suh variation, there has been an inreasing fous on the effetive takling of diffiult topis in partiular (e.g. through the TREC Robust trak [23]), or more reently, on the risksensitive evaluation of systems aross many topis [8, 9, 26]. Originally, the aim of risk-sensitive evaluation [9] was to provide new analysis tehniques for quantifying and visualising the risk-reward tradeoff of any retrieval strategy that requires a balane between risk and reward. Hene, it failitates the quest for ranking strategies that are more robust in retrieval effetiveness ompared to a baseline retrieval strategy robust in the sense of the stability or variane of the retrieval results aross topis, while ahieving good average performane over all topis. Thevarianewithrespettoagivenbaseline systembover a given set of topis Q with topis an then be measured as ariskfuntionf Risk, whihtakesintoaountthedownsiderisk of a new system r (i.e. performing a topi worse than the baseline) is defined in [26] as follows: F Risk = 1 max[0,(b i r i)], (1) where r i and b i are respetively the sore of the system r and the sore of the baseline system b on topi i, as measured by a retrieval effetiveness measure (e.g. NDCG@20, ERR@20 [6]). Similarly, a reward funtion F Reward, whih takes into aount the upside-risk (i.e. performing a topi better than the baseline) is defined as: F Reward = 1 max[0,(r i b i)]. (2) Thereby, the overall gain in the retrieval effetiveness of r with respet to b an be expressed as: U Gain = F Reward F Risk. (3) Next, a single measure, U Risk [26], whih allows the riskreward tradeoff to be adjusted, was defined: U Risk = U Gain α F Risk = 1 δ q +(1+α) q Q + q Q δ q, (4) where δ q = r q b q. The left summand in the square brakets, whih is the sum of the sore differenes δ q for all q where r q > b q (i.e., q Q +), gives the total win (or upsiderisk) with respet to the baseline. Orthogonally, the right summand, whih is the sum of the sore differenes δ q for all q where r q < b q, gives the total loss (or downside-risk). The risk sensitivity parameter α 0 ontrols the tradeoff between reward and risk (or win and loss): α = 0 results in a pure gain model, while for higher α, the penalty for under-performing with respet to the baseline is inreased: typially α = 1,5,10 [10]. In this paper, we extend the original aforementioned aim of risk-sensitive evaluation with the following ontributions: 1. A well-established statistial hypothesis testing theory for risk-sensitive evaluations from whih arises a new risk measure T Risk (Setion 3), to turn risk-sensitive evaluation from a desriptive analysis to a fully-fledged inferential analysis (Setion 4). 2. A method for exploratory risk analysis that an identify the topis that ommit real levels of risk (Setion 5). 3. Adaptations of the proposed T Risk measure that an enhane the robustness of the state-of-the-art LambdaMART learning to rank tehnique, ompared to U Risk, without degradations in overall effetiveness, where the learned model adaptively adjusts with respet to the risk level ommitted by individual topis (Setion 6). 3. THE NEW T RISK MEASURE Withoutloss ofgenerality, atα = 0, therisk-rewardtradeoff measure U Risk redues to the U Gain formula in Eq. (3), whih an be expressed as the average gain over topis: U Gain = 1 δ i = 1 (r i b i). (5) In the ontext of statistis, U Gain refers to the sample mean of paired sore differenes, d, for two IR systems (the system under evaluation r and the baseline system b):

3 d = r b = 1 (r i b i) = U Gain (6) andintheontextofevaluatingirsystems, thisrefers tothe differene in average effetiveness between two IR systems, r b, where r and b are respetively the average effetiveness of system r and the average effetiveness of the baseline system b over topis. On the other hand, the Student s t statisti for mathed pairs, as is ommonly applied when testing the signifiane of results between two systems, an be expressed as: t = d SE( d) [ = r b SE( d) ], (7) Within Eq. (7), the standard error of paired sample mean, SE( d), an be estimated as follows: SE( d) = s d, (8) where s d = 1 (δ i d) 2 is the paired sample standard deviation. Hene, we argue that the Student s t statisti of Eq. (7) is atually a linear transformation of U Gain from Eq. (3), whih we all T Gain: T Gain = UGain SE(U Gain) = s d U Gain. (9) This transformation an be referred to as studentisation (.f., t-sores) [14], whih in fat is a type of standardisation (i.e., z-sores). Standardisation is a monotoni linear transformation, whih transforms any given set of data to a set with zero mean and unit variane, while preserving the original data distribution in shape. The t-sore of a raw U Gain measurement, T Gain, differs from the raw measurement in two important aspets. First, given a set of IR systems, a test olletion, and a baseline system, the systems ranking to be obtained on the basis of T Gain will not neessarily be onordant with the systems ranking to be obtained on the basis of U Gain, sine the t statisti takes into aount the inherent variation in the observed paired sore differenes r i b i aross the topis, i.e., SE(U Gain). Seond, given a partiular baseline system, the two T Gain sores to be obtained on two different test olletions for the same IR system are omparable with eah other in magnitude, at least in theory [7], while the two U Gain sores are not, as typial in the ase of the two raw effetiveness sores to be yielded from a standard effetiveness measure, suh as mean average preision [28]. Having shown how T Gain an be defined as a linear transformation ofu Gain, baseduponthetstatisti, we nowexamine U Risk, whih allows the risk-reward tradeoff to be ontrolled by the α parameter. For α 0, the t statisti based on U Risk, whih we all T Risk, an be expressed as follows: T Risk = U Risk SE(U Risk ). (10) Althoughboththe T Gain formula ineq. (9) andthe T Risk formula in Eq. (10) stem from the lassial t statisti in Eq. (7), the estimation of the standard error in U Risk, the estimation of SE(U Risk ) within T Risk, is not as straightforward as in the ase of SE(U Gain), for the reason that the U Risk formula reweighs the sore differenes δ i in averaging, proportionally to α, for eah topi i where r i < b i, as opposed to U Gain. Hene, in the remainder of this setion, we propose two methods to estimate SE(U Risk ): A speulative parametri estimator SE x that is an analogy to the paired sample standard deviation s d (Setion 3.1); and a nonparametrijakknifeestimator SE J, basedontheleaveone-out Jakknife tehnique (Setion 3.2). Indeed, later in Setion 3.3, we use the Jakknife Estimator SE J to show the validity of the speulative SE x estimator. On the other hand, T Risk has several advantages over U Risk. Firstly, it an be easily interpreted for an inferential analysis of risk. Indeed, we will later show in Setion 4 that in order to test the signifiane of an observed riskreward tradeoff sore between a partiular IR system and a provided baseline system, one an use T Risk as the test statisti of the Student s t test for mathed pairs. Seondly, T Risk permits the identifiation of topis that ommit signifiant risk or not we all this exploratory risk analysis whih we present later in Setion 5. Finally, this exploratory risk analysis leads to new risksensitive measures that an be diretly integrated into the LambdaMART learning to rank tehnique, to produe learned models that exhibit less risk than those obtained from U Risk whilst not degrading effetiveness, as explained in Setion Parametri Estimator of SE(U Risk ) Let the random variable X i denote the risk-reward tradeoff sore between system r and baseline b for topi i: { δi if r X i = i > b i (11) (1+α)δ i if r i < b i for i = 1,2,..., and a predefined value of α 0. Then, the standard error of U Risk, SE(U Risk ) an be approximated by the standard error of the sample mean x: SE x = sx, (12) where s 2 x = 1 (x i x) 2. Here, the sample mean x orresponds to the U Risk sore onsidered as the arithmeti mean of the sample of the observed individual topi risk-reward tradeoff sores x 1,x 2,...,x at a predefined value of α: x = U Risk = 1 x i. (13) This parametri estimator of SE(U Risk ), SE x, is speulativeandheneitsvaliditymightbeompromised tosomeextent. Therefore, we empirially verify the validity of SE x in estimating SE(U Risk ) by means of omparing it with a nonparametri re-sampling tehnique, alled the Jakknife [21], whih we present in Setion 3.2. Indeed, by omparing the two estimates of SE(U Risk ) (i.e., the parametri estimate SE x of Eq. (12) and the nonparametri Jakknife estimate of SE(U Risk )), one an deide whether an inferene to be made on the basis of the T Risk statisti is valid. If the two estimates agree with eah other, suh an inferene may be said to be valid, otherwise its validity is ompromised. 3.2 Jakknife Estimate of SE(U Risk ) In this paper, the Jakknife tehnique is employed for a purpose whih serves two different aims: 1) as a mehanism of the empirial verifiation of the validity of an inferene to be made based on the T Risk statisti in Eq. (10), and 2) as a mehanism for exploratory risk analysis. Jakknife, whih is also known as the Quenouille-Tukey Jakknife or leave-one-out, was first introdued by Quenouille [18] and then developed by Tukey [21]. Tukey used the Jakknife tehnique to determine how an estimate is affeted by the subsets of observations when disordant values

4 (i.e., outlier data) are present. In the presene of disordant values, it is expeted that the Jakknife tehnique ould redue the bias in the estimate. Although the original objetive of Jakknife is to detet outliers, in priniple it is a re-sampling tehnique for estimating the bias and the standard error ofanyestimate [11]. InJakknife, thesame testis repeated by leaving one subjet out eah time: this explains why this tehnique is also referred to as leave-one-out. Let the random variables X 1,X 2,...,X denote a random sample of size, suh that X i is drawn identially and independently from a distribution F for i = 1,2,...,. Suppose that the goal is to estimate an unknown parameter θ of F. It an be shown that θ an be estimated by a statisti ˆθ, whih is derived from an observed sample x 1,x 2...,x from F, with a measurable amount of sampling error [15]. An unbiased estimator ˆθ is a statisti whose expeted value E(ˆθ) is equal to the true value of the population parameter of interest θ, i.e., E(ˆθ) = θ. The amount of bias assoiated with an estimator is therefore given by: bias(ˆθ) = E(ˆθ θ) = E(ˆθ) θ. (14) We denote as X (i) the sub-sample without the datum X i. Thereareintotalsub-samplesofsize 1fori = 1,2,...,: X (i) = X 1,X 2,...,X i 1,X i+1,...,x. Next,lettheestimatederivedfromthei th sub-samplex (i) be denoted as ˆθ (i), and the mean over sub-samples be: ˆθ (.) = 1 ˆθ (i). (15) The Jakknife estimate of bias, whih is atually a nonparametri estimate of E(ˆθ θ), is defined as follows [21]: bias J (ˆθ) = ( 1)(ˆθ (.) ˆθ) = ( 1) (ˆθ (i) ˆθ). and, in aordane, the bias-redued Jakknife estimate of θ is defined as θ = ˆθ bias J (ˆθ) = ˆθ ( 1)ˆθ (.). Tukey [21] showed that the Jakknife tehnique an also be used to estimate the variane of ˆθ by introduing the so-alled pseudo-values, θ (i) = ˆθ ( 1)ˆθ (i), suh that var J (ˆθ) = 1 ( 1) [ θ(i) θ ] 2 ( 1) = [ˆθ(i) ˆθ ] 2. (.) This nonparametri Jakknife estimate of variane gives the empirial estimate of the standard error of ˆθ: SE(ˆθ) = var J (ˆθ). (16) For the T Risk statisti in Eq. (10), the standard error of U Risk, SE(U Risk ), an hene be estimated by substituting U Risk into Eq. (16) as ˆθ: SE J = var J (U Risk ). (17) 3.3 Empirial Validation of SE(U Risk ) The nonparametri estimator SE J is an alternative to the parametri estimator SE x (Eq.(12)). Inthissetion, we empirially ompare these estimates of SE(U Risk ) with eah other, to assess the validity of the result of a hypothesis test to be performed using T Risk as the test statisti. In general, if the two estimates agree, the test result may be said to be valid, and otherwise its validity will be ompromised. As a result, nonparametri methods an help to alleviate doubts about the validity of the analysis performed [14]. In the following, we ompare the estimates using the submitted runs to the TREC Web trak. In partiular, the provided baseline run for the TREC 2013 Web trak risksensitive task is based on the Indri retrieval platform. However, as the submitted runs and results for the TREC 2013 ampaign were not yet publily available at the time of writing, in the following we perform an empirial study based on runs submitted to the TREC 2012 Web trak. Indeed, the 2013 trak oordinators have made available a set of Indri runs on the TREC 2012 Web trak topis 1 that orrespond to the TREC 2013 baseline runs - in our results, we use the 2012 equivalent run to the 2013 pre-determined baseline, the so-alled indricasp. We report the U Risk values obtained using the offiial TREC 2012 evaluation measure, ERR@20. Table 1 reports the parametri estimates (SE x) and the nonparametrijakknifeestimates(se J)ofthestandarderrors assoiated with the average risk-reward tradeoff sores (U Risk ), alulated for eah of the TREC 2012 Web trak top 8 ad-ho runs over = 50 topis, with respet to the indricasp baseline, applying several risk-sensitivity parameter values of α = 0,1,5,10. From the results, it an be observed that the two estimates, SE x and SE J agree with eah other for eah of the 8 runs. In fat, over all of the 48 runs submitted to the TREC 2012 Web trak, we observe a Root Mean Square Error (RMSE) of between SE x and SE J. Thus, we onlude that it is highly likely that it would be valid to ondut an inferential risk analysis upon those TREC 2012 runs based on the new risk-reward tradeoff measure T Risk (Eq. (10)), regardless of how SE(U Risk ) is estimated. An example of inferential risk analysis based on T Risk follows in the next setion. 4. INFERENTIAL RISK ANALYSIS The goal of the lassial evaluation of IR systems is to deide whether one IR system is better in retrieval effetiveness than another on the population of topis. This goal an be formulated into a (two-sided) null hypothesis, as given by: H 0 : µ r = µ b or H 0 : µ r µ b = 0, (18) against the alternative hypothesis H 1 : µ r µ b, where µ r and µ b represent respetively the population mean performane of the system r and the population mean performane of the baseline system b. The test statisti for this null hypothesis is the t statisti (Eq. (7)), sine the larger values of t are evidene against the null hypothesis H 0 : µ r µ b = 0. Below, we desribe the hypothesis testing of H 0 in abstrat terms, before explaininghowitan be appliedtot Risk (Setion 4.1) and illustrating its appliation upon the TREC 2012 Web trak runs (Setion 4.2). In order to deide how muh differene between the two sample means r and b is assumed to be large enough to rejet the null hypothesis, we should first determine how muh differene an be attributed to a hane flutuation. It an be shown that, under the null hypothesis H 0, the sampling (or null) distribution of the test statisti t an be approximated by a Student s t distribution with df = 1 degrees of freedom for any population distribution with finite mean µ and variane σ 2 > 0, beause of the entral limit theorem [12]. Thus, at a predefined signifiane level of γ (typially γ = 0.05 for 95% onfidene), two standard deviations (±t (γ/2,df) SE( d)) determine the maximum differene that an be attributed to hane flutuation, where in between the ritial values ±t (γ/2,df) the area under the Student s t 1

5 Table 1: Calulated risk-reward tradeoff sores, U Risk for the TREC 2012 Web trak top 8 ad-ho runs at the risk-sensitivity parameter values of α = 0,1,5,10, along with the parametri estimates SE x and the nonparametri Jakknife estimates SE J of the assoiated standard errors SE(U Risk ). indricasp is the baseline. α = 0 α = 1 α = 5 α = 10 ERR@20 U Risk SE x SE J U Risk SE x SE J U Risk SE x SE J U Risk SE x SE J uogtra44xi srhvrs DFalah121A QUTparaBline utw2012f ICTNET12ADR indricasp * * * * * * * * * * * * irra qutwb distribution sums up to (1 γ). If an observed t-sore is greater than t (γ/2,df), or less than t (γ/2,df), one an rejet H 0 with 100%(1 γ) onfidene, denoted as the p-value. 4.1 Inferene Based on T Gain and T Risk The above protool of hypothesis testing is referred to as the Student s t test for mathed pairs, or paired t test for short, in statistis. Hene, in the ontext of risk-sensitive evaluation, the T Gain formula in Eq. (9) stands for the test statisti t. In fat, at α = 0, testing the signifiane of an observed risk-reward tradeoff sore between r and b (i.e. an observed U Gain sore) is akin to testing the signifiane of the observed differene between r and b. To test the signifiane of an observed U Gain sore, one an therefore ompare the orresponding T Gain sore with the two-sided ritial ±t (γ/2,df) values at a desired level of signifiane γ. If t (γ/2,df) T Gain t (γ/2, 1), the observed U Gain sore an be attributed to hane flutuation, meaning that the observed gain in the performane of the system r with respet to the baseline system b is not statistially signifiant. In suh a ase, it is equally likely that the observed U Gain sore may or may not our on another topi sample drawn from the population. Otherwise, if T Gain t (γ/2, 1) or T Gain t (γ/2, 1), one an however be sure that a U Gain sore at least as extreme as the observed sore would our on 100(1 γ)% of the topi samples that ould be drawn from the population. Both T Gain and T Risk stem from the t statisti. Indeed, for α = 0, T Gain = T Risk, while for α > 0, SE(U Risk ) was shown to be valid in Setion 3.3. Hene, we argue that an equivalent inferential analysis an be onduted upon the T Risk sores that have been alulated based on U Risk. In the following, we provide an illustration of suh inferential analysis upon runs submitted to the TREC 2012 Web trak, but the same inferential analysis methodology ould be applied for any risk-sensitive evaluation senario. 4.2 Inferential Analysis of Web Trak Runs Given a partiular IR system, a baseline system, and a set of topis, one an use the paired ttest for testing thesignifiane of the alulated average tradeoff sore between risk and reward over the topis, U Risk, by omparing the orresponding t-sore, T Risk, with the ritial values ±t (γ/2,df) at a desired level of signifiane γ. To illustrate suh an analysis, Table 2 reports the U Risk risk-reward tradeoff sores based on ERR@20, and the orresponding T Risk sores for the 8 highest performing TREC 2012 ad-ho runs, given the baseline run indricasp (we omit other submitted runs for brevity, however the following analysis would be equally appliable to them). As the TREC 2012 Web trak has 50 topis, for a signifiane level of γ = 0.05, the ritial values for T Risk are ±t (0.025,49) = ±2. In Table 2, the U Risk sores to whih a two-sided paired t test gives signifiane are those that have a orresponding T Risk sore less than 2 or greater than +2. For example, at α = 0, the alulated U Risk sores of the top 4 runs are signifiant with a p-value less than This means that, under the null hypothesis H 0 : µ r = µ b, given another sample of 50 topis from the population, the probability of observing a risk-reward tradeoff sore, between any one of these 4 runs and the baseline run indricasp, that is as extreme or more extreme than the one that was observed is less than 0.05, i.e. the assoiated p-values. Sine T Risk > 0, for those runs, the delared signifiane ounts in favour of reward against risk. Thus, one an onlude, with 95% onfidene, that the expeted per topi effetiveness of eah of the top 4 runs is, on average, higher than the expeted per topi effetiveness of the baseline run indricasp on the population of topis. In other words, given a topi from the population, it is highly likely that any one of the top 4 runs will not perform worse for that topi than indricasp. This suggests, as a result, that those top runs do not exhibit a real risk that is generalisable to the population of topis. On the other hand, a run with T Risk < 2 at α = 0 will be under a real risk, though among the shown top 8 TREC 2012 runs there is no suh run. For those runs with 2 T Risk < +2, suh as utw2012f1 and qutwb, the risk analysis performed here is inonlusive, sine the assoiated U Risk sores an be attributedtohane flutuation, i.e. it is equally likely that they may or may not be under a real risk. Next, we observe from Table 2 that as α inreases, the observed tradeoffs between risk and reward for eah run hanges in favour of risk ompared to reward, hene the runs exhibiting signifiant U Risk sores hange. For example, eah of the runs with signifiant U Risk sores at α = 0 (i.e., the top 4 runs) have a U Risk sore that an be attributed to a hane flutuation at α = 10, while, in ontrast, those runs whose U Risk sores an be attributed to hane flutuation at α = 0 (i.e., the last 4 runs) have a signifiant U Risk sore at α = 10. Figure 1 shows the hange in the T Risk sores of the TREC 2012 top 8 ad-ho runs for several risk-sensitivity α parameter values from 0 to 15. From the figure, we observe that for α > 5the T Risk sores for all runs are negative in sign, and for the last 4 runs the alulated U Risk sores an be onsidered statistially signifiant (i.e., T Risk > 2.0 for α > 5). It is also observed that, even for α = 15, the alulated U Risk sores of the top 4 TREC runs an still be attributed to hane flutuation. As a result, the inferential analysis performed so far suggests that, in general, none of the 8 top TREC 2012 ad-ho

6 Table 2: U Risk and T Risk sores risk-reward tradeoff sores for the top 8 TREC 2012 ad-ho runs at α = 0,1,5,10, where the baseline is indricasp. The underlined U Risk sores are those for whih a two-tailed paired t test gives signifiane with p < i.e. exhibit a T Risk sore greater than +2 or less than 2. α = 0 α = 1 α = 5 α = 10 U Risk T Risk p-value U Risk T Risk p-value U Risk T Risk p-value U Risk T Risk p-value uogtra44xi srhvrs DFalah121A QUTparaBline utw2012f ICTNET12ADR irra qutwb Figure 1: The hange in standardised T Risk sores for the top TREC 2012 ad-ho runs for 0 α 15. T Risk sore Upper Critial Value Null Hypothesis H 0 :µ r = µ b Lower Critial Value α value uogtra44xi srhvrs1209 DFalah121A QUTparaBline utw2012f1 ICTNET12ADR2 indricasp irra12 qutwb runs are under a real risk of performing any given topi from the population worse than the baseline run indricasp, on average. In partiular, there an be no signifiant redution in risk that ould be attained for the top 4 systems, given a baseline system with the average retrieval effetiveness of indricasp. On the other hand, a signifiant redution in risk ould be attained, on average, for the last 4 systems, partiularly for α > 5. Lastly, in Table 2, it is notable that the high U Risk sores do not neessarily imply high T Risk sores, beause of the fat that eah system would in general have a different inherent variation in r i b i aross topis (i.e. SE(U Risk )) from that of the other systems. For example, onsider the runs uogtra44xi and srhvrs1209. At α = 0, uogtra44xi has a U Risk sore (0.1185) higher than the U Risk sore (0.1102) of srhvrs1209, while srhvrs1209 has a higher T Risk sore than uogtra44xi, i.e vs This shows that a ranking of retrieval systems obtained based on T Risk will not neessarily be onordant with the ranking of systems obtained based on U Risk. 5. EXPLORATORY RISK ANALYSIS In the previous setion, the risk analysis that we performed ould hide signifiant performane losses on individual topis. Nevertheless, one an perform an exploratory risk analysis to determine those individual topis on whih the observed risk-reward tradeoff sore between a given IR system and the baseline system (i.e., x i) is statistially signifiant. In the following, we provide a definition for exploratory risk analysis(setion 5.1), whih we later illustrate upon the TREC 2012 Web trak runs (Setion 5.2). 5.1 Definition The T Risk measure permits the topi-by-topi analysis of risk-reward tradeoff measurements, whih we refer to as exploratory risk analysis. Suh an analysis is impliitly suggested by the t statisti itself. The t statisti in Eq. (7) an be rewritten as follows: t = 1 d SE( d) = (ri bi) s d / = r i b i s d. (19) In here, eah omponent of the sum t i = r i b i s d : gives the standardised sore of the observed differene in effetiveness between the system r and the baseline system b on topi i, for i = 1,2,...,. In analogy, the T Risk measure, whih stems from the t statisti, an be rewritten as: T Risk = U Risk SE x = 1 xi s = x/ x i s x, (20) where eah omponent of the sum, in this ase, gives the standardised sore of the individual topi risk-reward tradeoff measurements x 1,x 2,...,x : T Ri = xi s x. (21) In a similar manner that we ompare the alulated T Risk sore of a given IR system with the two-sided ritial values ±t (γ/2,df) to deide whether the system exhibits a signifiant level of risk on average (Setion 4), to deide whether an observed loss (or gain) on a partiular topi i is signifiant, we an ompare the omponent T Ri sore with the same ritial values ±t (γ/2,df), at a desired signifiane level of γ. If t (γ/2,df) T Ri t (γ/2,df), the observed loss (or gain) an be attributed to hane flutuation, and otherwise it an be onsidered statistially signifiant. Indeed, this is one of the typial methods of outlier detetion in statistis [14]. Reall that the original objetive of Jakknife is to detet outliers [21]. The T Risk measure an also be expressed in terms of the Jakknife estimate of bias, following Wu [29]: T Risk = U Risk SE J = 1 ( 1) (ˆθ(i) ˆθ) SE J. (22) Here, eah omponent of the sum: ( 1) (ˆθ(i) T Ji = ˆθ) ( 1) ( x(i) x) =, (23) SE J varj ( x) gives the standardised Jakknife estimate of bias in U Risk due to leaving the topi risk-reward sore x i out of the sample x 1,x 2,...,x, where x = U Risk and x (i) is the U Risk sore to be obtained when the i th topi is leaved out of the topi set in use, for i = 1,2,...,. In general, both the T Ri statisti in Eq. (21) and the T Ji statisti in Eq. (23) an be used for the purpose of exploratory risk analysis. However, there is a ertain differene

7 between them in theory. Using T Ri, we an deide whether an observed performane loss on topi i is signifiant, by omparing the topi risk-reward sore x i with the maximum sore that an be attributed to hane flutuation, but as if the single datum x i is the whole sample. In ontrast, using T Ji, we an make the same deision by omparing the observed differene between two U Risk sores, x (i) x, with the maximum differene that an be attributed to hane flutuation. Sine we showed in Setion 3.3 that the two estimates of the standard error for eah TREC run are in perfet agreement (i.e. SE x SE J), we argue that this theoretial differene has no pratial onsequenes. Hene, in the following, we provide an illustration of exploratory risk analysis on the TREC 2012 Web trak runs, based on T Ji alone. However, initial experiments showed no differenes between T Ri and T Ji. 5.2 Exploratory Analysis of Web Trak Runs Figure 2 shows the standardised Jakknife estimate of bias in the U Risk sores alulated for two TREC runs, namely uogtra44xi and qutwb at α = 0, 5, 10, 15 for the 50 TREC 2012 Web trak topis, where indricasp is the baseline. This standardised Jakknife estimate of bias, T Ji is estimated by leaving one TREC 2012 Web trak topi out of the set of topis {151,152,...,200} in turn. In the figure, the topis that result in a signifiant performane loss (gain) for the orresponding systems with respet to indricasp, at the signifiane level of γ = 0.05, are those whih have a T Ji sore less than 2 (greater than 2, respetively). Horizontal lines at 2 and +2 are shown to aid larity. From Figure 2, at α = 0 it an be observed that uog- TrA44xi has more signifiant wins in number than qutwb, and less signifiant losses. This shows in detail why the delared signifiane for uogtra44xi in Setion 4 ounts in favour of reward against risk, while the observed tradeoff between risk and reward an be attributed to hane flutuation for qutwb, with respet to the baseline indricasp. In general, both of the runs uogtra44xi and qutwb exhibit onsiderable performane losses with respet to indricasp onthesametopis, inluding166, 172, 174, 175, and191, out of whih 2 are signifiant for uogtra44xi (i.e., 166 and 175) and 4 are signifiant for qutwb (i.e., 166, 172, 175, and 191), at α = 0. In partiular, onsider the topi 166, on whih the magnitude of the T Ji sore is nearly the same for both runs. It is notable here that, as α inreases, thesignifiane ofthat topi relatively doubles for uogtra44xi, while for qutwb it nearly remains the same. The situation is also similar for topi 175, though the T Ji sore of uogtra44xi at α = 0 is small in magnitude ompared to that of qutwb. This is one of the important differenes between T Risk and U Risk in assessing the risk assoiated with IR systems. Given a partiular topi i, the same amount of performane loss with respet to a provided baseline effetiveness an lead to different T Ji (and T Ri = x i/se x) sores for different IR systems, depending on the variation in the observed riskreward tradeoff aross the topis (i.e., different SE x for different systems), while leading to the same topi risk-reward sore, x i, for i = 1,2,...,. As α inreases, the topi riskreward sore x i inreases proportionally for both of the runs uogtra44xi and qutwb. However, the tradeoff ounts, on average, signifiantly in favour of reward against risk for uog- TrA44xi, whereas, it ounts neither in favour of reward nor against risk for qutwb, as shown in Setion 4. Thus, thesame margin of inrease in topi risk-reward tradeoff sore x i in favour of risk should lead to a relatively higher level of risk for uogtra44xi than that for qutwb, in a way that T Ji did. Assessing the level of risk that a topi ommits for a given IRsystemrelative tothelevel ofrisk assoiated withthesystem on average is apropertyuniqueto themeasures T Ji and T Ri. Besides the use of these measures for exploratory risk analysis, this property also enables adaptive risk-sensitive optimisation within a learning to rank tehnique, as we explain in the next setion. 6. ADAPTIVE RISK OPTIMISATION In this setion, we desribe how to exploit the new riskreward tradeoff measure T Risk (Eq. (10)) in learning robust ranking models that maximises average retrieval effetiveness while minimising risk-reward ratio, in the ontext of the state-of-the-art LambdaMART learning to rank tehnique [30]. As disussed below, Wang et al. [26] proposed to integrate U Risk (Eq.(4)) within LambdaMART to ahieve risk sensitive optimisation, by using α to penalise risk during the learning proess. However, U Risk onsiders topis equally regardless of the level of risk they ommit. In ontrast, we propose to adaptively hange the level of risksensitivity, so that the total risk-sensitivity is distributed aross the topis proportionally to the level of risk eah topi ommits. In the following: Setion 6.1 provides an overview of the LambdaMART objetive funtion, while Setion 6.3 desribes the integration of U Risk within LambdaMART; Setion 6.3 explains our proposed adaptive risk-sensitive optimisation approahes, with the experimental setup & results following in Setions 6.4 & 6.5, respetively. 6.1 LambdaMART LambdaMART [30] is a state-of-the-art learning to rank tehnique, whih won the 2011 Yahoo! learning to rank hallenge. It an be desribed as a tree-based tehnique, in that its resulting learned model takes the form of an ensemble of regression trees, whih is used to predit the sore of eah doument given the doument s feature values. During learning, LambdaMART reates a sequene of gradient boosted regression trees that improve an effetiveness metri. In general, for our purposes 2, it is suffiient to state that LambdaMART s objetive funtion is based upon the produt of two omponents: (i) the derivative of a rossentropy that originates from the RankNet learning to rank tehnique[3] alulated between the sores of two douments a and b, and (ii) the absolute hange M in an evaluation measure M due to the swapping of douments a and b [4]. Therefore the final gradient λ new a of adoument a within the objetive funtion is obtained over all pairs of douments that a partiipates in for query q: λ new a = b aλ ab M ab where λ ab is RankNet s ross-entropy derivative, and M ab is the hange in an evaluation measure M by swapping douments a and b. Various IR evaluation measures are suitable for use as M, inluding NDCG and MAP, as they have been shown to satisfy a onsisteny property [4]: for a pair of douments a and b where a is ranked higher than b, if the relevane label of a is higher than b, then a degrading swap of a and b must result in a derease in M (i.e. M 0), and orthogonally M 0 for improving swaps. 2 Further details on LambdaMART an be found in [4, 26].

8 Figure 2: Bar graph showing the standardised Jakknife estimate of bias in the U Risk, T Ji, for uogtra44xi and qutwb at α = 0,5,10,15, where indricasp is the baseline. uogtra44xi qutwb T Ji at α= T Ji at α= T Ji at α= T Ji at α= Risk-Sensitive Optimisation Wang et al. [26] demonstrated that a more robust learned model ould be obtained from LambdaMART if the M is replaed by the differene in U Risk for a given swap of two douments, denoted T. In doing so, their implementation weights the value of M by α +1 only for the topis with down-side risk, while for the topis with up-side risk it leaves M as is, T = M. T was shown to exhibit the onsisteny property iff the underlying evaluation measure M is onsistent (e.g. as obtained from NDCG). 6.3 Adaptive Risk-Sensitive Optimisation Compared to U Risk, T Risk is grounded in the theory of hypothesis testing and produes values that are easily interpretable as shown in Setion 4. However, as a linear transformation of U Risk, the diret appliation of T Risk as T within LambdaMART to attain risk-sensitive optimisation annot offer marked improvements on the resulting learned models. On the other hand, the exploratory risk analysis of Setion 5 offers a promising diretion, as it permits the learning to rank proess to adaptively fous on topis depending upon the level of risk that they ommit. In this setion, we propose two new models of adaptive risk-sensitive optimisation that exploit the standardised topi risk-reward tradeoff sores (T Ri, Eq. (21)), but whih differ on whih individual topis they operate on. In partiular, the first model, Semi- Adaptive Risk-sensitive Optimisation (SARO), fouses only on the topis with down-side risk and augments only the orresponding M values. In ontrast, the Fully Adaptive Risk-sensitive Optimisation (FARO) model operates on all topis and augments every M value. Hene, ompared to U Risk as used in [26], FARO and SARO both alter the importane of riskier topis within the learning proess. In U Risk, M is multiplied by α+1 if the topi ommits a downside risk 3. This amounts to a stati level of sensitivity for eah topi, irrespetive of the level of risk that the topi ommits. In ontrast, based on the standardised topi 3 This follows diretly from the definition of Eq. (4), however the onsisteny proof in Setion of [26] defines T for different senarios. risk-reward tradeoff sores, T Ri (Eq. (21)), we propose to adaptively adjust α so that the total level of sensitivity an be distributed aross the topis proportional to the levels of risk that they ommit. In order to ahieve this, for eah topi we must estimate the probability of observing a riskreward sore greater than the atual observed T Ri sore. Tehnially speaking, we need to estimate the umulative probability Pr(Z T Ri ), where T Ri is the observed riskreward tradeoff sore and Z is the orresponding standard normal variable of T Ri for all topis i = 1,2,..,. For large sample sizes (generally agreed to be 30), the distribution of the t statisti in Eq. (7) an be approximated by the standard normal probability distribution funtion, with zero mean and unit variane [15]. Thus, the probability Pr(Z T Ri ), whih is theprobabilityofatopirisk-reward sore greater than T Ri, an be estimated by the standard normal umulative distribution funtion Φ( ), as follows: Pr(Z T Ri ) 1 Φ(T Ri ), (24) for i = 1,2,...,. Φ(Z) is a monotonially inreasing funtion of the standard normal random variable Z, where 0 Φ(Z = z) 1 for z, and at Z = 0, Φ(Z) = 0.5. Hene, weanreplaetheoriginalαin T asα asfollows: α = [1 Φ(T Ri )] α. (25) where 0 α α. As the level of risk T Ri ommitted by topi i inreases, α also inreases. By substituting α into T (as defined by Wang et al. [26]), this augments the M values for every topi with a weight proportional to the level of risk that eah topi ommits. Theappliationofα differsbetweenthesaroandfaro models. In partiular, SARO only addresses the down-side risk, as in the ase of U Risk. Indeed, under the null hypothesis H 0 : µ r = µ b, the higher the level of down-side risk (i.e. the larger the size of the differene r i b i < 0), the higher the probability of observing a topi risk-reward tradeoff sore greater than the observed sore (Pr(Z T Ri )). Hene, SARO varies α from 0 to α, aording to the downside risk of eah topi. On the other hand, FARO operates on all topis. Indeed, for the topis with up-side risk, FARO gives lower weights

9 to the topis that more strongly outperform the baseline system (i.e. as the differene r i b i > 0 inreases). At the extreme, if topi i exhibits maximal improvements over the baseline (i.e. r i b i = 1), then Φ(T Ri ) = 1, and hene topi i has minimal emphasis on the learner. In other words, the learner fouses on improving the riskier topis. FARO operates on all topis, by redefining T as follows: T = (1+α ) M, (26) Moreover, for α = 0, α = 0, hene T = M, i.e. the gain-only LambdaMART, as for U Risk. Finally, we informally omment on the onsisteny of SARO and FARO: For both models, we alulate SE(U Risk ) after the first iteration of boosting within LambdaMART, and not for eah onsidered swap we found this to be suffiient to obtain aurate estimates of SE(U Risk ); Next, the onsisteny of SARO follows from U Risk, as our replaement of α with α, as 0 α α. For FARO, T only hanges sign with M, again as 0 α α. Hene, as long as M is onsistent, both SARO and FARO are also onsistent. 6.4 Experimental Setup WeimplementtheU Risk, SAROandFAROmodelswithin the Jforests implementation [13] of LambdaMART 4. Experiments are onduted using the large MSLR-Web10k learning to rank dataset 5, as used by Wang et al. [26]. This dataset enompasses 9,685 queries with labelled douments obtained from a ommerial web searh engine. For eah ranked doument for eah query, a range of 136 typial query-independent, query-dependent and query features are provided. We use idential hyper-parameters for LambdaMART to those desribed by Wang et al. [26], namely: the minimum number of douments in eah leaf m = 500,1000, the number of leaves l = 50, the number of trees in the ensemble nt = 800 and the learning rate r = The best m value is hosen for eah of the five folds using the validation topi set, based on the NDCG@10 performane of the original LambdaMART algorithm, and used for all experiments for that fold thereafter. For the alulation of risk measures, like [26], we use the ranking obtained from the BM25.whole.doument feature as the baseline system. The NDCG@10 performane of this baseline is The performanes obtained for LambdaMART upon the MSLR-Web10k in terms of NDCG@1 and NDCG@10 are similar in magnitude to those reported by Wang et al. [26], however we note some differenes in the risk profile. Suh differenes are expeted given the different implementations: Wang et al. [26] used a private implementation of LambdaMART, while we use and adapt an open soure mahine learning toolkit for U Risk, SARO and FARO. Nevertheless, the reported results allow valid onlusions to be drawn, inluding idential onlusions to [26] on the impat of using U Risk within LambdaMART. 6.5 Results for SARO and FARO Table 3 reports the effetiveness and robustness results for FARO and SARO along with U Risk, for α = 1,5,10,20 6. In the table, the gain over the baseline effetiveness is ex- 4 All of our ode has been integrated to Jforests, available at α=0 is equivalent to the normal LambdaMART algorithm. Table 3: Results for SARO, FARO and U Risk. α = 0 α = 1 α = 5 α = 10 α = 20 NDCG@1 (U Risk ) NDCG@1 (SARO) NDCG@1 (FARO) NDCG@10 (U Risk ) NDCG@10 (SARO) NDCG@10 (FARO) Risk/Reward (U Risk ) Risk/Reward (SARO) Risk/Reward (FARO) Loss/Win (U Risk ) Loss/Win (SARO) Loss/Win (FARO) Loss (U Risk ) Loss (SARO) Loss (FARO) Win (U Risk ) Win (SARO) Win (FARO) Loss > 20% (U Risk ) Loss > 20% (SARO) Loss > 20% (FARO) pressed as the risk (Eq. (1)) to reward (Eq. (2)) ratio (i.e., the Risk/Reward rows). Similarly, the number of topis that the risk-sensitive optimisation ontributed to reward against risk is expressed as the loss to win ratio (i.e., the Loss/Win rows). Raw numbers of losses and wins assoiated with eah α value for eah model are also shown. Finally the Loss > 20% rows show, for eah model, the number of topis on whih the relative loss in performane over the BM25 baseline was higher than 20% 7. As expeted, sine the semi-adaptive risk-sensitive optimisation (SARO) and the risk-sensitive optimisation based on U Risk fous on only those topis with down-side risk, there is a steady derease in average retrieval effetiveness (i.e., NDCG@1 and NDCG@10), as the risk-sensitivity parameter value of α inreases. Nevertheless, SARO results in a derease in average retrieval effetiveness that is less than U Risk, for all α values. In ontrast, the fully adaptive risk-sensitive optimisation (FARO) maintains the average retrieval effetiveness nearly onstant aross all α values, as well as the values of the quality and robustness measures, namely the risk-reward ratio and the loss-win ratio. For SARO, the observed values of the two quality and robustness metris (risk-reward ratio and loss-win ratio) are better than for U Risk aross the α values. For the metri Loss > 20%, they are omparable between SARO and U Risk, given a topi sample as large as 9685 in size. Next, for FARO, the observed values of the two quality and robustness metris are omparable with that of the risksensitive optimisation based on U Risk aross α values, and for the metri, Loss > 20% the observed values for FARO are slightly worse than that of both U Risk and SARO. To summarise, the empirial evidene in Table 3 suggest that (i) FARO is best suited for retrieval tasks that are not tolerant to any loss in average effetiveness but also require robustness in effetiveness aross the topis, and (ii) SARO suits retrieval tasks that require primarily robustness but are tolerant to some loss in the ahievable average effetiveness. 7 Similar measures are reported in [26]. With 9685 topis, all NDCG differenes are statistially signifiant.

Risk-Sensitive Evaluation and Learning to Rank using Multiple Baselines

Risk-Sensitive Evaluation and Learning to Rank using Multiple Baselines B Taner Dinçer 1, Craig Macdonald 2, Iadh Ounis 2 1 Sitki Kocman University of Mugla, Mugla, Turkey 2 University of Glasgow, Glasgow,