arxiv: v2 [cs.lg] 15 Apr 2015

Size: px

Start display at page:

Download "arxiv: v2 [cs.lg] 15 Apr 2015"

Erin Lindsey Elliott
6 years ago
Views:

1 AA:1 AA arxiv: v2 [cs.lg] 15 Apr 2015 Authors address: Andrea Esuli, Istituto di Scienza e Tecnologie dell Informazione, Consiglio Nazionale delle Ricerche, Via Giuseppe Moruzzi 1, Pisa, Italy. andrea.esuli@isti.cnr.it. Fabrizio Sebastiani, Qatar Computing Research Institute, PO Box 5825, Doha, Qatar. fsebastiani@qf.org.qa. Fabrizio Sebastiani is on leave from Consiglio Nazionale delle Ricerche. The order in which the authors are listed is purely alphabetical; each author has given an equally important contribution to this work. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies show this notice on the first page or initial screen of a display along with the full citation. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any component of this work in other works requires prior specific permission and/or a fee. Permissions may be requested from Publications Dept., ACM, Inc., 2 Penn Plaza, Suite 701, New York, NY USA, fax +1 (212) , or permissions@acm.org. c YYYY ACM /YYYY/-ARTAA $15.00 DOI:

2 ANDREA ESULI, Consiglio Nazionale delle Ricerche FABRIZIO SEBASTIANI, Qatar Computing Research Institute We address the problem of quantification, a supervised learning task whose goal is, given a class, to estimate the relative frequency (or prevalence) of the class in a dataset of unlabelled items. Quantification has several applications in data and text mining, such as estimating the prevalence of positive reviews in a set of reviews of a given product, or estimating the prevalence of a given support issue in a dataset of transcripts of phone calls to tech support. So far, quantification has been addressed by learning a general-purpose classifier, counting the unlabelled items which have been assigned the class, and tuning the obtained counts according to some heuristics. In this paper we depart from the tradition of using general-purpose classifiers, and use instead a supervised learning model for structured prediction, capable of generating classifiers directly optimized for the (multivariate and non-linear) function used for evaluating quantification accuracy. The experiments that we have run on 5500 binary high-dimensional datasets (averaging more than 14,000 documents each) show that this method is more accurate, more stable, and more efficient than existing, state-of-the-art quantification methods. Categories and Subject Descriptors: I.5.2 [Pattern Recognition]: Design Methodology Classifier design and evaluation; H.3.3 [Information Storage and Retrieval]: Information Search and Retrieval Information filtering; Search process; I.2.7 [Artificial Intelligence]: Natural Language Processing Text analysis General Terms: Algorithm, Design, Experimentation, Measurements Additional Key Words and Phrases: Quantification, Prevalence estimation, Prior estimation, Supervised learning, Text classification, Loss functions, Kullback-Leibler divergence ACM Reference Format: Andrea Esuli and Fabrizio Sebastiani, YYYY. Optimizing Text Quantifiers for Multivariate Loss Functions. ACM Trans. Knowl. Discov. Data. VV, NN, Article AA ( YYYY), 26 pages. DOI: 1. INTRODUCTION In recent years it has been pointed out that, in a number of applications involving classification, the final goal is not determining which class (or classes) individual unlabelled data items belong to, but determining the prevalence (or relative frequency ) of each class in the unlabelled data. The latter task is known as quantification [Forman 2005; 2006a; 2008; Forman et al. 2006]. Although what we are going to discuss here applies to any type of data, we are mostly interested in text quantification, i.e., quantification when the data items are textual documents. To see the importance of text quantification, let us examine the task of classifying textual answers returned to open-ended questions in questionnaires [Esuli and Sebastiani 2010a; Gamon 2004; Giorgetti and Sebastiani 2003], and let us discuss two important such scenarios. In the first scenario, a telecommunications company asks its current customers the question How satisfied are you with our mobile phone services?, and wants to classify the resulting textual answers according to whether they belong to the class May- DefectToCompetition. The company is likely interested in accurately classifying each individual customer, since it may want to call each customer that is assigned the class and offer her improved conditions. In the second scenario, a market research agency asks respondents the question What do you think of the recent ad campaign for product X?, and wants to classify the resulting textual answers according to whether they belong to the class LovedThe- Campaign. Here, the agency is likely not interested in whether a specific individual

3 AA:3 belongs to the class LovedTheCampaign, but is likely interested in knowing how many respondents belong to it, i.e., in knowing the prevalence of the class. In sum, while in the first scenario classification is the goal, in the second scenario the real goal is quantification, i.e., evaluating the results of classification at the aggregate level rather than at the individual level. Other scenarios in which quantification is the goal may be, e.g., predicting election results by estimating the prevalence of blog posts (or tweets) supporting a given candidate or party [Hopkins and King 2010], or planning the amount of human resources to allocate to different types of issues in a customer support center by estimating the prevalence of customer calls related to a given issue [Forman 2005], or supporting epidemiological research by estimating the prevalence of medical reports in which a specific pathology is diagnosed [Baccianella et al. 2013]. The obvious method for dealing with the latter type of scenarios is aggregative quantification, i.e., classifying each unlabelled document and estimating class prevalence by counting the documents that have been attributed the class. However, there are two reasons why this strategy is suboptimal. The first reason is that a good classifier may not be a good quantifier, and vice versa. To see this, one only needs to look at the definition of F 1, the standard evaluation function for binary classification, defined as F 1 = 2 T P 2 T P + F P + F N (1) where T P, F P and F N indicate the numbers of true positives, false positives, and false negatives, respectively. According to F 1, a binary classifier ˆΦ 1 for which F P = 20 and F N = 20 is worse than a classifier ˆΦ 2 for which, on the same test set, F P = 0 and F N = 10. However, ˆΦ 1 is intuitively a better binary quantifier than ˆΦ 2 ; indeed, ˆΦ 1 is a perfect quantifier, since F P and F N are equal and thus compensate each other, so that the distribution of the test items across the class and its complement is estimated perfectly. A second reason is that standard supervised learning algorithms are based on the assumption that the training set is drawn from the same distribution as the unlabelled data the classifier is supposed to classify. But in real-world settings this assumption is often violated, a phenomenon usually referred to as concept drift [Sammut and Harries 2011]. For instance, in a backlog of newswire stories from year 2001, the prevalence of class Terrorism in August data will likely not be the same as in September data; training on August data and testing on September data might well yield low quantification accuracy. Violations of this assumption may occur for reasons ranging from the bias introduced by experimental design, to the irreproducibility of the testing conditions at training time [Quiñonero-Candela et al. 2009]. Concept drift usually comes in one of three forms [Kelly et al. 1999]: (a) the class priors p(c i ) may change, i.e., the one in the test set may significantly differ from the one in the training set; (b) the classconditional distributions p(x c i ) may change; (c) the posterior distribution p(c i x) may change. It is the first of these three cases that poses a problem for quantification. The previous arguments indicate that text quantification should not be considered a mere byproduct of text classification, and should be studied as a task of its own. To date, proposed methods explicitly addressed to quantification (see e.g., [Bella et al. 2010; Forman 2005; 2006a; 2008; Forman et al. 2006; Hopkins and King 2010; Xue and Weiss 2009]) employ general-purpose supervised learning methods, i.e., address quantification by elaborating on the results returned by a general-purpose standard classifier. In this paper we take a sharply different, structured prediction approach, based upon the use of classifiers explicitly optimized for the non-linear, multivariate

4 AA:4 Andrea Esuli and Fabrizio Sebastiani evaluation function that we will use for assessing quantification accuracy. This idea was first proposed, but not implemented, in [Esuli and Sebastiani 2010b]. The rest of the paper is organized as follows. In Section 2, after setting the stage we describe the evaluation function we will adopt ( 2.1) and sketch a number of quantification methods previously proposed in the literature ( 2.2). In Section 3 we introduce our novel method based on explicitly minimizing, via a structured prediction model, the evaluation measure we have chosen. Section 4 presents experiments in which we test the method we propose on two large batches of binary, high-dimensional, publicly available datasets (the two batches consist of 5148 and 352 datasets, respectively), using all the methods introduced in 2.2 as baselines. Section 5 discusses related work, while Section 6 concludes. 2. PRELIMINARIES In this paper we will focus on quantification at the binary level. That is, given a domain of documents D and a class c, we assume the existence of an unknown target function (or ground truth) Φ : D { 1, +1} that specifies which members of D belong to c; as usual, +1 and 1 represent membership and non-membership in c, respectively. The approaches we will focus on are based on aggregative quantification, i.e., they rely on the generation of a classifier ˆΦ : D { 1, +1} via supervised learning from a training set T r. We will indicated with T e the test set on which quantification effectiveness is going to be tested. We define the prevalence (or relative frequency) λ T e (c) of class c in a set of documents T e as the fraction of members of T e that belong to c, i.e., as λ T e (c) = {d j T e Φ(d j ) = +1} T e Given a set T e of unlabelled documents and a class c, quantification is defined as the task of estimating λ T e (c), i.e., of computing an estimate ˆλ T e (c) such that λ T e (c) and ˆλ T e (c) are as close as possible 1. What as close as possible exactly means will be formalized by an appropriate evaluation measure (see 2.1). The reasons why we focus on binary quantification are two-fold: Many quantification problems are binary in nature. For instance, estimating the prevalence of positive and negative reviews in a dataset of reviews of a given product is such a task. Another such task is estimating from blog posts the prevalence of support for either of two candidates in the second round of a two-round ( run-off ) election. A multi-class multi-label problem (also known as an n-of-m problem, i.e., a problem where zero, one, or several among m classes can be attributed to the same document) can be reduced to m independent binary problems of type (c j vs. c j ), where C = {c 1,..., c j,..., c m } is the set of classes and where c j denotes the complement of c j. Binary quantification methods can thus also be applied to solving quantification in multi-class multi-label contexts. We instead leave the discussion of quantification in single-label multi-class (i.e., 1-ofm) contexts to future work Evaluation measures for quantification Different measures have been used in the literature for measuring binary quantification accuracy. (2) 1 Consistently with most mathematical literature we use the caret symbol (ˆ) to indicate estimation.

5 AA:5 The simplest such measure is bias (B), defined as B(λ T e, ˆλ T e ) = ˆλ T e (c) λ T e (c) and used in [Forman 2005; 2006a; Tang et al. 2010]; positive bias indicates a tendency to overestimate the prevalence of c, while negative bias indicates a tendency to underestimate it. Absolute Error (AE - also used in [Esuli and Sebastiani 2010b], where it is called percentage discrepancy, and in [Barranquero et al. 2013; Bella et al. 2010; Forman 2005; 2006a; González-Castro et al. 2013; Sánchez et al. 2008; Tang et al. 2010]), defined as AE(λ T e, ˆλ T e ) = ˆλ T e (c) λ T e (c), is an alternative, equally simplistic measure that accounts for the fact that positive and negative bias are (in the absence of specific application-dependent constraints) equally undesirable. Relative absolute error (RAE), defined as RAE(λ T e, ˆλ T e ) = ˆλ T e (c) λ T e (c) λ T e (c) is a refinement of AE meant to account for the fact that the same value of absolute error is a more serious mistake when the true class prevalence is small. For instance, predicting ˆλ T e (c) = 0.10 when λ T e (c) = 0.01 and predicting ˆλ T e (c) = 0.50 when λ T e (c) = 0.41 are equivalent errors according to B and AE, but the former is intuitively a more serious error than the latter. The most convincing among the evaluation measures proposed so far is certainly Forman s [2005], who uses normalized cross-entropy, better known as Kullback-Leibler Divergence (KLD see e.g., [Cover and Thomas 1991]). KLD, defined as KLD(λ T e, ˆλ T e ) = c C λ T e (c) log λ T e(c) ˆλ T e (c) and also used in [Esuli and Sebastiani 2010b; Forman 2006a; 2008; Tang et al. 2010], is a measure of the error made in estimating a true distribution λ T e over a set C of classes by means of a distribution ˆλ T e ; this means that KLD is in principle suitable for evaluating quantification, since quantifying exactly means predicting how the test items are distributed across the classes. KLD ranges between 0 (perfect coincidence of λ T e and ˆλ T e ) and + (total divergence of λ T e and ˆλ T e ). In the binary case in which C = {c, c}, KLD becomes KLD(λ T e, ˆλ T e ) = λ T e (c) log λ T e(c) ˆλ T e (c) + λ T e(c) log λ T e(c) ˆλ T e (c) Continuity arguments indicate that we should consider 0 log 0 q = 0 and p log p 0 = + (see [Cover and Thomas 1991, p. 18]). Note that, as from Equation 4, KLD is undefined when the predicted distribution ˆλ T e is zero for at least one class (a problem that also affects RAE). As a result, we smooth the fractions λ T e (c)/ˆλ T e (c) and λ T e (c)/ˆλ T e (c) in Equation 4 by adding a small quantity ɛ to both the numerator and the denominator. The smoothed KLD function is always defined and still returns a value of zero when λ T e and ˆλ T e coincide. KLD offers several advantages with respect to RAE (and, a fortiori, to B and AE). One advantage is that, as evident from Equation 5, it is symmetric with respect to the complement of a class, i.e., switching the role of c and c does not change the result. This means that, e.g., predicting ˆλ T e (c) = 0.10 when λ T e (c) = 0.11 and predicting ˆλ T e (c) = 0.90 when λ T e (c) = 0.89, are equivalent errors (which seems intuitive), while RAE considers the former a much more serious error than the latter. This is especially useful in binary quantification tasks in which it is not clear which of the two classes should (3) (4) (5)

6 AA:6 Andrea Esuli and Fabrizio Sebastiani play the role of the positive class c, as in e.g., Employed vs. Unemployed. A second advantage is that KLD is not defined only on the binary (and multi-label multi-class) case, but is also defined on the single-label multi-class case; this allows evaluating different types of quantification tasks with the same measure. Last but not least, one benefit of using KLD is that it is a very well-known measure, having been the subject of intense study within information theory [Csiszár and Shields 2004] and, although from a more applicative angle, within the language modelling approach to information retrieval [Zhai 2008] Existing quantification methods A number of methods have been proposed in the (still brief) literature on quantification; below we list the main ones, which we will use as baselines in the experiments discussed in Section 4. Classify and Count (CC). An obvious method for quantification consists of generating a classifier from T r, classifying the documents in T e, and estimating λ T e by simply counting the fraction of documents in T e that are predicted positive, i.e., ˆλ CC T e (c) = {d j T e ˆΦ(d j ) = +1} T e Forman [2008] calls this the classify and count (CC) method. Probabilistic Classify and Count (PCC). A variant of the above consists in generating a classifier from T r, classifying the documents in T e, and computing λ T e as the expected fraction of documents predicted positive, i.e., ˆλ P CC T e (c) = 1 T e d j T e (6) p(c d j ) (7) where p(c d j ) is the probability of membership in c of test document d j returned by the classifier. If the classifier only returns confidence scores that are not probabilities (as is the case, e.g., when AdaBoost.MH is the learner [Schapire and Singer 2000]), the confidence scores must be converted into probabilities, e.g., by applying a logistic function. The PCC method is dismissed as unsuitable in [Forman 2005; 2008], but is shown to perform better than CC in [Bella et al. 2010] (where it is called Probability Average ) and in [Tang et al. 2010]. Adjusted Classify and Count (ACC). Forman [2005; 2008] uses a further method which he calls Adjusted Count, and which we will call Adjusted Classify and Count (ACC) so as to make its relation with CC more explicit. The underlying idea is that CC would be optimal, were it not for the fact that the classifier may generate different numbers of false positives and false negatives, and that this difference would lead to imperfect quantification. If we knew the true positive rate (tpr = T P T P +F N, a.k.a. recall) and false positive rate (fpr = F P F P +T N, a.k.a. fallout) that the classifier has obtained on T e, it is easy to check that perfect quantification would be obtained by adjusting (c) as follows: ˆλ CC T e ˆλ ACC CC ˆλ T e T e (c) = (c) fpr T e(c) tpr T e (c) fpr T e (c) Since we cannot know the true values of tpr T e (c) and fpr T e (c), the ACC method consists of estimating them on T r via k-fold cross-validation and using the resulting estimates in Equation 8. However, one problem with ACC is that it is not guaranteed to return a value in [0,1], due to the fact that the estimates of tpr T e (c) and fpr T e (c) may be imperfect. This (8)

7 AA:7 lead Forman [2008] to clip the results of the estimation (i.e., equate to 1 every value higher than 1 and to 0 every value lower than 0) in order for the final results to be in [0,1]. Probabilistic Adjusted Classify and Count (PACC). The PACC method (proposed in [Bella et al. 2010], where it is called Scaled Probability Average ) is a probabilistic variant of ACC, i.e., it stands to ACC like PCC stands to CC. Its underlying idea CC is to replace, in Equation 8, ˆλ T e (c), tpr T e(c) and fpr T e (c) with their expected values, with probability of membership in c replacing binary predictions. Equation 8 is thus transformed into ˆλ P T e ACC P CC ˆλ T e (c) E[fpr T e (c)] (c) = (9) E[tpr T e (c)] E[fpr T e (c)] where E[tpr T e (c)] and E[fpr T e (c)] (expected tpr T e (c) and expected fpr T e (c), respectively) are defined as E[tpr T e (c)] = 1 p(c d j ) (10) T e c d j T e c E[fpr T e (c)] = 1 p(c d j ) (11) T e c d j T e c and T e c (resp., T e c ) indicates the set of documents in T e that belong (resp., do not belong) to class c. Again, since we cannot know the true E[tpr T e (c)] and E[fpr T e (c)] (given that we do not know T e c and T e c ), we estimate them on T r via k-fold crossvalidation and use the resulting estimates in Equation 9. Threshold@0.50 (T50), Method X (X), and Method Max (MAX). Forman [2008] points out that the ACC method is very sensitive to the decision threshold of the classifier, which may yield unreliable values of λ ACC T e (c) (or lead to λ ACC T e (c) being undefined when tpr T e = fpr T e ). In order to reduce this sensitivity, [Forman 2008] recommends to heuristically set the decision threshold in such a way that tpr T r (as obtained via k-fold cross-validation) is equal to.50 before computing Equation 8. This method is dubbed Threshold@0.50 (T50). Alternative heuristics that [Forman 2008] discusses are to set the decision threshold in such a way that fpr T r = 1 tpr T r (this is dubbed Method X) or such that (tpr T r fpr T r ) is maximized (this is dubbed Method Max). Median Sweep (MS). Alternatively, [Forman 2008] recommends to compute λ ACC T e (c) for every decision threshold that gives rise (in k-fold cross-validation) to different tpr T r or fpr T r values, and take the median of all the resulting estimates of (c). This method is dubbed Median Sweep (MS). Mixture Model (MM). The MM method (proposed in [Forman 2005]) consists of assuming that the distribution D T e of the scores that the classifier assigns to the test examples is a mixture λ ACC T e D T e = λ T e (c) D T e c + (1 λ T e (c)) D T e c (12) where Dc T e and Dc T e are the distributions of the scores that the classifier assigns to the positive and the negative test examples, respectively, and where λ T e (c) and λ T e (c) are the parameters of this mixture. The MM method consists of estimating Dc T e and Dc T e via k-fold cross-validation, and picking as value of λ T e (c) the one that generates the best fit between the observed D T e and the mixture. Two variants of this method, called the Kolmogorov-Smirnov Mixture Model (MM(KS)) and the PP-Area Mixture Model (MM(PP)), are actually defined in [Forman 2005], which differ in terms of how the goodness of fit between the left- and the right-hand side of Equation 12 is estimated. See [Forman 2005] for more details.

8 AA:8 Andrea Esuli and Fabrizio Sebastiani 3. OPTIMIZING QUANTIFICATION ACCURACY A problem with the methods discussed in 2.2 is that most of them are fairly heuristic in nature. For instance, the fact that methods such as ACC (and all the others based on it, such as T50, MS, X, and MAX) require clipping is scarcely reassuring. More in general, methods such as T50 or MS have hardly any theoretical foundation, and choosing them over CC only rests on our knowledge that they have performed better in previously reported experiments. A further problem is that some of these methods rest on assumptions that seem problematic. For instance, one problem with the MM method is that it seems to implicitly rely on the hypothesis that estimating Dc T e and Dc T e via k-fold cross-validation on T r can be done reliably. However, since the very motivation of doing quantification is that the training set and the test set may have quite different characteristics, this hypothesis seems adventurous. A similar argument casts some doubt on ACC: how reliable are the estimates of tpr T e and fpr T e that can be generated via k-fold crossvalidation on T r, given the different characteristics that training set and test set may have in the application contexts where quantification is required? 2 In sum, the very same arguments that are used to deem the CC method unsuitable for quantification seem to undermine the previously mentioned attempts at improving on CC. In this paper we propose a new, theoretically well-founded quantification method that radically differs from the ones discussed in 2.2. Note that all of the methods discussed in 2.2 employ general-purpose supervised learning methods, i.e., address quantification by post-processing the results returned by a standard classifier (where the decision threshold has possibly been tuned according to some heuristics). In particular, all the supervised learning methods adopted in the literature on quantification optimize Hamming distance or variants thereof, and not a quantification-specific evaluation function. When the dataset is imbalanced (typically: when the positives are by far outnumbered by the negatives), as is frequently the case in text classification, this is suboptimal, since a supervised learning method that minimizes Hamming distance will generate classifiers with a tendency to make negative predictions. This means that F N will be much higher than F P, to the detriment of quantification accuracy 3. We take a sharply different approach, based upon the use of classifiers explicitly optimized for the evaluation function that we will use for assessing quantification accuracy. Given such a classifier, we will simply use a classify and count approach, with no heuristic threshold tuning (à la T50 / X / MAX) and no a posteriori adjustment (à la ACC). The idea of using learning algorithms capable of directly optimizing the measure (a.k.a. loss ) used for evaluating effectiveness is well-established in supervised learning. However, in our case following this route is non-trivial, because the evaluation measure that we want to use (KLD) is non-linear, i.e., is such that the error on the test set may not be formulated as a linear combination of the error incurred by each test example. An evaluation measure for quantification is inherently non-linear, because how the error on an individual test item impacts on the overall quantification error depends on how the other test items have been classified. For instance, if in the other test items there are more false positives than false negatives, an additional false negative is actually beneficial to overall quantification error, because of the mutual compensation effect between F P and F N mentioned in Section 1. As a result, a measure of 2 In Appendix A we thoroughly analyse (also by means of concrete experiments) the issue of how (un)reliable the k-fold cross-validation estimates of tpr T e and fpr T e are in practice. 3 To witness, in the experiments we report in Section 4 our 5148 test sets exhibit, when classified by the classifiers generated by the linear SVM used for implementing the CC method, an average F P/F N ratio of 0.109; by contrast, for an optimal quantifier this ratio is always 1.

9 AA:9 quantification accuracy is inherently non-linear, and should thus be multivariate, i.e., take in consideration all test items at once. As discussed in [Joachims 2005], the assumption that the error on the test set may be formulated as a linear combination of the error incurred by each test example (as indeed happens for many common error measures e.g., Hamming distance) underlies most existing discriminative learners, which are thus suboptimal for tackling quantification. In order to sidestep this problem, we adopt the SVM for Multivariate Performance Measures (SVM perf ) learning algorithm proposed by Joachims [2005] 4. SVM perf is a learning algorithm of the Support Vector Machine family that can generate classifiers optimized for any non-linear, multivariate loss function that can be computed from a contingency table (as KLD is). SVM perf is a specialization to the problem of binary classification of the structural SVM (SV M struct ) learning algorithm [Joachims et al. 2009a; Joachims et al. 2009b; Tsochantaridis et al. 2004] for structured prediction, i.e., an algorithm designed for predicting multivariate, structured objects (e.g., trees, sequences, sets). SVM perf is fundamentally different from conventional algorithms for learning classifiers: while these latter learn univariate classifiers (i.e., functions of type ˆΦ : D { 1, +1} that classify individual instances one at a time), SVM perf learns multivariate classifiers (i.e., functions of type ˆΦ : D S { 1, +1} S that classify entire sets S of instances in one shot). By doing so, SVM perf can optimize properties of entire sets of instances, properties (such as KLD) that cannot be expressed as linear functions of the properties of the individual instances. As discussed in [Joachims et al. 2009b], SV M struct can be adapted to a specific task by defining four components: (1) A joint feature map Ψ(x, y). This function computes a vector of features (describing the match between the input vectors in x and the relative outputs, true or predicted, in y) from all the input-output pairs at the same time. In this way the number of features, and thus the number of parameters of the model, can be kept constant regardless of the size of the sample set. The Ψ function allows to generalise not only on inputs (x) but also on outputs (y), thus allowing to produce predictions not seen in the training data. In SVM perf Ψ is defined 5 as Ψ(x, y) = 1 n n y i x i (13) (2) A loss function (y, ŷ). SVM perf works with loss functions (T P, F P, F N, T N) in which the four values are those from the contingency table resulting from comparing the true labels y with the predicted labels ŷ. In our work we take the loss i=1 4 In [Joachims 2005] SVM perf is actually called SV M multi, but the author has released its implementation under the name SVM perf. We will use this latter name because it uniquely identifies the algorithm on the Web, while searching for SVM multi often returns the SV M multiclass package, which addresses a different problem. 5 For this formulation of Ψ, and when error rate is the chosen loss function, Joachims [2005] shows that SVM perf coincides with the traditional univariate SVM model (called SVM org in [Joachims 2005]).

10 AA:10 Andrea Esuli and Fabrizio Sebastiani function to be KLD, i.e., 6 KLD (T P, F P, F N, T N) = KLD(λ, ˆλ) (14) T P + F N where λ(c) = T P + F P + F N + T N and ˆλ T P + F P T e (c) = T P + F P + F N + T N (3) An algorithm for the efficient computation of a hypothesis ˆΦ(x) = argmaxŷ Y {w Ψ(x, ŷ)} (15) where w is a vector of parameters. In SVM perf this simply corresponds to computing ˆΦ(x) = (sign(w x 1 ),..., sign(w x n )) (16) (4) An algorithm for the efficient computation of the loss-augmented hypothesis ˆΦ (x) = argmaxŷ Y { (y, ŷ) + w Ψ(x, ŷ)} (17) which in SVM perf is computed via an algorithm [Joachims 2005, Algorithm 2] with O(n 2 ) worst-case complexity. We have used the implementation of SVM perf made available by Joachims 7, which we have extended by implementing the module that takes care of the KLD loss function. In the rest of the paper our method will be dubbed SVM(KLD). 4. EXPERIMENTS We now present the results of experiments aimed at assessing whether the approach we have proposed in Section 3 delivers better quantification accuracy than state-ofthe-art quantification methods. In order to do this, we have run all our experiments by using as baselines for our SVM(KLD) method all the methods described in 2.2. For the CC, ACC, T50, X, MAX, MS, MM(KS), MM(PP) methods we have used the original implementation that we have obtained from the author (this guarantees that the baselines perform at their full potential). We have instead implemented PCC and PACC ourselves. At the heart of the implementation of all the baselines is a standard linear SVM with the parameters set at their default values; where quantities (such as e.g., fpr T e and tpr T e see Equation 8) had to be estimated from the training set, we have used 50-fold cross-validation, as done and recommended in [Forman 2008]. In order to guarantee a fair comparison with the baselines we have used the default values for the parameters also for SVM perf, which lies at the basis of our SVM(KLD) method 8. 6 In Equation 14 KLD is written as a function of T P, F P, F N, T N for the simple fact that in [Joachims 2005] (where SVM perf was originally described) the loss function is specified as a function of the four cells of the contingency table. However, it should be clear that λ(c) does not depend on the predicted labels: T P +F N, this latter is equivalent to writing T P +F P +F N+T N, where GP (the gold positives ) is T P + F N and GN (the gold negatives ) is F P + T N. even if in Equation 14 we have written it out as λ(c) = λ(c) = GP GP +GN Seen under this light, there is no trace of predicted labels in λ(c) = GP, and λ(c) is just a function of GP +GN the gold standard and not of the prediction. Analogously, it should be clear that ˆλ(c) is just a function of the prediction and not of the gold standard. 7 SVM perf is available from perf.html. Our module that extends it to deal with KLD is available at 8 An additional reason why we have left the parameters at their default values is that, in a context in which the characteristics of T r and T e may substantially differ, it is not clear that the parameter values which are found optimal on T r via k-fold cross-validation will also prove optimal (or at least will perform reasonably) on T e.

11 AA:11 In order to generate the vectorial representations for our documents, the classic bag-of-words approach has been adopted. In particular, punctuation has been removed, all letters have been converted to lowercase, numbers have been removed, stop words have been removed using the stop list provided in [Lewis 1992, pages ], and stemming has been performed by means of the version of Porter s stemmer available from All the remaining stemmed words ( terms ) that occur at least once in T r have thus been used as features of our vectorial representations of documents; no feature selection has been performed. Feature weights have been obtained via the ltc variant [Salton and Buckley 1988] of the well-known tf idf class of weighting functions, i.e., tfidf(t k, d i ) = tf(t k, d i ) log T r # T r (t k ) where d i is a document, # T r (t k ) denotes the number of documents in T r in which feature t k occurs at least once and { 1 + log #(tk, d tf(t k, d i ) = i ) if #(t k, d i ) > 0 (19) 0 otherwise where #(t k, d i ) denotes the number of times t k occurs in d i. Weights obtained by Equation 18 are normalized through cosine normalization, i.e., w ki = (18) tfidf(t k, d i ) T s=1 tfidf(t s, d i ) 2 (20) where T denotes the total number of features. Following [Forman 2008], we set the ɛ constant for smoothing KLD to the value ɛ = 1 2 T e Datasets The datasets we use for our experiments have been extracted from two important text classification test collections, REUTERS CORPUS VOLUME 1 version 2 (RCV1-V2) and OHSUMED-S. RCV1-V2 is a standard, publicly available benchmark for text classification consisting of 804,414 news stories produced by Reuters from 20 Aug 1996 to 19 Aug RCV1-V2 ranks as one of the largest corpora currently used in text classification research and, as pointed out in [Forman 2006b], suffers from extensive drift, i.e., from substantial variability between the training set and the test set, which makes it a challenging dataset for quantification. In our experiments we have used the 12,807 news stories of the 1st week (20 to 26 Aug 1996) for training, and the 791,607 news stories of the other 52 weeks for testing 10. We have further partitioned these latter into 52 test sets each consisting of one week s worth of data 11. RCV1-V2 is multi-label, i.e., a document may belong to several classes at the same time. Of the 103 classes of which its Topic hierarchy consists, in our experiments we have restricted our attention to the 99 classes with at least one positive training example. Consistently with the evaluation presented in [Lewis et al. 2004], also classes placed at internal nodes in the hierarchically organized classification scheme are considered in the evaluation; as This is the standard LYRL2004 split between training and test data, originally defined in [Lewis et al. 2004]. 11 More precisely, since the period covered by RCV1-V2 consists of 365 days, i.e., 52 full weeks + 1 day, the 52nd test set consists of 1 day s worth of data only.

12 AA:12 Andrea Esuli and Fabrizio Sebastiani positive examples of these classes we use the union of the positive examples of their subordinate nodes, plus their own positive examples. The OHSUMED-S dataset [Esuli and Sebastiani 2013] is a subset of the wellknown OHSUMED test collection [Hersh et al. 1994]. OHSUMED-S consists of a set of 15,643 MEDLINE records spanning the years from 1987 to 1991, where each record is classified under one or more of the 97 MeSH index terms that belong to the Heart Disease (HD) subtree of the well-known MeSH tree of index terms 12. Each entry consists of summary information relative to a paper published on one of 270 medical journals; the available fields are title, abstract, author, source, publication type, and MeSH index terms. As the training set we have used, consistently with [Esuli and Sebastiani 2013], the 2,510 documents belonging to year 1987; 9 MeSH index terms out of the 97 in the HD subtree are never assigned to any training document, so the number of classes we actually use is 88. We partition the four remaining years worth of data into four bins (1988, 1989, 1990, 1991), each containing the documents generated within the corresponding calendar year. The reason why we do not use the entire OHSUMED dataset is that roughly 93% of OHSUMED entries have no class assigned from the HD subtree, which means that the classes in the HD subtree have very low prevalence (λ T r = on average); we thus prefer to use OHSUMED-S, which presents a wider range of prevalence values. This experimental setting thus generates = 5148 binary quantification test sets for RCV1-V2 (containing an average of 15,223 documents each), and 4 88 = 352 test sets for OHSUMED-S (containing an average of 3,283 documents each). This large number of test sets will give us an opportunity to study quantification across different dimensions (e.g., across classes characterized by different prevalence, across classes characterized by different amounts of drift, across time) 13. More detailed figures about our datasets are given in Table I. Note that both RCV1-V2 and OHSUMED-S classes are characterized by severe imbalance, as can be noticed by the two Avg prevalence of the positive class rows of Table I, where both values are very far away from the value of 0.5, which represents perfect balance. On a side note, it is well-known that the bag-of-words extraction process outlined a few paragraphs above gives rise to very high-dimensional (albeit sparse) vectors; our case is no exception, and the dimensionality of our vectors is 53,204 (RCV1-V2) and 11,286 (OHSUMED-S), respectively. Note that the experimental protocol we adopt is different from the one adopted by Forman. In [Forman 2005; 2006a; 2008] he proposes a protocol in which, given a training set T r and a test set T e, several controlled experiments are run by artificially altering class prevalences (i.e., by randomly removing predefined percentages of the positives or of the negatives) either on T r or on T e. This protocol is meant to test the robustness of the methods with respect to different distribution drifts (i.e., differences between λ T r (c) and λ T e (c) of different magnitude) and different class prevalence values. We prefer to opt for a different protocol, one in which natural training and test sets are used, without artificial alterations. The reason is that artificial alterations may generate class prevalence values and/or distribution drifts that are simply not realistic (e.g., a situation in which λ T r (c) =.40 and λ T e (c) =.01); conversely, focusing on naturally occurring datasets forces us to come to terms with realistic levels of class prevalence and/or distribution drift. We will thus adopt the latter protocol in all the In order to guarantee perfect reproducibility of our results, we make available at quantification/ the feature vectors of the RCV1-V2 and OHSUMED-S documents as extracted from our preprocessing module, and already split respectively into the 53 and 5 sets described above.

13 AA:13 ALL TRAINING TEST Table I. Main characteristics of the datasets used in our experiments. RCV1-V2 OHSUMED-S Total # of docs 804,414 15,643 # of classes (i.e., binary tasks) Time unit used for split week year # of docs 12,807 2,510 # of features 53,204 11,286 Min # of positive docs per class 2 1 Max # of positive docs per class 5, Avg # of positive docs per class Min prevalence of the positive class Max prevalence of the positive class Avg prevalence of the positive class # of docs 791,607 13,133 # of test sets per class 52 4 Avg # of test docs per set 15,212 3,283 Min # of positive docs per class 0 0 Max # of positive docs per class 9,775 1,250 Avg # of positive docs per class Min prevalence of the positive class Max prevalence of the positive class Avg prevalence of the positive class experiments discussed in this paper, compensating for the absence of artificial alterations by also studying (see 4.2.3) the behaviour of our methods separately on test sets characterized by different (and natural) levels of distribution drift Testing quantification accuracy We have run our experiments by learning quantifiers for each class c on the respective training set and testing the quantifiers separately on each of the test sets, using KLD as the evaluation measure. We have done this for all the 99 classes 52 weeks in RCV1-V2 and for all the 88 classes 4 years in OHSUMED-S, and for all the 10 baseline methods discussed in 2.2 plus our SVM(KLD) method Analysing the results along the class dimension. We first discuss the results according to the class dimension, i.e., by averaging the results for each RCV1-V2 class across the 52 test weeks and for each OHSUMED-S class across the 4 years 14. Since this would leave no less than 99 RCV1-V2 classes to discuss, we further average the results across all the RCV1-V2 classes characterized by a training class prevalence λ T r (c) that falls into a certain interval (same for OHSUMED-S and its 88 classes). This allows us to separately check the behaviour of our quantification methods on groups of classes that are homogeneous by level of imbalance. We have also run statistical significance tests in order to check whether the improvement obtained by the best performing method over the 2nd best performer on the group is statistically significant 15. The results are reported in Table II, where four levels of imbalance have been singled out: very low prevalence (VLP, which accounts for all the classes c such that λ T r (c) < 0.01; there are 48 RCV1-V2 and 51 OHSUMED-S such classes), low prevalence (LP 14 Wherever in this paper we speak of averaging accuracy results across different test sets, what we mean is actually macroaveraging, i.e., taking the accuracy results on the individual test sets and computing their arithmetic mean. This is sharply different from microaveraging, i.e., merging the test sets and computing a single accuracy figure on the merged set. Quite obviously, in a quantification setting microaveraging does not make any sense at all, since false positives from one set and false negatives from another set would compensate each other, thus generating misleadingly high accuracy values. 15 All the statistical significance tests discussed in this paper are based on a two-tailed paired t-test and the use of a significance level.

14 AA:14 Andrea Esuli and Fabrizio Sebastiani Table II. Accuracy of SVM(KLD) and of 10 baseline methods as measured in terms of KLD on the 99 classes of RCV1-V2 (top) and on the 88 classes of OHSUMED-S (bottom) grouped by class prevalence in T r (Columns 2 to 5); lower values are better; Column 6 indicates average accuracy across all the classes. The best result in each column is indicated with boldface only when there is a statistically significant difference with respect to each of the other tested methods (p < 0.001, two-tailed paired t-test on KLD value across the test sets in the group). The methods are ranked in terms of the value indicated in the All column. RCV1-V2 OHSUMED-S VLP LP HP VHP All SVM(KLD) 2.09E E E E E-03 PACC 2.16E E E E E-03 ACC 2.17E E E E E-03 MAX 2.16E E E E E-03 CC 2.55E E E E E-03 X 3.48E E E E E-03 PCC 1.04E E E E E-03 MM(PP) 1.76E E E E E-02 MS 1.98E E E E E-02 T E E E E E-02 MM(KS) 2.00E E E E E-02 VLP LP HP VHP All SVM(KLD) 1.21E E E E E-03 PACC 2.86E E E E E-03 ACC 2.37E E E E E-03 CC 2.38E E E E E-03 X 1.38E E E E E-03 MM(PP) 4.90E E E E E-03 MM(KS) 1.37E E E E E-02 MS 3.80E E E E E-02 T E E E E E-02 MAX 5.57E E E E E-02 PCC 1.20E E E E E λ T r (c) < 0.05; 34 RCV1-V2 and 28 OHSUMED-S classes), high prevalence (HP 0.05 λ T r (c) < 0.10; 10 RCV1-V2 and 4 OHSUMED-S classes), and very high prevalence (VHP 0.10 λ T r (c); 7 RCV1-V2 and 5 OHSUMED-S classes). The first observation that can be made by looking at the RCV1-V2 results in this table is that, when evaluated across all our 5148 test sets (Column 6), SVM(KLD) outperforms all the other baseline methods in a statistically significant way, scoring a KLD value of 1.32E-03 against the 1.74E-03 value (a -24.2% error reduction) obtained by the best-performing baseline (the PACC method). This is largely a result of a much better balance between false positives and false negatives obtained by the base classifiers: while (as already observed in Footnote 3) the average F P/F N ratio across the 5148 test sets is for CC, it is for SVM(KLD) (and it is 1 for the perfect quantifier). The OHSUMED-S results essentially confirm the insights obtained from the RCV1-V2 results, with SVM(KLD) again the best of the 11 methods and PACC again the 2nd best; the difference between them is now even higher, with an error reduction of -56.8%. A second observation that the RCV1-V2 results allow is to make is that SVM(KLD) scores well on all the four groups of classes identified; while it is not always the best method (e.g., it is outperformed by other methods in the HP and VHP groups), it consistently performs well on all four groups. In particular, SVM(KLD) seems to excel at classes characterized by drastic imbalance, as witnessed by the VLP group, where SVM(KLD) is the best performer, and by the LP group, where SVM(KLD) is the best performer in a statistically significant way. In fact, this group of classes seems largely responsible for the excellent overall performance (Column 6) displayed by SVM(KLD),

15 AA:15 since on the LP group the margin between it and the other methods is large (4.92E-04 against 1.98E-03 of the 2nd best method), and since the VLP and LP groups altogether account for no less than 82 out of the total 99 classes. This latter fact characterizes most naturally occurring datasets, whose class distribution usually exhibits a power law, with very few highly frequent classes and very many highly infrequent classes. The OHSUMED-S results essentially confirm the RCV1-V2 results, with SVM(KLD) again the best performer on the VLP and LP classes, and still performing well, although not being the best, on HP and VHP. The stability of SVM(KLD) is also confirmed by Table III, which reports, for the same groups of test sets identified by Table II, the variance in KLD across the members of the group. For example, on RCV1-V2, Column 3 reports the variance in KLD across all the 34 classes such that.01 λ T r (c) 0.05 and across the 52 test weeks, for a total of = 1, 768 test sets. What we can observe from this table is that, when averaged across all the = 5148 test sets (Column 6), the variance of SVM(KLD) is lower than the variance of all other methods in a statistically significant way. The variance of SVM(KLD) is fairly low in all the four subsets of classes, and particularly so in the subsets of the most imbalanced classes (VLP and LP), in which SVM(KLD) is the best performer in a statistically significant way. The OHSUMED-S results essentially confirm the RCV1-V2 results, with SVM(KLD) the best performer on VLP and LP, and still behaving well on HP and VHP. Concerning the baselines, our results seem to disconfirm the ones reported in [Forman 2008] according to which the MS method is the best of the lot, and according to which the ACC method can estimate the class distribution well in many situations, but its performance degrades severely when the training class distribution is highly imbalanced. In our experiments, instead, MS is substantially outperformed by several baseline methods; ACC is instead a strong contender, and (contrary to the statement above) especially shines on the subsets of the most imbalanced classes. The results of both Tables II and III clearly show that, on both RCV1-V2 and OHSUMED-S, PACC is the best of the baseline methods presented in Analysing the results along the temporal dimension. We now analyse the results along the temporal dimension. In order to do this for the RCV1-V2 dataset (resp., OHSUMED-S dataset), for each of the 52 test weeks (resp., 4 test years) we average the 99 (resp., 88) accuracy results corresponding to the individual classes, and check the temporal accuracy trend resulting from these averages. This trend is displayed in Figure 1, where the results of SVM(KLD) are plotted together with the results of the three best-performing baseline methods. The plots unequivocally show that SVM(KLD) is the best method across the entire temporal spectrum for both RCV1-V2 and OHSUMED-S. Note that quantification accuracy remains fairly stable across time, i.e., we are not witnessing any substantial decrease in quantification accuracy with time. Intuition might instead suggest that quantification accuracy should decrease with time, due to the combined effects of true concept drift and distribution drift. This may indicate that (at least in the context of broadcast news that RCV1-V2 represents, and in the context of medical scientific articles that OHSUMED-S represents) the chosen timeframe (one year for RCV1-V2, four years for OHSUMED-S) is not sufficient enough a timeframe to observe a significant such drift Analysing the results along the distribution drift dimension. The last angle according to which we analyse the results is distribution drift. That is, we conduct our analysis in terms of how much the prevalence λ T ei (c) in a given test set T e i drifts away from

Quantification. Using Supervised Learning to Estimate Class Prevalence. Fabrizio Sebastiani

Quantification. Using Supervised Learning to Estimate Class Prevalence. Fabrizio Sebastiani Quantification Using Supervised Learning to Estimate Class Prevalence Fabrizio Sebastiani Istituto di Scienza e Tecnologie dell Informazione Consiglio Nazionale delle Ricerche 56124 Pisa, IT E-mail: fabrizio.sebastiani@isti.cnr.it