arxiv: v2 [cs.lg] 15 Apr 2015

Size: px
Start display at page:

Download "arxiv: v2 [cs.lg] 15 Apr 2015"

Transcription

1 AA:1 AA arxiv: v2 [cs.lg] 15 Apr 2015 Authors address: Andrea Esuli, Istituto di Scienza e Tecnologie dell Informazione, Consiglio Nazionale delle Ricerche, Via Giuseppe Moruzzi 1, Pisa, Italy. andrea.esuli@isti.cnr.it. Fabrizio Sebastiani, Qatar Computing Research Institute, PO Box 5825, Doha, Qatar. fsebastiani@qf.org.qa. Fabrizio Sebastiani is on leave from Consiglio Nazionale delle Ricerche. The order in which the authors are listed is purely alphabetical; each author has given an equally important contribution to this work. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies show this notice on the first page or initial screen of a display along with the full citation. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any component of this work in other works requires prior specific permission and/or a fee. Permissions may be requested from Publications Dept., ACM, Inc., 2 Penn Plaza, Suite 701, New York, NY USA, fax +1 (212) , or permissions@acm.org. c YYYY ACM /YYYY/-ARTAA $15.00 DOI:

2 ANDREA ESULI, Consiglio Nazionale delle Ricerche FABRIZIO SEBASTIANI, Qatar Computing Research Institute We address the problem of quantification, a supervised learning task whose goal is, given a class, to estimate the relative frequency (or prevalence) of the class in a dataset of unlabelled items. Quantification has several applications in data and text mining, such as estimating the prevalence of positive reviews in a set of reviews of a given product, or estimating the prevalence of a given support issue in a dataset of transcripts of phone calls to tech support. So far, quantification has been addressed by learning a general-purpose classifier, counting the unlabelled items which have been assigned the class, and tuning the obtained counts according to some heuristics. In this paper we depart from the tradition of using general-purpose classifiers, and use instead a supervised learning model for structured prediction, capable of generating classifiers directly optimized for the (multivariate and non-linear) function used for evaluating quantification accuracy. The experiments that we have run on 5500 binary high-dimensional datasets (averaging more than 14,000 documents each) show that this method is more accurate, more stable, and more efficient than existing, state-of-the-art quantification methods. Categories and Subject Descriptors: I.5.2 [Pattern Recognition]: Design Methodology Classifier design and evaluation; H.3.3 [Information Storage and Retrieval]: Information Search and Retrieval Information filtering; Search process; I.2.7 [Artificial Intelligence]: Natural Language Processing Text analysis General Terms: Algorithm, Design, Experimentation, Measurements Additional Key Words and Phrases: Quantification, Prevalence estimation, Prior estimation, Supervised learning, Text classification, Loss functions, Kullback-Leibler divergence ACM Reference Format: Andrea Esuli and Fabrizio Sebastiani, YYYY. Optimizing Text Quantifiers for Multivariate Loss Functions. ACM Trans. Knowl. Discov. Data. VV, NN, Article AA ( YYYY), 26 pages. DOI: 1. INTRODUCTION In recent years it has been pointed out that, in a number of applications involving classification, the final goal is not determining which class (or classes) individual unlabelled data items belong to, but determining the prevalence (or relative frequency ) of each class in the unlabelled data. The latter task is known as quantification [Forman 2005; 2006a; 2008; Forman et al. 2006]. Although what we are going to discuss here applies to any type of data, we are mostly interested in text quantification, i.e., quantification when the data items are textual documents. To see the importance of text quantification, let us examine the task of classifying textual answers returned to open-ended questions in questionnaires [Esuli and Sebastiani 2010a; Gamon 2004; Giorgetti and Sebastiani 2003], and let us discuss two important such scenarios. In the first scenario, a telecommunications company asks its current customers the question How satisfied are you with our mobile phone services?, and wants to classify the resulting textual answers according to whether they belong to the class May- DefectToCompetition. The company is likely interested in accurately classifying each individual customer, since it may want to call each customer that is assigned the class and offer her improved conditions. In the second scenario, a market research agency asks respondents the question What do you think of the recent ad campaign for product X?, and wants to classify the resulting textual answers according to whether they belong to the class LovedThe- Campaign. Here, the agency is likely not interested in whether a specific individual

3 AA:3 belongs to the class LovedTheCampaign, but is likely interested in knowing how many respondents belong to it, i.e., in knowing the prevalence of the class. In sum, while in the first scenario classification is the goal, in the second scenario the real goal is quantification, i.e., evaluating the results of classification at the aggregate level rather than at the individual level. Other scenarios in which quantification is the goal may be, e.g., predicting election results by estimating the prevalence of blog posts (or tweets) supporting a given candidate or party [Hopkins and King 2010], or planning the amount of human resources to allocate to different types of issues in a customer support center by estimating the prevalence of customer calls related to a given issue [Forman 2005], or supporting epidemiological research by estimating the prevalence of medical reports in which a specific pathology is diagnosed [Baccianella et al. 2013]. The obvious method for dealing with the latter type of scenarios is aggregative quantification, i.e., classifying each unlabelled document and estimating class prevalence by counting the documents that have been attributed the class. However, there are two reasons why this strategy is suboptimal. The first reason is that a good classifier may not be a good quantifier, and vice versa. To see this, one only needs to look at the definition of F 1, the standard evaluation function for binary classification, defined as F 1 = 2 T P 2 T P + F P + F N (1) where T P, F P and F N indicate the numbers of true positives, false positives, and false negatives, respectively. According to F 1, a binary classifier ˆΦ 1 for which F P = 20 and F N = 20 is worse than a classifier ˆΦ 2 for which, on the same test set, F P = 0 and F N = 10. However, ˆΦ 1 is intuitively a better binary quantifier than ˆΦ 2 ; indeed, ˆΦ 1 is a perfect quantifier, since F P and F N are equal and thus compensate each other, so that the distribution of the test items across the class and its complement is estimated perfectly. A second reason is that standard supervised learning algorithms are based on the assumption that the training set is drawn from the same distribution as the unlabelled data the classifier is supposed to classify. But in real-world settings this assumption is often violated, a phenomenon usually referred to as concept drift [Sammut and Harries 2011]. For instance, in a backlog of newswire stories from year 2001, the prevalence of class Terrorism in August data will likely not be the same as in September data; training on August data and testing on September data might well yield low quantification accuracy. Violations of this assumption may occur for reasons ranging from the bias introduced by experimental design, to the irreproducibility of the testing conditions at training time [Quiñonero-Candela et al. 2009]. Concept drift usually comes in one of three forms [Kelly et al. 1999]: (a) the class priors p(c i ) may change, i.e., the one in the test set may significantly differ from the one in the training set; (b) the classconditional distributions p(x c i ) may change; (c) the posterior distribution p(c i x) may change. It is the first of these three cases that poses a problem for quantification. The previous arguments indicate that text quantification should not be considered a mere byproduct of text classification, and should be studied as a task of its own. To date, proposed methods explicitly addressed to quantification (see e.g., [Bella et al. 2010; Forman 2005; 2006a; 2008; Forman et al. 2006; Hopkins and King 2010; Xue and Weiss 2009]) employ general-purpose supervised learning methods, i.e., address quantification by elaborating on the results returned by a general-purpose standard classifier. In this paper we take a sharply different, structured prediction approach, based upon the use of classifiers explicitly optimized for the non-linear, multivariate

4 AA:4 Andrea Esuli and Fabrizio Sebastiani evaluation function that we will use for assessing quantification accuracy. This idea was first proposed, but not implemented, in [Esuli and Sebastiani 2010b]. The rest of the paper is organized as follows. In Section 2, after setting the stage we describe the evaluation function we will adopt ( 2.1) and sketch a number of quantification methods previously proposed in the literature ( 2.2). In Section 3 we introduce our novel method based on explicitly minimizing, via a structured prediction model, the evaluation measure we have chosen. Section 4 presents experiments in which we test the method we propose on two large batches of binary, high-dimensional, publicly available datasets (the two batches consist of 5148 and 352 datasets, respectively), using all the methods introduced in 2.2 as baselines. Section 5 discusses related work, while Section 6 concludes. 2. PRELIMINARIES In this paper we will focus on quantification at the binary level. That is, given a domain of documents D and a class c, we assume the existence of an unknown target function (or ground truth) Φ : D { 1, +1} that specifies which members of D belong to c; as usual, +1 and 1 represent membership and non-membership in c, respectively. The approaches we will focus on are based on aggregative quantification, i.e., they rely on the generation of a classifier ˆΦ : D { 1, +1} via supervised learning from a training set T r. We will indicated with T e the test set on which quantification effectiveness is going to be tested. We define the prevalence (or relative frequency) λ T e (c) of class c in a set of documents T e as the fraction of members of T e that belong to c, i.e., as λ T e (c) = {d j T e Φ(d j ) = +1} T e Given a set T e of unlabelled documents and a class c, quantification is defined as the task of estimating λ T e (c), i.e., of computing an estimate ˆλ T e (c) such that λ T e (c) and ˆλ T e (c) are as close as possible 1. What as close as possible exactly means will be formalized by an appropriate evaluation measure (see 2.1). The reasons why we focus on binary quantification are two-fold: Many quantification problems are binary in nature. For instance, estimating the prevalence of positive and negative reviews in a dataset of reviews of a given product is such a task. Another such task is estimating from blog posts the prevalence of support for either of two candidates in the second round of a two-round ( run-off ) election. A multi-class multi-label problem (also known as an n-of-m problem, i.e., a problem where zero, one, or several among m classes can be attributed to the same document) can be reduced to m independent binary problems of type (c j vs. c j ), where C = {c 1,..., c j,..., c m } is the set of classes and where c j denotes the complement of c j. Binary quantification methods can thus also be applied to solving quantification in multi-class multi-label contexts. We instead leave the discussion of quantification in single-label multi-class (i.e., 1-ofm) contexts to future work Evaluation measures for quantification Different measures have been used in the literature for measuring binary quantification accuracy. (2) 1 Consistently with most mathematical literature we use the caret symbol (ˆ) to indicate estimation.

5 AA:5 The simplest such measure is bias (B), defined as B(λ T e, ˆλ T e ) = ˆλ T e (c) λ T e (c) and used in [Forman 2005; 2006a; Tang et al. 2010]; positive bias indicates a tendency to overestimate the prevalence of c, while negative bias indicates a tendency to underestimate it. Absolute Error (AE - also used in [Esuli and Sebastiani 2010b], where it is called percentage discrepancy, and in [Barranquero et al. 2013; Bella et al. 2010; Forman 2005; 2006a; González-Castro et al. 2013; Sánchez et al. 2008; Tang et al. 2010]), defined as AE(λ T e, ˆλ T e ) = ˆλ T e (c) λ T e (c), is an alternative, equally simplistic measure that accounts for the fact that positive and negative bias are (in the absence of specific application-dependent constraints) equally undesirable. Relative absolute error (RAE), defined as RAE(λ T e, ˆλ T e ) = ˆλ T e (c) λ T e (c) λ T e (c) is a refinement of AE meant to account for the fact that the same value of absolute error is a more serious mistake when the true class prevalence is small. For instance, predicting ˆλ T e (c) = 0.10 when λ T e (c) = 0.01 and predicting ˆλ T e (c) = 0.50 when λ T e (c) = 0.41 are equivalent errors according to B and AE, but the former is intuitively a more serious error than the latter. The most convincing among the evaluation measures proposed so far is certainly Forman s [2005], who uses normalized cross-entropy, better known as Kullback-Leibler Divergence (KLD see e.g., [Cover and Thomas 1991]). KLD, defined as KLD(λ T e, ˆλ T e ) = c C λ T e (c) log λ T e(c) ˆλ T e (c) and also used in [Esuli and Sebastiani 2010b; Forman 2006a; 2008; Tang et al. 2010], is a measure of the error made in estimating a true distribution λ T e over a set C of classes by means of a distribution ˆλ T e ; this means that KLD is in principle suitable for evaluating quantification, since quantifying exactly means predicting how the test items are distributed across the classes. KLD ranges between 0 (perfect coincidence of λ T e and ˆλ T e ) and + (total divergence of λ T e and ˆλ T e ). In the binary case in which C = {c, c}, KLD becomes KLD(λ T e, ˆλ T e ) = λ T e (c) log λ T e(c) ˆλ T e (c) + λ T e(c) log λ T e(c) ˆλ T e (c) Continuity arguments indicate that we should consider 0 log 0 q = 0 and p log p 0 = + (see [Cover and Thomas 1991, p. 18]). Note that, as from Equation 4, KLD is undefined when the predicted distribution ˆλ T e is zero for at least one class (a problem that also affects RAE). As a result, we smooth the fractions λ T e (c)/ˆλ T e (c) and λ T e (c)/ˆλ T e (c) in Equation 4 by adding a small quantity ɛ to both the numerator and the denominator. The smoothed KLD function is always defined and still returns a value of zero when λ T e and ˆλ T e coincide. KLD offers several advantages with respect to RAE (and, a fortiori, to B and AE). One advantage is that, as evident from Equation 5, it is symmetric with respect to the complement of a class, i.e., switching the role of c and c does not change the result. This means that, e.g., predicting ˆλ T e (c) = 0.10 when λ T e (c) = 0.11 and predicting ˆλ T e (c) = 0.90 when λ T e (c) = 0.89, are equivalent errors (which seems intuitive), while RAE considers the former a much more serious error than the latter. This is especially useful in binary quantification tasks in which it is not clear which of the two classes should (3) (4) (5)

6 AA:6 Andrea Esuli and Fabrizio Sebastiani play the role of the positive class c, as in e.g., Employed vs. Unemployed. A second advantage is that KLD is not defined only on the binary (and multi-label multi-class) case, but is also defined on the single-label multi-class case; this allows evaluating different types of quantification tasks with the same measure. Last but not least, one benefit of using KLD is that it is a very well-known measure, having been the subject of intense study within information theory [Csiszár and Shields 2004] and, although from a more applicative angle, within the language modelling approach to information retrieval [Zhai 2008] Existing quantification methods A number of methods have been proposed in the (still brief) literature on quantification; below we list the main ones, which we will use as baselines in the experiments discussed in Section 4. Classify and Count (CC). An obvious method for quantification consists of generating a classifier from T r, classifying the documents in T e, and estimating λ T e by simply counting the fraction of documents in T e that are predicted positive, i.e., ˆλ CC T e (c) = {d j T e ˆΦ(d j ) = +1} T e Forman [2008] calls this the classify and count (CC) method. Probabilistic Classify and Count (PCC). A variant of the above consists in generating a classifier from T r, classifying the documents in T e, and computing λ T e as the expected fraction of documents predicted positive, i.e., ˆλ P CC T e (c) = 1 T e d j T e (6) p(c d j ) (7) where p(c d j ) is the probability of membership in c of test document d j returned by the classifier. If the classifier only returns confidence scores that are not probabilities (as is the case, e.g., when AdaBoost.MH is the learner [Schapire and Singer 2000]), the confidence scores must be converted into probabilities, e.g., by applying a logistic function. The PCC method is dismissed as unsuitable in [Forman 2005; 2008], but is shown to perform better than CC in [Bella et al. 2010] (where it is called Probability Average ) and in [Tang et al. 2010]. Adjusted Classify and Count (ACC). Forman [2005; 2008] uses a further method which he calls Adjusted Count, and which we will call Adjusted Classify and Count (ACC) so as to make its relation with CC more explicit. The underlying idea is that CC would be optimal, were it not for the fact that the classifier may generate different numbers of false positives and false negatives, and that this difference would lead to imperfect quantification. If we knew the true positive rate (tpr = T P T P +F N, a.k.a. recall) and false positive rate (fpr = F P F P +T N, a.k.a. fallout) that the classifier has obtained on T e, it is easy to check that perfect quantification would be obtained by adjusting (c) as follows: ˆλ CC T e ˆλ ACC CC ˆλ T e T e (c) = (c) fpr T e(c) tpr T e (c) fpr T e (c) Since we cannot know the true values of tpr T e (c) and fpr T e (c), the ACC method consists of estimating them on T r via k-fold cross-validation and using the resulting estimates in Equation 8. However, one problem with ACC is that it is not guaranteed to return a value in [0,1], due to the fact that the estimates of tpr T e (c) and fpr T e (c) may be imperfect. This (8)

7 AA:7 lead Forman [2008] to clip the results of the estimation (i.e., equate to 1 every value higher than 1 and to 0 every value lower than 0) in order for the final results to be in [0,1]. Probabilistic Adjusted Classify and Count (PACC). The PACC method (proposed in [Bella et al. 2010], where it is called Scaled Probability Average ) is a probabilistic variant of ACC, i.e., it stands to ACC like PCC stands to CC. Its underlying idea CC is to replace, in Equation 8, ˆλ T e (c), tpr T e(c) and fpr T e (c) with their expected values, with probability of membership in c replacing binary predictions. Equation 8 is thus transformed into ˆλ P T e ACC P CC ˆλ T e (c) E[fpr T e (c)] (c) = (9) E[tpr T e (c)] E[fpr T e (c)] where E[tpr T e (c)] and E[fpr T e (c)] (expected tpr T e (c) and expected fpr T e (c), respectively) are defined as E[tpr T e (c)] = 1 p(c d j ) (10) T e c d j T e c E[fpr T e (c)] = 1 p(c d j ) (11) T e c d j T e c and T e c (resp., T e c ) indicates the set of documents in T e that belong (resp., do not belong) to class c. Again, since we cannot know the true E[tpr T e (c)] and E[fpr T e (c)] (given that we do not know T e c and T e c ), we estimate them on T r via k-fold crossvalidation and use the resulting estimates in Equation 9. Threshold@0.50 (T50), Method X (X), and Method Max (MAX). Forman [2008] points out that the ACC method is very sensitive to the decision threshold of the classifier, which may yield unreliable values of λ ACC T e (c) (or lead to λ ACC T e (c) being undefined when tpr T e = fpr T e ). In order to reduce this sensitivity, [Forman 2008] recommends to heuristically set the decision threshold in such a way that tpr T r (as obtained via k-fold cross-validation) is equal to.50 before computing Equation 8. This method is dubbed Threshold@0.50 (T50). Alternative heuristics that [Forman 2008] discusses are to set the decision threshold in such a way that fpr T r = 1 tpr T r (this is dubbed Method X) or such that (tpr T r fpr T r ) is maximized (this is dubbed Method Max). Median Sweep (MS). Alternatively, [Forman 2008] recommends to compute λ ACC T e (c) for every decision threshold that gives rise (in k-fold cross-validation) to different tpr T r or fpr T r values, and take the median of all the resulting estimates of (c). This method is dubbed Median Sweep (MS). Mixture Model (MM). The MM method (proposed in [Forman 2005]) consists of assuming that the distribution D T e of the scores that the classifier assigns to the test examples is a mixture λ ACC T e D T e = λ T e (c) D T e c + (1 λ T e (c)) D T e c (12) where Dc T e and Dc T e are the distributions of the scores that the classifier assigns to the positive and the negative test examples, respectively, and where λ T e (c) and λ T e (c) are the parameters of this mixture. The MM method consists of estimating Dc T e and Dc T e via k-fold cross-validation, and picking as value of λ T e (c) the one that generates the best fit between the observed D T e and the mixture. Two variants of this method, called the Kolmogorov-Smirnov Mixture Model (MM(KS)) and the PP-Area Mixture Model (MM(PP)), are actually defined in [Forman 2005], which differ in terms of how the goodness of fit between the left- and the right-hand side of Equation 12 is estimated. See [Forman 2005] for more details.

8 AA:8 Andrea Esuli and Fabrizio Sebastiani 3. OPTIMIZING QUANTIFICATION ACCURACY A problem with the methods discussed in 2.2 is that most of them are fairly heuristic in nature. For instance, the fact that methods such as ACC (and all the others based on it, such as T50, MS, X, and MAX) require clipping is scarcely reassuring. More in general, methods such as T50 or MS have hardly any theoretical foundation, and choosing them over CC only rests on our knowledge that they have performed better in previously reported experiments. A further problem is that some of these methods rest on assumptions that seem problematic. For instance, one problem with the MM method is that it seems to implicitly rely on the hypothesis that estimating Dc T e and Dc T e via k-fold cross-validation on T r can be done reliably. However, since the very motivation of doing quantification is that the training set and the test set may have quite different characteristics, this hypothesis seems adventurous. A similar argument casts some doubt on ACC: how reliable are the estimates of tpr T e and fpr T e that can be generated via k-fold crossvalidation on T r, given the different characteristics that training set and test set may have in the application contexts where quantification is required? 2 In sum, the very same arguments that are used to deem the CC method unsuitable for quantification seem to undermine the previously mentioned attempts at improving on CC. In this paper we propose a new, theoretically well-founded quantification method that radically differs from the ones discussed in 2.2. Note that all of the methods discussed in 2.2 employ general-purpose supervised learning methods, i.e., address quantification by post-processing the results returned by a standard classifier (where the decision threshold has possibly been tuned according to some heuristics). In particular, all the supervised learning methods adopted in the literature on quantification optimize Hamming distance or variants thereof, and not a quantification-specific evaluation function. When the dataset is imbalanced (typically: when the positives are by far outnumbered by the negatives), as is frequently the case in text classification, this is suboptimal, since a supervised learning method that minimizes Hamming distance will generate classifiers with a tendency to make negative predictions. This means that F N will be much higher than F P, to the detriment of quantification accuracy 3. We take a sharply different approach, based upon the use of classifiers explicitly optimized for the evaluation function that we will use for assessing quantification accuracy. Given such a classifier, we will simply use a classify and count approach, with no heuristic threshold tuning (à la T50 / X / MAX) and no a posteriori adjustment (à la ACC). The idea of using learning algorithms capable of directly optimizing the measure (a.k.a. loss ) used for evaluating effectiveness is well-established in supervised learning. However, in our case following this route is non-trivial, because the evaluation measure that we want to use (KLD) is non-linear, i.e., is such that the error on the test set may not be formulated as a linear combination of the error incurred by each test example. An evaluation measure for quantification is inherently non-linear, because how the error on an individual test item impacts on the overall quantification error depends on how the other test items have been classified. For instance, if in the other test items there are more false positives than false negatives, an additional false negative is actually beneficial to overall quantification error, because of the mutual compensation effect between F P and F N mentioned in Section 1. As a result, a measure of 2 In Appendix A we thoroughly analyse (also by means of concrete experiments) the issue of how (un)reliable the k-fold cross-validation estimates of tpr T e and fpr T e are in practice. 3 To witness, in the experiments we report in Section 4 our 5148 test sets exhibit, when classified by the classifiers generated by the linear SVM used for implementing the CC method, an average F P/F N ratio of 0.109; by contrast, for an optimal quantifier this ratio is always 1.

9 AA:9 quantification accuracy is inherently non-linear, and should thus be multivariate, i.e., take in consideration all test items at once. As discussed in [Joachims 2005], the assumption that the error on the test set may be formulated as a linear combination of the error incurred by each test example (as indeed happens for many common error measures e.g., Hamming distance) underlies most existing discriminative learners, which are thus suboptimal for tackling quantification. In order to sidestep this problem, we adopt the SVM for Multivariate Performance Measures (SVM perf ) learning algorithm proposed by Joachims [2005] 4. SVM perf is a learning algorithm of the Support Vector Machine family that can generate classifiers optimized for any non-linear, multivariate loss function that can be computed from a contingency table (as KLD is). SVM perf is a specialization to the problem of binary classification of the structural SVM (SV M struct ) learning algorithm [Joachims et al. 2009a; Joachims et al. 2009b; Tsochantaridis et al. 2004] for structured prediction, i.e., an algorithm designed for predicting multivariate, structured objects (e.g., trees, sequences, sets). SVM perf is fundamentally different from conventional algorithms for learning classifiers: while these latter learn univariate classifiers (i.e., functions of type ˆΦ : D { 1, +1} that classify individual instances one at a time), SVM perf learns multivariate classifiers (i.e., functions of type ˆΦ : D S { 1, +1} S that classify entire sets S of instances in one shot). By doing so, SVM perf can optimize properties of entire sets of instances, properties (such as KLD) that cannot be expressed as linear functions of the properties of the individual instances. As discussed in [Joachims et al. 2009b], SV M struct can be adapted to a specific task by defining four components: (1) A joint feature map Ψ(x, y). This function computes a vector of features (describing the match between the input vectors in x and the relative outputs, true or predicted, in y) from all the input-output pairs at the same time. In this way the number of features, and thus the number of parameters of the model, can be kept constant regardless of the size of the sample set. The Ψ function allows to generalise not only on inputs (x) but also on outputs (y), thus allowing to produce predictions not seen in the training data. In SVM perf Ψ is defined 5 as Ψ(x, y) = 1 n n y i x i (13) (2) A loss function (y, ŷ). SVM perf works with loss functions (T P, F P, F N, T N) in which the four values are those from the contingency table resulting from comparing the true labels y with the predicted labels ŷ. In our work we take the loss i=1 4 In [Joachims 2005] SVM perf is actually called SV M multi, but the author has released its implementation under the name SVM perf. We will use this latter name because it uniquely identifies the algorithm on the Web, while searching for SVM multi often returns the SV M multiclass package, which addresses a different problem. 5 For this formulation of Ψ, and when error rate is the chosen loss function, Joachims [2005] shows that SVM perf coincides with the traditional univariate SVM model (called SVM org in [Joachims 2005]).

10 AA:10 Andrea Esuli and Fabrizio Sebastiani function to be KLD, i.e., 6 KLD (T P, F P, F N, T N) = KLD(λ, ˆλ) (14) T P + F N where λ(c) = T P + F P + F N + T N and ˆλ T P + F P T e (c) = T P + F P + F N + T N (3) An algorithm for the efficient computation of a hypothesis ˆΦ(x) = argmaxŷ Y {w Ψ(x, ŷ)} (15) where w is a vector of parameters. In SVM perf this simply corresponds to computing ˆΦ(x) = (sign(w x 1 ),..., sign(w x n )) (16) (4) An algorithm for the efficient computation of the loss-augmented hypothesis ˆΦ (x) = argmaxŷ Y { (y, ŷ) + w Ψ(x, ŷ)} (17) which in SVM perf is computed via an algorithm [Joachims 2005, Algorithm 2] with O(n 2 ) worst-case complexity. We have used the implementation of SVM perf made available by Joachims 7, which we have extended by implementing the module that takes care of the KLD loss function. In the rest of the paper our method will be dubbed SVM(KLD). 4. EXPERIMENTS We now present the results of experiments aimed at assessing whether the approach we have proposed in Section 3 delivers better quantification accuracy than state-ofthe-art quantification methods. In order to do this, we have run all our experiments by using as baselines for our SVM(KLD) method all the methods described in 2.2. For the CC, ACC, T50, X, MAX, MS, MM(KS), MM(PP) methods we have used the original implementation that we have obtained from the author (this guarantees that the baselines perform at their full potential). We have instead implemented PCC and PACC ourselves. At the heart of the implementation of all the baselines is a standard linear SVM with the parameters set at their default values; where quantities (such as e.g., fpr T e and tpr T e see Equation 8) had to be estimated from the training set, we have used 50-fold cross-validation, as done and recommended in [Forman 2008]. In order to guarantee a fair comparison with the baselines we have used the default values for the parameters also for SVM perf, which lies at the basis of our SVM(KLD) method 8. 6 In Equation 14 KLD is written as a function of T P, F P, F N, T N for the simple fact that in [Joachims 2005] (where SVM perf was originally described) the loss function is specified as a function of the four cells of the contingency table. However, it should be clear that λ(c) does not depend on the predicted labels: T P +F N, this latter is equivalent to writing T P +F P +F N+T N, where GP (the gold positives ) is T P + F N and GN (the gold negatives ) is F P + T N. even if in Equation 14 we have written it out as λ(c) = λ(c) = GP GP +GN Seen under this light, there is no trace of predicted labels in λ(c) = GP, and λ(c) is just a function of GP +GN the gold standard and not of the prediction. Analogously, it should be clear that ˆλ(c) is just a function of the prediction and not of the gold standard. 7 SVM perf is available from perf.html. Our module that extends it to deal with KLD is available at 8 An additional reason why we have left the parameters at their default values is that, in a context in which the characteristics of T r and T e may substantially differ, it is not clear that the parameter values which are found optimal on T r via k-fold cross-validation will also prove optimal (or at least will perform reasonably) on T e.

11 AA:11 In order to generate the vectorial representations for our documents, the classic bag-of-words approach has been adopted. In particular, punctuation has been removed, all letters have been converted to lowercase, numbers have been removed, stop words have been removed using the stop list provided in [Lewis 1992, pages ], and stemming has been performed by means of the version of Porter s stemmer available from All the remaining stemmed words ( terms ) that occur at least once in T r have thus been used as features of our vectorial representations of documents; no feature selection has been performed. Feature weights have been obtained via the ltc variant [Salton and Buckley 1988] of the well-known tf idf class of weighting functions, i.e., tfidf(t k, d i ) = tf(t k, d i ) log T r # T r (t k ) where d i is a document, # T r (t k ) denotes the number of documents in T r in which feature t k occurs at least once and { 1 + log #(tk, d tf(t k, d i ) = i ) if #(t k, d i ) > 0 (19) 0 otherwise where #(t k, d i ) denotes the number of times t k occurs in d i. Weights obtained by Equation 18 are normalized through cosine normalization, i.e., w ki = (18) tfidf(t k, d i ) T s=1 tfidf(t s, d i ) 2 (20) where T denotes the total number of features. Following [Forman 2008], we set the ɛ constant for smoothing KLD to the value ɛ = 1 2 T e Datasets The datasets we use for our experiments have been extracted from two important text classification test collections, REUTERS CORPUS VOLUME 1 version 2 (RCV1-V2) and OHSUMED-S. RCV1-V2 is a standard, publicly available benchmark for text classification consisting of 804,414 news stories produced by Reuters from 20 Aug 1996 to 19 Aug RCV1-V2 ranks as one of the largest corpora currently used in text classification research and, as pointed out in [Forman 2006b], suffers from extensive drift, i.e., from substantial variability between the training set and the test set, which makes it a challenging dataset for quantification. In our experiments we have used the 12,807 news stories of the 1st week (20 to 26 Aug 1996) for training, and the 791,607 news stories of the other 52 weeks for testing 10. We have further partitioned these latter into 52 test sets each consisting of one week s worth of data 11. RCV1-V2 is multi-label, i.e., a document may belong to several classes at the same time. Of the 103 classes of which its Topic hierarchy consists, in our experiments we have restricted our attention to the 99 classes with at least one positive training example. Consistently with the evaluation presented in [Lewis et al. 2004], also classes placed at internal nodes in the hierarchically organized classification scheme are considered in the evaluation; as This is the standard LYRL2004 split between training and test data, originally defined in [Lewis et al. 2004]. 11 More precisely, since the period covered by RCV1-V2 consists of 365 days, i.e., 52 full weeks + 1 day, the 52nd test set consists of 1 day s worth of data only.

12 AA:12 Andrea Esuli and Fabrizio Sebastiani positive examples of these classes we use the union of the positive examples of their subordinate nodes, plus their own positive examples. The OHSUMED-S dataset [Esuli and Sebastiani 2013] is a subset of the wellknown OHSUMED test collection [Hersh et al. 1994]. OHSUMED-S consists of a set of 15,643 MEDLINE records spanning the years from 1987 to 1991, where each record is classified under one or more of the 97 MeSH index terms that belong to the Heart Disease (HD) subtree of the well-known MeSH tree of index terms 12. Each entry consists of summary information relative to a paper published on one of 270 medical journals; the available fields are title, abstract, author, source, publication type, and MeSH index terms. As the training set we have used, consistently with [Esuli and Sebastiani 2013], the 2,510 documents belonging to year 1987; 9 MeSH index terms out of the 97 in the HD subtree are never assigned to any training document, so the number of classes we actually use is 88. We partition the four remaining years worth of data into four bins (1988, 1989, 1990, 1991), each containing the documents generated within the corresponding calendar year. The reason why we do not use the entire OHSUMED dataset is that roughly 93% of OHSUMED entries have no class assigned from the HD subtree, which means that the classes in the HD subtree have very low prevalence (λ T r = on average); we thus prefer to use OHSUMED-S, which presents a wider range of prevalence values. This experimental setting thus generates = 5148 binary quantification test sets for RCV1-V2 (containing an average of 15,223 documents each), and 4 88 = 352 test sets for OHSUMED-S (containing an average of 3,283 documents each). This large number of test sets will give us an opportunity to study quantification across different dimensions (e.g., across classes characterized by different prevalence, across classes characterized by different amounts of drift, across time) 13. More detailed figures about our datasets are given in Table I. Note that both RCV1-V2 and OHSUMED-S classes are characterized by severe imbalance, as can be noticed by the two Avg prevalence of the positive class rows of Table I, where both values are very far away from the value of 0.5, which represents perfect balance. On a side note, it is well-known that the bag-of-words extraction process outlined a few paragraphs above gives rise to very high-dimensional (albeit sparse) vectors; our case is no exception, and the dimensionality of our vectors is 53,204 (RCV1-V2) and 11,286 (OHSUMED-S), respectively. Note that the experimental protocol we adopt is different from the one adopted by Forman. In [Forman 2005; 2006a; 2008] he proposes a protocol in which, given a training set T r and a test set T e, several controlled experiments are run by artificially altering class prevalences (i.e., by randomly removing predefined percentages of the positives or of the negatives) either on T r or on T e. This protocol is meant to test the robustness of the methods with respect to different distribution drifts (i.e., differences between λ T r (c) and λ T e (c) of different magnitude) and different class prevalence values. We prefer to opt for a different protocol, one in which natural training and test sets are used, without artificial alterations. The reason is that artificial alterations may generate class prevalence values and/or distribution drifts that are simply not realistic (e.g., a situation in which λ T r (c) =.40 and λ T e (c) =.01); conversely, focusing on naturally occurring datasets forces us to come to terms with realistic levels of class prevalence and/or distribution drift. We will thus adopt the latter protocol in all the In order to guarantee perfect reproducibility of our results, we make available at quantification/ the feature vectors of the RCV1-V2 and OHSUMED-S documents as extracted from our preprocessing module, and already split respectively into the 53 and 5 sets described above.

13 AA:13 ALL TRAINING TEST Table I. Main characteristics of the datasets used in our experiments. RCV1-V2 OHSUMED-S Total # of docs 804,414 15,643 # of classes (i.e., binary tasks) Time unit used for split week year # of docs 12,807 2,510 # of features 53,204 11,286 Min # of positive docs per class 2 1 Max # of positive docs per class 5, Avg # of positive docs per class Min prevalence of the positive class Max prevalence of the positive class Avg prevalence of the positive class # of docs 791,607 13,133 # of test sets per class 52 4 Avg # of test docs per set 15,212 3,283 Min # of positive docs per class 0 0 Max # of positive docs per class 9,775 1,250 Avg # of positive docs per class Min prevalence of the positive class Max prevalence of the positive class Avg prevalence of the positive class experiments discussed in this paper, compensating for the absence of artificial alterations by also studying (see 4.2.3) the behaviour of our methods separately on test sets characterized by different (and natural) levels of distribution drift Testing quantification accuracy We have run our experiments by learning quantifiers for each class c on the respective training set and testing the quantifiers separately on each of the test sets, using KLD as the evaluation measure. We have done this for all the 99 classes 52 weeks in RCV1-V2 and for all the 88 classes 4 years in OHSUMED-S, and for all the 10 baseline methods discussed in 2.2 plus our SVM(KLD) method Analysing the results along the class dimension. We first discuss the results according to the class dimension, i.e., by averaging the results for each RCV1-V2 class across the 52 test weeks and for each OHSUMED-S class across the 4 years 14. Since this would leave no less than 99 RCV1-V2 classes to discuss, we further average the results across all the RCV1-V2 classes characterized by a training class prevalence λ T r (c) that falls into a certain interval (same for OHSUMED-S and its 88 classes). This allows us to separately check the behaviour of our quantification methods on groups of classes that are homogeneous by level of imbalance. We have also run statistical significance tests in order to check whether the improvement obtained by the best performing method over the 2nd best performer on the group is statistically significant 15. The results are reported in Table II, where four levels of imbalance have been singled out: very low prevalence (VLP, which accounts for all the classes c such that λ T r (c) < 0.01; there are 48 RCV1-V2 and 51 OHSUMED-S such classes), low prevalence (LP 14 Wherever in this paper we speak of averaging accuracy results across different test sets, what we mean is actually macroaveraging, i.e., taking the accuracy results on the individual test sets and computing their arithmetic mean. This is sharply different from microaveraging, i.e., merging the test sets and computing a single accuracy figure on the merged set. Quite obviously, in a quantification setting microaveraging does not make any sense at all, since false positives from one set and false negatives from another set would compensate each other, thus generating misleadingly high accuracy values. 15 All the statistical significance tests discussed in this paper are based on a two-tailed paired t-test and the use of a significance level.

14 AA:14 Andrea Esuli and Fabrizio Sebastiani Table II. Accuracy of SVM(KLD) and of 10 baseline methods as measured in terms of KLD on the 99 classes of RCV1-V2 (top) and on the 88 classes of OHSUMED-S (bottom) grouped by class prevalence in T r (Columns 2 to 5); lower values are better; Column 6 indicates average accuracy across all the classes. The best result in each column is indicated with boldface only when there is a statistically significant difference with respect to each of the other tested methods (p < 0.001, two-tailed paired t-test on KLD value across the test sets in the group). The methods are ranked in terms of the value indicated in the All column. RCV1-V2 OHSUMED-S VLP LP HP VHP All SVM(KLD) 2.09E E E E E-03 PACC 2.16E E E E E-03 ACC 2.17E E E E E-03 MAX 2.16E E E E E-03 CC 2.55E E E E E-03 X 3.48E E E E E-03 PCC 1.04E E E E E-03 MM(PP) 1.76E E E E E-02 MS 1.98E E E E E-02 T E E E E E-02 MM(KS) 2.00E E E E E-02 VLP LP HP VHP All SVM(KLD) 1.21E E E E E-03 PACC 2.86E E E E E-03 ACC 2.37E E E E E-03 CC 2.38E E E E E-03 X 1.38E E E E E-03 MM(PP) 4.90E E E E E-03 MM(KS) 1.37E E E E E-02 MS 3.80E E E E E-02 T E E E E E-02 MAX 5.57E E E E E-02 PCC 1.20E E E E E λ T r (c) < 0.05; 34 RCV1-V2 and 28 OHSUMED-S classes), high prevalence (HP 0.05 λ T r (c) < 0.10; 10 RCV1-V2 and 4 OHSUMED-S classes), and very high prevalence (VHP 0.10 λ T r (c); 7 RCV1-V2 and 5 OHSUMED-S classes). The first observation that can be made by looking at the RCV1-V2 results in this table is that, when evaluated across all our 5148 test sets (Column 6), SVM(KLD) outperforms all the other baseline methods in a statistically significant way, scoring a KLD value of 1.32E-03 against the 1.74E-03 value (a -24.2% error reduction) obtained by the best-performing baseline (the PACC method). This is largely a result of a much better balance between false positives and false negatives obtained by the base classifiers: while (as already observed in Footnote 3) the average F P/F N ratio across the 5148 test sets is for CC, it is for SVM(KLD) (and it is 1 for the perfect quantifier). The OHSUMED-S results essentially confirm the insights obtained from the RCV1-V2 results, with SVM(KLD) again the best of the 11 methods and PACC again the 2nd best; the difference between them is now even higher, with an error reduction of -56.8%. A second observation that the RCV1-V2 results allow is to make is that SVM(KLD) scores well on all the four groups of classes identified; while it is not always the best method (e.g., it is outperformed by other methods in the HP and VHP groups), it consistently performs well on all four groups. In particular, SVM(KLD) seems to excel at classes characterized by drastic imbalance, as witnessed by the VLP group, where SVM(KLD) is the best performer, and by the LP group, where SVM(KLD) is the best performer in a statistically significant way. In fact, this group of classes seems largely responsible for the excellent overall performance (Column 6) displayed by SVM(KLD),

15 AA:15 since on the LP group the margin between it and the other methods is large (4.92E-04 against 1.98E-03 of the 2nd best method), and since the VLP and LP groups altogether account for no less than 82 out of the total 99 classes. This latter fact characterizes most naturally occurring datasets, whose class distribution usually exhibits a power law, with very few highly frequent classes and very many highly infrequent classes. The OHSUMED-S results essentially confirm the RCV1-V2 results, with SVM(KLD) again the best performer on the VLP and LP classes, and still performing well, although not being the best, on HP and VHP. The stability of SVM(KLD) is also confirmed by Table III, which reports, for the same groups of test sets identified by Table II, the variance in KLD across the members of the group. For example, on RCV1-V2, Column 3 reports the variance in KLD across all the 34 classes such that.01 λ T r (c) 0.05 and across the 52 test weeks, for a total of = 1, 768 test sets. What we can observe from this table is that, when averaged across all the = 5148 test sets (Column 6), the variance of SVM(KLD) is lower than the variance of all other methods in a statistically significant way. The variance of SVM(KLD) is fairly low in all the four subsets of classes, and particularly so in the subsets of the most imbalanced classes (VLP and LP), in which SVM(KLD) is the best performer in a statistically significant way. The OHSUMED-S results essentially confirm the RCV1-V2 results, with SVM(KLD) the best performer on VLP and LP, and still behaving well on HP and VHP. Concerning the baselines, our results seem to disconfirm the ones reported in [Forman 2008] according to which the MS method is the best of the lot, and according to which the ACC method can estimate the class distribution well in many situations, but its performance degrades severely when the training class distribution is highly imbalanced. In our experiments, instead, MS is substantially outperformed by several baseline methods; ACC is instead a strong contender, and (contrary to the statement above) especially shines on the subsets of the most imbalanced classes. The results of both Tables II and III clearly show that, on both RCV1-V2 and OHSUMED-S, PACC is the best of the baseline methods presented in Analysing the results along the temporal dimension. We now analyse the results along the temporal dimension. In order to do this for the RCV1-V2 dataset (resp., OHSUMED-S dataset), for each of the 52 test weeks (resp., 4 test years) we average the 99 (resp., 88) accuracy results corresponding to the individual classes, and check the temporal accuracy trend resulting from these averages. This trend is displayed in Figure 1, where the results of SVM(KLD) are plotted together with the results of the three best-performing baseline methods. The plots unequivocally show that SVM(KLD) is the best method across the entire temporal spectrum for both RCV1-V2 and OHSUMED-S. Note that quantification accuracy remains fairly stable across time, i.e., we are not witnessing any substantial decrease in quantification accuracy with time. Intuition might instead suggest that quantification accuracy should decrease with time, due to the combined effects of true concept drift and distribution drift. This may indicate that (at least in the context of broadcast news that RCV1-V2 represents, and in the context of medical scientific articles that OHSUMED-S represents) the chosen timeframe (one year for RCV1-V2, four years for OHSUMED-S) is not sufficient enough a timeframe to observe a significant such drift Analysing the results along the distribution drift dimension. The last angle according to which we analyse the results is distribution drift. That is, we conduct our analysis in terms of how much the prevalence λ T ei (c) in a given test set T e i drifts away from

Quantification. Using Supervised Learning to Estimate Class Prevalence. Fabrizio Sebastiani

Quantification. Using Supervised Learning to Estimate Class Prevalence. Fabrizio Sebastiani Quantification Using Supervised Learning to Estimate Class Prevalence Fabrizio Sebastiani Istituto di Scienza e Tecnologie dell Informazione Consiglio Nazionale delle Ricerche 56124 Pisa, IT E-mail: fabrizio.sebastiani@isti.cnr.it

More information

MP-Boost: A Multiple-Pivot Boosting Algorithm and Its Application to Text Categorization

MP-Boost: A Multiple-Pivot Boosting Algorithm and Its Application to Text Categorization MP-Boost: A Multiple-Pivot Boosting Algorithm and Its Application to Text Categorization Andrea Esuli, Tiziano Fagni, and Fabrizio Sebastiani Istituto di Scienza e Tecnologia dell Informazione Consiglio

More information

Text Quantification: A Tutorial

Text Quantification: A Tutorial Text Quantification: A Tutorial Fabrizio Sebastiani Qatar Computing Research Institute Qatar Foundation PO Box 5825 Doha, Qatar E-mail: fsebastiani@qf.org.qa http://www.qcri.com/ EMNLP 2014 Doha, QA October

More information

Supervised Term Weighting for Automated Text Categorization

Supervised Term Weighting for Automated Text Categorization Supervised Term Weighting for Automated Text Categorization Franca Debole Istituto di Scienza e Tecnologie dell Informazione Consiglio Nazionale delle Ricerche Via G. Moruzzi 1-56124 Pisa (Italy) debole@iei.pi.cnr.it

More information

Generative Models. CS4780/5780 Machine Learning Fall Thorsten Joachims Cornell University

Generative Models. CS4780/5780 Machine Learning Fall Thorsten Joachims Cornell University Generative Models CS4780/5780 Machine Learning Fall 2012 Thorsten Joachims Cornell University Reading: Mitchell, Chapter 6.9-6.10 Duda, Hart & Stork, Pages 20-39 Bayes decision rule Bayes theorem Generative

More information

Blog Distillation via Sentiment-Sensitive Link Analysis

Blog Distillation via Sentiment-Sensitive Link Analysis Blog Distillation via Sentiment-Sensitive Link Analysis Giacomo Berardi, Andrea Esuli, Fabrizio Sebastiani, and Fabrizio Silvestri Istituto di Scienza e Tecnologie dell Informazione, Consiglio Nazionale

More information

SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION

SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION 1 Outline Basic terminology Features Training and validation Model selection Error and loss measures Statistical comparison Evaluation measures 2 Terminology

More information

Click Prediction and Preference Ranking of RSS Feeds

Click Prediction and Preference Ranking of RSS Feeds Click Prediction and Preference Ranking of RSS Feeds 1 Introduction December 11, 2009 Steven Wu RSS (Really Simple Syndication) is a family of data formats used to publish frequently updated works. RSS

More information

Utility-Theoretic Ranking for Semiautomated Text Classification

Utility-Theoretic Ranking for Semiautomated Text Classification Utility-Theoretic Ranking for Semiautomated Text Classification GIACOMO BERARDI and ANDREA ESULI, Italian National Council of Research FABRIZIO SEBASTIANI, Qatar Computing Research Institute Semiautomated

More information

Generative Models for Classification

Generative Models for Classification Generative Models for Classification CS4780/5780 Machine Learning Fall 2014 Thorsten Joachims Cornell University Reading: Mitchell, Chapter 6.9-6.10 Duda, Hart & Stork, Pages 20-39 Generative vs. Discriminative

More information

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 8. Chapter 8. Classification: Basic Concepts

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 8. Chapter 8. Classification: Basic Concepts Data Mining: Concepts and Techniques (3 rd ed.) Chapter 8 1 Chapter 8. Classification: Basic Concepts Classification: Basic Concepts Decision Tree Induction Bayes Classification Methods Rule-Based Classification

More information

Sampling Strategies to Evaluate the Performance of Unknown Predictors

Sampling Strategies to Evaluate the Performance of Unknown Predictors Sampling Strategies to Evaluate the Performance of Unknown Predictors Hamed Valizadegan Saeed Amizadeh Milos Hauskrecht Abstract The focus of this paper is on how to select a small sample of examples for

More information

A Study of the Dirichlet Priors for Term Frequency Normalisation

A Study of the Dirichlet Priors for Term Frequency Normalisation A Study of the Dirichlet Priors for Term Frequency Normalisation ABSTRACT Ben He Department of Computing Science University of Glasgow Glasgow, United Kingdom ben@dcs.gla.ac.uk In Information Retrieval

More information

Pattern Recognition and Machine Learning. Learning and Evaluation of Pattern Recognition Processes

Pattern Recognition and Machine Learning. Learning and Evaluation of Pattern Recognition Processes Pattern Recognition and Machine Learning James L. Crowley ENSIMAG 3 - MMIS Fall Semester 2016 Lesson 1 5 October 2016 Learning and Evaluation of Pattern Recognition Processes Outline Notation...2 1. The

More information

ECE521 week 3: 23/26 January 2017

ECE521 week 3: 23/26 January 2017 ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear

More information

Introduction. Chapter 1

Introduction. Chapter 1 Chapter 1 Introduction In this book we will be concerned with supervised learning, which is the problem of learning input-output mappings from empirical data (the training dataset). Depending on the characteristics

More information

Evaluation. Andrea Passerini Machine Learning. Evaluation

Evaluation. Andrea Passerini Machine Learning. Evaluation Andrea Passerini passerini@disi.unitn.it Machine Learning Basic concepts requires to define performance measures to be optimized Performance of learning algorithms cannot be evaluated on entire domain

More information

Learning theory. Ensemble methods. Boosting. Boosting: history

Learning theory. Ensemble methods. Boosting. Boosting: history Learning theory Probability distribution P over X {0, 1}; let (X, Y ) P. We get S := {(x i, y i )} n i=1, an iid sample from P. Ensemble methods Goal: Fix ɛ, δ (0, 1). With probability at least 1 δ (over

More information

Machine Learning, Fall 2009: Midterm

Machine Learning, Fall 2009: Midterm 10-601 Machine Learning, Fall 009: Midterm Monday, November nd hours 1. Personal info: Name: Andrew account: E-mail address:. You are permitted two pages of notes and a calculator. Please turn off all

More information

Tutorial on Venn-ABERS prediction

Tutorial on Venn-ABERS prediction Tutorial on Venn-ABERS prediction Paolo Toccaceli (joint work with Alex Gammerman and Ilia Nouretdinov) Computer Learning Research Centre Royal Holloway, University of London 6th Symposium on Conformal

More information

Models, Data, Learning Problems

Models, Data, Learning Problems Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen Models, Data, Learning Problems Tobias Scheffer Overview Types of learning problems: Supervised Learning (Classification, Regression,

More information

Performance Evaluation and Hypothesis Testing

Performance Evaluation and Hypothesis Testing Performance Evaluation and Hypothesis Testing 1 Motivation Evaluating the performance of learning systems is important because: Learning systems are usually designed to predict the class of future unlabeled

More information

Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers

Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers Erin Allwein, Robert Schapire and Yoram Singer Journal of Machine Learning Research, 1:113-141, 000 CSE 54: Seminar on Learning

More information

Significance Tests for Bizarre Measures in 2-Class Classification Tasks

Significance Tests for Bizarre Measures in 2-Class Classification Tasks R E S E A R C H R E P O R T I D I A P Significance Tests for Bizarre Measures in 2-Class Classification Tasks Mikaela Keller 1 Johnny Mariéthoz 2 Samy Bengio 3 IDIAP RR 04-34 October 4, 2004 D a l l e

More information

CHAPTER-17. Decision Tree Induction

CHAPTER-17. Decision Tree Induction CHAPTER-17 Decision Tree Induction 17.1 Introduction 17.2 Attribute selection measure 17.3 Tree Pruning 17.4 Extracting Classification Rules from Decision Trees 17.5 Bayesian Classification 17.6 Bayes

More information

Learning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text

Learning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text Learning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text Yi Zhang Machine Learning Department Carnegie Mellon University yizhang1@cs.cmu.edu Jeff Schneider The Robotics Institute

More information

CS246 Final Exam, Winter 2011

CS246 Final Exam, Winter 2011 CS246 Final Exam, Winter 2011 1. Your name and student ID. Name:... Student ID:... 2. I agree to comply with Stanford Honor Code. Signature:... 3. There should be 17 numbered pages in this exam (including

More information

PAC-learning, VC Dimension and Margin-based Bounds

PAC-learning, VC Dimension and Margin-based Bounds More details: General: http://www.learning-with-kernels.org/ Example of more complex bounds: http://www.research.ibm.com/people/t/tzhang/papers/jmlr02_cover.ps.gz PAC-learning, VC Dimension and Margin-based

More information

Text classification II CE-324: Modern Information Retrieval Sharif University of Technology

Text classification II CE-324: Modern Information Retrieval Sharif University of Technology Text classification II CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2016 Some slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276, Stanford)

More information

ECE521 Lecture7. Logistic Regression

ECE521 Lecture7. Logistic Regression ECE521 Lecture7 Logistic Regression Outline Review of decision theory Logistic regression A single neuron Multi-class classification 2 Outline Decision theory is conceptually easy and computationally hard

More information

Decision Tree Learning

Decision Tree Learning Decision Tree Learning Berlin Chen Department of Computer Science & Information Engineering National Taiwan Normal University References: 1. Machine Learning, Chapter 3 2. Data Mining: Concepts, Models,

More information

Text Mining. Dr. Yanjun Li. Associate Professor. Department of Computer and Information Sciences Fordham University

Text Mining. Dr. Yanjun Li. Associate Professor. Department of Computer and Information Sciences Fordham University Text Mining Dr. Yanjun Li Associate Professor Department of Computer and Information Sciences Fordham University Outline Introduction: Data Mining Part One: Text Mining Part Two: Preprocessing Text Data

More information

Machine Learning

Machine Learning Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University August 30, 2017 Today: Decision trees Overfitting The Big Picture Coming soon Probabilistic learning MLE,

More information

Unsupervised Learning with Permuted Data

Unsupervised Learning with Permuted Data Unsupervised Learning with Permuted Data Sergey Kirshner skirshne@ics.uci.edu Sridevi Parise sparise@ics.uci.edu Padhraic Smyth smyth@ics.uci.edu School of Information and Computer Science, University

More information

Lecture 2. Judging the Performance of Classifiers. Nitin R. Patel

Lecture 2. Judging the Performance of Classifiers. Nitin R. Patel Lecture 2 Judging the Performance of Classifiers Nitin R. Patel 1 In this note we will examine the question of how to udge the usefulness of a classifier and how to compare different classifiers. Not only

More information

Evaluation requires to define performance measures to be optimized

Evaluation requires to define performance measures to be optimized Evaluation Basic concepts Evaluation requires to define performance measures to be optimized Performance of learning algorithms cannot be evaluated on entire domain (generalization error) approximation

More information

A STAFFING ALGORITHM FOR CALL CENTERS WITH SKILL-BASED ROUTING: SUPPLEMENTARY MATERIAL

A STAFFING ALGORITHM FOR CALL CENTERS WITH SKILL-BASED ROUTING: SUPPLEMENTARY MATERIAL A STAFFING ALGORITHM FOR CALL CENTERS WITH SKILL-BASED ROUTING: SUPPLEMENTARY MATERIAL by Rodney B. Wallace IBM and The George Washington University rodney.wallace@us.ibm.com Ward Whitt Columbia University

More information

PAC-learning, VC Dimension and Margin-based Bounds

PAC-learning, VC Dimension and Margin-based Bounds More details: General: http://www.learning-with-kernels.org/ Example of more complex bounds: http://www.research.ibm.com/people/t/tzhang/papers/jmlr02_cover.ps.gz PAC-learning, VC Dimension and Margin-based

More information

Probability Review and Naïve Bayes

Probability Review and Naïve Bayes Probability Review and Naïve Bayes Instructor: Alan Ritter Some slides adapted from Dan Jurfasky and Brendan O connor What is Probability? The probability the coin will land heads is 0.5 Q: what does this

More information

Benchmarking Non-Parametric Statistical Tests

Benchmarking Non-Parametric Statistical Tests R E S E A R C H R E P O R T I D I A P Benchmarking Non-Parametric Statistical Tests Mikaela Keller a Samy Bengio a Siew Yeung Wong a IDIAP RR 05-38 January 5, 2006 to appear in Advances in Neural Information

More information

Machine Learning Linear Classification. Prof. Matteo Matteucci

Machine Learning Linear Classification. Prof. Matteo Matteucci Machine Learning Linear Classification Prof. Matteo Matteucci Recall from the first lecture 2 X R p Regression Y R Continuous Output X R p Y {Ω 0, Ω 1,, Ω K } Classification Discrete Output X R p Y (X)

More information

A Support Vector Method for Multivariate Performance Measures

A Support Vector Method for Multivariate Performance Measures A Support Vector Method for Multivariate Performance Measures Thorsten Joachims Cornell University Department of Computer Science Thanks to Rich Caruana, Alexandru Niculescu-Mizil, Pierre Dupont, Jérôme

More information

Information Retrieval and Organisation

Information Retrieval and Organisation Information Retrieval and Organisation Chapter 13 Text Classification and Naïve Bayes Dell Zhang Birkbeck, University of London Motivation Relevance Feedback revisited The user marks a number of documents

More information

Unsupervised Anomaly Detection for High Dimensional Data

Unsupervised Anomaly Detection for High Dimensional Data Unsupervised Anomaly Detection for High Dimensional Data Department of Mathematics, Rowan University. July 19th, 2013 International Workshop in Sequential Methodologies (IWSM-2013) Outline of Talk Motivation

More information

Modern Information Retrieval

Modern Information Retrieval Modern Information Retrieval Chapter 8 Text Classification Introduction A Characterization of Text Classification Unsupervised Algorithms Supervised Algorithms Feature Selection or Dimensionality Reduction

More information

Predictive Modeling: Classification. KSE 521 Topic 6 Mun Yi

Predictive Modeling: Classification. KSE 521 Topic 6 Mun Yi Predictive Modeling: Classification Topic 6 Mun Yi Agenda Models and Induction Entropy and Information Gain Tree-Based Classifier Probability Estimation 2 Introduction Key concept of BI: Predictive modeling

More information

ECE 5424: Introduction to Machine Learning

ECE 5424: Introduction to Machine Learning ECE 5424: Introduction to Machine Learning Topics: Ensemble Methods: Bagging, Boosting PAC Learning Readings: Murphy 16.4;; Hastie 16 Stefan Lee Virginia Tech Fighting the bias-variance tradeoff Simple

More information

CS6375: Machine Learning Gautam Kunapuli. Decision Trees

CS6375: Machine Learning Gautam Kunapuli. Decision Trees Gautam Kunapuli Example: Restaurant Recommendation Example: Develop a model to recommend restaurants to users depending on their past dining experiences. Here, the features are cost (x ) and the user s

More information

Generative MaxEnt Learning for Multiclass Classification

Generative MaxEnt Learning for Multiclass Classification Generative Maximum Entropy Learning for Multiclass Classification A. Dukkipati, G. Pandey, D. Ghoshdastidar, P. Koley, D. M. V. S. Sriram Dept. of Computer Science and Automation Indian Institute of Science,

More information

Regret Analysis for Performance Metrics in Multi-Label Classification The Case of Hamming and Subset Zero-One Loss

Regret Analysis for Performance Metrics in Multi-Label Classification The Case of Hamming and Subset Zero-One Loss Regret Analysis for Performance Metrics in Multi-Label Classification The Case of Hamming and Subset Zero-One Loss Krzysztof Dembczyński 1, Willem Waegeman 2, Weiwei Cheng 1, and Eyke Hüllermeier 1 1 Knowledge

More information

Dimensionality reduction

Dimensionality reduction Dimensionality reduction ML for NLP Lecturer: Kevin Koidl Assist. Lecturer Alfredo Maldonado https://www.cs.tcd.ie/kevin.koidl/cs4062/ kevin.koidl@scss.tcd.ie, maldonaa@tcd.ie 2017 Recapitulating: Evaluating

More information

Machine Learning for natural language processing

Machine Learning for natural language processing Machine Learning for natural language processing Classification: Naive Bayes Laura Kallmeyer Heinrich-Heine-Universität Düsseldorf Summer 2016 1 / 20 Introduction Classification = supervised method for

More information

Discretizing Continuous Attributes in AdaBoost for Text Categorization

Discretizing Continuous Attributes in AdaBoost for Text Categorization Discretizing Continuous Attributes in AdaBoost for Text Categorization Pio Nardiello 1, Fabrizio Sebastiani 2, and Alessandro Sperduti 3 1 MercurioWeb SNC, Via Appia 85054 Muro Lucano (PZ), Italy pionardiello@mercurioweb.net

More information

Prediction of Citations for Academic Papers

Prediction of Citations for Academic Papers 000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 045 046 047 048 049 050

More information

On the Problem of Error Propagation in Classifier Chains for Multi-Label Classification

On the Problem of Error Propagation in Classifier Chains for Multi-Label Classification On the Problem of Error Propagation in Classifier Chains for Multi-Label Classification Robin Senge, Juan José del Coz and Eyke Hüllermeier Draft version of a paper to appear in: L. Schmidt-Thieme and

More information

Text classification II CE-324: Modern Information Retrieval Sharif University of Technology

Text classification II CE-324: Modern Information Retrieval Sharif University of Technology Text classification II CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2017 Some slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276, Stanford)

More information

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering Types of learning Modeling data Supervised: we know input and targets Goal is to learn a model that, given input data, accurately predicts target data Unsupervised: we know the input only and want to make

More information

Randomized Decision Trees

Randomized Decision Trees Randomized Decision Trees compiled by Alvin Wan from Professor Jitendra Malik s lecture Discrete Variables First, let us consider some terminology. We have primarily been dealing with real-valued data,

More information

Query Performance Prediction: Evaluation Contrasted with Effectiveness

Query Performance Prediction: Evaluation Contrasted with Effectiveness Query Performance Prediction: Evaluation Contrasted with Effectiveness Claudia Hauff 1, Leif Azzopardi 2, Djoerd Hiemstra 1, and Franciska de Jong 1 1 University of Twente, Enschede, the Netherlands {c.hauff,

More information

Using HDDT to avoid instances propagation in unbalanced and evolving data streams

Using HDDT to avoid instances propagation in unbalanced and evolving data streams Using HDDT to avoid instances propagation in unbalanced and evolving data streams IJCNN 2014 Andrea Dal Pozzolo, Reid Johnson, Olivier Caelen, Serge Waterschoot, Nitesh V Chawla and Gianluca Bontempi 07/07/2014

More information

Algorithm-Independent Learning Issues

Algorithm-Independent Learning Issues Algorithm-Independent Learning Issues Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2007 c 2007, Selim Aksoy Introduction We have seen many learning

More information

Regularization. CSCE 970 Lecture 3: Regularization. Stephen Scott and Vinod Variyam. Introduction. Outline

Regularization. CSCE 970 Lecture 3: Regularization. Stephen Scott and Vinod Variyam. Introduction. Outline Other Measures 1 / 52 sscott@cse.unl.edu learning can generally be distilled to an optimization problem Choose a classifier (function, hypothesis) from a set of functions that minimizes an objective function

More information

Latent Dirichlet Allocation Based Multi-Document Summarization

Latent Dirichlet Allocation Based Multi-Document Summarization Latent Dirichlet Allocation Based Multi-Document Summarization Rachit Arora Department of Computer Science and Engineering Indian Institute of Technology Madras Chennai - 600 036, India. rachitar@cse.iitm.ernet.in

More information

An Empirical Analysis of Domain Adaptation Algorithms for Genomic Sequence Analysis

An Empirical Analysis of Domain Adaptation Algorithms for Genomic Sequence Analysis An Empirical Analysis of Domain Adaptation Algorithms for Genomic Sequence Analysis Gabriele Schweikert Max Planck Institutes Spemannstr. 35-39, 72070 Tübingen, Germany Gabriele.Schweikert@tue.mpg.de Bernhard

More information

CS 231A Section 1: Linear Algebra & Probability Review. Kevin Tang

CS 231A Section 1: Linear Algebra & Probability Review. Kevin Tang CS 231A Section 1: Linear Algebra & Probability Review Kevin Tang Kevin Tang Section 1-1 9/30/2011 Topics Support Vector Machines Boosting Viola Jones face detector Linear Algebra Review Notation Operations

More information

Boosting: Foundations and Algorithms. Rob Schapire

Boosting: Foundations and Algorithms. Rob Schapire Boosting: Foundations and Algorithms Rob Schapire Example: Spam Filtering problem: filter out spam (junk email) gather large collection of examples of spam and non-spam: From: yoav@ucsd.edu Rob, can you

More information

Large Scale Semi-supervised Linear SVM with Stochastic Gradient Descent

Large Scale Semi-supervised Linear SVM with Stochastic Gradient Descent Journal of Computational Information Systems 9: 15 (2013) 6251 6258 Available at http://www.jofcis.com Large Scale Semi-supervised Linear SVM with Stochastic Gradient Descent Xin ZHOU, Conghui ZHU, Sheng

More information

Summary and discussion of: Dropout Training as Adaptive Regularization

Summary and discussion of: Dropout Training as Adaptive Regularization Summary and discussion of: Dropout Training as Adaptive Regularization Statistics Journal Club, 36-825 Kirstin Early and Calvin Murdock November 21, 2014 1 Introduction Multi-layered (i.e. deep) artificial

More information

Data Mining Techniques

Data Mining Techniques Data Mining Techniques CS 622 - Section 2 - Spring 27 Pre-final Review Jan-Willem van de Meent Feedback Feedback https://goo.gl/er7eo8 (also posted on Piazza) Also, please fill out your TRACE evaluations!

More information

Notes on Discriminant Functions and Optimal Classification

Notes on Discriminant Functions and Optimal Classification Notes on Discriminant Functions and Optimal Classification Padhraic Smyth, Department of Computer Science University of California, Irvine c 2017 1 Discriminant Functions Consider a classification problem

More information

DECISION TREE LEARNING. [read Chapter 3] [recommended exercises 3.1, 3.4]

DECISION TREE LEARNING. [read Chapter 3] [recommended exercises 3.1, 3.4] 1 DECISION TREE LEARNING [read Chapter 3] [recommended exercises 3.1, 3.4] Decision tree representation ID3 learning algorithm Entropy, Information gain Overfitting Decision Tree 2 Representation: Tree-structured

More information

Classification: The rest of the story

Classification: The rest of the story U NIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN CS598 Machine Learning for Signal Processing Classification: The rest of the story 3 October 2017 Today s lecture Important things we haven t covered yet Fisher

More information

Modern Information Retrieval

Modern Information Retrieval Modern Information Retrieval Chapter 8 Text Classification Introduction A Characterization of Text Classification Unsupervised Algorithms Supervised Algorithms Feature Selection or Dimensionality Reduction

More information

Information Hiding and Covert Communication

Information Hiding and Covert Communication Information Hiding and Covert Communication Andrew Ker adk @ comlab.ox.ac.uk Royal Society University Research Fellow Oxford University Computing Laboratory Foundations of Security Analysis and Design

More information

RETRIEVAL MODELS. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

RETRIEVAL MODELS. Dr. Gjergji Kasneci Introduction to Information Retrieval WS RETRIEVAL MODELS Dr. Gjergji Kasneci Introduction to Information Retrieval WS 2012-13 1 Outline Intro Basics of probability and information theory Retrieval models Boolean model Vector space model Probabilistic

More information

The exam is closed book, closed calculator, and closed notes except your one-page crib sheet.

The exam is closed book, closed calculator, and closed notes except your one-page crib sheet. CS 188 Fall 2015 Introduction to Artificial Intelligence Final You have approximately 2 hours and 50 minutes. The exam is closed book, closed calculator, and closed notes except your one-page crib sheet.

More information

Intelligent Data Analysis Lecture Notes on Document Mining

Intelligent Data Analysis Lecture Notes on Document Mining Intelligent Data Analysis Lecture Notes on Document Mining Peter Tiňo Representing Textual Documents as Vectors Our next topic will take us to seemingly very different data spaces - those of textual documents.

More information

Comparison of Shannon, Renyi and Tsallis Entropy used in Decision Trees

Comparison of Shannon, Renyi and Tsallis Entropy used in Decision Trees Comparison of Shannon, Renyi and Tsallis Entropy used in Decision Trees Tomasz Maszczyk and W lodzis law Duch Department of Informatics, Nicolaus Copernicus University Grudzi adzka 5, 87-100 Toruń, Poland

More information

Decision Trees: Overfitting

Decision Trees: Overfitting Decision Trees: Overfitting Emily Fox University of Washington January 30, 2017 Decision tree recap Loan status: Root 22 18 poor 4 14 Credit? Income? excellent 9 0 3 years 0 4 Fair 9 4 Term? 5 years 9

More information

Cover Page. The handle holds various files of this Leiden University dissertation

Cover Page. The handle  holds various files of this Leiden University dissertation Cover Page The handle http://hdl.handle.net/1887/39637 holds various files of this Leiden University dissertation Author: Smit, Laurens Title: Steady-state analysis of large scale systems : the successive

More information

Measuring Discriminant and Characteristic Capability for Building and Assessing Classifiers

Measuring Discriminant and Characteristic Capability for Building and Assessing Classifiers Measuring Discriminant and Characteristic Capability for Building and Assessing Classifiers Giuliano Armano, Francesca Fanni and Alessandro Giuliani Dept. of Electrical and Electronic Engineering, University

More information

Latent Dirichlet Allocation Introduction/Overview

Latent Dirichlet Allocation Introduction/Overview Latent Dirichlet Allocation Introduction/Overview David Meyer 03.10.2016 David Meyer http://www.1-4-5.net/~dmm/ml/lda_intro.pdf 03.10.2016 Agenda What is Topic Modeling? Parametric vs. Non-Parametric Models

More information

Performance evaluation of binary classifiers

Performance evaluation of binary classifiers Performance evaluation of binary classifiers Kevin P. Murphy Last updated October 10, 2007 1 ROC curves We frequently design systems to detect events of interest, such as diseases in patients, faces in

More information

Introduction to Machine Learning Midterm Exam

Introduction to Machine Learning Midterm Exam 10-701 Introduction to Machine Learning Midterm Exam Instructors: Eric Xing, Ziv Bar-Joseph 17 November, 2015 There are 11 questions, for a total of 100 points. This exam is open book, open notes, but

More information

Stephen Scott.

Stephen Scott. 1 / 35 (Adapted from Ethem Alpaydin and Tom Mitchell) sscott@cse.unl.edu In Homework 1, you are (supposedly) 1 Choosing a data set 2 Extracting a test set of size > 30 3 Building a tree on the training

More information

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima. http://goo.gl/xilnmn Course website KYOTO UNIVERSITY Statistical Machine Learning Theory From Multi-class Classification to Structured Output Prediction Hisashi Kashima kashima@i.kyoto-u.ac.jp DEPARTMENT

More information

Data Mining. CS57300 Purdue University. Bruno Ribeiro. February 8, 2018

Data Mining. CS57300 Purdue University. Bruno Ribeiro. February 8, 2018 Data Mining CS57300 Purdue University Bruno Ribeiro February 8, 2018 Decision trees Why Trees? interpretable/intuitive, popular in medical applications because they mimic the way a doctor thinks model

More information

Lecture 3: Decision Trees

Lecture 3: Decision Trees Lecture 3: Decision Trees Cognitive Systems - Machine Learning Part I: Basic Approaches of Concept Learning ID3, Information Gain, Overfitting, Pruning last change November 26, 2014 Ute Schmid (CogSys,

More information

Empirical Risk Minimization, Model Selection, and Model Assessment

Empirical Risk Minimization, Model Selection, and Model Assessment Empirical Risk Minimization, Model Selection, and Model Assessment CS6780 Advanced Machine Learning Spring 2015 Thorsten Joachims Cornell University Reading: Murphy 5.7-5.7.2.4, 6.5-6.5.3.1 Dietterich,

More information

Natural Language Processing. Classification. Features. Some Definitions. Classification. Feature Vectors. Classification I. Dan Klein UC Berkeley

Natural Language Processing. Classification. Features. Some Definitions. Classification. Feature Vectors. Classification I. Dan Klein UC Berkeley Natural Language Processing Classification Classification I Dan Klein UC Berkeley Classification Automatically make a decision about inputs Example: document category Example: image of digit digit Example:

More information

Towards Lifelong Machine Learning Multi-Task and Lifelong Learning with Unlabeled Tasks Christoph Lampert

Towards Lifelong Machine Learning Multi-Task and Lifelong Learning with Unlabeled Tasks Christoph Lampert Towards Lifelong Machine Learning Multi-Task and Lifelong Learning with Unlabeled Tasks Christoph Lampert HSE Computer Science Colloquium September 6, 2016 IST Austria (Institute of Science and Technology

More information

Predictive analysis on Multivariate, Time Series datasets using Shapelets

Predictive analysis on Multivariate, Time Series datasets using Shapelets 1 Predictive analysis on Multivariate, Time Series datasets using Shapelets Hemal Thakkar Department of Computer Science, Stanford University hemal@stanford.edu hemal.tt@gmail.com Abstract Multivariate,

More information

CS-E4830 Kernel Methods in Machine Learning

CS-E4830 Kernel Methods in Machine Learning CS-E4830 Kernel Methods in Machine Learning Lecture 5: Multi-class and preference learning Juho Rousu 11. October, 2017 Juho Rousu 11. October, 2017 1 / 37 Agenda from now on: This week s theme: going

More information

Machine Learning (CS 567) Lecture 2

Machine Learning (CS 567) Lecture 2 Machine Learning (CS 567) Lecture 2 Time: T-Th 5:00pm - 6:20pm Location: GFS118 Instructor: Sofus A. Macskassy (macskass@usc.edu) Office: SAL 216 Office hours: by appointment Teaching assistant: Cheol

More information

Induction of Decision Trees

Induction of Decision Trees Induction of Decision Trees Peter Waiganjo Wagacha This notes are for ICS320 Foundations of Learning and Adaptive Systems Institute of Computer Science University of Nairobi PO Box 30197, 00200 Nairobi.

More information

Machine Learning for NLP

Machine Learning for NLP Machine Learning for NLP Uppsala University Department of Linguistics and Philology Slides borrowed from Ryan McDonald, Google Research Machine Learning for NLP 1(50) Introduction Linear Classifiers Classifiers

More information

Behavioral Data Mining. Lecture 2

Behavioral Data Mining. Lecture 2 Behavioral Data Mining Lecture 2 Autonomy Corp Bayes Theorem Bayes Theorem P(A B) = probability of A given that B is true. P(A B) = P(B A)P(A) P(B) In practice we are most interested in dealing with events

More information

Classification of Publications Based on Statistical Analysis of Scientific Terms Distributions

Classification of Publications Based on Statistical Analysis of Scientific Terms Distributions AUSTRIAN JOURNAL OF STATISTICS Volume 37 (2008), Number 1, 109 118 Classification of Publications Based on Statistical Analysis of Scientific Terms Distributions Vaidas Balys and Rimantas Rudzkis Institute

More information

Dan Roth 461C, 3401 Walnut

Dan Roth   461C, 3401 Walnut CIS 519/419 Applied Machine Learning www.seas.upenn.edu/~cis519 Dan Roth danroth@seas.upenn.edu http://www.cis.upenn.edu/~danroth/ 461C, 3401 Walnut Slides were created by Dan Roth (for CIS519/419 at Penn

More information

Sparse vectors recap. ANLP Lecture 22 Lexical Semantics with Dense Vectors. Before density, another approach to normalisation.

Sparse vectors recap. ANLP Lecture 22 Lexical Semantics with Dense Vectors. Before density, another approach to normalisation. ANLP Lecture 22 Lexical Semantics with Dense Vectors Henry S. Thompson Based on slides by Jurafsky & Martin, some via Dorota Glowacka 5 November 2018 Previous lectures: Sparse vectors recap How to represent

More information