Progressive Random k-labelsets for Cost-Sensitive Multi-Label Classification

Size: px
Start display at page:

Download "Progressive Random k-labelsets for Cost-Sensitive Multi-Label Classification"

Transcription

1 1 26 Progressive Random k-labelsets for Cost-Sensitive Multi-Label Classification Yu-Ping Wu Hsuan-Tien Lin Department of Computer Science and Information Engineering, National Taiwan University, Taiwan Abstract In multi-label classification, an instance is associated with multiple relevant labels, and the goal is to predict these labels simultaneously. Many real-world applications of multi-label classification come with different performance evaluation criteria. It is thus important to design general multi-label classification methods that can flexibly take different criteria into account. Such methods tackle the problem of cost-sensitive multi-label classification (CSMLC). Most existing CSMLC methods either suffer from high computational complexity or focus on only certain specific criteria. In this work, we propose a novel CSMLC method, named progressive random k-labelsets (PRAkEL), to resolve the two issues above. The method is extended from a popular multi-label classification method, random k-labelsets, and hence inherits its efficiency. Furthermore, the proposed method can handle arbitrary example-based evaluation criteria by progressively transforming the CSMLC problem into a series of cost-sensitive multi-class classification problems. Experimental results demonstrate that PRAkEL is competitive with existing methods under the specific criteria they can optimize, and is superior under several popular criteria. Keywords: machine learning, multi-label classification, loss function, cost-sensitive learning, labelset, ensemble method 1. Introduction Multi-label classification (MLC) extends traditional multi-class classification by allowing each instance to be associated with a set of relevant labels. For example, in text classification, a document (instance) can belong to several topics (labels). Given a set of instances as well as their relevant labels, the goal of an MLC method is to predict the relevant labels of a new instance. Recently, MLC has attracted much research attention with a wide range of applications including music tag annotation (Trohidis et al., 2008; Lo et al., 2011), image classification (Boutell et al., 2004), and video classification (Qi et al., 2007). In contrast to multi-class classification, one important characteristic of MLC is the possible correlations between different labels. Many approaches have been proposed to exploit the correlations. Chaining methods learn a label by treating other labels as features (Read et al., 2011; Dembczynski et al., 2010). Labelset-based methods learn several labels jointly (Tsoumakas et al., 2010; Tsoumakas and Vlahavas, 2007; Lo et al., 2014; Lo, 2013). Other methods transform the space of labels to capture the correlations (Hsu et al., 2009; Tai and Lin, 2012; Hardoon et al., 2004). c Y.-P. Wu & H.-T. Lin.

2 Wu Lin A key challenge of MLC is to automatically adapt a method to the evaluation criterion of interest. In real-world applications, different criteria are often required to evaluate the performance of an MLC method. For example, Hamming loss measures the proportion of the misclassified labels to the total number of labels; F1 score originates from information retrieval applications and is the harmonic mean of the precision and recall; subset 0/1 loss requires all labels to be correctly predicted. Because of the different natures of those criteria, a method that performs well under one criterion may not be well-suited for other criteria. It is therefore important to design general MLC methods that take the evaluation criterion into account, either in the training or prediction stage. Since the evaluation criterion, or metric, determines the cost for misclassifying an instance, this type of problem is generally called cost-sensitive multi-label classification (CSMLC) (Lo et al., 2014; Li and Lin, 2014), which is formally defined in Section 2. We shall explain in Section 3 that most existing MLC methods either aim for optimizing a certain evaluation metric or require extra efforts to be adapted to each metric. For example, binary relevance (BR) (Tsoumakas et al., 2010) minimizes Hamming loss by learning each label independently. Label powerset (LP) (Tsoumakas et al., 2010) minimizes subset 0/1 loss by transforming the MLC problem to a multi-class classification problem with a huge number of hyper-classes. The well-known random k-labelsets (RAkEL) (Tsoumakas and Vlahavas, 2007) method focuses on many smaller multi-class classification problems to be computationally efficient, but it is only loosely connected to subset 0/1 loss (Ferng and Lin, 2013). There are currently a few methods for dealing with general CSMLC problems (Dembczynski et al., 2010; Tsochantaridis et al., 2005; Li and Lin, 2014; Doppa et al., 2014). RAkEL has been extended to cost-sensitive random k-labelsets (CS-RAkEL) (Lo, 2013) and generalized k-labelsets ensemble (GLE) (Lo et al., 2014) to handle example-dependent weighted Hamming loss, but not general metrics. Probabilistic classifier chain (Dembczynski et al., 2010) requires designing an efficient inference rule with respect to the metric, and covers many, but not all, of the metrics of interest (Li and Lin, 2014). Condensed filter tree (Li and Lin, 2014) is a chaining method that takes any evaluation metric into account during the training stage, but its training time is quadratic to the number of labels. The structured support vector machine (Tsochantaridis et al., 2005) can also handle arbitrary metric, but it relies on solving a sophisticated optimization problem depending on the metric and is thus also inefficient. To the best of our knowledge, no existing CSMLC methods are both general and efficient. In this work, we design a general and efficient CSMLC method in Section 4. This novel method, named progressive random k-labelsets (PRAkEL), is extended from RAkEL to inherit its efficiency. In particular, PRAkEL practically enjoys linear training time in terms of the number of labels. Moreover, PRAkEL is able to optimize any example-based metric by modifying the training stage of RAkEL. More specifically, RAkEL reduces the original problem to many regular multi-class problems and ignores the original cost information; PRAkEL reduces the CSMLC problem to many cost-sensitive multi-class ones by transferring the cost information to the sub-problems. The transferring task is non-trivial, however, because each sub-problem involves only a subset of labels of the original problem. We therefore introduce the notion of reference labels to determine the costs in the sub-problems. 2

3 PRAkEL We carefully propose two strategies for defining the reference labels, which lead to different advantages and disadvantages in both theoretical and empirical aspects. We conducted experiments on seven benchmark datasets with various sizes and domains. The experimental results in Section 5 show that PRAkEL is competitive with state-of-the-art MLC methods under the specific metrics associated with the methods. Furthermore, in terms of general metrics, PRAkEL usually outperforms other methods. The results demonstrate that the proposed method is indeed more general, and more suitable for solving real-world problems. 2. Problem Setup In CSMLC, we denote an instance by a vector x X = R d and the relevant labels of x by a set Y {1, 2,, K}, where K is the total number of labels. Equivalently, this set of labels can be represented by a bit vector y Y = {0, 1} K, where the l-th component y[l] is 1 if and only if the l-th label is relevant, i.e., l Y. Here, X and Y are called the input space and label space, respectively; the pair (x, y) is called an example. In this work, we consider a particular CSMLC setup that allows each example to carry its own cost information. The example-based setup, which assumes example-dependent costs, is more general than the setup with label-dependent costs, in which all examples share the same cost functions. The more general setup makes it possible to express the importance of different instances easily through embedding the importance in the example-dependent cost, and has been considered in several studies of cost-sensitive learning (Fan et al., 1999; Zadrozny et al., 2003; Sun et al., 2007). Formally, given a training set {(x n, y n, c n )} N n=1 consisting of N examples, where c n : Y R 0 is a non-negative cost function and each (x n, y n, c n ) is drawn independently from an unknown distribution D, the goal of CSMLC is to learn a classifier h: X Y such that the expected cost E (x,y,c) D [c(h(x))] is small. Note that the example-based setup cannot cover all popular evaluation criteria in multilabel classification. For instance, the micro-f1 and macro-f1 criteria, which are defined on a set of y rather than a single one, cannot be expressed as an example-dependent cost function. Nonetheless, as highlighted by earlier CSMLC works (Li and Lin, 2014), studying the example-based setup can be viewed as an intermediate step toward those more complicated criteria. Two remarks about this setup are in order. First, for a classifier h, since c(h(x)) is being minimized, it is natural to assume c has a minimum of 0 at y, the true label vector of x. With this assumption, although y does not appear in the learning goal, its information is implicitly stored in the cost function. Second, we can similarly define the problem of cost-sensitive multi-class classification (CSMCC) by replacing the label space Y with {1, 2,, K}, which stands for K different classes. In fact, this setup is widely adopted in many existing works (Tu and Lin, 2010; Zhou and Liu, 2010; Abe et al., 2004). Modern CSMCC works (Zhou and Liu, 2010) allow flexibly taking any cost functions into account based on application needs. While the proposed method shares the same flexibility in its derivation, we consider a more realistic scenario of CSMLC in the experiments. In particular, many CSMLC problems are actually associated with a global, label-dependent cost L: Y Y R, typically called a loss function, where L(y, ŷ) is the loss when predicting y as ŷ. Those problems aim to learn a classifier h: X Y such that E[L(y, h(x))] is 3

4 Wu Lin small (Dembczynski et al., 2010; Li and Lin, 2014). The aim can be easily expressed in our setup by assigning c n (ŷ) = L(y n, ŷ). (1) We focus on CSMLC with such loss functions to demonstrate the applicability of the proposed method and to make a fair comparison with existing CSMLC methods (Li and Lin, 2014; Dembczynski et al., 2010). Popular loss functions include Hamming loss 1 L H (y, ŷ) = 1 K K ŷ[l] y[l] ; l=1 weighted Hamming loss with respect to the weight w R 0 K L H,w (y, ŷ) = K w[l] ŷ[l] y[l] ; l=1 ranking loss L r (y, ŷ) = 1 R(y) (k,l): y[k]<y[l] ŷ[k] > ŷ[l] + 1 ŷ[k] = ŷ[l], 2 where R(y) = {(k, l) y[k] < y[l]} is a normalizer; F1 loss 2 which is one minus the F1 score; L F (y, ŷ) = 1 2y ŷ y 1 + ŷ 1, subset 0/1 loss L s (y, ŷ) = ŷ y. For those loss functions defined above, we follow the convention that when the denominator is zero, the loss is defined as zero. To simplify the explanations of the proposed method, we further introduce some terminology. We denote the set of K labels by L K = {1,, K}. A subset S of L K with S = k is called a k-labelset. If S = {s 1,, s k } is a k-labelset with s 1 < < s k, then we denote (y[s 1 ],, y[s k ]) {0, 1} k by y[s]. When the number of labels, K, is clear in the context, we also use the notation S c to represent the (K k)-labelset L K \ S = {1 l K l / S}. We summarize the main notation used throughout the paper in Table is the indicator function is the l 1 norm. 4

5 PRAkEL Notation Description N Number of training examples d Dimension of input space (number of features) K Number of labels X = R d Input space (feature space) Y = {0, 1} K Output space (label space) x X Instance (feature vector) y Y True label vector ŷ Y Predicted label vector ỹ Y Reference label vector (see Section 4.2) h: X Y Multi-label classifier c: Y R 0 Example-dependent cost function L: Y Y R Label-dependent loss function L K = {1,, K} The set of K labels S L K with S = k k-labelset y[s] The ordered set of labels of y within S M Number of iterations (labelsets) for the proposed method Table 1: Main notation used in the paper. 3. Related Work Multi-label classification methods can be divided into two main categories, namely, algorithm adaptation and problem transformation (Tsoumakas and Katakis, 2007). Algorithm adaptation methods directly extend a specific learning algorithm to tackle MLC problems. Multi-label k-nearest neighbor (ML-kNN) (Zhang and Zhou, 2007) is adapted from the famous k-nearest neighbors algorithm. AdaBoost.MH and AdaBoost.MR (Schapire and Singer, 2000) are an two multi-label extensions of the AdaBoost algorithm (Freund and Schapire, 1999). ML-C4.5 (Clare and King, 2001) is an adaptation of the popular C4.5 algorithm. BP-MLL (Zhang and Zhou, 2006) is derived from the back-propagation algorithm of neural networks. Problem transformation methods transform MLC problems into other types of learning problems and solve them by existing algorithms. Such methods are general and can be coupled with any mature algorithms. Our proposed method in Section 4 belongs to this category. Binary relevance (BR) (Tsoumakas et al., 2010) is arguably the simplest problem transformation method, which transforms the MLC problem into several binary classification problems by learning and predicting each label independently. Classifier chain (CC) (Read et al., 2011) iteratively learns a binary classifier to predict the l-th label using {(x n, ŷ n [1],, ŷ n [l 1])} as the training set, where ŷ n contains the previously predicted labels. Although it considers the label dependencies, the order of labels becomes crucial to the performance of CC. Many approaches have been proposed to address this issue (Read et al., 2011, 2014; Goncalves et al., 2013). In particular, the ensemble of classifier chains (ECC) (Read et al., 2011) learns several CC classifiers, each with a random ordering of labels, and it averages the predictions from all the classifiers to classify a new instance. 5

6 Wu Lin Instead of learning one binary classifier for each label, probabilistic classifier chain (PCC) (Dembczynski et al., 2010) learns probabilistic classifiers to estimate P (y x) by the chain rule K P (y x) = P (y[1] x) P (y[l] x, y[1],, y[l 1]) l=2 and then applies Bayes optimal inference rule designed for the evaluation metric to produce the final prediction. In principle, PCC is able to be adapted to any different metric to tackle CSMLC problems by designing proper inference rules for the metric. However, deriving efficient inference rules for different metrics is practically challenging. Inference rules for Hamming, ranking, F1 and subset 0/1 loss have been designed (Dembczynski et al., 2010, 2011), but the rules for other metrics remain an open question. Similar to ECC, the ensembled probabilistic classifier chain (EPCC) (Dembczynski et al., 2010) resolves the issue of label ordering by random orderings. The Monte Carlo optimization for classifier chains (MCC) (Read et al., 2014) employs the Monte Carlo scheme to find a good label ordering in the training stage of PCC. A recently proposed method, the classifier trellis (CT) (Read et al., 2015), is extended from MCC to consider a trellis structure of labels rather than a chain to improve efficiency. During the prediction stage of both methods (Read et al., 2014, 2015), the Monte Carlo scheme is applied to generate samples from P (y x). A large number of samples may be required for Monte Carlo simulation, which results in possible computational challenges during prediction. While those samples can in principle be used to make cost-sensitive predictions, the possibility has not been fully studied in both works. In fact, the original works consider only approximate inference for Hamming loss and subset 0/1 loss. A group of methods take label dependencies into account by learning multiple labels jointly. Label powerset (LP) (Tsoumakas et al., 2010) transforms each label combination into a unique hyper-class and learns a multi-class classifier. If there are K labels in total, then the number of classes may be as large as 2 K. Hence, when the number of labels is large, LP suffers from computational issues and an insufficient number of training examples within each class. To overcome the drawback, a method called random k-labelsets (RAkEL) (Tsoumakas and Vlahavas, 2007) focuses on one labelset at a time. Recall that a k-labelset is a size-k subset of {1, 2,, K}. RAkEL iteratively selects a random k-labelset S m and learns an LP classifier h m for the training set restricted to the labels within S m, i.e., {(x n, y n [S m ])}. Each classifier h m predicts the k labels within S m, and the final prediction of an instance is produced by a majority vote of all the classifiers. Because the number of classes in each LP classifier is decreased, RAkEL is more efficient than LP. In addition, it achieves better performance than LP in terms of Hamming and F1 loss. Nonetheless, there is a noticeable issue of RAkEL. In each multi-class sub-problem, a one-bit prediction error and a two-bit error are equally penalized. That is, the LP classifiers cannot distinguish between small and big errors. Because these classifiers are learned without considering the evaluation metric, RAkEL is not a cost-sensitive method. Two extensions of RAkEL were proposed to address the above issue, but they both consider only the example-dependent weighted Hamming loss rather than general metrics. The cost-sensitive random k-labelsets (CS-RAkEL) (Lo, 2013) method reduces the CSMLC 6

7 PRAkEL problem to several multi-class ones with instance weights. The weight of each instance is defined as the sum of the misclassified costs of the relevant labels. Despite the restriction, one advantage of CS-RAkEL is that it only requires re-weighting of the instances and can hence be coupled with many traditional multi-class classification algorithms. Generalized k-labelsets ensemble (GLE) (Lo et al., 2014) learns a set of LP classifiers and determines a linear combination of them by minimizing the averaged loss of training examples. The minimization is formulated as a quadratic optimization problem without any constraints and hence can be solved efficiently. While both CS-RAkEL and GLE are pioneering works on extending RAkEL for CSMLC, they focus on specific applications of tagging. As a consequence, the two methods do not come with much theoretical guarantee, and it is non-trivial to extend them to handle example-dependent costs. For the methods introduced above, BR and CC optimize Hamming loss; CS-RAkEL and GLE deal with weighted Hamming loss; MCC and CT minimize Hamming and subset 0/1 loss currently, with the potential of handling general metrics yet to be studied; PCC is designed to deal with general metrics, but is computationally demanding for arbitrary metrics that come without efficient inference rules. Another method that deals with general metrics is the structured support vector machine (SSVM) (Tsochantaridis et al., 2005). The SSVM optimizes a metric by re-scaling certain variables in the traditional SVM optimization problem based on the metric. However, the complexity of solving the problem depends on the metric and is usually too high for practical applications. Condensed filter tree (CFT) (Li and Lin, 2014) is a state-of-the-art CSMLC method, extended from the well-known filter tree algorithm (Beygelzimer et al., 2009) to handle multi-label data. Similarly, the divide-and-conquer tree algorithm (Beygelzimer et al., 2009) for multi-class problems can be directly adapted to CSMLC problems to design the top-down tree (TT) method (Li and Lin, 2014). Both CFT and TT can be viewed as cost-sensitive extensions of CC. CFT suffers from its training time, which is quadratic to the number of labels; TT suffers from its weaker performance as compared with CFT (Li and Lin, 2014). Multi-label search (MLS) (Doppa et al., 2014) optimizes a metric by adapting the HCsearch framework to multi-label problems. It learns a heuristic function and estimates the evaluation metric in the training stage. Then, during the prediction stage, MLS conducts a heuristic search towards minimizing the estimated cost. Despite its generality, MLS suffers from high computational complexity. To learn the heuristic function during training, it needs to solve a ranking problem consisting of O(NK) examples, where N is the number of training examples and K is the number of labels. In summary, many existing MLC methods are not applicable to arbitrary example-based metrics of CSMLC (BR, CC, LP, RAkEL). There are some extensions dealing with restricted metrics of CSMLC (CS-RAkEL, GLE). For general metrics, current methods suffer from computational issues (CFT, MLS, SSVM), performance issues (TT), or require elegant design of inference rules or more studies to handle different metrics (PCC, MCC, CT). In the next section, we present a general yet efficient cost-sensitive multi-label method, which is competitive with state-of-the-art CSMLC methods. 7

8 Wu Lin 4. Proposed Method Recall that the LP method solves an MLC problem by transforming it into a single multi-class problem. Similarly, a CSMLC problem can be transformed into a cost-sensitive multi-class classification (CSMCC) problem, as illustrated in the CFT work (Li and Lin, 2014). The resulting method, however, suffers from the same computational issue as LP, and hence is not feasible for large problems. CFT solves the computational issue by considering an efficient multi-class classification model the filter tree. In this work, we deal with the computational issue differently. We extend the idea of RAkEL and propose a novel labelset-based method, which iteratively transforms the CSMLC problem into a series of CSMCC problems. Different from RAkEL, the critical part of the proposed method is the transfer of the cost information to the sub-problems in the training stage. This is not a trivial task, since each sub-problem involves only a subset of labels and hence the costs in each sub-problem cannot be easily connected to those in the original problem. Therefore, we introduce the notion of reference label vectors to determine the costs in the sub-problems. While the overall idea sounds simple, it advances the study of CSMLC in several aspects: Compared with traditional MLC methods such as RAkEL, the proposed method is sensitive to the evaluation metric and hence is able to optimize arbitrary example-based metrics. Compared with CS-RAkEL and GLE, the proposed method handles more general metrics and comes with solid theoretical analysis. Compared with PCC, MCC and SSVMs, our method alternatively considers label dependencies through labelsets and requires no manual adaptation to each evaluation metric. Compared with existing CSMLC methods such as CFT, our method is more efficient in terms of training time complexity while reaching similar level of performance. We first provide the framework of the proposed method. Then, we describe it in great detail and present its analysis Framework Let T = {(x n, y n, c n )} N n=1 be the training set and M be the number of iterations. Inspired by RAkEL, in the m-th iteration, our method selects a random k-labelset S m and constructs a CSMCC training set T m = {(x n, y n [S m ], c n)} N n=1 of K = 2 k classes, where c n : {0, 1} k R. The main difference between our method and RAkEL is that, the multi-class sub-problems defined here contain the costs c n, and hence our method is able to carry the information of the evaluation metric. The two issues of RAkEL discussed in Section 3 can also be resolved by properly defining these c n. Although in our problem setup described in Section 2, the label space of a CSMCC problem should be L K, by considering a bijection between L K and {0, 1} k, we may treat y n [S m ] as an element of L K and assume c n : L K R. Then, any CSMCC algorithm can be employed to learn a multi-class classifier h m : X {0, 1} k for T m. 8

9 PRAkEL Similar to RAkEL, the final prediction of a new instance x is produced by a majority vote of all the classifiers h m. More precisely, if we define h m : X { 1, 0, 1} K by { hm (x)[s m ] = 2 h m(x) 1 { 1, 1} k h m (x)[sm] c = 0 {0} K k (2), then the final prediction ŷ Y can be obtained by setting ŷ[l] = 1 if and only if M m=1 h m[l] > Cost Transformation Having described the framework, we now turn our attention to the multi-class cost functions c n in the sub-problems, which must be defined in each iteration. At this point, notice that if we define c n(ŷ ) = ŷ y n [S m ], then the proposed method degenerates into RAkEL. Since this c n is independent of the original cost function c n, it can also be seen from this assignment that RAkEL is not a cost-sensitive method. To establish the connections between these two cost functions, c n must carry a certain amount of information of c n. Note that the domain of c n is {0, 1} k and c n is defined on Y = {0, 1} K. To extend c n to the domain of c n, we propose considering a reference label vector ỹ n Y and setting the value of c n to be the cost c n assuming the labels outside S m were predicted the same as ỹ n. Mathematically, c n(ŷ ) = c n (ŷ ỹ n [S c m]). (3) Here, we treat ŷ and ỹ n [S c m] as subsets of S m and S c m, respectively, and therefore, their union is considered as a subset of L K, or equivalently a bit vector in {0, 1} K. It then remains to define these ỹ n in each iteration to complete the transformation. We shall see in the next section that these reference vectors may depend on the classifiers learned in the previous iterations, and hence, the multi-class cost functions would be obtained progressively. As a consequence, the proposed method is called progressive random k- labelsets (PRAkEL). The training and prediction algorithms of PRAkEL are presented in Algorithms 1 and 2, where the weighting strategy mentioned in line 8 of Algorithm 1 is described in Section 4.4. For now we simply assume α m = 1 for 1 m M. Another thing to note is that, we do not explicitly require selecting a labelset that has not been chosen before. However, in practice we give higher priority to those labels that were selected fewer times in the previous iterations. In particular, we guarantee that all labels are selected at least once if km K Defining Reference Label Vectors We propose two strategies for defining the reference label vectors. The first, and also the most intuitive, is to let ỹ n = y n in every iteration. The proposed method with this assignment is denoted by PRAkEL t to indicate the usage of the true label vectors. In this strategy, we implicitly assume that the labels outside the labelset can be perfectly predicted by the other classifiers. In real-world situations, however, this is usually not the case. Therefore, in the second strategy, we define ỹ n to be the predicted label vector of x n obtained thus far. Thus, the 9

10 Wu Lin Algorithm 1: Training algorithm of PRAkEL Input: Training set T = {(x n, y n, c n )} N n=1, cost-sensitive multi-class training algorithm A Parameter: Size of labelset k, number of iterations M Output: Labelsets S m, weights α m and multi-class classifiers h m : X {0, 1} k for 1 m M 1 for m 1 to M do 2 Select a random k-labelset S m ; 3 Define the reference label vector ỹ n for 1 n N; 4 Transform each example (x n, y n, c n ) to (x n, y n [S m ], c n) by Equation 3; 5 T {(x n, y n [S m ], c n)} N n=1 based on (3); 6 h m A(T ); 7 Define h m by Equation 2; 8 Assign weight α m according to the weighting strategy; 9 F n F n + α m h m (x n ) for 1 n N; 10 end Algorithm 2: Prediction algorithm of PRAkEL Input: New instance x, labelsets S m, weights α m and multi-class classifiers h m : X {0, 1} k for 1 m M Output: Predicted label vector ŷ Y 1 Define h m for 1 m M by (2); 2 F M m=1 α mh m (x); 3 ŷ = 0 {0} K ; 4 for l 1 to K do 5 if F [l] > 0 then 6 ŷ[l] 1; 7 end 8 end optimization in each sub-problem no longer depends on the perfect predictions from the previous classifiers. Formally, let F m,n = m p=1 h p(x n ) for 1 n N and define H m,n Y by H m,n [l] = F m,n [l] > 0. That is, H m,n is the prediction of x n by a majority vote of the first m classifiers. We then define ỹ n in the m-th iteration to be H m 1,n for m 2, and let ỹ n = y n in the first iteration. Since the reference label vectors as well as the multi-class sub-problems are obtained progressively, the proposed method coupled with this strategy is denoted simply by PRAkEL. Recall that in our problem setup we assume the minimum of each c n is 0. Therefore, for PRAkEL t we have minŷ {0,1} k c n(ŷ ) = minŷ Y c n (ŷ[s m ] y n [Sm]) c = c n (y n ) = 0. In other words, the minimum cost for every example in each sub-problem is 0, which is a consequence of ỹ n = y n. For PRAkEL, however, this identity may not hold. Since the predicted labels outside S m cannot be altered in the m-th iteration, it is natural to add a constant to each of the functions c n such that minŷ {0,1} k c n(ŷ ) = 0. Therefore, the transformed cost functions 10

11 PRAkEL for PRAkEL are all shifted to satisfy this equality by the following formula. c n(ŷ ) = c n (ŷ ỹ n [S c m]) c n (ŷ y n [S c m]) (4) Interestingly, after shifting the costs, PRAkEL t and PRAkEL become equivalent under Hamming loss and ranking loss. To show this, we first present two lemmas. Lemma 1 Let L r be the function of ranking loss and y Y = {0, 1} K. Then, there exists a unique w R 0 K such that L r (y, ) = L H,w (y, ), where L H,w is the function of weighted Hamming loss with respect to w. Proof See Appendix A. Lemma 2 Let L H,w be the function of weighted Hamming loss and S be a k-labelset. For any subsets y 0 and y 1 of S, L H,w(y, y 0 ỹ[sc ]) L H,w (y, y 1 ỹ[sc ]) is independent of ỹ {0, 1} K. Proof See Appendix A. Theorem 3 Under Hamming loss and ranking loss, PRAkEL t and PRAkEL are equivalent. Proof Let L be the loss function of interest and consider the m-th iteration. For any instance x, let b and c be the cost functions of x in the m-th multi-class sub-problem, in the training of PRAkEL t and PRAkEL, respectively. We show that b (y ) = c (y ) min c. Let ỹ be the reference label vector of x for PRAkEL. Since we are considering a single instance, by Lemma 1, we may assume L is the function of weighted Hamming loss. Let S be the k-labelset in the current iteration and y be the true label vector of x. If y S, then by definition, c (y ) min c = c(y ỹ[s c ]) min c(ŷ) ŷ : ŷ[s c ]=ỹ[s c ] = L(y, y ỹ[s c ]) min L(y, ŷ) ŷ : ŷ[s c ]=ỹ[s c ] = L(y, y ỹ[s c ]) min ŷ S L(y, ŷ ỹ[s c ]) = max ŷ S (L(y, y ỹ[s c ]) L(y, ŷ ỹ[s c ])). In addition, by Lemma 2, L(y, y ỹ[s c ]) L(y, ŷ ỹ[s c ]) is independent of ỹ[s c ] for all ŷ S. Therefore, we have c (y ) min c = max ŷ S (L(y, y y[s c ]) L(y, ŷ y[s c ])) = L(y, y y[s c ]) L(y, y[s] y[s c ]) = L(y, y y[s c ]) = b(y y[s c ]) = b (y ). 11

12 Wu Lin Moreover, for these two loss functions, it is easy to derive an upper bound of the training cost. Consider a training example (x, y, c). Let e m be the training cost of x in the m-th CSMCC sub-problem. We hope to bound the overall multi-label training cost of x in terms of these e m. By Lemma 1, again, it suffices to consider weighted Hamming loss. Recall that K is the number of labels, k is the size of the labelsets, and M is the number of iterations. For simplicity, assume km is a multiple of K. In addition, we assume that each label appears in exactly r = km/k labelsets. That is, the labelsets are selected uniformly. Let h m { 1, 0, 1} K be the prediction of x in the m-th iteration as defined in Section 4.1 and ŷ Y be the final prediction, which is obtained by averaging these h m. Now, focus on the l-th label. If ŷ[l] y[l], then there must be at least half of those m with l S m such that h m [l] is predicted incorrectly. Hence, the part of the overall training cost contributed by the l-th label cannot exceed e m /(r/2) = 2e m /r. As a result, by the property of weighted Hamming loss, the training cost is no more than M m=1 2e m/r = (2K/k)ē, where ē = M m=1 e m/m. By the above arguments, we have the following theorem. Theorem 4 Let E m be the multi-class training cost of the training set in the m-th iteration. Then, under Hamming loss and ranking loss, the overall training cost of the CSMLC problem for both PRAkEL t and PRAkEL is no more than (2K/k)Ē, where Ē is the mean of E m. Proof Since the statement is true for each example, the proof is straightforward. Despite the equivalence between PRAkEL t and PRAkEL for Hamming and ranking loss, they are not the same for arbitrary cost functions. In the experiment section, we demonstrate that PRAkEL is more effective under F1 loss. For now, we present an explanation, by restricting ourselves to the case where the labelsets are disjoint. In this case, K/k = M, and the upper bound in Theorem 4 can be improved to (K/k)Ē = MĒ because the final prediction of each label is determined by a single LP classifier. Under this restriction, we have a similar result for PRAkEL. Before stating the next theorem, we have to make some normality assumption about the cost functions. For a label vector y and its corresponding cost function c, we assume that if ŷ Y is one bit closer to y than ŷ Y, then c(ŷ ) c(ŷ ). That is, a more correct prediction does not result in a larger cost. In fact, this simple assumption has been implicitly made by many MLC methods such as BR, CC and RAkEL. Theorem 5 Assume the labelsets are disjoint. Then, for any cost function satisfying the above assumption, the overall training cost for PRAkEL is no more than MĒ. Proof We may assume there is only one training example (x, y, c), where the subscript n is dropped here for simplicity. Recall that the reference label vector of x in the m-th iteration, 12

13 PRAkEL denoted by ỹ (m), is defined to be H m 1 for m 2. Then, for m 2, c(h m ) = c(h m [S m ] H m [S c m]) = c(h m(x) H m 1 [S c m]) = E m + c(y[s m ] H m 1 [S c m]) E m + c(h m 1 ), where the third equality is by definition of E m, and the inequality follows from the assumption we just made. Hence, by induction, the overall training cost is c(h M ) c(ỹ (1) )+ M m=1 E m = c(y) + MĒ = MĒ. Note that this bound cannot be improved since all inequalities in the proof become equalities under Hamming loss. Nonetheless, there is no analogous result for PRAkEL t, as shown in the following theorem. Theorem 6 Assume k < K. For PRAkEL t, there is no constant B > 0 such that the bound BĒ of the overall training cost holds for any cost functions. Proof Again, assume the labelsets are disjoint and there is only one instance x. Consider the special case where the true label vector of x is y = (1,, 1) Y, and assume h m [l] = 1 for all l S m and all m. In this case, ŷ = (0,, 0) Y, and therefore, its F1 loss is L F (y, ŷ) = 1. In addition, if we define ŷ m = ŷ[s m ] y[s c m], then E m = L F (y, ŷ m ) (5) l = y[l] ŷ m[l] l y[l] ŷ m[l] + 2 l y[l] = ŷ (6) m[l] = 1 k = k + 2(K k). (7) Hence, we have L F (y, ŷ) = 1 = ((2K k)/k)ē. Note that if the factor 2 in the equation (7) is replaced by a larger constant, then the bound needs to be larger. Moreover, we can freely define a loss function L similar to L F by replacing the constant 2 in (6) with an arbitrary positive one. Letting the constant tend to infinity, the proof is complete. Theorems 5 and 6 suggest we define the reference label vectors to be the predicted instead of the true ones. Empirical results in the experiment section also support this finding. In fact, a previous study on multi-target regression has already revealed the problem of treating true targets as additional input variables (Spyromitros-Xioufis et al., 2016). Besides, the authors showed that in-sample estimates of target variables are still problematic, and proposed an approach of out-of-sample estimates to tackle the issue. Although we do not consider this kind of estimates in this paper, we believe that a similar approach for PRAkEL could be considered in future work. One disadvantage of employing the predicted labels is that the sub-problems need to be learned iteratively, while the training process of the LP classifiers of RAkEL can be parallelized. In addition, the two cost-sensitive extensions of RAkEL, CS-RAkEL and GLE, as well as PRAkEL t, apparently do not have this drawback. There is thus a tradeoff between performance and efficiency. 13

14 Wu Lin 4.4. Weighting of Base Classifiers In general, some sub-problems of PRAkEL are easier to solve, while others are more difficult. Thus, the performance of each LP classifier within PRAkEL can be different, and the majority vote of these classifiers may be sub-optimal. Inspired by GLE (Lo et al., 2014), we can further assign different weights to the LP classifiers to represent the importance of them. To achieve this, a linear combination of the classifiers is learned by minimizing the training cost. Formally, given a new instance x, its prediction ŷ Y is produced by setting ŷ[l] = 1 if and only if M m=1 α mh m (x)[l] > 0, where these α m > 0 are called the weights of the base classifiers. Accordingly, the assignment F m,n = m p=1 h p(x n ) in the previous section should be changed to F m,n = m p=1 α ph p (x n ). One approach for determining these weights is to solve an optimization problem after all the h m are learned, just as GLE does. However, this overall optimization ignores the iterative nature of PRAkEL, where the value of F m,n depends on α p for 1 p < m in the m-th iteration. We therefore iteratively determine α m by greedily minimizing the training cost. More precisely, let α 1 = 1 for simplicity, and for m 2, by regarding H m,n as a function of α m, we solve the following single-variable optimization problem and define α m to be an optimal solution. min α R 1 N N c n (H m,n (α)) (8) n=1 It is not easy to solve this type of problems in general. Nevertheless, since the objective function is piece-wise constant, the optimization problem (8) can be solved by considering only finitely many α, and the remaining task is to obtain these candidate α. It then suffices to find the discontinuities of the objective function, and therefore the zeros of each component of the function F m,n (α) for all n, denoted by a set E m,n R. Since F m,n (α) = F m 1,n + αh m (x n ), we have E m,n {α F m,n (α)[l] = 0 for some l S m } = { F m 1,n [l]/h m (x)[l] l S m }, implying E m,n S m = k. If ( n E m,n ) R >0 = {a 1,, a P } with 0 < a 1 < < a P, then clearly P Nk, and the set of candidate α can be chosen to be {(a i + a i+1 )/2 1 i < P } {a 1 /2, a P + 1}. This weighting strategy is called greedy weighting (GW). Certainly, one can simplify the process of solving (8) by minimizing it over a fixed finite set, E, the candidate set of α, to ease the burden of computation and decrease the possibility of overfitting. For example, let E = {i/p 1 i P } {ɛ} for some P N, where 0 < ɛ < 1/P M is a small number for tie breaking. This weighting strategy is called simple weighting (SW) Analysis of Time Complexity First, we analyze the training time complexity of PRAkEL without considering the weighting of the base classifiers. The trivial steps of Algorithm 1 to form the sub-problems are of time complexity at most O(N) multiplied by the time needed to calculate the reference label ỹ n and the cost c n. The more time-consuming step of PRAkEL, similar to RAkEL, depends on the time spent on the CSMCC base classifier, which is denoted as T 0 (N, d, K ) for N examples, d features, and K classes. The empirical results of PRAkEL in the next section demonstrate that it suffices to let each label appear in a fixed number of labelsets on 14

15 PRAkEL average. That is, only M = O(K/k) iterations are needed, and hence, the practical training time of PRAkEL is T 0 (N, d, 2 k ) O(K/k), which is linear in K. In contrast, as discussed in Section 3, the training time of CFT (Li and Lin, 2014) is O(NK 2 ) multiplied by the time needed to calculate the cost c n, and summed with O(K) calls to the base classifier. The complexity analysis reveals the asymptotic efficiency of PRAkEL over CFT. When considering the weighting, in each iteration, GW (which is generally more time consuming than SW) needs O(k) to determine the zeros of each F m,n, and evaluating the goodness of all candidate α can be done within O(Nk), multiplied by the time needed to calculate c n. That is, the running time of PRAkEL-GW with M = O(K/k) iterations needs an additional O(NK) multiplied by the time needed to calculate the cost c n. The additional time of PRAkEL-GW is still asymptotically more efficient than the training time of CFT. 5. Experiment 5.1. Experimental Setup The experiments were conducted on seven benchmark datasets (Tsoumakas et al., 2011). 3 These datasets were taken because of their diversity of domains and popularity in multi-label research community. Their basic statistics are provided in Table 2, where N is the number of examples, d is the dimension of the input space, and K is the number of labels. Dataset Domain N d K Cardinality Density CAL500 music emotions music enron text medical text scene image tmc2007 text yeast biology Table 2: Statistics of the datasets. For statistical significance, all results reported in Section 5.2 were averaged over 30 independent runs. For each run, we randomly sampled 75% of the dataset for training and used the remaining data for testing. One third of the training set was reserved for validation. We compared four variants of the proposed method, namely, PRAkEL t, PRAkEL, PRAkEL-GW and PRAkEL-SW, with three types of methods: (a) labelset-related methods, including RAkEL (Tsoumakas and Vlahavas, 2007) and CS-RAkEL (Lo, 2013); (b) state-ofthe-art CSMLC methods, including EPCC (Dembczynski et al., 2010, 2011, 2012) and CFT (Li and Lin, 2014); (c) a state-of-the-art cost-insensitive MLC method, ML-kNN (Zhang and Zhou, 2007). All hyper-parameters of all the compared methods and the base classifiers were selected by grid search on the validation set. For our method and the labelset-related methods, the parameter k was selected from {2,, 9}, and for each k, the maximum M was fixed to 10K/k. The ensemble size of EPCC was selected from {1,, 7} for efficiency, and on datasets with more than 20 labels, the Monte Carlo sampling technique was employed 3. They were obtained from 15

16 Wu Lin with a sample size of 200 (Dembczynski et al., 2012). For CFT, the number of internal iterations was selected from {2,, 8}, as suggested by the original authors. 4 For the base classifier of EPCC, we employed logistic regression implemented in LI- BLINEAR (Fan et al., 2008). For the methods requiring a regular binary or multi-class classifier, we used linear one-versus-all support vector machines (SVMs) implemented in LIBLINEAR. Our method was coupled with linear RED-OSSVR (Tu and Lin, 2010). 5 The regularization parameter in linear SVMs and RED-OSSVR was also selected by grid search on the validation set. The cost functions we considered in the experiments are all derived from loss functions, as explained in Section Results and Discussion Dataset PRAkEL t PRAkEL PRAkEL-GW PRAkEL-SW CAL ± ± ± ± emotions ± ± ± ± enron ± ± ± ± medical ± ± ± ± scene ± ± ± ± tmc ± ± ± ± yeast ± ± ± ± Dataset EPCC CFT RAkEL ML-kNN CAL ± ± ± ± emotions ± ± ± ± enron ± ± ± ± medical ± ± ± ± scene ± ± ± ± tmc ± ± ± ± yeast ± ± ± ± Table 3: Performance of each method in terms of Hamming loss (mean ± standard error). Dataset PRAkEL t PRAkEL PRAkEL-GW PRAkEL-SW CAL ± ± ± ± emotions ± ± ± ± enron ± ± ± ± medical ± ± ± ± scene ± ± ± ± tmc ± ± ± ± yeast ± ± ± ± Dataset EPCC CFT RAkEL ML-kNN CAL ± ± ± ± emotions ± ± ± ± enron ± ± ± ± medical ± ± ± ± scene ± ± ± ± tmc ± ± ± ± yeast ± ± ± ± Table 4: Performance of each method in terms of ranking loss (mean ± standard error). 4. Because of its efficiency issues, we restricted the maximum number of iterations to 4 on datasets with K > RED-OSSVR can be shown to be equivalent to one-versus-all SVMs for cost functions c n(ŷ) = ŷ y n. 16

17 PRAkEL Dataset PRAkEL t PRAkEL PRAkEL-GW PRAkEL-SW CAL ± ± ± ± emotions ± ± ± ± enron ± ± ± ± medical ± ± ± ± scene ± ± ± ± tmc ± ± ± ± yeast ± ± ± ± Dataset EPCC CFT RAkEL ML-kNN CAL ± ± ± ± emotions ± ± ± ± enron ± ± ± ± medical ± ± ± ± scene ± ± ± ± tmc ± ± ± ± yeast ± ± ± ± Table 5: Performance of each method in terms of F1 loss (mean ± standard error). Tables 3, 4 and 5 present the results of the four variants of our method, EPCC, CFT, RAkEL and ML-kNN in terms of Hamming, ranking and F1 loss. The best results for each dataset are marked in bold. Comparison of Variants of PRAkEL In this subsection, we draw a comparison between the four variants of the proposed method, namely, PRAkEL t, PRAkEL, PRAkEL-GW and PRAkEL-SW. We first compare PRAkEL t and PRAkEL to understand the difference between using the true and the predicted ones as the reference label vectors. Recall that PRAkEL t and PRAkEL are theoretically equivalent under Hamming and ranking loss, and therefore, it is not a coincidence that the results of the first two variants in Tables 3 and 4 are exactly the same. Table 5 shows that on all the datasets PRAkEL has lower costs than PRAkEL t in terms of F1 loss. We also present in Table 6 the results of the Student s t-test at a significance level of 0.05, on two pairs of variants. The comparison of PRAkEL and PRAkEL t under F1 loss reveals that PRAkEL is significantly superior on five datasets. This demonstrates the benefits of exploiting previous predictions, and is also consistent with the theoretical results in Theorems 5 and 6. Thus, for the remaining experiments, the results of PRAkEL t are not presented. Loss function PRAkEL v.s. PRAkEL t PRAkEL-GW v.s. PRAkEL PRAkEL-SW v.s. PRAkEL Hamming loss 0/7/0 0/5/2 1/6/0 Ranking loss 0/7/0 3/3/1 3/4/0 F1 loss 5/2/0 1/5/1 2/5/0 Total 5/16/0 4/13/4 6/15/0 Table 6: Variants of PRAkEL versus other variants by the Student s t-test at a significance level of 0.05 (superior/comparable/inferior). 17

18 Wu Lin Figure 1: Training costs of PRAkEL, PRAkEL-GW and PRAkEL-SW in terms of F1 loss with the standard errors. Dataset PRAkEL PRAkEL-GW PRAkEL-SW CAL ± ± ± emotions ± ± ± enron ± ± ± medical ± ± ± scene ± ± ± tmc ± ± ± yeast ± ± ± Table 7: Training costs of PRAkEL, PRAkEL-GW and PRAkEL-SW in terms of F1 loss (mean ± standard error). Next, we compare the three weighting strategies, i.e., uniform, greedy and simple weighting. From table 6, overall PRAkEL is competitive with PRAkEL-GW, although under ranking loss the performance of PRAkEL-GW is slightly better. In addition, from the last comparison wee see that PRAkEL-SW is never outperformed by PRAkEL under these three loss functions. For Hamming loss, there is no significant difference between the performance of PRAkEL and PRAkEL-SW. For ranking loss and F1 loss, however, PRAkEL-SW performs slightly better than PRAkEL. Since the last two variants greedily minimize the training costs in every iteration, it is expected that their training costs are much lower than PRAkEL s. Table 7 and Fig. 1, which show the training costs in terms of F1 loss, verify this deduction. Under other loss functions we also observe similar behavior. The reason is that, for PRAkEL-GW, the weights of the classifiers are determined from an optimization problem with no constraints, while for PRAkEL-SW, the weights are restricted to the candidate set. From a holistic point of view, the candidate set acts as a regularizer, which prevents PRAkEL-SW from 18

On the Problem of Error Propagation in Classifier Chains for Multi-Label Classification

On the Problem of Error Propagation in Classifier Chains for Multi-Label Classification On the Problem of Error Propagation in Classifier Chains for Multi-Label Classification Robin Senge, Juan José del Coz and Eyke Hüllermeier Draft version of a paper to appear in: L. Schmidt-Thieme and

More information

Classifier Chains for Multi-label Classification

Classifier Chains for Multi-label Classification Classifier Chains for Multi-label Classification Jesse Read, Bernhard Pfahringer, Geoff Holmes, Eibe Frank University of Waikato New Zealand ECML PKDD 2009, September 9, 2009. Bled, Slovenia J. Read, B.

More information

A Deep Interpretation of Classifier Chains

A Deep Interpretation of Classifier Chains A Deep Interpretation of Classifier Chains Jesse Read and Jaakko Holmén http://users.ics.aalto.fi/{jesse,jhollmen}/ Aalto University School of Science, Department of Information and Computer Science and

More information

Regret Analysis for Performance Metrics in Multi-Label Classification The Case of Hamming and Subset Zero-One Loss

Regret Analysis for Performance Metrics in Multi-Label Classification The Case of Hamming and Subset Zero-One Loss Regret Analysis for Performance Metrics in Multi-Label Classification The Case of Hamming and Subset Zero-One Loss Krzysztof Dembczyński 1, Willem Waegeman 2, Weiwei Cheng 1, and Eyke Hüllermeier 1 1 Knowledge

More information

Multi-Label Selective Ensemble

Multi-Label Selective Ensemble Multi-Label Selective Ensemble Nan Li, Yuan Jiang and Zhi-Hua Zhou National Key Laboratory for Novel Software Technology, Nanjing University Collaborative Innovation Center of Novel Software Technology

More information

Multi-label learning from batch and streaming data

Multi-label learning from batch and streaming data Multi-label learning from batch and streaming data Jesse Read Télécom ParisTech École Polytechnique Summer School on Mining Big and Complex Data 5 September 26 Ohrid, Macedonia Introduction x = Binary

More information

Optimization of Classifier Chains via Conditional Likelihood Maximization

Optimization of Classifier Chains via Conditional Likelihood Maximization Optimization of Classifier Chains via Conditional Likelihood Maximization Lu Sun, Mineichi Kudo Graduate School of Information Science and Technology, Hokkaido University, Sapporo 060-0814, Japan Abstract

More information

Analysis of the Performance of AdaBoost.M2 for the Simulated Digit-Recognition-Example

Analysis of the Performance of AdaBoost.M2 for the Simulated Digit-Recognition-Example Analysis of the Performance of AdaBoost.M2 for the Simulated Digit-Recognition-Example Günther Eibl and Karl Peter Pfeiffer Institute of Biostatistics, Innsbruck, Austria guenther.eibl@uibk.ac.at Abstract.

More information

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 8. Chapter 8. Classification: Basic Concepts

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 8. Chapter 8. Classification: Basic Concepts Data Mining: Concepts and Techniques (3 rd ed.) Chapter 8 1 Chapter 8. Classification: Basic Concepts Classification: Basic Concepts Decision Tree Induction Bayes Classification Methods Rule-Based Classification

More information

A Simple Algorithm for Multilabel Ranking

A Simple Algorithm for Multilabel Ranking A Simple Algorithm for Multilabel Ranking Krzysztof Dembczyński 1 Wojciech Kot lowski 1 Eyke Hüllermeier 2 1 Intelligent Decision Support Systems Laboratory (IDSS), Poznań University of Technology, Poland

More information

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18 CSE 417T: Introduction to Machine Learning Final Review Henry Chai 12/4/18 Overfitting Overfitting is fitting the training data more than is warranted Fitting noise rather than signal 2 Estimating! "#$

More information

Selective Ensemble of Classifier Chains

Selective Ensemble of Classifier Chains Selective Ensemble of Classifier Chains Nan Li 1,2 and Zhi-Hua Zhou 1 1 National Key Laboratory for Novel Software Technology Nanjing University, Nanjing 210023, China 2 School of Mathematical Sciences,

More information

CSCI-567: Machine Learning (Spring 2019)

CSCI-567: Machine Learning (Spring 2019) CSCI-567: Machine Learning (Spring 2019) Prof. Victor Adamchik U of Southern California Mar. 19, 2019 March 19, 2019 1 / 43 Administration March 19, 2019 2 / 43 Administration TA3 is due this week March

More information

Large-Margin Thresholded Ensembles for Ordinal Regression

Large-Margin Thresholded Ensembles for Ordinal Regression Large-Margin Thresholded Ensembles for Ordinal Regression Hsuan-Tien Lin and Ling Li Learning Systems Group, California Institute of Technology, U.S.A. Conf. on Algorithmic Learning Theory, October 9,

More information

Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers

Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers Erin Allwein, Robert Schapire and Yoram Singer Journal of Machine Learning Research, 1:113-141, 000 CSE 54: Seminar on Learning

More information

Algorithm-Independent Learning Issues

Algorithm-Independent Learning Issues Algorithm-Independent Learning Issues Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2007 c 2007, Selim Aksoy Introduction We have seen many learning

More information

Binary Classification, Multi-label Classification and Ranking: A Decision-theoretic Approach

Binary Classification, Multi-label Classification and Ranking: A Decision-theoretic Approach Binary Classification, Multi-label Classification and Ranking: A Decision-theoretic Approach Krzysztof Dembczyński and Wojciech Kot lowski Intelligent Decision Support Systems Laboratory (IDSS) Poznań

More information

On Label Dependence in Multi-Label Classification

On Label Dependence in Multi-Label Classification Krzysztof Dembczynski 1,2 dembczynski@informatik.uni-marburg.de Willem Waegeman 3 willem.waegeman@ugent.be Weiwei Cheng 1 cheng@informatik.uni-marburg.de Eyke Hüllermeier 1 eyke@informatik.uni-marburg.de

More information

18.9 SUPPORT VECTOR MACHINES

18.9 SUPPORT VECTOR MACHINES 744 Chapter 8. Learning from Examples is the fact that each regression problem will be easier to solve, because it involves only the examples with nonzero weight the examples whose kernels overlap the

More information

Infinite Ensemble Learning with Support Vector Machinery

Infinite Ensemble Learning with Support Vector Machinery Infinite Ensemble Learning with Support Vector Machinery Hsuan-Tien Lin and Ling Li Learning Systems Group, California Institute of Technology ECML/PKDD, October 4, 2005 H.-T. Lin and L. Li (Learning Systems

More information

Instance-based Learning CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Instance-based Learning CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016 Instance-based Learning CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2016 Outline Non-parametric approach Unsupervised: Non-parametric density estimation Parzen Windows Kn-Nearest

More information

CS7267 MACHINE LEARNING

CS7267 MACHINE LEARNING CS7267 MACHINE LEARNING ENSEMBLE LEARNING Ref: Dr. Ricardo Gutierrez-Osuna at TAMU, and Aarti Singh at CMU Mingon Kang, Ph.D. Computer Science, Kennesaw State University Definition of Ensemble Learning

More information

Solving Classification Problems By Knowledge Sets

Solving Classification Problems By Knowledge Sets Solving Classification Problems By Knowledge Sets Marcin Orchel a, a Department of Computer Science, AGH University of Science and Technology, Al. A. Mickiewicza 30, 30-059 Kraków, Poland Abstract We propose

More information

ECE 5424: Introduction to Machine Learning

ECE 5424: Introduction to Machine Learning ECE 5424: Introduction to Machine Learning Topics: Ensemble Methods: Bagging, Boosting PAC Learning Readings: Murphy 16.4;; Hastie 16 Stefan Lee Virginia Tech Fighting the bias-variance tradeoff Simple

More information

Large-Margin Thresholded Ensembles for Ordinal Regression

Large-Margin Thresholded Ensembles for Ordinal Regression Large-Margin Thresholded Ensembles for Ordinal Regression Hsuan-Tien Lin (accepted by ALT 06, joint work with Ling Li) Learning Systems Group, Caltech Workshop Talk in MLSS 2006, Taipei, Taiwan, 07/25/2006

More information

Binary Classification, Multi-label Classification and Ranking: A Decision-theoretic Approach

Binary Classification, Multi-label Classification and Ranking: A Decision-theoretic Approach Binary Classification, Multi-label Classification and Ranking: A Decision-theoretic Approach Krzysztof Dembczyński and Wojciech Kot lowski Intelligent Decision Support Systems Laboratory (IDSS) Poznań

More information

AdaBoost. Lecturer: Authors: Center for Machine Perception Czech Technical University, Prague

AdaBoost. Lecturer: Authors: Center for Machine Perception Czech Technical University, Prague AdaBoost Lecturer: Jan Šochman Authors: Jan Šochman, Jiří Matas Center for Machine Perception Czech Technical University, Prague http://cmp.felk.cvut.cz Motivation Presentation 2/17 AdaBoost with trees

More information

Regret Analysis for Performance Metrics in Multi-Label Classification: The Case of Hamming and Subset Zero-One Loss

Regret Analysis for Performance Metrics in Multi-Label Classification: The Case of Hamming and Subset Zero-One Loss Regret Analysis for Performance Metrics in Multi-Label Classification: The Case of Hamming and Subset Zero-One Loss Krzysztof Dembczyński 1,3, Willem Waegeman 2, Weiwei Cheng 1, and Eyke Hüllermeier 1

More information

An Ensemble Ranking Solution for the Yahoo! Learning to Rank Challenge

An Ensemble Ranking Solution for the Yahoo! Learning to Rank Challenge An Ensemble Ranking Solution for the Yahoo! Learning to Rank Challenge Ming-Feng Tsai Department of Computer Science University of Singapore 13 Computing Drive, Singapore Shang-Tse Chen Yao-Nan Chen Chun-Sung

More information

Boosting & Deep Learning

Boosting & Deep Learning Boosting & Deep Learning Ensemble Learning n So far learning methods that learn a single hypothesis, chosen form a hypothesis space that is used to make predictions n Ensemble learning à select a collection

More information

COMS 4771 Introduction to Machine Learning. Nakul Verma

COMS 4771 Introduction to Machine Learning. Nakul Verma COMS 4771 Introduction to Machine Learning Nakul Verma Announcements HW1 due next lecture Project details are available decide on the group and topic by Thursday Last time Generative vs. Discriminative

More information

Understanding Generalization Error: Bounds and Decompositions

Understanding Generalization Error: Bounds and Decompositions CIS 520: Machine Learning Spring 2018: Lecture 11 Understanding Generalization Error: Bounds and Decompositions Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the

More information

IFT Lecture 7 Elements of statistical learning theory

IFT Lecture 7 Elements of statistical learning theory IFT 6085 - Lecture 7 Elements of statistical learning theory This version of the notes has not yet been thoroughly checked. Please report any bugs to the scribes or instructor. Scribe(s): Brady Neal and

More information

Decision Trees: Overfitting

Decision Trees: Overfitting Decision Trees: Overfitting Emily Fox University of Washington January 30, 2017 Decision tree recap Loan status: Root 22 18 poor 4 14 Credit? Income? excellent 9 0 3 years 0 4 Fair 9 4 Term? 5 years 9

More information

Machine Learning. Lecture 9: Learning Theory. Feng Li.

Machine Learning. Lecture 9: Learning Theory. Feng Li. Machine Learning Lecture 9: Learning Theory Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 2018 Why Learning Theory How can we tell

More information

arxiv: v1 [cs.ds] 3 Feb 2018

arxiv: v1 [cs.ds] 3 Feb 2018 A Model for Learned Bloom Filters and Related Structures Michael Mitzenmacher 1 arxiv:1802.00884v1 [cs.ds] 3 Feb 2018 Abstract Recent work has suggested enhancing Bloom filters by using a pre-filter, based

More information

ECS289: Scalable Machine Learning

ECS289: Scalable Machine Learning ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 18, 2016 Outline One versus all/one versus one Ranking loss for multiclass/multilabel classification Scaling to millions of labels Multiclass

More information

VBM683 Machine Learning

VBM683 Machine Learning VBM683 Machine Learning Pinar Duygulu Slides are adapted from Dhruv Batra Bias is the algorithm's tendency to consistently learn the wrong thing by not taking into account all the information in the data

More information

Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring /

Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring / Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring 2015 http://ce.sharif.edu/courses/93-94/2/ce717-1 / Agenda Combining Classifiers Empirical view Theoretical

More information

Lecture 21: Spectral Learning for Graphical Models

Lecture 21: Spectral Learning for Graphical Models 10-708: Probabilistic Graphical Models 10-708, Spring 2016 Lecture 21: Spectral Learning for Graphical Models Lecturer: Eric P. Xing Scribes: Maruan Al-Shedivat, Wei-Cheng Chang, Frederick Liu 1 Motivation

More information

CS 6375 Machine Learning

CS 6375 Machine Learning CS 6375 Machine Learning Decision Trees Instructor: Yang Liu 1 Supervised Classifier X 1 X 2. X M Ref class label 2 1 Three variables: Attribute 1: Hair = {blond, dark} Attribute 2: Height = {tall, short}

More information

Introduction to Logistic Regression

Introduction to Logistic Regression Introduction to Logistic Regression Guy Lebanon Binary Classification Binary classification is the most basic task in machine learning, and yet the most frequent. Binary classifiers often serve as the

More information

Tufts COMP 135: Introduction to Machine Learning

Tufts COMP 135: Introduction to Machine Learning Tufts COMP 135: Introduction to Machine Learning https://www.cs.tufts.edu/comp/135/2019s/ Logistic Regression Many slides attributable to: Prof. Mike Hughes Erik Sudderth (UCI) Finale Doshi-Velez (Harvard)

More information

Machine Learning, Midterm Exam: Spring 2008 SOLUTIONS. Q Topic Max. Score Score. 1 Short answer questions 20.

Machine Learning, Midterm Exam: Spring 2008 SOLUTIONS. Q Topic Max. Score Score. 1 Short answer questions 20. 10-601 Machine Learning, Midterm Exam: Spring 2008 Please put your name on this cover sheet If you need more room to work out your answer to a question, use the back of the page and clearly mark on the

More information

A Blended Metric for Multi-label Optimisation and Evaluation

A Blended Metric for Multi-label Optimisation and Evaluation A Blended Metric for Multi-label Optimisation and Evaluation Laurence A. F. Park 1 and Jesse Read 1 School of Computing, Engineering and Mathematics, Western Sydney University, Australia. lapark@scem.westernsydney.edu.au

More information

arxiv: v1 [cs.lg] 19 Jul 2017

arxiv: v1 [cs.lg] 19 Jul 2017 arxiv:1707.06142v1 [cs.lg] 19 Jul 2017 Luca Mossina luca.mossina@isae-supaero.fr Emmanuel Rachelson emmanuel.rachelson@isae-supaero.fr Département d Ingénierie des Systèmes Complexes (DISC) ISAE-SUPAERO

More information

Midterm Exam Solutions, Spring 2007

Midterm Exam Solutions, Spring 2007 1-71 Midterm Exam Solutions, Spring 7 1. Personal info: Name: Andrew account: E-mail address:. There should be 16 numbered pages in this exam (including this cover sheet). 3. You can use any material you

More information

Introduction to Machine Learning Lecture 13. Mehryar Mohri Courant Institute and Google Research

Introduction to Machine Learning Lecture 13. Mehryar Mohri Courant Institute and Google Research Introduction to Machine Learning Lecture 13 Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Multi-Class Classification Mehryar Mohri - Introduction to Machine Learning page 2 Motivation

More information

Consistency of Nearest Neighbor Methods

Consistency of Nearest Neighbor Methods E0 370 Statistical Learning Theory Lecture 16 Oct 25, 2011 Consistency of Nearest Neighbor Methods Lecturer: Shivani Agarwal Scribe: Arun Rajkumar 1 Introduction In this lecture we return to the study

More information

Classification of Ordinal Data Using Neural Networks

Classification of Ordinal Data Using Neural Networks Classification of Ordinal Data Using Neural Networks Joaquim Pinto da Costa and Jaime S. Cardoso 2 Faculdade Ciências Universidade Porto, Porto, Portugal jpcosta@fc.up.pt 2 Faculdade Engenharia Universidade

More information

Regret Analysis for Performance Metrics in Multi-Label Classification: The Case of Hamming and Subset Zero-One Loss

Regret Analysis for Performance Metrics in Multi-Label Classification: The Case of Hamming and Subset Zero-One Loss Regret Analysis for Performance Metrics in Multi-Label Classification: The Case of Hamming and Subset Zero-One Loss Krzysztof Dembczyński 1,3, Willem Waegeman 2, Weiwei Cheng 1,andEykeHüllermeier 1 1 Department

More information

Machine Learning for Signal Processing Bayes Classification and Regression

Machine Learning for Signal Processing Bayes Classification and Regression Machine Learning for Signal Processing Bayes Classification and Regression Instructor: Bhiksha Raj 11755/18797 1 Recap: KNN A very effective and simple way of performing classification Simple model: For

More information

hsnim: Hyper Scalable Network Inference Machine for Scale-Free Protein-Protein Interaction Networks Inference

hsnim: Hyper Scalable Network Inference Machine for Scale-Free Protein-Protein Interaction Networks Inference CS 229 Project Report (TR# MSB2010) Submitted 12/10/2010 hsnim: Hyper Scalable Network Inference Machine for Scale-Free Protein-Protein Interaction Networks Inference Muhammad Shoaib Sehgal Computer Science

More information

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima. http://goo.gl/xilnmn Course website KYOTO UNIVERSITY Statistical Machine Learning Theory From Multi-class Classification to Structured Output Prediction Hisashi Kashima kashima@i.kyoto-u.ac.jp DEPARTMENT

More information

Multiclass Classification-1

Multiclass Classification-1 CS 446 Machine Learning Fall 2016 Oct 27, 2016 Multiclass Classification Professor: Dan Roth Scribe: C. Cheng Overview Binary to multiclass Multiclass SVM Constraint classification 1 Introduction Multiclass

More information

A first model of learning

A first model of learning A first model of learning Let s restrict our attention to binary classification our labels belong to (or ) We observe the data where each Suppose we are given an ensemble of possible hypotheses / classifiers

More information

Learning Theory. Ingo Steinwart University of Stuttgart. September 4, 2013

Learning Theory. Ingo Steinwart University of Stuttgart. September 4, 2013 Learning Theory Ingo Steinwart University of Stuttgart September 4, 2013 Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 1 / 62 Basics Informal Introduction Informal Description

More information

10-701/ Machine Learning - Midterm Exam, Fall 2010

10-701/ Machine Learning - Midterm Exam, Fall 2010 10-701/15-781 Machine Learning - Midterm Exam, Fall 2010 Aarti Singh Carnegie Mellon University 1. Personal info: Name: Andrew account: E-mail address: 2. There should be 15 numbered pages in this exam

More information

ECS289: Scalable Machine Learning

ECS289: Scalable Machine Learning ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 27, 2015 Outline One versus all/one versus one Ranking loss for multiclass/multilabel classification Scaling to millions of labels Multiclass

More information

Lecture : Probabilistic Machine Learning

Lecture : Probabilistic Machine Learning Lecture : Probabilistic Machine Learning Riashat Islam Reasoning and Learning Lab McGill University September 11, 2018 ML : Many Methods with Many Links Modelling Views of Machine Learning Machine Learning

More information

Learning with multiple models. Boosting.

Learning with multiple models. Boosting. CS 2750 Machine Learning Lecture 21 Learning with multiple models. Boosting. Milos Hauskrecht milos@cs.pitt.edu 5329 Sennott Square Learning with multiple models: Approach 2 Approach 2: use multiple models

More information

Data Mining und Maschinelles Lernen

Data Mining und Maschinelles Lernen Data Mining und Maschinelles Lernen Ensemble Methods Bias-Variance Trade-off Basic Idea of Ensembles Bagging Basic Algorithm Bagging with Costs Randomization Random Forests Boosting Stacking Error-Correcting

More information

CSCI567 Machine Learning (Fall 2018)

CSCI567 Machine Learning (Fall 2018) CSCI567 Machine Learning (Fall 2018) Prof. Haipeng Luo U of Southern California Sep 12, 2018 September 12, 2018 1 / 49 Administration GitHub repos are setup (ask TA Chi Zhang for any issues) HW 1 is due

More information

9 Classification. 9.1 Linear Classifiers

9 Classification. 9.1 Linear Classifiers 9 Classification This topic returns to prediction. Unlike linear regression where we were predicting a numeric value, in this case we are predicting a class: winner or loser, yes or no, rich or poor, positive

More information

Midterm Exam, Spring 2005

Midterm Exam, Spring 2005 10-701 Midterm Exam, Spring 2005 1. Write your name and your email address below. Name: Email address: 2. There should be 15 numbered pages in this exam (including this cover sheet). 3. Write your name

More information

The Kernel Trick, Gram Matrices, and Feature Extraction. CS6787 Lecture 4 Fall 2017

The Kernel Trick, Gram Matrices, and Feature Extraction. CS6787 Lecture 4 Fall 2017 The Kernel Trick, Gram Matrices, and Feature Extraction CS6787 Lecture 4 Fall 2017 Momentum for Principle Component Analysis CS6787 Lecture 3.1 Fall 2017 Principle Component Analysis Setting: find the

More information

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima. http://goo.gl/jv7vj9 Course website KYOTO UNIVERSITY Statistical Machine Learning Theory From Multi-class Classification to Structured Output Prediction Hisashi Kashima kashima@i.kyoto-u.ac.jp DEPARTMENT

More information

Multi-label Active Learning with Auxiliary Learner

Multi-label Active Learning with Auxiliary Learner Multi-label Active Learning with Auxiliary Learner Chen-Wei Hung and Hsuan-Tien Lin National Taiwan University November 15, 2011 C.-W. Hung & H.-T. Lin (NTU) Multi-label AL w/ Auxiliary Learner 11/15/2011

More information

Minimax risk bounds for linear threshold functions

Minimax risk bounds for linear threshold functions CS281B/Stat241B (Spring 2008) Statistical Learning Theory Lecture: 3 Minimax risk bounds for linear threshold functions Lecturer: Peter Bartlett Scribe: Hao Zhang 1 Review We assume that there is a probability

More information

Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning

Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning Nicolas Thome Prenom.Nom@cnam.fr http://cedric.cnam.fr/vertigo/cours/ml2/ Département Informatique Conservatoire

More information

Lecture 3: Multiclass Classification

Lecture 3: Multiclass Classification Lecture 3: Multiclass Classification Kai-Wei Chang CS @ University of Virginia kw@kwchang.net Some slides are adapted from Vivek Skirmar and Dan Roth CS6501 Lecture 3 1 Announcement v Please enroll in

More information

Click Prediction and Preference Ranking of RSS Feeds

Click Prediction and Preference Ranking of RSS Feeds Click Prediction and Preference Ranking of RSS Feeds 1 Introduction December 11, 2009 Steven Wu RSS (Really Simple Syndication) is a family of data formats used to publish frequently updated works. RSS

More information

Lecture 2. Judging the Performance of Classifiers. Nitin R. Patel

Lecture 2. Judging the Performance of Classifiers. Nitin R. Patel Lecture 2 Judging the Performance of Classifiers Nitin R. Patel 1 In this note we will examine the question of how to udge the usefulness of a classifier and how to compare different classifiers. Not only

More information

A Brief Introduction to Adaboost

A Brief Introduction to Adaboost A Brief Introduction to Adaboost Hongbo Deng 6 Feb, 2007 Some of the slides are borrowed from Derek Hoiem & Jan ˇSochman. 1 Outline Background Adaboost Algorithm Theory/Interpretations 2 What s So Good

More information

FINAL: CS 6375 (Machine Learning) Fall 2014

FINAL: CS 6375 (Machine Learning) Fall 2014 FINAL: CS 6375 (Machine Learning) Fall 2014 The exam is closed book. You are allowed a one-page cheat sheet. Answer the questions in the spaces provided on the question sheets. If you run out of room for

More information

Lecture 2 Machine Learning Review

Lecture 2 Machine Learning Review Lecture 2 Machine Learning Review CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago March 29, 2017 Things we will look at today Formal Setup for Supervised Learning Things

More information

Multi-dimensional classification via a metric approach

Multi-dimensional classification via a metric approach Multi-dimensional classification via a metric approach Zhongchen Ma a, Songcan Chen a, a College of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics (NUAA), Nanjing 211106,

More information

Voting (Ensemble Methods)

Voting (Ensemble Methods) 1 2 Voting (Ensemble Methods) Instead of learning a single classifier, learn many weak classifiers that are good at different parts of the data Output class: (Weighted) vote of each classifier Classifiers

More information

CS-E4830 Kernel Methods in Machine Learning

CS-E4830 Kernel Methods in Machine Learning CS-E4830 Kernel Methods in Machine Learning Lecture 5: Multi-class and preference learning Juho Rousu 11. October, 2017 Juho Rousu 11. October, 2017 1 / 37 Agenda from now on: This week s theme: going

More information

Stephen Scott.

Stephen Scott. 1 / 35 (Adapted from Ethem Alpaydin and Tom Mitchell) sscott@cse.unl.edu In Homework 1, you are (supposedly) 1 Choosing a data set 2 Extracting a test set of size > 30 3 Building a tree on the training

More information

Machine Learning. VC Dimension and Model Complexity. Eric Xing , Fall 2015

Machine Learning. VC Dimension and Model Complexity. Eric Xing , Fall 2015 Machine Learning 10-701, Fall 2015 VC Dimension and Model Complexity Eric Xing Lecture 16, November 3, 2015 Reading: Chap. 7 T.M book, and outline material Eric Xing @ CMU, 2006-2015 1 Last time: PAC and

More information

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machine (SVM) and Kernel Methods Support Vector Machine (SVM) and Kernel Methods CE-717: Machine Learning Sharif University of Technology Fall 2015 Soleymani Outline Margin concept Hard-Margin SVM Soft-Margin SVM Dual Problems of Hard-Margin

More information

Bayesian Methods for Machine Learning

Bayesian Methods for Machine Learning Bayesian Methods for Machine Learning CS 584: Big Data Analytics Material adapted from Radford Neal s tutorial (http://ftp.cs.utoronto.ca/pub/radford/bayes-tut.pdf), Zoubin Ghahramni (http://hunch.net/~coms-4771/zoubin_ghahramani_bayesian_learning.pdf),

More information

From Binary to Multiclass Classification. CS 6961: Structured Prediction Spring 2018

From Binary to Multiclass Classification. CS 6961: Structured Prediction Spring 2018 From Binary to Multiclass Classification CS 6961: Structured Prediction Spring 2018 1 So far: Binary Classification We have seen linear models Learning algorithms Perceptron SVM Logistic Regression Prediction

More information

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation. CS 189 Spring 2015 Introduction to Machine Learning Midterm You have 80 minutes for the exam. The exam is closed book, closed notes except your one-page crib sheet. No calculators or electronic items.

More information

An Analytical Comparison between Bayes Point Machines and Support Vector Machines

An Analytical Comparison between Bayes Point Machines and Support Vector Machines An Analytical Comparison between Bayes Point Machines and Support Vector Machines Ashish Kapoor Massachusetts Institute of Technology Cambridge, MA 02139 kapoor@mit.edu Abstract This paper analyzes the

More information

ML in Practice: CMSC 422 Slides adapted from Prof. CARPUAT and Prof. Roth

ML in Practice: CMSC 422 Slides adapted from Prof. CARPUAT and Prof. Roth ML in Practice: CMSC 422 Slides adapted from Prof. CARPUAT and Prof. Roth N-fold cross validation Instead of a single test-training split: train test Split data into N equal-sized parts Train and test

More information

Diversity Regularization of Latent Variable Models: Theory, Algorithm and Applications

Diversity Regularization of Latent Variable Models: Theory, Algorithm and Applications Diversity Regularization of Latent Variable Models: Theory, Algorithm and Applications Pengtao Xie, Machine Learning Department, Carnegie Mellon University 1. Background Latent Variable Models (LVMs) are

More information

Week 5: Logistic Regression & Neural Networks

Week 5: Logistic Regression & Neural Networks Week 5: Logistic Regression & Neural Networks Instructor: Sergey Levine 1 Summary: Logistic Regression In the previous lecture, we covered logistic regression. To recap, logistic regression models and

More information

Learning with Rejection

Learning with Rejection Learning with Rejection Corinna Cortes 1, Giulia DeSalvo 2, and Mehryar Mohri 2,1 1 Google Research, 111 8th Avenue, New York, NY 2 Courant Institute of Mathematical Sciences, 251 Mercer Street, New York,

More information

Sparse Linear Models (10/7/13)

Sparse Linear Models (10/7/13) STA56: Probabilistic machine learning Sparse Linear Models (0/7/) Lecturer: Barbara Engelhardt Scribes: Jiaji Huang, Xin Jiang, Albert Oh Sparsity Sparsity has been a hot topic in statistics and machine

More information

An Empirical Study of Building Compact Ensembles

An Empirical Study of Building Compact Ensembles An Empirical Study of Building Compact Ensembles Huan Liu, Amit Mandvikar, and Jigar Mody Computer Science & Engineering Arizona State University Tempe, AZ 85281 {huan.liu,amitm,jigar.mody}@asu.edu Abstract.

More information

Part of the slides are adapted from Ziko Kolter

Part of the slides are adapted from Ziko Kolter Part of the slides are adapted from Ziko Kolter OUTLINE 1 Supervised learning: classification........................................................ 2 2 Non-linear regression/classification, overfitting,

More information

A REVIEW ARTICLE ON NAIVE BAYES CLASSIFIER WITH VARIOUS SMOOTHING TECHNIQUES

A REVIEW ARTICLE ON NAIVE BAYES CLASSIFIER WITH VARIOUS SMOOTHING TECHNIQUES Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 3, Issue. 10, October 2014,

More information

Change point method: an exact line search method for SVMs

Change point method: an exact line search method for SVMs Erasmus University Rotterdam Bachelor Thesis Econometrics & Operations Research Change point method: an exact line search method for SVMs Author: Yegor Troyan Student number: 386332 Supervisor: Dr. P.J.F.

More information

Introduction to Machine Learning. Lecture 2

Introduction to Machine Learning. Lecture 2 Introduction to Machine Learning Lecturer: Eran Halperin Lecture 2 Fall Semester Scribe: Yishay Mansour Some of the material was not presented in class (and is marked with a side line) and is given for

More information

CS145: INTRODUCTION TO DATA MINING

CS145: INTRODUCTION TO DATA MINING CS145: INTRODUCTION TO DATA MINING 4: Vector Data: Decision Tree Instructor: Yizhou Sun yzsun@cs.ucla.edu October 10, 2017 Methods to Learn Vector Data Set Data Sequence Data Text Data Classification Clustering

More information

MIDTERM SOLUTIONS: FALL 2012 CS 6375 INSTRUCTOR: VIBHAV GOGATE

MIDTERM SOLUTIONS: FALL 2012 CS 6375 INSTRUCTOR: VIBHAV GOGATE MIDTERM SOLUTIONS: FALL 2012 CS 6375 INSTRUCTOR: VIBHAV GOGATE March 28, 2012 The exam is closed book. You are allowed a double sided one page cheat sheet. Answer the questions in the spaces provided on

More information

Bagging. Ryan Tibshirani Data Mining: / April Optional reading: ISL 8.2, ESL 8.7

Bagging. Ryan Tibshirani Data Mining: / April Optional reading: ISL 8.2, ESL 8.7 Bagging Ryan Tibshirani Data Mining: 36-462/36-662 April 23 2013 Optional reading: ISL 8.2, ESL 8.7 1 Reminder: classification trees Our task is to predict the class label y {1,... K} given a feature vector

More information

Lecture 6: September 22

Lecture 6: September 22 CS294 Markov Chain Monte Carlo: Foundations & Applications Fall 2009 Lecture 6: September 22 Lecturer: Prof. Alistair Sinclair Scribes: Alistair Sinclair Disclaimer: These notes have not been subjected

More information