Optimization of Classifier Chains via Conditional Likelihood Maximization

Size: px

Start display at page:

Download "Optimization of Classifier Chains via Conditional Likelihood Maximization"

Cecily Watkins
6 years ago
Views:

1 Optimization of Classifier Chains via Conditional Likelihood Maximization Lu Sun, Mineichi Kudo Graduate School of Information Science and Technology, Hokkaido University, Sapporo , Japan Abstract Multi-label classification associates an unseen instance with multiple relevant labels. In recent years, a variety of methods have been proposed to handle the multi-label problems. Classifier chains is one of the most popular multi-label methods because of its efficiency and simplicity. In this paper, we consider to optimize classifier chains from the view point of conditional likelihood maximization. In the proposed unified framework, classifier chains can be optimized in either or both of two aspects: label correlation modeling and multilabel feature selection. In this paper we show that previous classifier chains algorithms are specified in the unified framework. In addition, previous information theoretic multilabel feature selection algorithms are specified with different assumptions on the feature and label space. ased on these analyses, we propose a novel multi-label method, k-dependence classifier chains with label-specific features, and demonstrate the effectiveness of the method. Keywords: multi-label classification, classifier chains, conditional likelihood maximization, k-dependence ayesian network, multi-label feature selection. 1. Introduction Unlike traditional single-label classification where each instance is associated with only one label, Multi-Label Classification (MLC) refers to the problems assigning multiple labels to a single test instance. MLC can be seen in a wide range of real-world applications such as text categorization, semantic image classification, bioinformatics analysis and video annotation. In fact, MLC is ubiquitous in real-world problems. For example, a news article Corresponding author addresses: sunlu@main.ist.hokudai.ac.jp (Lu Sun), mine@main.ist.hokudai.ac.jp (Mineichi Kudo) Preprint submitted to Pattern Recognition March 14, 2017

2 is possibly relevant to multiple topics, like science, technology, economics, politics, etc; a single image is probably associated with a set of semantic concepts, like sky, sea, field, building, etc. To tackle such multi-label problems, various MLC methods have been proposed. The existing MLC methods fall into two broad categories: problem transformation and algorithm adaptation [1]. As a convenient and straightforward way for MLC, problem transformation strategy transforms an MLC problem into one or a set of single-label classification problems, and learns one or a family of classifiers for modeling the single-label memberships. Most of popular baseline MLC methods, such as inary Relevance (R) [2], Calibrated Label Ranking (CLR) [3], and Label Powerset (LP) [4], belong to this strategy. Algorithm adaptation strategy induces conventional machine learning algorithms in the multi-label settings. Various MLC methods adopting one of the above two strategies have been developed and succeeded in dealing with multi-label problems. Classifier Chains (CC) [5] is one of the most promising MLC methods which follow the problem transformation strategy. Originated from R which simply ignores label correlations, CC constructs a chain structure on labels and determines the presence/absence of the current label under the condition of previously determined label. CC succeeds in modeling label correlations and achieves higher classification accuracy at similar computational expense with R. Although CC-based methods have achieved much success in various applications [5, 6, 7], further improvement in classification accuracy is still required. Here we seek the possibility to improve CC in terms of two aspects: label correlation modeling and multi-label feature selection. The intuition behind this idea is that all of the previously determined labels are not always necessary for decision on the current label (necessity of limiting label correlations), and irrelevant and redundant features are usually harmful for the performance of CC (necessity of feature selection). In this paper, we propose a unified framework comprising of both label correlation modeling and multi-label feature selection via conditional likelihood maximization of MLC. The contributions of this work are cast into three-folds. First, we propose a general framework taking label correlation modeling and multi-label feature selection into account via conditional likelihood maximization. Second, the k-dependence classifier chains method is proposed based on greedy iterative optimization of a sub-problem of likelihood maximization. 2

3 Third, a general information theoretic feature selection method is proposed for MLC, where three terms on relevancy, redundancy and label correlations are considered for feature subset selection. The rest of this paper is organized as follows. Section 2 discusses the related works, mainly focusing on CC-based methods and information theoretic feature selection. Section 3 illustrates the unifying framework for MLC by conditional likelihood maximization, and induces two sub-problems: model selection and multi-label feature selection. Sections 4 and 5 present the solutions on model selection and multi-label feature selection, respectively. Section 6 summarizes several theoretical findings during the development of the proposed method. Section 7 discusses the implementation issues. Section 8 introduces the experiments, and reports the results. Finally, Section 9 concludes this paper and discusses the further research. 2. Related works Previous efforts have been paid on MLC in terms of various viewpoints, such as label correlations modeling [4, 5], loss function analysis [6, 8, 9], large-scale learning [10, 11] and dimension reduction [12, 13, 14]. In this paper we concentrate mainly on two aspects: label correlations modeling and dimensionality reduction. It has been shown in a number of researches [3, 4, 5] that modeling label correlations is very crucial to perform accurate classification. On the other hand, various dimension reduction algorithms, including of feature selection [15, 16] and feature extraction [12, 17], have been employed in MLC, in order to simplify the learning phase and overcome the curse of dimensionality. In order to capture label correlations, Classifier chains (CC) based methods [5, 6, 7] have been proposed at tractable computational complexity. CC-based methods originates from inary Relevance (R) [2], which simply decomposes a multi-label problem into a set of binary classification problems, totally ignoring label correlations. In this sense, R is actually a hamming loss risk minimizer [8]. In CC [5], label correlations are expressed in an ordered chain of labels. In the learning phase, according to a predefined chain order, it builds a set of binary classifiers such that each classifier predicts the correct value of a target label by referring to the correct values of all the preceding labels in addition to the features. In the prediction phase, it predicts in turn the value of the target label using the previously 3

4 estimated values of its parent labels as extra features. However, the performance of CC is sensitive to the distinct chain orders, and it suffers from the problem of error propagation in the prediction phase. Several efforts have been paid to overcome the limitations of CC. ayesian Classifier Chains (CC) [7] introduces a directed tree as the probabilistic structure over labels. The directed tree is established by randomly choosing a label as its root and by assigning directions to the remaining edges. It shares the same model with CC, but restricts the number of parent label no more than 1, which limits its expression ability on label correlations. Probabilistic Classifier Chains (PCC) [6] aims to solve the error propagation problem, providing better estimates than CC at the expense of higher processing time. Although PCC shares the learning model with CC, it chooses the best predictor by searching the Maximum A Posterior (MAP) assignment in an exhaustive manner. The exponential cost of PCC in prediction limits its application. To make the prediction tractable for PCC, PCC-beam [18] is proposed by applying beam search to find an approximate MAP assignment of labels to a test instance. MCC [19] utilizes the Monte Carlo scheme to find the suboptimal chain order and perform efficient inference for the MAP assignment in the learning and prediction phase, respectively. In [20], the dynamic programming technique is used to search the globally optimal chain order of CC. In addition, to speed up the search procedure, a greedy approach is proposed to find locally optimal CC. In a recent work [21], the Classifier Trellis (CT) method is proposed for scalable MLC by extending the 1-dimensional chain of CC to a 2-dimensional trellis structure. CT saves label correlations in the trellis structure, where each label depends only on its adjacent labels. In this way, CT enables to limit the number of parent labels, and thus becomes scalable to the MLC problems with a large number of labels. In terms of Feature Space Dimensionality Reduction (FS-DR), a variety of traditional supervised dimension reduction approaches have been specifically extended to match the setting of MLC. In [22], a supervised Multi-label Latent Semantic Indexing (MLSI) approach is developed to map the input features into a subspace by preserving the label information. y maximizing the feature-label dependence under the Hilbert-Schmidt independence criterion, Multi-label Dimension reduction via Dependence Maximization (MDDM) [12] derived a closed-form solution to efficiently find the projection into the feature subspace. In addition, several traditional dimension reduction techniques, such as Canonical Correlation Analysis 4

5 (CCA) and Linear Discriminant Analysis (LDA), are proposed to handle the MLC problem [23, 24]. In the field of feature selection, an information theoretic approach for Label Powerset (LP) has been developed in [25]. The method introduces a nearest neighbor estimator for computing mutual information, and applies pruned LP to control the problem size. The multivariate mutual information criterion is used in [26] to select useful features. Due to its computational inefficiency, an approximate solution is proposed to estimate the multivariate mutual information. In [15], the authors extend the feature selection framework in [27] to handle two MLC decomposition methods, R and LP, which achieves significant improvement compared with several multi-label feature selection methods. In fact, all the methods mentioned above can be regarded as global FS-DR methods, since they attempt to find an identical feature subspace globally for all the labels. However, it is more reasonable to think that each label holds a specific supporting feature subset. To overcome the limitations, local FS-DR methods [17, 16] have been proposed to find label-specific features. In [17], LabelspecIfic FeaTures (LIFT) are extracted by conducting cluster analysis on the positive and negative instances of each label. The Learning Label-Specific Features (LLSF) method is proposed in [16]. LLSF selects label-specific features by optimizing the least squares problem with constraints of label correlations and feature sparsity. 3. The unified framework for MLC via likelihood maximization 3.1. Multi-label classification In the scenario of MLC, an observation (x, y) consists of a d-dimensional feature vector x and a q-dimensional target label vector y, drawn from the underlying random variables X = (X 1,..., X d ) R d and Y = (Y 1,..., Y q ) {0, 1} q, respectively. For an observation of labels y = (y 1,..., y L ), y j = 1 if the label j is relevant to the instance, and y j = 0 otherwise, j = 1,..., q. The task of MLC is to find an optimal classifier h : R d {0, 1} q, which assigns a label vector ŷ = h(x) to each instance x such that h minimizes a loss function between ŷ and y. For a loss function L(Y, h(x)), the optimal classifier h is h = arg min E xy L(Y, h(x)). (1) h Specifically, given the subset 0-1 loss L s (y, ŷ) = 1 y ŷ, where 1 ( ) denotes the indicator 5

6 function. Eq. (1) can be rewritten in a pointwise way, ŷ = h (x) = arg max p(y x). (2) y Here we use p(y X) to represent the conditional probability distribution of label variables Y given feature variables X. According to (2), the following two capabilities are necessary for an optimized multi-label classifier h in subset 0-1 loss. In the training phase, it should capture the underlying probability distribution p(y X); In the testing phase, given a test instance ˆx, the Maximum A Posteriori (MAP) assignment ŷ should be obtained by solving the optimization problem max y p(y ˆx). In this study, we focus on training a multi-label classifier with the first capability, i.e., successfully modeling p(y X), in order to learn an approximate optimized classifier in subset 0-1 loss A conditional likelihood view on MLC According to (2), the objective of MLC is actually to approximate the underlying conditional probability p(y X). In this paper, we assume that the underlying probability p(y X) can be approximated by a ayesian network of relatively simple structure with conditional probability p (Y X), which optimally captures the label correlations. In addition, we further assume that p (Y X) can be modeled by a subset of X, i.e., the relevant features X θ X. A p-dimensional binary vector θ is adopted with 1 indicating selected and 0 unselected. Hence, X θ /X θ denotes the selected/unselected feature subset, and thus X = {X θ, X θ }. Hence we have p (Y X) = p (Y X θ ). Last, a predictive model f(y X θ, τ) is built to model p (Y X θ ), where τ denotes the parameters of f used to predict y. In this way, we find the optimal ayesian network, identify the relevant feature subset X θ, and then build the predictive model f. Given a sample of N observations D = {(x i, y i ) i = 1,..., N} drawn from an underlying i.i.d. process p : X Y, the conditional likelihood L of the observations D given parameters {, θ, τ} becomes Therefore, our objective becomes L(, θ, τ D) = N f(y i x i θ, τ). (3) i=1 {, θ, τ } = arg max L(, θ, τ D) = arg max,θ,τ,θ,τ 6 N f(y i x i θ, τ). (4) i=1

7 For convenience, we use the average log-likelihood l instead of L: l = 1 N N log f(y i x i θ, τ). (5) i=1 y taking p (Y X θ ) into consideration, (5) is rewritten as l = 1 N N i=1 [ f(y i x i θ log, τ) p (y i x i θ ) p(y i x i θ ) p (y i x i ) p(y ] i x i ) p(y i x i ) p(yi x i ) y the law of large numbers, l approaches in probability the expected version [ f(y Xθ, τ) E xy {log p (Y X θ ) p(y X θ ) p (Y X) p(y X) ]} p(y X) p(y X) (6) (7) We negate the above formula to minimize { l E xy log p { (Y X θ ) }+E xy log p { (Y X) }+E xy log p(y X) } E xy {log p(y X)} f(y X θ, τ) p (Y X θ ) p (Y X) (8) Since X = {X θ,x θ }, the second term can be developed as follows, { E xy log p } { (Y X) = E xy log p } (Y X θ, X θ ), p (Y X θ ) p (Y X θ ) = E xy {log p (X θ, Y X θ ) p (X θ X θ )p (Y X θ ) }, = I p (X θ ; Y X θ ), (9) where I(X; Y Z) denotes the mutual information between X and Y conditioned on Z, I(X; Y Z) = xyz p(x, y, z) log p(x, y z) p(x z)p(y z) = E p(x, Y Z) xyz log p(x Z)p(Y Z). (10) Using the K-L divergence [28]: D KL (p q) = x p(x) log p(x) { q(x) = E x log p(x) }, (11) q(x) (8) is finally rewritten as l D KL (p f) + I p (X θ ; Y X θ ) + D KL (p p ) + H(Y X) (12) From (12), our objective function can be decomposed into four different terms. 7

8 1. D KL (p f): the K-L divergence between the conditional probability p (Y X θ ) and the predictive model f(y X θ, τ), which measures how well f approximates p given the selected feature subset X θ. This parameter τ could be optimized by the predefined predictive model given the optimized parameters and θ, τ = arg min D KL (p (Y X θ ) f(y X θ, τ)). (13) τ Thus (13) is the parameter selection problem. Distinct predictive models would produce the different τ. It depends on our choice of the baseline model f. 2. I p (X θ ; Y X θ ): the mutual information between the unselected features X θ and the labels Y conditioned on the selected features X θ. This term depends on both the approximate ayesian network and the selected features X θ. Given the optimized, the optimal θ can be obtained as In fact (14) is the multi-label feature selection problem. θ = arg min I p (X θ ; Y X θ ). (14) θ 3. D KL (p p ): the K-L divergence between the underlying probability p(y X) and the approximate probability p (Y X) modeled by a ayesian network, which measures how well p approximate p. This term depends only on the ayesian network, hence we have = arg min D KL (p(y X) p (Y X)). (15) Note that (15) is actually the model selection problem, which aims to find the optimal ayesian network to capture label correlations. 4. H(Y X): the conditional entropy of labels Y given features X. This term presents the uncertainty in the labels when all features are known, which is a bound on the ayes error [29]. Since it is independent of all parameters, we can remove this term from our optimization problem. ased on above discussion, we will take the following strategy: Optimization Strategy. We address the problem of multi-label classification with feature selection in three stages: first learning label space structure, then selecting useful feature subset, and last building the predictive model. 8

9 1. Model Selection max L ( D) = min D KL (p(y X) p (Y X)) = max σ q j=1 I(Y j ;Pa j X) (19) Objective Framework 2. Feature Selection max,θ,τ = max,θ,τ i=1 L(, θ, τ D) N f(y i x θ i, τ) (4) min D KL (p f),θ,τ + I p (X ഥ θ ; Y X θ) + D KL (p p (12) max L (θ D) θ = min I p (X ഥ θ θ ; Y X θ) = max I p θ (X θ ;Y) (35) 3. Parameter Selection max L (τ D) τ = min τ D KL (p (Y X θ ) f(y X θ, τ)) (13) Figure 1: The framework of MLC via conditional likelihood maximization. Fig. 1 shows the framework of MLC via conditional likelihood maximization following the Optimization Strategy. In this strategy, instead of directly addressing the optimization problem (4), we solve the sub-problems (15), (14), and (13) independently. In the following sections, we shall discuss how the sub-problems (15) and (14) can be solved in Section 4 and 5, respectively. In addition, we see that some popular MLC methods and multi-label feature selection algorithms are embedded in this framework as the special cases of optimization of sub-problems with appropriate assumptions on the label or feature space. 4. Model selection Under the Optimization Strategy, to model the underlying probability distribution p(y X), the optimal ayesian network of a special type could be obtained by optimizing arg max L( D) = arg min D KL (p(y X) p (Y X)). (16) Here we limit in the k-dependence ayesian (kd) network : p (Y X) = q p(y j Pa j, X), (17) j=1 where Pa j represents the parents of label j, Pa j = min{j 1, k}, k [0, q 1]. Note that kd is limited in the number of parents up to k compared with the chain rule in the 9

10 canonical order of Y 1, Y 2,..., Y q : p(y X) = q p(y j Y 1, Y 2,..., Y j 1, X). (18) j=1 The optimization problem (16) has been addressed in our previous work [30] as a theorem. Theorem 1. To approximate a conditional probability p(y X) in a certain family of ayesian networks, the optimal ayesian network in K-L divergence is obtained if the sum of conditional mutual information between each variable of Y and its parent variables given the observation X is maximized. Proof. The optimization problem of (15) can be developed as follows: { = arg min E xy log p(y X) } p (Y X) = arg max E xy {log p (Y X)} + H(Y X), H(Y X) can be omitted due to its independence with, thus we have = arg max E xy {log p (Y X)} q = arg max = arg max = arg max E xy {log j=1 j=1 p(y j Pa j, X) q } p(y j, Pa j X) E xy {log p(y j X)p(Pa j X) p(y j X) q q I(Y j ; Pa j X) H(Y j X). j=1 Since q j=1 H(Y j X) is independent of, we reach our conclusion: j=1 = arg min D KL (p(y X) p (Y X)) = arg max } q I(Y j ; Pa j X). (19) j=1 The objective of (16) is to save label correlations in a ayesian network. In [8], a formal definition on two types of label correlations, namely conditional and marginal dependence, is given. Conditional dependence captures the dependence of labels given a specific instance, while marginal dependence can be considered as the expected dependence averaged over all 10

11 instances [8]. According to the definition, the optimal ayesian network in (19) obtained by Theorem 1 actually models conditional label correlations. For more information on label dependence in MLC, see [8]. According to Theorem 1, the following corollary is derived. Corollary 1. Given a specific order of labels, the probability p modeled by k-dependence ayesian network optimally approximates p(y X) in terms of K-L divergence if the parents of the label Y j holds the restriction Pa j = j 1, j = 1,..., q. Proof. We assume that labels are ordered in Y 1, Y 2,..., Y q, the possible parents of label j have been restricted as its previous labels S j = {Y 1, Y 2,..., Y j 1 }, j = 1,..., q. Hence, the optimization problem (19) can be independently solved by each label, and is equivalent to select optimally the parent labels from the previous labels. = arg max I(Y j ; Pa j X), j = 1,..., q. (20) Let Pa j be S j \ Pa j. ased on the chain rule of mutual information, we have I(Y j ; S j X) = I(Y j ; Pa j X) + I(Y j ; Pa j Pa j, X). (21) Since conditional mutual information is always non-negative, i.e., I(Y j ; Pa j Pa j, X) 0, we have I(Y j ; S j X) I(Y j ; Pa j X). Hence, (20) is optimized by treating all previous labels as the parents of Y j, i.e., Pa j = S j. In this way, a fully-connected ayesian network is chosen for, thus Pa j = j 1, j = 1,..., q Review CC-based methods For the Classifier Chains (CC) and Probabilistic Classifier Chains (PCC) methods, given a predefined chain order, the classifier for each label is trained by taking the previous labels as extra features. In this sense, a fully-connected ayesian network is constructed by CC and PCC according to a particular chain order. The difference between CC and PCC is that, in the testing phase, CC finds its prediction by the greedy search: ŷ j = arg max p(y j pa ˆ j, ˆx), j = 1,..., q. (22) y j In contrast, PCC aims to find the MAP assignment by searching 2 q paths according to (2), resulting in an exponential complexity in q for the testing phase [6]. Thus PCC is actually a 11

12 (a) R (b) CC and PCC (c) CC (d) kcc (k = 2) Figure 2: Graphical models of CC-based methods in order Y 1 Y 2 Y 3 Y 4. risk minimizer in terms of subset 0-1 loss [31], while CC can be considered as a deterministic approximation to PCC. The predictions of PCC (ŷ pcc ) and CC (ŷ cc ) shall be equal if the conditional probability of the MAP assignment (ŷ map ) is more than 0.5 [6], i.e., ŷ cc = ŷ pcc, if p(ŷ map ˆx) > 0.5. As stated in Sec. 3.1, this study focuses on modeling p(y X) rather than finding the MAP assignment. The following corollary is introduced for four CC-based methods, CC, kcc, CC and R, which conduct the simple greedy search in the testing phase. ased on Corollary 1, the following corollary on these CC-based methods is derived. Corollary 2. In the same chain order of q labels, in the ideal case, classifier chains asymptotically outperform k-dependence classifier chains in terms of subset 0-1 loss, when k < q 1. Similarly, k 1 -dependence classifier chains asymptotically outperforms k 2 -dependence classifier chains in subset 0-1 loss for 0 k 2 < k 1 q 1. Proof. ased on Corollary 1, given a chain order, classifier chains optimally models p(y X) by a fully-connected ayesian network. Therefore, according to (2), in fact CC is the best approximation of the risk minimizer in subset 0-1 loss among the four CC-based methods. Moreover, for the k-dependence classifier chains, K-L divergence between p (Y X) and p(y X) always decreases as the value of k increases. According to (2) again, we reach our conclusion. Here we note that all of CC, CC and R can be viewed as the special cases of kcc by setting the value of k as q 1, 1 and 0, respectively. CC works better than ayesian Classifier Chains (CC) and inary Relevance (R) in approximating ability of the underlying distribution p(y X). Therefore, by (2), CC outperforms CC and R in subset 0-1 loss in 12

13 an asymptotic case. According to Corollary 2, given the same chain order, as an asymptotic analysis, we have the following order in performance: PCC CC kcc CC R, 1 < k < q 1. (23) Here A presents that A asymptotically outperforms in subset 0-1 loss. It is worth noting that, PCC is considered in (23) since it is the exact optimizer in subset 0-1 loss. Fig. 2 shows the probabilistic graphical models of CC-based methods over four labels in a chain order Y 1 Y 2 Y 3 Y k-dependence ayesian network for CC As discussed above, using all preceding labels as Pa j = S j is optimal for approximating p(y X) in a given chain order. However it is not always true in real-world problems with a finite number of instances. It is often possible to build a better model by eliminating irrelevant and redundant parent labels. Successful elimination of such useless features simplifies the learning model, and gives a better performance. In this study, therefore, we propose to use the k-dependence ayesian network [32] where at most k preceding labels are adopted as the parents of a label. As shown in the last section, given a chain order, the optimization problem (19) can solved independently for each label. Hence, it suffices to solve the problem arg max L( D) = arg max I(Y j ; Pa j X), j = 1,..., q, (24) where Pa j S j and Pa j k. However, the calculation of conditional mutual information costs very highly for the high-dimensional feature vectors. Thus, we find an approximation for I(Y j ; Pa j X), relying on some appropriate assumptions. Theorem 2. arg max I(Y j ; Pa j ) = arg max I(Y j ; Pa j X), if the label Y j is independent of the feature vector X conditioned on the parent labels Pa j. Proof. According the fact I(A; C) I(A; ) = I(A; C ) I(A; C), we have I(Y j ; Pa j X) = I(Y j ; Pa j ) + I(Y j ; X Pa j ) I(Y j ; X). (25) 13

14 If the conditional independence assumption holds, the second term of (25) vanishes. addition, the third term is independent of ayesian network. Thus we have the consistence. In To simplify the computation, a forward parent label selection algorithm is developed here for the optimization problem = arg max I(Y j ; Pa j ). Since I(Y j ; Pa j Y l ) = I(Y j ; Pa j ) + I(Y j ; Y l Pa j ), (26) starting from Pa j =, we can update Pa j by adding Y l Pa j such that Y l = arg max I(Y l ; Y j Pa j ), (27) Y l Pa j until Pa j reaches a predefined number. I(Y l ; Y j Pa j ) is further developed as, I(Y l ; Y j Pa j ) = I(Y l ; Y j ) + I(Y j ; Pa j Y l ) I(Y j ; Pa j ) = I(Y l ; Y j ) + H(Pa j Y l ) H(Pa j Y j, Y l ) I(Y j ; Pa j ) (28) Parent Independence Assumption. For target label Y j, one preceding label Y k Pa j, and Y l Pa j, we assume: p(pa j Y l ) = p(y k Y l ), (29) Y k Pa j p(pa j Y j, Y l ) = p(y k Y j, Y l ). (30) Y k Pa j These assumptions require that parent labels are conditional independent given one unselected parent Y l (29) or given Y l with the target label Y j (30). Under this assumption, we have the following: I(Y l ; Y j Pa j ) = I(Y l ; Y j ) + H(Y k Y l ) H(Y k Y j, Y l ) I(Y j ; Pa j ), Y k Pa j Y k Pa j = I(Y l ; Y j ) + I(Y k ; Y j Y l ) I(Y j ; Pa j ). (31) Y k Pa j Substituting (31) into (27) and removing the last term independent of Y l, we have Y l = arg max I(Y l ; Y j ) + I(Y j ; Y k Y l ). (32) Y l Pa j Y k Pa j 14

15 In (32), the first term captures the relevance between the unselected parent label Y l and the target label Y j, while the second term models the mutual information between the selected label Y k and the target label Y j, conditioned on the unselected label Y l. Since we have I(Y j ; Y k Y l ) = I(Y j ; Y k ) + I(Y l ; Y k Y j ) I(Y l ; Y k ), the second term actually captures both conditional redundancy I(Y l ; Y k Y j ) and redundancy I(Y l ; Y k ). It means that the parent label which has the best trade-off between relevancy and redundancy would be selected. worth noting that in [27], the authors reach the similar conclusion for information theoretic feature selection. The difference is that here our objective is to perform parent label selection to find the optimal k-dependence ayesian network given a chain order. We can generalize (32) as Y l = arg max I(Y l ; Y j ) + α I(Y j ; Y k Y l ), j = 1,..., q, (33) Y l Pa j Y k Pa j where α 0 denotes a parameter controlling the weights assigned to the two terms. In practice, we prefer to set α = 1 Pa j to prevent from sweeping the first term as the number of parent labels grows. ased on (33), we propose the k-dependence Classifier Chains (kcc) method. It is 5. Multi-label feature selection ased on the Optimization Strategy, the following equation can be developed based on the built ayesian network, arg max θ L(θ D) = arg min I p (X θ ; Y X θ ). (34) θ Note that there is no monotonicity in the conditional mutual information, so that θ = (0, 0,..., 0) or θ = (1, 1,..., 1) is not always the solution. According to the chain rule of mutual information and X = {X θ, X θ }, we have I p (X; Y) = I p (X θ ; Y) + I p (X θ ; Y X θ ). (35) From the above formula, we see that minimization of I p (X θ ; Y X θ ) is equivalent to maximization of I p (X θ ; Y), thus (34) is transformed into: arg max θ L(θ D) = arg max I p (X θ ; Y). (36) θ 15

16 Note that the similar optimization problem has been proposed by intuition in the field of information theoretic feature selection for the single-label variable Y in a variety of papers [27, 33, 34]. Here we theoretically derive the optimization problem from conditional likelihood maximization for the multi-label case, and take label correlations into account by the learned ayesian network. However, directly solving (36) is a non-trivial thing due to the difficulty on the calculation of mutual information between high-dimensional X θ and Y. Hence, similar with Section 4.2, the greedy sequential optimization is applied for (36). Since I p (X θ X m ; Y) = I p (X θ ; Y) + I p (X m ; Y X θ ), (37) Starting from X θ =, we can update X θ by adding X m X θ such that X m = arg max I p (X m ; Y X θ ). (38) X m X θ until X θ reaches a predefined number. y taking the ayesian network into consideration, we have I p (X m ; Y X θ ) = = q I(X m ; Y j Pa j, X θ ), j=1 q (I(X m ; Y j ) + I(Pa j, X θ ; Y j X m ) I(Pa j, X θ ; Y j )). (39) j=1 Parent and Feature Independence Assumption. For each feature X n X θ, given one unselected feature X m X θ and the target labels Y, we assume the following: p(pa j, X θj Y j ) = p(y k Y j ) p(x n Y j ), Y k Pa j X n X θ p(pa j, X θ Y j, X m ) = p(y k Y j, X m ) p(x n Y j, X m ), Y k Pa j X n X θ It requires that the parents Pa j and selected features X θ are conditional independent of Y j with or without the unselected feature X m. ased on the above assumption and (39), (38) becomes: q X m = arg max I(X m ; Y j ) + I(Y j ; Y k X m ) + I(Y j ; X n X m ). (40) X m X θ j=1 Y k Pa j X n X θ 16

17 The first term captures the Relevance between the unselected feature X m and the target label Y j, and the second term captures Label Correlations modeled in k-dependence ayesian network. For the third term of (40), we express it as I(Y j ; X n X m ) = I(Y j ; X n ) + I(X m ; X n Y j ) I(X m ; X n ). (41) Since I(Y j ; X n ) is independent of X m, I(Y j ; X n X m ) can be represented by last two terms. Here we call I(Y j ; X n X m ) as Redundancy Difference, since it in fact reflects the difference of Conditional Redundancy I(X m ; X n Y j ) and Redundancy I(X m ; X n ). Hence, according to (40), one feature would be selected in that it provides the best trade-off among relevance, label correlations and redundancy. We can further generalize (40) as, q X m = arg max I(X m ; Y j ) + β I(Y j ; Y k X m ) + γ I(Y j ; X n X m ). (42) X m X θ j=1 Y k Pa j X n X θ where β, γ 0 are two parameters controlling the weights on correlation and redundancy difference terms, respectively. In practice, we typically set β = 1 and γ = 1 Pa X θ to normalize the terms, preventing from sweeping either term as the number of Pa j or X θj grows. This means we hold a relatively weak assumptions on conditional label independence (the 2nd term) and feature independence (the 3rd term). In the derivation of (42), we assume the relevant features are identical for all the labels. However, it is more general and reasonable to assume that the selected feature subset is labelspecific. For the target label j, its relevant features are denoted as X θj, thus X = {X θj, X θj }, j = 1,..., q. ased on the similar derivation and assumption of (42), we can obtain the following feature selection criterion for label j, j = 1,..., q, X m = arg max I(X m ; Y j ) + β I(Y j ; Y k X m ) + γ I(Y j ; X n X m ). (43) X m X θj Y k Pa j X n X θj In this case, we set β = 1 Pa j and γ = 1 X θj. Note that the criterion (43) is optimized label- independently, since X m X θj and X θj is label-specific. In this paper, we employ (42) and (43) for global and local Multi-Label Feature Selection (MLFS), respectively. 6. Summary of theoretical findings In Table 1, we summarize the techniques above for comparing model selection and feature selection algorithms. y combining kcc (33) and local MLFS (43), we propose the optimized 17

18 Table 1: The generality of the proposed framework for MLC. Parameter Setting Algorithm Authors Pa j = 0 inary Relevance (R) outell et al. (2004) [2] Model Pa j 1 ayesian CC (CC) Zaragoza et al.(2011) [7] Selection Pa j = j 1 Classifier Chains (CC) Read et al. (2011) [5] Pa j k kcc (Proposed) Sun and Kudo (2017) β = γ = 0 MI Maximization (MIM) Lewis (1992) [35] Feature q = 1 β = 0, γ = 1 X θ Joint MI (JMI) Yang and Moody (1999)[34] Max-Rel Min-Red Peng et al. (2005) [33] Selection β = γ = 0 Multi-Label MIM Scchidis et al (2012) [15] q > 1 β = 0, γ = 1 X θ Multi-Label JMI Scchidis et al. (2012) [15] β = 1 Pa j, γ = 1 X θ MLFS (Proposed) Sun and Kudo (2017) Max-Rel Min-Red ignores the conditional redundancy term in (42). Classifier Chains (occ) method, whose procedure is outlined in Algorithm Implementation issues 7.1. Mutual information estimation In the optimizations of both model and feature selection, calculation of mutual information is extensively used. For the discrete/categorical feature variables X (Y is originally binary), the calculation of mutual information is simple and straightforward. Given a sample of N i.i.d. observations {x i, y i } N i=1, based on the law of large numbers, we have the following approximation: I(X; Y ) = x,y p(x, y) log p(x, y) p(x)p(y) 1 N N i=1 log ˆp(xi, y i ) ˆp(x i )ˆp(y i ), (44) where ˆp denotes the empirical probability distribution. In contrast, when the feature variable X is continuous, it becomes quite difficult to compute mutual information I(X; Y ), since it is typically impossible to obtain ˆp. One of the solutions is to use kernel density estimation, in which case, ˆp is approximated by the following: ˆp(x) = 1 N N K h (x x i ), (45) i=1 18

19 Algorithm 1 The algorithm of occ (kcc + MLFS) Input: D = {x i, y i } N i=1: training set with q labels, chain: label order, ˆx: test instance, k: maximal size of parents, M: size of selected features, f 1,..., f q : q predictive models Output: ŷ: predicted label set of ˆx Training: 1: S, {Pa j } q j=1 ; 2: for j chain do 3: for l = 1,..., k do 4: select Y l from S according to (33); 5: Pa j Pa j Y l ; 6: for m = 1,..., M do 7: select X m from X according to (43); 8: X θj X θj X m ; 9: build classifier f j on D j, where D j = {x i θ j pa i j, y i j} N i=1; 10: S S Y j ; Testing: 11: for j chain do 12: ŷ j f j (ˆx θj pa ˆ j ); 13: ŷ (ŷ 1, ŷ 2,..., ŷ q ). where K h ( ) is a non-negative kernel function with bandwidth h (h > 0). Typically, the Gaussian kernel can be used as K h ( ). However, it is computationally expensive to compute (45) and usually difficult to select a good value of the bandwidth h. Another solution for computing I(X; Y ) with continuous feature X is to apply data discretization as preprocessing [27]. In this study, we adopt this method and discretize the continuous variable X based on its mean µ X and standard deviation σ X. For example, we divide a continuous value of feature variable X into one of three categories { 1, 0, 1} according to µ X ± σ X. The experimental results demonstrate the efficiency of such approach for approximating I(X; Y ). Note that data discretization is performed only for calculating mutual information in feature selection, which means that original continuous/discrete features will be utilized to train classification models after the selection of useful features. 19

20 Algorithm 2 The algorithm of chain order selection Input: D = {x i, y i } N i=1: training data with q labels; Output: π: selected chain order of q labels; 1: S π, S π {1, 2,..., q}; 2: Initialize π(1) arg max q d π(1) j=2 m=1 I(Y π(1); Y π(j) X m ); 3: S π π(1), S π S π /π(1); 4: for j = 2,..., q do 5: Select π(j) S π according to (47); 6: S π S π π(j), S π S π /π(j); 7.2. Chain order selection The chain order is crucial to the performance of CC-based methods [5, 6]. To improve the performance, [18] and [19] propose to select the correct chain order by using the eam Search and Monte Carlo algorithm, respectively. In this study, we develop a greedy algorithm for efficient chain order selection according to Theorem 1. Different from Theorem 1, where model selection is conducted to find the optimal k-dependence ayesian network ( Pa j k, j) based on the canonical order, here we aim to perform model selection by selecting the optimal chain order π for a fully-connected ayesian network π ( Pa j = j 1, j). Suppose that a permutation π : {1, 2,..., q} {1, 2,..., q} determines the optimal chain order of labels, so that the jth label is placed at the π(j)th position of the chain. Thus, following Theorem 1, we have π = arg min D KL (p(y X) p π (Y X)) = arg max π π q I(Y π(j) ; Pa j X), (46) where Pa j = {Y π(k) } j 1 k=1. Similar to the derivation in Section 4.2, a greedy forward search algorithm can be developed to solve the optimization problem in (46) with appropriate independence assumption. In the jth iteration, we select the best place π(j) for the jth label according to the following formula: ( ) j 1 π 1 d (j) = arg max I(Y π(j) ; Y π(k) X m ) + 1 k 1 I(Y π(k) ; Y π(l) Y π(j) ). (47) π(j) d k 1 k=1 m=1 Algorithm 2 outlines the procedure of chain order selection. In Sec. 8.5, we shall show the efficiency of Algorithm l=1 j=1

21 Table 2: The statistics of experimental multi-label datasets. In below, N, d and q are the data size in instances, features and labels, respectively. Cardinality, Density and Distinct denotes the label cardinality, label density and number of distinct label combinations, respectively. Dataset N d q Cardinality Density Distinct Feature type Domain emotions numeric music scene numeric image yeast numeric biology birds numeric audio genbase nominal biology medical nominal text enron nominal text languagelog nominal text rcv1s numeric text bibtex nominal text corel16k nominal image corel5k nominal image delicious nominal text 8. Experiments 8.1. Datasets and evaluation metrics We conducted experiments on thirteen benchmark datasets in Mulan [36] and Meka [37], in order to evaluate the performance of the proposed method. The statistics of the datasets are summarized in Table 2, where Cardinality and Density denote the label cardinality and label density, respectively. In addition, Distinct denotes the number of distinct label combinations, and Feature type presents the property of features. Given a test dataset T = {x i, y i } N T i=1, we use four evaluation metrics to report the experimental results. Here 1 denotes the indicator function. Exact-Match := 1 N T Hamming-Score := 1 N T Macro-F1 := 1 L L l=1 NT i=1 1 ŷ i =y i, NT 1 L i=1 L l=1 1 ŷl i, =yi l 2 N T i=1 ŷi l yi l NT i=1 ŷi l + N T i=1 yi l, Micro-F1 := 2 L NT l=1 i=1 ŷi l yi l L NT l=1 i=1 ŷi l + L NT l=1 i=1 yi l. 21

22 The above metrics can be cast into two categories, instance-based metrics (Exact-Match and Hamming-Score) and label-based metrics (Macro-F1 and Micro-F1) [38]. Among the metrics, Exact-Match is the most stringent measure, since it does not evaluate partial match of labels. In spite of that, it is a good metric to measure how well label correlations are modeled. Hamming-Score emphasizes on the prediction accuracy on label-instance pairs and is able to evaluate the performance on each single label. Macro-F1 and Micro-F1 take the partial match of label sets into account, providing good supplements for instance-based metrics. In addition, Macro-F1 is more sensitive to the performance of rare categories (the labels in minority), while Micro-F1 is affected more by the major categories (the labels in majority). Hence, joint use of Macro-F1 and Micro-F1 should be a good supplement for the instance-based evaluation metrics to evaluate the performances of MLC methods Configuration In the experiments we compare the following seven MLC methods: R: inary Relevance (R) [2]. It transforms a multi-label problem with q labels into q binary classification problems according to the one-vs-all strategy. ECC: Ensemble of CCs (ECC) [5]. A number of CCs are established by randomly selecting the chain orders, and the final prediction is made by majority votes. ECC: Ensemble of CCs (ECC) [7]. A number of CCs are established by randomly selecting theirs roots, and its prediction is made by the same way ECC does. MDDM: Multi-label Dimension reduction via Dependence Maximization (MDDM) [12]. It is a global FS-DR method. y maximizing the feature-label dependence, the projection is built to project the features into the low-dimensional subspace. LLSF: Learning Label-Specific Features (LLSF) [16]. It is a local FS-DR method. Label-specific features are selected by optimizing the least squares problem with constraints of label correlation and feature sparsity. occ 1 : Optimized Classifier Chains (occ). The proposed MLC method by combining model selection (kcc) and local feature selection (MLFS). 22

23 EoCC 1 : Ensemble of occs (EoCC). The same ensemble strategy of ECC is applied. R was used as the baseline MLC method. As the ensemble CC-based methods, ECC and ECC were introduced due to their superior performances over some MLC decomposition methods as shown in [5, 7]. MDDM was chosen as a representative of global FS-DR methods, which outperformed several global FS-DR methods, like PCA, LPP [39], MLSI [22], CCA [23], as reported in [12]. As a local FS-DR method, LLSF was chosen due to its performance advantage in comparison with another local FS-DR method, LIFT [17], as reported in [16]. In the experiments, 5-fold cross validation was performed to evaluate the classification performance. For fair comparison, a linear SVM implemented in Liblinear [40] was used as the baseline binary classifier for all the comparing methods. In parameter setting, we tune the important parameters on controlling the feature sparsity for the FS-DR methods, such as MDDM, LLSF and occ/eocc. Specifically, the dimensionality of feature space in MDDM (thr) and occ/eocc (M) is selected from {0.1, 0.3, 0.5, 0.7, 0.9} of total number d of features, while the value of β in LLSF is tuned in range of {0.001, 0.01, 0.1, 1, 10}. As suggested in [12] and [16], we set the regularization parameter µ = 0.5 in MDDM, and the parameters α = 0.5 and γ = 0.01 in LLSF. To balance the performance of occ in Exact-Match and Macro/Micro-F1, we set the parameters as k = 0.8 q. For the ensemble methods, we use an ensemble of 10 CCs/CCs/oCCs for building ECC/ECC/EoCC. In order to scale up the ensemble methods, random sampling is applied to randomly select 75% of instances and 50% of features for building each single model of the ensemble, as recommended in [5]. All the comparing methods were implemented in Matlab, and experiments were performed in a computer configured with an Intel Quad-Core i CPU at 3.4GHz with 4G RAM On k-dependence classifier chains To evaluate the performance of model selection, an experiment was performed by kcc on four datasets, where we increased the number of parents k from 0 to q 1 by step 1. For convenience, the values of each metric were normalized by dividing its maximum. Fig. 3 shows the experimental results in four metrics averaged by 5-fold cross validation. Consistent with the theoretical analysis, the performance of kcc in Exact-Match upgrades as the value of k increases except on the medical dataset. Moreover, as the value of k approaches 1 We provide the codes of occ and EoCC at: 23

24 1 scene 1 medical 1 enron 1 rcv1s Performance Exact-Match Hamming-Score Macro-F1 Micro-F k Performance Exact-Match Hamming-Score Macro-F1 Micro-F k Performance Exact-Match Hamming-Score Macro-F1 Micro-F k Performance Exact-Match Hamming-Score Macro-F1 Micro-F k Figure 3: Performances of kcc on four different datasets in four evaluation metrics, whose values in each metric are normalized by dividing its maximum. The number k of parents is increased from 0 to q 1 by step 1. The indices of the maximum in Exact-Match over the four datasets are 5, 11, 12 and 88, respectively. 0.8 q, the performance becomes stable. In terms of Macro/Micro-F1, the performances share the similar tendency with Exact-Match, but the changes on values are relatively small. In Hamming-Score, its performance seems irrelevant to the change of k. Therefore, it is suggested to set the value of k as 0.8 q if the objective is to optimize Exact-Match On multi-label feature selection To evaluate the performance of multi-label feature selection, another experiment was performed by five information theoretic feature selection algorithms. The algorithms are generated according to (42) and (43) as Multi-Label Mutual Information Maximization (MLMIM) [15] (β = γ = 0 of (42)), Multi-Label Joint Mutual Information (MLJMI) [15] and Multi- Label Max-Relevance Min-Redundancy (MLMRMR) [33] (β = 0, γ = 1/ X θ of (42), but MLMRMR ignores the conditional redundancy term.), MLFS local (β = 1/ Pa j, γ = 1/ X θj of (43)), and MLFS global (β = 1/ Pa j, γ = 1/ X θ of (42)). Fig. 4 shows the experimental results on four different datasets in Exact-Match and Macro-F1. We increased the number M of selected features from 10% to 100% by step 10% of min{d, 200}. The local algorithm, MLFS local, outperformed the other ones on average, and its performance was more stable to the value change of M compared with other algorithms. Among the global algorithms, MLFS global worked better than MLMIM, MLMRMR and MLJMI. Note that the performance of MLFS local and MLFS global achieved significant advantage with a smaller value of M, probably because of the success on modeling label correlations in (42). Similar pattern can be observed in other metrics. In summary, performing local feature selection is more effective 24

25 scene 0.7 medical 0.15 enron 0.16 rcv1s1 Exact-Match MLFS local MLFS global MLMIM MLMRMR MLJMI M Exact-Match MLFS local MLFS global MLMIM MLMRMR MLJMI M Exact-Match 0.1 MLFS local MLFS global MLMIM MLMRMR MLJMI M Exact-Match MLFS local MLFS global MLMIM MLMRMR MLJMI M 0.8 scene 0.4 medical 0.25 enron 0.25 rcv1s Macro-F MLFS local MLFS global MLMIM MLMRMR MLJMI M Macro-F MLFS local MLFS global MLMIM MLMRMR MLJMI M Macro-F MLFS local MLFS global MLMIM MLMRMR MLJMI M Macro-F MLFS local MLFS global MLMIM MLMRMR MLJMI M Figure 4: Comparing five information theoretic feature selection algorithms on four datasets in Exact-Match (the top row) and Macro-F1 (the bottom row). The number M of selected features is chosen from 10% to 100% by step 10% of min{d, 200}. than the global way, and it is important to take label correlations into account. Thus, we shall only employ MLFS local in the following experiments. To evaluate the potential of occ, we further conduct experiments by varying its two parameters: the number k of parents and the number M of selected features. The experimental results on the four datasets in Exact-Match and Macro-F1 are shown in Fig. 5. We select the values of k and M according to the percentages from 5% to 100% by step 5%. In terms of k, we can see that the performance becomes stable when a small percent of parents are selected. In terms of M, occ achieved best performances when a fraction of features are selected. The results verify that the existence of useless parents and features On chain order selection We performed another experiment to evaluate the performance of chain order selection developed in Algorithm 2. In this experiment, we incorporated Algorithm 2 with occ (Algorithm 1), and set the two parameters of occ as k = q and M = d. To simplify the computation of Algorithm 2, we reduced the dimensionality of features to 1 by Principle Component Analysis (PCA) and ignored the second term of (47). We first compared occ (with canonical label order Y 1 Y 2 Y q 1 Y q ), occ cos (with chain order selection) and EoCC 5 (ensemble of five occs with randomly selected orders). As shown in Fig. 6, 25

Active Learning for Sparse Bayesian Multilabel Classification

Active Learning for Sparse Bayesian Multilabel Classification Deepak Vasisht, MIT & IIT Delhi Andreas Domianou, University of Sheffield Manik Varma, MSR, India Ashish Kapoor, MSR, Redmond Multilabel Classification