Multi-dimensional classification via a metric approach

Size: px

Start display at page:

Download "Multi-dimensional classification via a metric approach"

Peregrine O’Neal’
5 years ago
Views:

1 Multi-dimensional classification via a metric approach Zhongchen Ma a, Songcan Chen a, a College of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics (NUAA), Nanjing , China Abstract Multi-dimensional classification (MDC) refers to learning an association between individual inputs and their multiple dimensional output discrete variables, and is thus more general than multi-class classification (MCC) and multi-label classification (MLC). One of the core goals of MDC is to model output structure for improving classification performance. To this end, one effective strategy is to firstly make a transformation for output space and then learn in the transformed space. However, existing transformation approaches are all rooted in label power-set (LP) method and thus inherit its drawbacks (e.g. class imbalance and class overfitting). In this study, we first analyze the drawbacks of the LP, then propose a novel transformation method which can not only overcome these drawbacks but also construct a bridge from MDC to MLC. As a result, many off-the-shelf MLC methods can be adapted to our newly-formed problem. However, instead of adapting these methods, we propose a novel metric learning based method, which can yield a closed-form solution for the newly-formed problem. Interestingly, our metric learning based method can also naturally be applicable to MLC, thus itself can be of independent interest as well. Extensive experiments justify the effectiveness of our transformation approach and our metric learning based method. Keywords: Multi-dimensional classification, problem transformation, distance metric learning, closed-form solution Corresponding author address: s.chen@nuaa.edu.cn (Songcan Chen) Preprint submitted to Journal of LATEX Templates September 21, 2017

2 1. Introduction 5 10 In supervised learning, binary classification (BC), multi-label classification (MLC) and multi-class classification (MCC) have been extensively studied in past years. As a more general learning task, MDC is relatively less studied up to now, which can partly be attributed to its more complex output space. Figure 1 displays the relationships among the different classification paradiams in term of m class variables of K possible values each. As shown in the figure, BC has only single class variable whose range is {1, 0} or {1, 1}, corresponding to m = 1 and K = 2; MCC also has only single class variable but whose range can take a number of class values, corresponding to m = 1 and K > 2; MLC has multiple class variables whose range is also {1, 0} or {1, 1}, corresponding to m > 1 and K = 2. More generally, MDC allows multiple class variables that can take a number of class values, corresponding to m > 1 and K > 2. m 1 MLC MDC m 1 BC MCC K 2 K Figure 1: Relationship between different classification paradigms, where m is the number of class variables and K is the Fig. number 1. 不同的分类问题范例 of values each :L of 代表标签的数量和 these variables may take. K 代表每个标签的取值范围 A wide range of applications has been found corresponding to this task, for example, in computer vision task [1], a landscape image may present many information such as the month, season, or the type of subject; in information retrieval task [2][3], documents can be classified into different kinds of categories like mood or topic; in computational advertising task[4], a social media information may demonstrate the user s gender, age, personality, happiness or political polarity. Like MLC, the core goal of MDC is to achieve effective classification performance by modeling output structure. In modeling, a simplest assumption is 2

3 that the class variables are completely unrelated, thus it is sufficient to design a separate independent model for each class. However, such an ideal assumption is hardly applicable to real world problems in general, as correlation (structure) often exists among class variables, for example, a user s age can have strong impact on his political polarity where the young are generally more radical and elders are often more conservative. Even within each output dimension, there exists an explicit within-dimension relationship among its values, which refers to that only one value of a class variable can be activated. Therefore, one key to achieve its effective learning lies in how to take sufficient advantage of explicit and/or implicit relationships both among output dimensions and among values within each output dimension. In order to model such output structures, there are two main strategies proposed: (i) explicitly modeling the dependence structures between class variables, e.g., via imposing chain structure [5][6][7], or using multi-dimensional Bayesian network structure [8][9] or adopting Markov random field[10] (ii) implicitly modeling output structure by transformation approaches [11][12][13]. A major limitation of the former strategy lies in requiring a pre-defined output structure (e.g., chain or Bayesian network), thus partly losing flexibility of characterizing structure. In contrast, the transformation approach of the latter strategy enjoys more flexibility due to its ability to modeling various structures. What s more, we also witness that such a transformation method has demonstrated its convincing performance in [13]. Therefore, in this paper, we follow such a transformation strategy to model output structures of MDC. To the best of our knowledge, all the existing transformation methods can be classified as label power-set (LP)-based transformation approach. LP [11] can transform the MDC problem into a corresponding multi-class classification problem by defining a new compound class variable whose range exactly contains all the possible combinations of values of the original class variables. Though implicitly considering the interaction between different classes, LP suffers from class imbalance and class overfitting problems, where the class imbalance refers to the great differences in the total number of instances for different combi- 3

4 nations of the class variables, and the class overfitting problem refers to zero instances for some combinations of the class variables. To address these issues of LP, [13] proposed to firstly form super-class partitions by modeling the dependence between class variables and then make each super-class partition correspond to a compound class variable defined by LP. Although this superclass partitioning can reduce the the original problem to a set of subproblems, these newly formed subproblems still need to be transformed by LP, thereby, the approach naturally suffers its problems. In this study, we analyze the drawbacks of the LP and propose a novel transformation method which can not only overcome these drawbacks but also construct a bridge from MDC to MLC. Specifically, our transformation approach desires to form a new output space with all binary variables by a tricky binarization process for the original output space of MDC. Since such a newlyformed problem has a similarity with MLC (e.g. the class variables of both problems are all binary), our transformation approach is named as Multi-Label like Transformation approach (MLKT) and subsequently, many off-the-shelf MLC methods can be adapted to the newly-formed problem. However, instead of adapting these methods, in this study, we also propose a novel metric-based method of aiming to make the predictions of an instance in the learned metric space close to its true class values while far away from others. And, our metric-based method involved can yield a closed-form solution, thus its learning is more efficient than its competitive methods. Interestingly, our metric learning method involved can also naturally be applicable to MLC, thus itself can be of independent interest as well. Finally, extensive experimental results justify that: our approach combing the above two procedures achieves a better classification performance than the state-of-the-art MDC methods while our method itself also obtains competitive classification performance with a lower learning complexity compared to its counterparts designed specifically for MLC. The rest of the paper is structured as follows: We firstly introduce the required background in the field of multi-dimensional classification in Section 2. Then we introduce MLKT in Section 3. Next, we present the details of the 4

5 85 distance metric learning method in Section 4. We then experimentally evaluate the proposed schemes in Section 5. Finally, we give concluding remarks in Section Background In this section, we review basic multi-dimensional classifiers. In MDC, we have N labeled instances D = {(x i, y i )} N i=1 from which we wish to build a classifier that associates multiple class values with each data instance. The data instance is represented by a vector of d values x = (x 1,..., x d ), each drawn from some input domain X 1 X d. And the classes are represented by a vector of m values y = (y 1,... y m ) from the domain Y 1 Y m where each Y j = {1,..., K j } is the set of possible values for the jth class variable Y j. Specifically, we seek to build a classifier f that assigns each instance x to a vector y of class values: f : X 1 X d Y 1 Y m x : (x 1,..., x d ) y : (y 1,... y m ) Binary Relevance (BR) is a straightforward method for MDC. It trains m classifiers f := (f 1,..., f m ) for each class variable. Specifically, a standard multi-class classifier f j learns to associate one of the values y j Y j to each data instance, where f j : X 1 X d Y j. However, it is unable to capture the dependencies among classes and suffers low accuracies as illustrated in [5, 14, 15]. MDC has attracted much attentions recently and many multi-dimensional classifiers for modeling the output structure of MDC have been proposed in recent years. As presented in the introduction section, there are two main strategies: 1. Explicit representation of the dependence structure between class variables. 5

6 Classifier chains model (CC)[5, 6, 7], Classifier trellises (CT) [14] and Multi-dimensional Bayesian network classifiers (MBCs) [8][9] were recently proposed methods following this strategy for MDC. Specifically, classifier chains model (CC) learns m classifiers, one for each class variable. These classifiers are linked at random order, such that the jth classifier uses as input features not only the instance, but also the output predictions of the previous j 1 classifiers, namely ŷ j = f j (x, ŷ 1,..., ŷ j 1 ) for any test instance x. Specifically: f j : X 1 X d Y 1 Y j 1 Y j This method has demonstrated high performance in multi-label domains and is directly applicable to MDC. However, a drawback is that the class variable ordering in chain has a strong effect on predictive accuracy, and with greedy structure comes the concern of error propagation along the chain due to that an incorrect estimate ŷ j will negatively affect all subsequent class variables. Naturally, the ensemble strategy (ECC) [7], which trains several CC classifiers with random order chains, can be used to alleviate these problems. classifier trellises (CT) captures dependencies among class variables by considering a predefined trellis structure. Each of the vertices of the trellis corresponds to one of the class variables. Fig.2 shows a simple example of the structure where the parents of each class variable are the class variables laying on the vertices above and to the left in the trellis. Specifically: f d : X 1 X d Y b Y c Y d. 115 CT can scale large data sets with reasonable complexity. However, just like CC, the artificially-defined greedy structure may falsely reflect the real dependency among class variables, thus limit its classification performance in real world applications unless such a predefined structure coincides with the given problem. Multi-dimensional Bayesian network classifiers(mbcs) are a family of probabilistic graphical models which organizes class and feature variables 6

Figure 2: A simple example of Classifier Trellises [14]. 120 125 as three different subgraphs: class subgraph, feature subgraph, and bridge (from class to features) subgraph.

7 Figure 2: A simple example of Classifier Trellises [14] as three different subgraphs: class subgraph, feature subgraph, and bridge (from class to features) subgraph. Different graphical structures for the class and the feature subgraphs can lead to different families of MBCs. We show a simple tree-tree structure of MBC in Fig.3. In recent years, various MBCs have been proposed and have become useful tools for modeling output structures of MDC [8, 9, 16]. However, the problem is still its exponential computational complexity. Figure 3: A simple tree-tree structure of MBC [8][9] Implicit incorporation of output structure by transforming output space. To the best of our knowledge, all the existing transformation methods for MDC can be classified as label power-set (LP)-based transformation approach. Label power-set (LP) [11] is a typical transformation approach for MLC and also can be directly applied to MDC. It firstly forcefully assumes all the class variables are dependent and then defines a new compound class variable whose range contains all the possible combinations of values of the original class variables. Specifically, f : X 1 X d Cartesian product(y 1,..., Y m ). 7

8 135 As a result, the original problem is turned into a multi-class classification problem which has many off-the-shelf methods available. In this way, the output structure of MDC is implicitly considered. However, LP easily suffers from class overfitting and class imbalance problems as mentioned in the introduction section. Random k-labelsets (RAkEL) [17] and Super classes Classifier (SCC) [13] are LP-based transformation approaches, where RAkEL uses multiple LP classifiers with each trained on a random subset of Y, while SCC s LP classifiers are trained on the subsets of class variables with strong dependency. Formally, given a subset class variable S of Y, then its corresponding LP classifier is learned, namely: f S : X 1 X d Cartesian product(y S ). Note that RAkEL is specially designed for MLC, but can straightly be applied to MDC. While SCC has demonstrated convincing classification performance for MDC. However, both methods need to resort to LP, therefore naturally suffering its same problems to some extend In summary, the above two strategies have their own advantages and drawbacks. However, relatively speaking, the latter one enjoys more flexibility in modeling output structure. Therefore, in this paper, we follow the latter one to model output structure of MDC. Unfortunately, existing transformation approaches are all based on LP, thus naturally inherit its class overfitting and class imbalance problems. Thanks to these, we try to give an analysis for the drawbacks of LP and propose a novel transformation approach to overcome them. 3. Transformation for MDC 3.1. Analysis for LP 150 We now give an analysis for LP and detail the causes of its drawbacks. 8

9 LP is a typical transformation approach for MDC. It defines a new compound class variable whose range contains all the possible combinations of values of the original class variables. In this way, the original MDC problem is turned into a multi-class classification problem which is relatively easier for subsequent 155 learning. However, such transformation has two serious drawbacks: The first drawback is class overfitting, which is invoked by the great reduction of the number of instances for each class after transformation. Specifically, given a dataset {x i, y i } N i=1 of a MDC problem which has m class variables with each having K i class values. Clearly, the number of instances for each class of the i-th output dimension is 1 K i N in the original problem and the worst case of the number is 1 K max N, where K max = max(k i ). However, after the transformation, the number of instances for each class becomes 1 m i=1 Ki N, which is far less than 1 K max N due to m i=1 Ki K max. Hence, this drawback makes learning for the formed problem easy class overfitting. The second drawback is class imbalance, which is invoked by the reduction of balance degree (refers to the smallest ratio of the total number of instances between classes) after transformation. More specifically, given a dataset of an imbalance MDC problem which has m class variables Y 1,..., Y m, if we assume the balance degrees of class variables are respectively p 1,..., p m. After LP 170 transformation, the balance degree changes to p 1 p m. Based on p i < 1, i {1,..., m}, we can get p 1 p m < min(p 1,..., p m ). Thus the balance degree in the formed problem is worse than that in the original problem. In nature, although LP does not transform a totally balanced MDC problem into an imbalanced problem, it can indeed transform almost balanced MDC problem 175 into imbalanced problem. What s more, from the above analysis, we find that the more class variables the LP compound class variable includes, the more serious the class overfitting and class imbalance problems become. To overcome the drawbacks of LP, we propose a transformation approach to 180 1) make the number of instances for each class in the transformed problem as similar as possible to that in the original MDC problem; 2) keep the balance 9

10 degree in the formed problem as consistent as possible with that in the original MDC problem The procedure of MLKT 185 In this subsection, we present a novel transformation approach, namely M- LKT, which transforms Y to a subspace of {0, 1} L (where L is the dimensionality of the formed problem). Our approach inherits the following favorable characteristics of LP: keep the output space size invariant. 190 is easy for subsequent modeling in the transformed space. can reflect the explicit within-dimensional relationship. In addition, it also possesses two extra key characteristics, i.e., it can overcome the class overfitting and class imbalance problems suffered by LP it is decomposable for each class variable of MDC. Among above characteristics, we attempt to make the transformation decomposable, aiming to make easy transformation implementing and distinctive learning for each output variable. By doing so, we can ease the unnecessary computation cost of LP when the correlations between some output variables are not strong. Now, let us detail the procedures of MLKT: For each individual class variable of MDC, if K i 3, then for each y i Y i = {1,..., K i }, construct a new K i - dimensional class vector ẑ i where ẑ i j = 1 if j = yi, 0 otherwise. 205 if K i = 2, then for each y i Y i = {1, 2}, construct a 1-dimensional class vector ẑ i where ẑ i = 0 if y i = 1 and ẑ i = 1 if y i = 2 With the above transformation, the original output class vector y = (y 1,..., y m ) Y is converted to a corresponding class vector ẑ = [ẑ 1 ;... ; ẑ m ], i.e. ẑ is obtained 10

11 210 by concatenating corresponding m vectors in the ascending index order. Thus, we form a new output domain: Ẑ = {ẑ y y Y}, where denotes transformed by MLKT. Clearly, Ẑ {0, 1} L, where L is the dimensionality of ẑ. Let us give an example to help understanding MLKT transformation. Example 1. Assume the output space of MDC is Y = Y 1 Y 2, where Y 1 := {1, 2, 3}, Y 2 := {1, 2} transformation for each individual class domain. Y 1 := {1, 2, 3} {(1, 0, 0) T, (0, 1, 0) T, (0, 0, 1) T } Y 2 := {1, 2} {0, 1} 2. concatenating procedure. Ẑ = {(1, 0, 0, 0) T, (1, 0, 0, 1) T, (0, 1, 0, 0) T, (0, 1, 0, 1) T, (0, 0, 1, 0) T, (0, 0, 1, 1) T } Clearly, it is not easy for subsequent learning in the new formed space. However, we observe that the new formed space is equivalent to {0, 1} L with some additionally-imposed constraints which is relatively easier for subsequent learning. These additionally-imposed constraints can be obtained based on a deep insight as follows: Let us define an integer set φ i = {φ i 1, φ i 2,..., φ i j,... } for each i = 1,..., m, such that ẑ (φi ) = ẑ i where the elements of set φ i represent the indices of ẑ corresponding to class variable domain Y i. Based on the transformation characteristics of MLKT, we find that when K i 3, the vector ẑ (φ i ) has one and only one element to be 1, thus, we have ẑ (φ i j ) = 1, ẑ Ẑ. (1) j φ i In fact, we can use E T z = t to formulate all the equalities, where E R L m is an indicator matrix whose element 1, if k = φ i j E ki = Ki 3 (2) 0, otherwise 11

12 and t is a m dimensional vector whose element 1, if K i 3 t i = 0, otherwise (3) Next, we prove in Proposition 1 that the output domain of the newly-formed problem Ẑ is in fact equivalent to: Z = {E T z = t z {0, 1} L } (4) 225 where E and t are defined in Eq.(2) and Eq.(3), respectively. As a result, leading to the following Proposition 1: Proposition 1. The output domain Ẑ formed by MLKT is equivalent to Z = {E T z = t z {0, 1} L } where E is a predefined indicator matrix and t is a predefined m-dimensional vector. Proof. Based on Eq.(1), we find that z Ẑ, z Z. Therefore, we just need to 230 prove that the size of Z is consistent with that of Y (which means the consistency with that of Ẑ as well). Assume the original MDC problem has m class variables, with each having K 1, K 2,..., K m possible values. Note that the MLKT transformation is decomposable for each class variable, thus we just need to prove that the space of Z φi is also K i. In the following, we give a proof according to several cases: Case 1: if K i = 2, the class variable domain Y i is transformed to Z (φi) = {0, 1} 1, thus its space size is also 2. Case 2: if K i 3, the class variable domain Y i is transformed to Z (φi ) = { 1, z (φ i ) = 1 z (φ i ) {0, 1} Ki } in terms of our setup in Eq.(1), where 1 is a K i -dimensional vector with all elements being 1 and, represents the inner product between two vectors. Because the equality constraint can make sure that the vector z (φi ) has one and only one element to be 1, the space size of Z (φ i ) is also K i. 245 To further help understanding MLKT transformation from Y to Z = {E T z = t z {0, 1} L }, we give an example as follows: 12

13 Example 2. Assume the output space of MDC is Y = Y 1 Y 2, where Y 1 := {1, 2, 3}, Y 2 := {1, 2}. The MLKT approach follows two steps: Define φ i and L. 2. Define an indicator matrix E and a vector t. According to Eq.(4), we can get: φ 1 = {1, 2, 3}, φ 2 = {4}, L = 4, E 1 = (1, 1, 1, 0) T, E 2 = (0, 0, 0, 0) T, E = [E 1, E 2 ], t = [1, 0] T, respectively. Thus, Z can be defined as Z = {E T z = t z {0, 1} 4 } In terms of such a transformation, the number of instances for each class in the formed problem seems to be N 2, which is far more than that in the problem formed by LP, i.e., 1 d i=1 Ki N. Moreover, on the surface, although it is also more than that in the original problem, i.e., 1 K i N, this is not so. Based on the decomposability of MLKT and the consistent size of Z (φ i ) with that of Y i, so if the original MDC problem has N i instances categorized as y i, then the formed problem also has N i instances categorized as z (φi ), meaning that the formed problem keeps consistent with the original MDC problem not only the number of instances for each class but also the balance degree. Naturally, MLKT avoids the class overfitting and the class imbalance problems. Moreover, the explicit within-dimensional relationship is reflected by common-used one-vs-all coding [18]. In this way, MLKT can make all the desired transformation characteristics guaranteed. From now on, we just need to focus on learning the structure of the problem formed by MLKT to reveal original structure Learning for the transformed problem Notations Firstly, we give the notations which will be used in the following section. For a matrix A R p q, we use A F = i j A2 ij to denotes its Frobenius norm. For a positive definite matrix B 0, B 1 denotes the inverse of matrix B. And, we use x 1 x 2 B = (x 1 x 2 ) T B(x 1 x 2 ) to denote the Mahalanobis distance between vectors x 1 and x 2. 13

14 4.1. Model construction and Optimization Given the training data instances D = {(x i, y i )} N i=1 (y i Y), we can obtain the corresponding re-labeled instances D = {(x i, z i )} N i=1 (z Z) by MLKT transformation. Then, the left problem is to build a classifier g that assigns each instance x a vector z of class values: x i : (x 1 i,..., x d i ) z i : (zi 1,... zi L ) Let X R N d denote the input matrix and Z Z N L denote the output matrix. To solve the transformed problem, a simple linear regression model is to learn matrix P through the following formulation: 1 arg min P R d L 2 Z XP 2 F +γ P 2 F, (5) Here γ 0 is a regularization parameter. However, this method usually yields low classification performance due to lack of consideration of correlations in output space [19]. Considering the correlations, [20, 19] proposed to learn a discriminative Mahalanobis distance metric which can make the distance between P T x i and z i less than that between P T x i and any other output z in the output space. Unfortunately, both [20, 19] can not be directly applicable to our transformed problem, we, instead, develop an alternative novel metric learning method well suited to our scenario and it can nicely obtain a closed form solution (Our Mahalanobis metric learning method is similar to [20, 19] and we detail the connections in section 4.5.). Its formulation is presented as follows: arg min Ω 0 i, z i Z\z i P T x i z i 2 Ω + 1 Z\z i PT x i z i 2 Ω 1, (6) where P is the solution of the linear regression model (5), Ω is a positive definite 280 matrix. In the above, the first term aims to make the distance between P T x i and z i smaller and the second term the distance between P T x i and any other output z larger. The main idea of using Ω 1 is motivated by [21], where Ω 1 is used to measure the distances between dissimilar points. The goal is to increase 14

15 the Mahalanobis distance between P T x i and any other output z by decreasing P T x i z 2 Ω (see Proposition 1. [21]). 1 Because the space size of Z grows exponentially with dimension L, we only consider the k nearest neighbors (knn) of z i in the training dataset instead of any other outputs in the whole output space. Moreover, a regularizer term is used for avoiding overfitting. Therefore, we present the formulation of the distance metric learning method as follows: arg min Ω 0 λd sld (Ω, I)+ P T x i z i 2 Ω + 1 k PT x i z i 2 Ω 1 i, z i knn(z i)\z i (7) where γ 0, P is fixed and the solution of (5), I is the identity matrix and D sld (Ω, I) is the symmetrized LogDet divergence: D sld (Ω, I) := tr(ω) + tr(ω 1 ) 2L. Further define: S := i (P T x i z i )(P T x i z i ) T (8) 285 D := i, z i knn(z i)\z i 1 k (PT x i z i )(P T x i z i ) T (9) Using both S and D, the minimization problem (7) can be recast as arg min Ω 0 λd sld (Ω, I) + tr(ωs) + tr(ω 1 D). (10) Interestingly, the minimization problem (10) is the same as the problem (13) of [21], and is both strictly convex and strictly geodesically convex (Theorem 3 of [21]), thus having global optimal solution. What s more, it can have a closed form solution below: Ω = (S + λi) 1 1/2 (D + λi), (11) where A 1/2 B := A 1/2 (A 1/2 BA 1/2 ) 1/2 A 1/2. 15

16 It is this fact that the solution is given by the midpoint of the geodesic joining (S+λI) 1 and (D+λI). The geodesic viewpoint is important to make a tradeoff between (S + λi) 1 and (D + λi). Note that Ω := (S + λi) 1 1/2 (D + λi) is also the minimum of problem (12) according to [21]: arg min δ 2 R(Ω, (S + λi) 1 ) + δ 2 R(Ω, (D + λi)), (12) where δ R denotes the Riemannian distance δ R (U, V) := log(v 1/2 UV 1/2 ) F, for U, V 0 Thus, we can get the balanced version between S and D of problem (10): arg min Ω 0 := (1 t)δ2 R(Ω, (S + λi) 1 ) + δ 2 R(Ω, (D + λi)), t [0, 1]. (13) Interestingly, it can be shown (see [22], ch.6) that the unique solution to problem (13) is where A t B := A 1/2 (A 1/2 BA 1/2 ) t A 1/2. Ω = (S + λi) 1 t (D + λi) (14) The solution connects to the Riemannian geometry of symmetric positive definite (SPD) matrices, and thus we denote it as gmml. Totally, we detail the 290 learning procedure in Algorithm 1. Algorithm 1 MLKT-gMML algorithm Input: The MDC training data set D = {(x i, y i )} N i=1 (y i Y); The preset hyper-parameters k, λ, γ and t Output: Regression matrix P and distance metric Ω; 1: Transform D to D by MLKT approach: D = {(x i, z i )} N i=1 (z Z) 2: Set P := arg min 1 P R d L 2 Z XP 2 +γ P 2 F ; 3: Compute S and D by Eq.(8) and Eq.(9) 4: Set Ω := (S + λi) 1 t (D + λi); 5: return P and Ω; Note that P has an impact on learning Ω, conversely, Ω has an impact on learning P as well. Thus, P can be obtained by optimizing the following 16

17 problem: 1 P := arg min P R d L 2 Z XP 2 Ω +γ P 2 F. (15) Its solution can be boiled down to solving the Sylvester equation: (X T X)P + γpω 1 = X T Y. A classical algorithm for solving such equation is the Bartels- Stewart algorithm [23]. In a nutshell, an iteration algorithm for learning P and Ω is detailed in Algorithm 2 called as gmml-i. Algorithm 2 gmml-i algorithm Input: The MDC training data set D = {(x i, y i )} N i=1 (y i Y); The number of iterations η; The preset hyper-parameters k, λ, γ and t Output: Regression matrix P and distance metric Ω; 1: Transform D to D by MLKT approach: D = {(x i, z i )} N i=1 (z Z) 2: Set Ω init = I 3: repeat 4: Set P := arg min P R d L 1 2 Z XP 2 Ω +γ P 2 F ; 5: Compute S and D by Eq.(8) and (9) 6: Set Ω := (S + λi) 1 t (D + λi); 7: until (reach to η iterations) 8: return P and Ω; Prediction for a new instance Based on the learning procedure, the output z of a new instance x can be predicted by solving the following optimization problem: 1 min z Z 2 z PT x 2 Ω (16) It is equivalent to solving a quadratic binary optimization problem with equality constraints, namely, 1 min z 2 z PT x 2 Ω s.t. E T z = t (17) z {0, 1} L 17

18 The optimization problem (17) is very difficult to solve due to its NPhardness. Instead we replace the binary constraints with 0 z 1, then the NP-hard optimization problem is converted to a simple box-constrained quadratic programming as follows 1 min v 2 v PT x 2 Ω s.t. E T v = t (18) v [0, 1] L Now, for each set φ i, the prediction z (φ i ) of x can be made in terms of 1, if j = arg max k v (φ i z (φ i j ) = k ), if φ i 3, k = 1,..., K i 0, otherwise z (φi ) = round(v (φi )), if φ i = 1 (19) Where round() means rounding their predictions into 0/1 assignments. In turn, the prediction y i of x in the original output space is y i = j, if φ i 3 y i = (z (φ i ) + 1), if φ i = 1 Algorithm 3 details the predicting procedures. (20) Algorithm 3 Predict new instance x Input: The learned regression matrix P and distance metric Ω; The new instance x Output: The prediction class vector y 1: Solve z := arg min z Z z P T x 2 Ω ; 2: Inverse transformation: y z according to Eq.(19) and Eq.(20); 3: return y; 4.3. Connections between existing metric learning methods and ours 300 Two works are mostly related to our metric learning method, namely max- imum margin output coding (MMOC) [20] and large margin metric learning 18

19 305 with knn constraints (LM-kNN) [19]. These two methods likewise use a Mahalanobis distance metric (a symmetric positive semidefinite matrix denoted by S + ) to model output structure of MLC, where the Mahalanobis distance metric is used to learn a lower dimensional space. MMOC aims to learn a discriminative Mahalanobis metric which can make the distance between P T x i and its real class vector z i as close to 0 as possible and less than the distance between P T x i and any other outputs with some margin. Specifically, its formulation is as follows arg min Ω S +,{ξ i} n i=1 1 2 trace(ω) + C n n ξ i (21) i=1 s.t. ϕ T iz i Ωϕ izi + (z i, z) ξ i ϕ T izωϕ iz, z {0, 1}, i 310 where C is a positive constant, ϕ iz = P T x i z and ϕ izi = P T x i z i. It proved to have good classification accuracy for MLC task. However, it also suffers from a big burden, i.e. it has to treat the exponentially large number of constraints for each instance during training, leading to computational infeasibility. Like MMOC, LM-kNN also adopts a Mahalanobis metric learning method for MLC, which just involves k constraints for each instance. Its distance metric learning attempts to make instances with similar class vectors closer. Thus, the class vector of each instance can be predicted by their nearest neighbors. In fact, LM-kNN can be much simpler than MMOC and is established by minimizing the following objective: arg min Ω S +,{ξ i} n i=1 1 2 trace(ω) + C n n ξ i (22) i=1 s.t. ϕ T iz i Ωϕ izi + (z i, z) ξ i ϕ T izωϕ iz, z Nei(i), i where C, ϕ iz and ϕ izi are similarly defined to those of MMOC. Nei(i) is the output set of k nearest neighbors of input x i. For LM-kNN, its prediction for a testing instance can be obtained based on its k nearest neighbors in the learned metric space. Specifically, for the testing 19

20 input x, we find its k nearest instances {x 1,..., x k } in the training set, then, a set of scores for each class vector of x can be obtained from the distances between x and {x 1,..., x k }, lastly, using these scores to predict its class vector by thresholding. Clearly, neither MMOC nor LM-kNN can be applied to our transformed problem due to that the output space of our transformed problem is not equivalent to that of MLC. Although they can be adapted to our scenario by some efforts, these efforts are non-trivial because their corresponding training and/or predicting have to be re-designed. Moreover, at present it is not our focus. As a result, we choose an alternative design way for our Mahalanobis distance metric learning, where our method is formally close to MMOC but has a closed form solution as described in section Complexity Analysis The time complexity of regularized least square regression is basically the complexity of computing the matrix multiplication with O(Nd 2 + NdL) plus the complexity of the inverse computation with O(d 3 ). The complexity of computing the geometric mean of two matrices by Cholesky-Schur method [24] is O(L 3 ). The complexity of computing a Sylvester equation is O(d 3 +L 3 +Nd 2 + N dl). The complexity of computing a box-constrained quadratic problem is O(L 3 + Ld). And the time complexity of knn is O(kN). Algorithm 1 involves solving a regularized least square regression problem and computing the geometric mean of two matrices. Therefore, its total time complexity is O(Nd 2 + NdL + d 3 + L 3 + kn). Algorithm 2 involves solving a Sylvester equation and computing the geometric mean of two matrices with η iterations (where η is usually fixed to a pre-set small integer). Therefore, its time complexity is O(η(d 3 + L 3 + Nd 2 + NdL + kn)). Algorithm 3 involves solving a box-constrained quadratic problem, its time complexity is O(L 3 +Ld). Based on [19], the training and predicting time complexities of LM-kNN are respectively O( 1 ɛ (Nd 2 +NdL+L 3 +d 3 +kndl 2 )) and O(LN +Ld), where ɛ is the accuracy met by its solution. The training and predicting time complexities 20

21 of MMOC are respectively O(θ(Nd 2 + NdL + d 3 + NL 3 + N 4 )) and O(L 3 ), where θ is its iterations. In comparison with our metric learning counterparts used here, our Algorithm 1 and Algorithm 2 have an advantage over MMOC and LM-kNN in terms of training time due to η O( 1 ɛ ) and η θ. The predicting time complexity of our Algorithm 3 is comparable to MMOC and higher than LM-kNN. 5. Experiments 355 In this section, we discuss the experiments conducted on two publicly available real-world datasets for MDC. The two datasets for MDC are respectively ImageCLEF and Bridges. ImageCLEF2014 comes from a real world challenge in the field of robot vision [25]. Bridges dataset comes from the UCI collection [26]. Unfortunately, there are not yet many publicly available standardized multi-dimensional datasets, so we boost our collections with eight most commonly used multi-label datasets which can be accessed from Mulan 2. The characteristics of these datasets are shown in Table 1. Table 1: Datasets used in the evaluation Dataset class variables Features Instances birds emotions medical scene yeast flags genbase CAL bridges CLEF We consider two commonly used evaluation criteria for MDC, namely Ham- ming accuracy and Example accuracy. These evaluation criteria can be calcu

22 Table 2: Hamming Accuracy (Part A) classifier birds emotions medical scene yeast MLKT-RR ± ± ± ± ± MLKT-gMML ± ± ± ± ± MLKT-gMML-I ± ± ± ± ± lated as follows: 1. Hamming accuracy: Acc = 1 m m Acc j = 1 m j=1 m j=1 1 N N δ(y j i, ŷj i ) i= where δ(y j i, ŷj i ) = 1 if ŷj i = yj i, and 0 otherwise. Note that ŷj i jth class value predicted by the classifier for instance i and y j i value. 2. Example accuracy Acc = 1 N N δ(y i, ŷ i ) i=1 where δ(y i, ŷ i ) = 1 if ŷ i = y i, and 0 otherwise. denotes the is its true Before the experiments, some parameters need to be set in advance. The parameter η for gmml-i algorithm is always set to 3 throughout our experiments (because when η > 3, we find it has no changes for Ω and P). The parameters λ and t associated with Ω are respectively tuned from the range {10 0, 10 1, 10 2 } and {0.3, 0.5, 0.7}. The parameter γ for P is tuned from the range {0, 0.1, 0.2}. All the following experimental results are the average results of 10-fold cross validation experiments. And, we use the notation to denote the best results Comparison with our baseline methods 375 We firstly verify the classification accuracy of MLKT-gMML-I in comparison with that of both ridge regression model (namely Ω = I) and the algorithm without iteration procedure (namely MLKT-gMML). We show the results in Tables 2, 3, 4 and 5. 22

23 Table 3: Hamming Accuracy (Part B) classifier flags genbase CAL500 ImageCLEF2014 bridges MLKT-RR ± ± ± ± ± MLKT-gMML ± ± ± ± MLKT-gMML-I ± ± ± ± ± Table 4: Example Accuracy (Part A) classifier birds emotions medical scene yeast MLKT-RR ± ± ± ± ± MLKT-gMML ± ± ± ± ± MLKT-gMML-I ± ± ± ± ± From the results, we can see that MLKT-gMML-I nearly achieves the best accuracy on all these datasets in regard to both evaluation criteria. To verify whether the differences are significant, two non-parametric Friedman tests among these methods for Hamming accuracy and Example accuracy are conducted respectively. In Hamming accuracy, the Friedman test renders a F value of (> F (α,k 1,(b 1)(k 1)) = F (0.05,2,18) = 3.555) 3. Thus, the null hypothesis that all the methods have identical effects is rejected and a post-hoc test needs to be conducted for further testing their differences. To this end, a commonly-used post-hoc test, Nemenyi test, is conducted. The result is shown in Figure 4 from which we can see that: 1) MLKT-gMML-I has a significant difference from our other two methods; 2) MLKT-gMML achieves a comparable performance with MLKT-RR. 3 Here, F is the percent point function of the F distributuion, α is the significance level, b is the number of datasets and k is the number of algorithms for test. Table 5: Example Accuracy (Part B) classifier flags genbase CAL500 ImageCLEF2014 bridges MLKT-RR ± ± ± ± ± MLKT-gMML ± ± ± ± ± MLKT-gMML-I ± ± ± ± ±

24 MLKT gmml MLKT RR MLKT gmml I Figure 4: Friedman Test of our methods in terms of Hamming accuracy. In the graph, the horizontal axis represents the values of mean rank, the vertical axis represents the different methods for test. For each method, the represents its mean rank value, the line segment represents the critical range of Nemenyi test. Two methods have a significant difference, if their line segments are not overlapped; not so, otherwise In Example accuracy, the Friedman test renders a F value of (> F (α,k 1,(b 1)(k 1) = F (0.05,2,18) = 3.555), meaning the null hypothesis is also rejected. Then, A post-hoc Nemenyi test is again conducted. Its result is shown in Figure 5 and indicates that: 1) MLKT-gMML-I is significantly different from MLKT-RR; 2) MLKT-gMML also achieves a comparable performance with MLKT-RR. On the whole, we can conclude that MLKT-gMML-I achieves the best classification performance while MLKT-gMML achieves a comparable performance with MLKT-RR. Therefore, in the following, we just concentrate on the comparison between MLKT-gMML-I and the other competitive MDC methods Comparison with several competitive MDC methods 405 We then compare MLKT-gMML-I with several competitive methods for MD- C from the literature: Binary-Relevance (BR), Classifier Chains (CC), Ensemble of Classifier Chains (ECC), RAkEL and Super-Class Classifier (SCC). S- ince the above methods are only designed for modeling output structure, Naive Bayesian classifier is used as their base classifier in our experiments. We use an open-source Java framework, namely the MEKA [27] library, for the experiments. Regarding the parameterization of these approaches, ECC is 24

25 MLKT RR MLKT gmml MLKT gmml I Figure 5: Friedman Test of our methods in terms of Example accuracy. In the graph, the horizontal axis represents the values of mean rank, the vertical axis represents the different methods for test. For each method, the represents its mean rank value, the line segment represents the critical range of Nemenyi test. Two methods have a significant difference, if their line segments are not overlapped; not so, otherwise. Table 6: Hamming Accuracy (Part A) classifier birds emotions medical scene yeast BR-NB ± ± ± ± ± CC-NB ± ± ± ± ± ECC-NB ± ± ± ± ± RAkEL ± ± ± ± ± SCC ± ± ± ± ± MLKT-gMML-I ± ± ± ± ± configured to learn 10 different models for the ensemble, for RAkEL we use the recommended configuration with 2m models having triplets of class combinations, and for SCC we use a nearest neighbour replacement filter (NNR) to identify all p = 1 infrequent class-values and replace them with their n = 2 mostfrequent nearest neighbours. Their Hamming accuracy and Example accuracy are shown in Tables 6, 7, 8 and 9 respectively. From the results of these tables, we see that MLKT-gMML-I can achieve better performance on most of the datasets than its competitive MDC methods (BR, CC, ECC, RAkEL and SCC) in terms of both evaluation criteria. To verify the performance differences, two non-parametric Friedman tests among these methods for Hamming accuracy and Example accuracy are respectively conducted. 25

26 Table 7: Hamming Accuracy (Part B) classifier flags genbase CAL500 ImageCLEF2014 bridges BR-NB ± ± ± ± ± CC-NB ± ± ± ± ± ECC-NB ± ± ± ± ± RAkEL ± ± ± ± ± SCC ± ± ± ± MLKT-gMML-I ± ± ± ± ± Table 8: Example Accuracy (Part A) classifier birds emotions medical scene yeast BR-NB ± ± ± ± ± CC-NB ± ± ± ± ± ECC-NB ± ± ± ± ± RAkEL ± ± ± ± ± SCC ± ± ± ± ± MLKT-gMML-I ± ± ± ± ± Table 9: Example Accuracy (Part B) classifier flags genbase CAL500 ImageCLEF2014 bridges BR-NB ± ± ± ± ± CC-NB ± ± ± ± ± ECC-NB ± ± ± ± ± RAkEL ± ± ± ± ± SCC ± ± ± ± ± MLKT-gMML-I ± ± ± ± ±

27 In Hamming accuracy, the Friedman test renders a F value of (> F (α,k 1,(b 1)(k 1)) = F (0.05,9,45) = 2.422). Thus, the null hypothesis that all the methods are identical is rejected and a post-hoc test needs to be conducted for further testing their differences. To this end, a commongly-used post-hoc test, Nemenyi test, is conducted. The results is shown in Figure 6, from which we can see that: 1) MLKT-gMML-I has a significant difference from two methods (BR- NB and CC-NB); 2) There is no significant difference among these methods except MLKT-gMML-I. Therefore, MLKT-gMML-I achieves a slightly better classification performance than its competitive MDC methods. BR NB CC NB ECC NB RKkEL SCC MLKT gmml I Figure 6: Friedman Test in terms of Hamming accuracy. In the graph, the horizontal axis represents the values of mean rank, the vertical axis represents the different methods for test. For each method, the represents its mean rank value, the line segment represents the critical range of Nemenyi test. Two methods have a significant difference, if their line segments are not overlapped; not so, otherwise In Example accuracy, the Friedman test renders a F value of (> F (α,k 1,(b 1)(k 1)) = F (0.05,10,45) = 2.422). Thus, the null hypothesis that all the methods are identical is rejected and a post-hoc test needs to be conducted for further testing their differences. To this end, the Nemenyi test as above is conducted. The results is shown in Figure 6 from which we can see that: 1) MLKT-gMML-I has a significant difference from BR-NB. 2) There is not a significant difference among these methods except BR-NB. Therefore, MLKTgMML-I can achieve a comparable Example accuracy with its competitive methods. On the whole, we can conclude that MLKT-gMML-I achieves comparable 27

28 BR NB CC NB ECC NB RAkEL SCC MLKT gmml I Figure 7: Friedman Test in terms of Example accuracy. In the graph, the horizontal axis represents the values of mean rank, the vertical axis represents the different methods for test. For each method, the represents its mean rank value, the line segment represents the critical range of Nemenyi test. Two methods have a significant difference, if their line segments are not overlapped; not so, otherwise. (or even slightly better) classification performance with (or than) its competitive MDC methods Comparison with LM-kNN on MLC task Note that gmml-i is closely related to MMOC and LM-kNN and the latter two are designed for MLC specially, thus we just conduct experiments on MLC datasets to verify their classification performance. However, since MMOC has to deal with exponentially large number of constraints for each instance in training procedure, it is infeasible even for the CAL500 dataset with 68 features and 174 labels [19]. Therefore, we only compare gmml-i with LM-kNN. We show the results in figure 8 and figure 9. We can see from the figures that gmml-i achieves better performance on six datasets in terms of Hamming accuracy, while four datasets in terms of Example accuracy 4 than LM-kNN. So on the whole, gmml-i achieves better classification on most of the datasets. To verify their difference, the Friedman tests of differences between gmml-i and LM-kNN are conducted and render F -values of for Hamming accuracy and for Example accuracy re- 4 Both methods achieve zero accuracy on CAL500 for Example accuracy. 28

29 gmml I LM knn birds emotions medical scene yeast flags genbase CAL500 Figure 8: Hamming Accuracy (HA) gmml I LM knn birds emotions medical scene yeast flags genbase CAL500 Figure 9: Example Accuracy (EA). 29

30 460 spectively, both are not significant (< F (α,k 1,(b 1)(k 1)) = F (0.05,7,7) = 5.595). So, gmml-i can obtain competitive classification accuracy on MLC task with LM-kNN, but has a lower learning complexity than LM-kNN as analysed in section Conclusions In this paper, we proposed a new transformation approach, namely MLKT, for MDC, which possesses the following favorable characteristics: i) it can keep the space size of MDC invariant, ii) it can reflect the explicit within-dimensional relationships, iii) it is easy for subsequent modeling in the transformed space, iv) it can overcome the class overfitting and class imbalance problems suffered by LP-based transformation approach, v) it is decomposable for each output dimension of MDC. Moreover, we also presented a novel metric learning based method for the transformed problem, which itself can be of independent interest and also has a closed form solution. Extensive experimental results justified that our approach combined the above two procedures can achieve better classification performance than the competitive MDC methods, while our metric learning based method itself can also obtain competitive classification performance with a lower learning complexity compared to its counterparts designed specifically for MLC. And, as mentioned in the introduction section, we can refer to many MLC methods to develop alternatives well suited to our transformed problem as our future direction. Acknowledgements 480 This work is supported in part by the National Natural Science Foundation of China under the Grant Nos and in part by the Funding of Jiangsu Innovation Program for Graduate Education under Grant KYLX And we would like to express our appreciation for the valuable comments from reviewers and editors. 30

Fantope Regularization in Metric Learning

Fantope Regularization in Metric Learning CVPR 2014 Marc T. Law (LIP6, UPMC), Nicolas Thome (LIP6 - UPMC Sorbonne Universités), Matthieu Cord (LIP6 - UPMC Sorbonne Universités), Paris, France Introduction