On the Consistency of Multi-Label Learning

Size: px

Start display at page:

Download "On the Consistency of Multi-Label Learning"

Rafe Quinn
5 years ago
Views:

1 On the Consistency o Multi-Label Learning Wei Gao and Zhi-Hua Zhou National Key Laboratory or Novel Sotware Technology Nanjing University, Nanjing , China {gaow,zhouzh}@lamda.nju.edu.cn Abstract Multi-label learning has attracted much attention during the past ew years. Many multilabel learning approaches have been developed, mostly working with surrogate loss unctions since multi-label loss unctions are usually diicult to optimize directly owing to non-convexity and discontinuity. Though these approaches are eective, to the best o our knowledge, there is no theoretical result on the convergence o risk o the learned unctions to the Bayes risk. In this paper, ocusing on two well-known multi-label loss unctions, i.e., ranking loss and hamming loss, we prove a necessary and suicient condition or the consistency o multi-label learning based on surrogate loss unctions. Our results disclose that, surprisingly, none convex surrogate loss is consistent with the ranking loss. Inspired by the inding, we introduce the partial ranking loss, with which some surrogate unctions are consistent. For hamming loss, we show that some recent multi-label learning approaches are inconsistent even or deterministic multi-label classiication, and give a surrogate loss unction which is consistent or the deterministic case. Finally, we discuss on the consistency o learning approaches which address multi-label learning by decomposing into a set o binary classiication problems. 1 Introduction In traditional supervised learning, each instance is associated with a single label. In real-world applications, however, one object is usually relevant to multiple labels simultaneously. For example, a document about national education service may be categorized into several predeined topics, such as government and education; an image containing orests may be annotated with trees and mountains. For learning with such objects, multi-label learning has attracted much attention during the past ew years and many eective approaches have been developed (Schapire and Singer, 2000, Elissee and Weston, 2002, Zhou and Zhang, 2007, Zhang and Zhou, 2007, Hsu et al., 2009, Dembczyński et al., 2010, Petterson and Caetano, 2010). The consistency (also called Bayes consistency) o learning algorithms concerns whether the expected risk o a learned unction converges to the Bayes risk as the training sample size increases (Lin, 2002, Zhang, 2004b, Steinwart, 2005, Bartlett et al., 2006, Tewari and Bartlett, 2007, Duchi et al., 2010). Nowadays, it is well-accepted that a good learner should at least be consistent with large samples. It is noteworthy that, though many eorts have been devoted to multi-label learning, ew theoretical aspects were explored; in particular, to the best o our knowledge, the consistency o multi-label learning remains untouched although it is a very important theoretical issue. In this paper, ocusing on two well-known multi-label loss unctions, i.e., ranking loss and hamming loss, we present a theoretical study on the consistency o multi-label learning. We prove a necessary and suicient condition or the consistency o multi-label learning based on surrogate loss unctions. Our analysis discloses that, surprisingly, any convex surrogate loss is inconsistent with the ranking loss. Based on this inding, we introduce the partial ranking loss, which is consistent with some surrogate loss unctions; we also show that many current multi-label learning approaches are even not consistent with the partial ranking loss. As or hamming loss, our analysis shows that some recent multi-label learning approaches are inconsistent even or deterministic multi-label classiication, and we give a surrogate loss which is consistent with hamming loss or the deterministic case. Finally, we discuss on the consistency o approaches which address multi-label learning by transorming the problem into a set o binary classiication problems.

2 The rest o this paper is organized as ollows. Section 2 briely introduces the research background. Section 3 presents our necessary and suicient condition or the consistency o multi-label learning. Sections 4 and 5 study the consistency o multi-label learning approaches with regard to the ranking loss and hamming loss, respectively. Section 6 gives the detailed proos. 2 Background Let X be an instance space and L = {1, 2,..., Q} denotes a inite set o possible labels. An instance X X is associated with a subset o labels Y L which is called relevant labels, while the complement L \ Y is called irrelevant labels. For convenience o discussion, we represent the labels as a binary vector Y = (y 1, y 2,..., y Q ), where y i = +1 i label i is relevant to X and 1 otherwise, and denote by Y = {+1, 1} Q the set o all possible labels. Let D denote an unknown underlying probability distribution over X Y. For integer m > 0, we denote by [m] = {1, 2,..., m}, and or real r, r denotes the greatest integer which is no more than r. The ormal description o multi-label learning in the probabilistic setting is given as ollows: Given a training sample S = {(X 1, Y 1 ), (X 2, Y 2 ),..., (X m, Y m )} drawn i.i.d. according to the distribution D, the objective is to learn a unction h: X Y, which is able to assign a set o labels to unseen instances. In general, it is not easy to learn h directly, and in practice, one instead learns a real-valued vector unction = ( 1, 2,..., K ): X R K or some integer K > 0, where K = Q or K = 2 Q are common choices. Based on this vector unction, a prediction unction F : R K Y can be attained or assigning the set o relevant labels to an instance. Another popular approach or multi-label learning is to learn a real-valued vector unction = ( 1, 2,..., Q ) such that i (X) > j (X) i y i = +1 and y j = 1 or an instance-label pair (X, Y ), and a unction F should be learned to determine the number o relevant labels. Essentially, multi-label learning approaches try to minimize the expected risk o with regard to some loss L, i.e., R () = E (X,Y ) D [L ( (X), Y )]. (1) Notice that could be a prediction unction or a vector o real-valued unctions according to dierent losses. Further denote by R = in R() the minimal risk (e.g., the Bayes risk) over all measurable unctions. In this paper, we mainly ocus on loss unctions which are below-bounded and interval, deined as ollows: Deinition 1 A loss unction L is said to be below-bounded i L(, ) B holds or some constant B; L is said to be interval i it holds, or some constant γ > 0, that either L((X), Y ) = L((X ), Y ) or L((X), Y ) L((X ), Y ) γ, or every X, X X and Y, Y Y. There are many multi-label loss unctions (also called evaluation criteria), e.g., ranking loss, hamming loss, one-error, coverage and average precision (Schapire and Singer, 2000, Zhang and Zhou, 2006); accuracy, precision, recall and F 1 (Godbole and Sarawagi, 2004, Qi et al., 2007); subset accuracy (Ghamrawi and McCallum, 2005); etc. In this paper, we ocus on two well-known losses, i.e., ranking loss and hamming loss, and leave the discussion on other losses to uture work. Notice that all the above multi-label losses are non-convex and discontinuous. It is diicult, or even impossible, to optimize these losses directly. A easible method in practice is to consider instead a surrogate loss unction which can be optimized by eicient algorithms. Actually, most existing multi-label learning approaches, such as the boosting algorithm AdaBoost.MH (Schapire and Singer, 2000), neural network algorithm BP-MIL (Zhang and Zhou, 2006), SVM-style algorithms (Elissee and Weston, 2002, Taskar et al., 2004, Hariharan et al., 2010), etc., in essence, try to optimize some surrogate losses such as the exponential loss and hinge loss. There are many deinitions o consistency, e.g., the Fisher consistency (Lin, 2002), ininite-sample consistency (Zhang, 2004a), classiication calibration (Bartlett et al., 2006, Tewari and Bartlett, 2007), edge-consistency (Duchi et al., 2010), etc., and the consistency o learning algorithms based on optimizing a surrogate loss unction has been well-studied or binary classiication (Zhang, 2004b, Steinwart, 2005, Bartlett et al., 2006), multi-class classiication (Zhang, 2004a, Tewari and Bartlett, 2007), learning to rank (Cossock and Zhang, 2008, Xia et al., 2008, Duchi et al., 2010), etc. The consistency o multi-label learning, however, remains untouched, and to the best o our knowledge, this paper presents the irst theoretical analysis on the consistency o multi-label learning. 2

3 3 Multi-Label Consistency Given an instance X X, we denote by p(x) a vector o conditional probability or Y Y, i.e., p(x) = (p Y (X)) Y Y = (Pr (Y X)) Y Y, or some p(x) Λ, where Λ is the set o all possible conditional probability distribution vectors, i.e., { Λ = p R Y : } p Y = 1 and p Y 0. Y Y In the ollowing, or notational simplicity, we will suppress dependence o p(x) and (X) on the instance X as p and, respectively, when it is clear rom the context. For an instance X X, we deine the conditional risk o as l(p, ) = p Y L(, Y ) = Pr(Y X)L((X), Y ). (2) Y Y Y Y It is easy to get the expected risk and the minimal risk, respectively, as We urther deine the set o Bayes predictors as R() = E X [l(p, )] and R = E X [ in [l(p, )]]. A(p) = { : l(p, ) = in l(p, ) }. Notice that A(p) since L is interval and below-bounded. As mentioned above, the loss L in multi-label learning is generally non-convex and discontinuous, and it is diicult, even impossible, to minimize the risk given by Eq.(1) directly. In practice, a surrogate loss unction Ψ is usually considered in place o L. We deine the Ψ-risk and Bayes Ψ-risk o, respectively, as R Ψ () = E X,Y [Ψ((X), Y )] and RΨ = in R Ψ (). Similarly, we deine the conditional surrogate risk and the conditional Bayes surrogate risk o, respectively, as W (p, ) = p Y Ψ(, Y ) and W (p) = in W (p, ). Y Y It is obvious that R Ψ () = E X [W (p, )] and RΨ = E X[W (p)]. We now deine the multi-label consistency as ollows: Deinition 2 Given a below-bounded surrogate loss unction Ψ where Ψ(, Y ) is continuous or every Y Y, Ψ is said to be multi-label consistent w.r.t. the loss L i it holds, or every p Λ, that W (p) < in{w (p, ): / A(p)}. The ollowing theorem states that the multi-label consistency is a necessary and suicient condition or the convergence o Ψ-risk to the Bayes Ψ-risk, implying R() R. Theorem 3 The surrogate loss Ψ is multi-label consistent w.r.t. the loss L i and only i it holds or any sequence n that R Ψ ( n ) R Ψ then R( n ) R. The proo o this theorem is inspired by the technique o (Zhang, 2004a) and (Tewari and Bartlett, 2007), and we deer it to Section Consistency w.r.t. Ranking Loss The ranking loss concerns about the average raction o label pairs which are ordered reversely or an instance. For a real-valued vector unction = ( 1, 2,..., Q ), the ranking loss is given by L rankloss (, (X, Y )) = y i = 1 a Y I[ i (X) j (X)] = a Y I[ i (X) j (X)], (3) y y j=+1 i<y j where a Y is a non-negative penalty and I is the indicator unction, i.e., I[π] equals 1 i π holds and 0 otherwise. In multi-label learning, the most common penalty is a Y = {y i : y i = 1} 1 {y j : y j = +1} 1. 3

4 In this paper we consider the more general penalty, i.e., any non-negative penalty. It is easy to see that the ranking loss is below-bounded and interval since L rankloss (, (X, Y )) 0, and or every X, X X and Y, Y Y, it holds that either L rankloss (, (X, Y )) = L rankloss (, (X, Y )) or L rankloss (, (X, Y )) L rankloss (, (X, Y )) γ, where γ = min { ia Y ja Y > 0: i, j [(Q Q 2 ) Q 2 ]}. Considering Eq.(3), a natural choice or the surrogate loss is Ψ((X), Y ) = y i = 1 a Y ϕ( j (X) i (X)) = a Y ϕ( j (X) i (X)), (4) y y j=+1 i <y j where ϕ is a convex real-valued unction, which was chosen as hinge loss ϕ(x) = (1 x) + in (Elissee and Weston, 2002) and exponential loss ϕ(x) = exp( x) in (Schapire and Singer, 2000, Dekel et al., 2004, Zhang and Zhou, 2006). Beore our discussion, it is necessary to introduce the notations i,j = Y : y i <y j p Y a Y and δ(i, j; k, l) = Y : y i <y j,y k <y l p Y a Y, or a given vector p Λ and a non-negative vector (a Y ) Y Y. It is easy to get the ollowing lemmas: Lemma 4 For a vector p Λ and a non-negative vector (a Y ) Y Y, the ollowing properties hold: 1. i,i = 0; 2. δ(i, j; k, l) = δ(k, l; i, j); 3. i,j = δ(i, j; i, k) + δ(i, j; k, j) or every k i, j; 4. i,k + k,j + j,i = k,i + i,j + j,k ; 5. i,k k,i i i,j j,i and j,k k,j. Proo: The Properties 1 and 2 are immediate rom the deinitions. For Property 3, we have y i = 1 and y j = +1 or every Y Y satisying y i < y j. For y k (k i, j), there are only two choices: y k = +1 or y k = 1, and thus i,j = δ(i, j; i, k) + δ(i, j; k, j). For Property 4, we have, rom Property 3: i,k = δ(i, k; j, k) + δ(i, k; i, j), k,i = δ(k, i; j, i) + δ(k, i; k, j), j,i = δ(j, i; k, i) + δ(j, i; j, k), i,j = δ(i, j; k, j) + δ(i, j; i, k), k,j = δ(k, j; k, i) + δ(k, j; i, j), j,k = δ(j, k; j, i) + δ(j, k; i, k). Thus, it holds by combining with Property 2. Property 5 ollows directly rom Property 4. Lemma 5 For every p Λ and non-negative vector (a Y ) Y Y, the set o Bayes predictors or ranking loss is given by A(p) = { : or all i < j, i > j i i,j < j,i ; i j i i,j = j,i ; and i < j otherwise}. Proo: From the deinition o the conditional risk given by Eq.(2), we have l(p, ) = p Y L rankloss (, (X, Y )) = p Y a Y I[ i j ] Y Y Y Y y i<y j = p Y a Y I[ i j ] I[y i < y j ]. Y Y 1 i,j Q By swapping the two sums, we get l(p, ) = 1 i,j Q I[ i j ] Y : y i <y j p Y a Y = 1 i,j Q I[ i j ] i,j = 1 i<j Q I[ i j ] i,j + I[ i j ] j,i. Hence we complete the proo by combining with Property 5 in Lemma 4. The ollowing theorem discloses that none convex surrogate loss is consistent with ranking loss, and the proo is deerred to Section

5 Theorem 6 For any convex unction ϕ, the surrogate loss is not multi-label consistent w.r.t. ranking loss. Ψ((X), Y ) = y i<y j a Y ϕ( j (X) i (X)) Intuitively, Property 5 o Lemma 4 implies that { i,j } deines an order or the label set L = {1, 2,..., Q} by i j i i,j j,i. Notice that, or i j, there is possible that i,j = j,i. The set o Bayes predictors o a reasonable loss unction should include all unctions which are compatible with this order, i.e., s which enable i j i i j. In the deinition o ranking loss given by Eq.(3), the same penalty term is applied to i < j and i = j ; thus, the set o Bayes predictors with respect to ranking loss does not include some unctions which are compatible with the above order since what enorces by ranking loss is i j or j i i i,j = j,i or the label set L. For an extreme example, i.e., when all i,j s are equal or all i j, minimizing the convex surrogate loss unction Ψ leads to the optimal solution { : 1 = 2 = = Q } but / A(p) (rom Lemma 5). So, the same penalty on i < j and i = j encumbers the multi-label consistency. To overcome the deiciency, we instead introduce the partial ranking loss L p-rankloss (, (X, Y )) = a Y (I[ i (X) > j (X)] + 1 ) y i <y j 2 I[ i(x) = j (X)], (5) which has been commonly used or ranking problems. The only dierence rom ranking loss lies in the use o dierent penalties or y i <y j I[ i = j ], where the ranking loss uses a Y while the partial ranking loss uses a Y /2. With a proo similar to that o Lemma 5, we can get the set o Bayes predictors with respect to the partial ranking loss: A(p) = { : or all i < j, i > j i i,j < j,i ; and i < j i i,j > j,i }. (6) Now, consider the above extreme example, i.e., i,j s are equal or all i j, again. It is easy to see that by minimizing the surrogate loss unction Ψ, the optimal solution { : 1 = 2 = = Q } A(p), which exhibits multi-label consistency. We urther study more general cases, and the ollowing theorem provides a suicient condition or multi-label consistency w.r.t. partial ranking loss: Theorem 7 The surrogate loss Ψ given by Eq.(4) is multi-label consistent w.r.t. partial ranking loss i ϕ: R R is a dierential and non-increasing unction, and it holds that ϕ (0) < 0 and ϕ(x) + ϕ( x) 2ϕ(0), (7) i.e., ϕ(x) + ϕ( x) = 2ϕ(0) or every x R. The proo o Theorem 7 can be ound in Section 6.3, and rom this theorem we can get: Corollary 8 The surrogate loss Ψ given by Eq.(4) is multi-label consistent w.r.t partial ranking loss i ϕ(x) = arctan(x) or ϕ(x) = 1 e2x 1 + e 2x. Notice that Theorem 7 could not be applied directly to ϕ(x) = cx 2k+1 or some constant c > 0 and integer k 0, since such setting yields that the surrogate loss Ψ is not below-bounded. This problem, however, can be solved by introducing a regularization term. With a proo similar to that o Theorem 7, we get: Theorem 9 The ollowing surrogate loss is multi-label consistent w.r.t. partial ranking loss: Ψ((X), Y ) = y i<y j a Y ϕ( j (X) i (X)) + τυ((x)), where τ > 0, ϕ(x) = cx 2k+1 or some constant c > 0 and integer k 0, and Υ is symmetric, that is, Υ(..., i (X),..., j (X),...) = Υ(..., j (X),..., i (X),...). For example, we can easily construct the ollowing convex surrogate loss Ψ((X), Y ) = y i <y j a Y ( j (X) i (X)) + τ Q which is multi-label consistent w.r.t. partial ranking loss. i=1 2 i (X), It is worth noting that it does not mean any convex surrogate loss Ψ given by Eq.(4) is consistent w.r.t. partial ranking loss. In act, the ollowing theorem proves that, many non-linear surrogate losses are inconsistent w.r.t. partial ranking loss. 5

6 Theorem 10 I ϕ: R R is a convex, dierential, non-linear and non-increasing unction, the surrogate loss Ψ given by Eq.(4) is not multi-label consistent w.r.t. partial ranking loss. The proo is deerred to Section 6.4. The ollowing corollary shows that some state-o-the-art multi-label learning approaches (Schapire and Singer, 2000, Dekel et al., 2004, Zhang and Zhou, 2006) are even not multi-label consistent w.r.t. partial ranking loss. Corollary 11 I ϕ(x) = e x or ϕ(x) = ln(1 + exp( x)), the surrogate loss Ψ given by Eq.(4) is not multi-label consistent w.r.t. partial ranking loss. In summary, the ranking loss has been suggested as a standard rom early work by Schapire and Singer (2000), and many state-o-the-art multi-label learning approaches work with the ormulation given by Eq.(4). However, our analysis shows that none convex surrogate loss unctions are consistent w.r.t. ranking loss. Thus, ranking loss might not be a good loss unction and evaluation criterion or multi-label learning. The partial ranking loss is more reasonable than ranking loss, since it enables many, though not all, convex surrogate loss unctions to have consistency. In uture work it would be interesting to develop some new multi-label learning approaches based on minimizing the partial ranking loss, and design other loss unctions with better consistency. 5 Consistency w.r.t. Hamming Loss The hamming loss concerns about how many instance-label pairs are misclassiied. vector and prediction unction F, the hamming loss is given by L hamloss (F ((X)), Y ) = 1 Q Q i=1 I[ŷ i y i ], For a given where Ŷ = F ((X)) = (ŷ 1, ŷ 2,..., ŷ Q ). Hamming loss is obviously below-bounded and interval since 0 L hamloss (F ((X)), Y ) 1, and or every X, X X and Y, Y Y, it holds either L hamloss (F ((X)), Y )) = L hamloss (F ((X )), Y )) or L hamloss (F ((X)), Y )) L hamloss (F ((X )), Y )) 1/Q. We have the conditional risk: l(p, F ((X))) = p Y L hamloss (Ŷ, Y ) = 1 Y Y Q p Q Y I[ŷ i y i ] Y Y i=1 = 1 Q ( Q p Y I[ŷ i +1] + ) p Y I[ŷ i 1], i=1 Y Y,y i =+1 Y Y,y i = 1 and the set o Bayes predictors with regard to hamming loss: { ( A(p) = = (X): Ŷ = F () with ŷ i = sgn p Y 1 )}. (8) Y Y,y i =+1 2 A straightorward multi-label learning approach is to regard each subset o labels as a new class and then try to learn 2 Q unctions, i.e., = ( Y ) Y Y. Such a prediction unction is given by F ((X)) = max Y Y Y (X). (9) This approach can be viewed as a direct extension o the one-vs-all strategy or multi-class learning. We consider the ollowing ormulation: Ψ((X), Y ) = max ϕ(δ(ŷ, Y ) + Ŷ (X) Y (X)), (10) Ŷ Y where ϕ(x) and δ(y, Ŷ ) are set as ϕ(x) = max(0, x) and δ(y, Ŷ ) = Q i=1 I[y i ŷ i ], respectively, by Taskar et al. (2004), Hariharan et al. (2010). Beore a urther discussion, we divide multi-label classiication tasks into two categories, i.e., deterministic and non-deterministic, as ollows: Deinition 12 In multi-label classiication, i or every instance X X there exists a label Y Y such that P (Y X) > 0.5, the task is deterministic, and non-deterministic otherwise. Consistency or the deterministic case is easier than non-deterministic case. For example, many ormulations o SVMs are inconsistent or non-deterministic multi-class classiication (Zhang, 2004a, Tewari and Bartlett, 2007), but consistent or deterministic case as indicated by Zhang (2004a). The ollowing lemma shows that the approaches o (Taskar et al., 2004, Hariharan et al., 2010) are not multi-label consistent w.r.t. hamming loss, even or the deterministic case. 6

7 Lemma 13 For deterministic multi-label classiication, the surrogate loss Ψ given by Eq.(10) with δ(ŷ, Y ) = Q i=1 I[y i ŷ i ] is not multi-label consistent w.r.t. hamming loss. However, i we instead choose δ(ŷ, Y ) = I[Ŷ Y ], the ollowing theorem guarantees that the surrogate loss is multi-label consistent w.r.t. hamming loss, at least or the deterministic case. Theorem 14 For deterministic multi-label classiication, the surrogate loss Ψ given by Eq.(10) with δ(ŷ, Y ) = I[Y Ŷ ] is multi-label consistent w.r.t. hamming loss. The proos o Lemma 13 and Theorem 14 can be ound in Sections 6.5 and 6.6, respectively. Alternatively, it is possible to transorm a multi-label learning task into Q independent binary classiication tasks (Boutell et al., 2004). Now the goal is to learn Q unctions, = ( 1, 2,..., Q ), and the prediction unction is given by A common choice or the surrogate loss is F ((X)) = (sgn[ 1 (X)], sgn[ 2 (X)],..., sgn[ Q (X)]). Ψ((X), Y ) = Q i=1 ϕ(y i i (X)), (11) where ϕ is a convex unction, which was chosen as hinge loss ϕ(t) = (1 t) + in (Elissee and Weston, 2002) and exponential loss ϕ(t) = exp( t) in (Schapire and Singer, 2000). We have W (p, ) = Y Y p Y Ψ((X), Y ) = Q = Q i=1 i=1 p+ i ϕ( i(x)) + (1 p + i )ϕ( i(x)), Y Y p Y ϕ(y i i (X)) where p + i = Y : y i =+1 p Y and 1 p + i = Y : y i = 1 p Y. For simplicity, we denote by W i (p + i, i) = p + i ϕ( i) + (1 p + i )ϕ( i). This yields that minimizing W (p, ) is equivalent to minimizing W i (p + i, i) or every 1 i Q, i.e., W (p) = in W (p, ) = Q in i=1 i W i (p + i, i). The consistency or binary classiication has been well-studied (Zhang, 2004b, Bartlett et al., 2006), and based on the work o Bartlett et al. (2006) we can easily get: Theorem 15 The surrogate loss Ψ given by Eq.(11) is consistent w.r.t. hamming loss or convex unction ϕ with ϕ (0) < 0. It is evident rom this theorem that the surrogate loss Ψ given by Eq.(11) is consistent w.r.t. hamming loss i ϕ is any o the ollowing: Exponential: ϕ(x) = e x ; Hinge: ϕ(x) = max(0, 1 x); Least squares: ϕ(x) = (1 x) 2 ; Logistic regression: ϕ(x) = ln(1 + exp( x)). Notice that in this paper, we do not consider how to learn the real-valued unctions or small data or large number o labels, where many labels or label subsets lack enough training examples, while this is a challenging task and the exploitation o label correlation is thereore needed. In our analysis, we assume suicient training data and ignore the label correlation. Moreover, we do not consider how to decide the number o relevant labels or ranking loss and partial ranking loss, while in practice this is very challenging. These are important uture issues. 7

8 6 Proos 6.1 Proo o Theorem 3 We irst introduce some useul lemmas: Lemma 16 W (p) is continuous on Λ. Proo: From the Heine deinition o continuity, we need to show that W (p (n) ) W (p) or any sequence p (n) p. Let B r be a closed ball with radius r in R K. Since Y is inite, we have Y Y p(n) Y Ψ(, Y ) Y Y p Y Ψ(, Y ) uniormly or every B r and sequence p (n) p, leading to in B r Y Y p(n) Y From the act W (p (n) ) in Br Ψ(, Y ) in Y Y p(n) Y B r Y Y p Y Ψ(, Y ). Ψ(, Y ), by letting r, we have: lim sup W (p (n) ) W (p). (12) n Denote by Y = {Y p Y > 0 or Y Y} and assume Ψ(, ) C or some constant C (since Ψ is below-bounded). We have W (p (n) ) in Y Y p (n) Y Ψ(, Y ) + C Y Y/Y p (n) Y yielding ( lim in W (p (n) ) lim in in n n Y Y p(n) Y Ψ(, Y ) + C ) Y Y/Y p(n) Y = W (p), which completes the proo by combining Eq.(12). Lemma 17 I the surrogate loss unction Ψ is multi-label consistent w.r.t. loss unction L, then or any ϵ > 0 there exists δ > 0 such that, or every p Λ, l(p, ) in l(p, ) ϵ implies W (p, ) W (p) δ. Proo: We proceed by contradiction. Suppose Ψ is multi-label consistent and there exists ϵ > 0 and a sequence (p (n), (n) ) such that l(p (n), (n) ) in l(p (n), ) ϵ and W (p (n), (n) ) W (p (n) ). From the compactness o Λ, there exists a convergence sequence n k such that p (n k) p or some p Λ. From Lemma 16, we have W (p (n k), (n k) ) W (p). Similar to the proo o Lemma 16, denoted by Y = {Y p Y > 0 or Y Y}, we have ( lim sup W (p, (nk) ) = lim sup C ) n k n k Y Y/Y p(n k) Y + in Y Y p(n k) Y Ψ( (nk), Y ) lim W (p (nk), (nk) ) = W (p). nk This gives W (p, (n k) ) W (p) rom the deinition o W (p). Since Ψ is multi-label consistent, there exists a sequence (n k i ) satisying l(p, (n k i ) ) in l(p, ), which contradicts the assumption l(p (n), (n) ) in l(p (n), ) ϵ, and thus the lemma holds. Proo o Theorem 3: ( ) We irst introduce a new notation H(ϵ) = in p Λ, {W (p, ) in W (p, ): l(p, ) in l(p, ) ϵ}. It is obvious that H(0) = 0 and H(ϵ) > 0 or ϵ > 0 rom Lemma 17. Corollary 26 o (Zhang, 2004a) guarantees there exists a concave unction η on [0, ] such that η(0) = 0 and η(ϵ) 0 as ϵ 0 and R() R η(r Ψ () R Ψ). Thus, i R Ψ () RΨ then R() R. ( ) We proceed by contradiction. Suppose Ψ is not multi-label consistent, and thus there exists some p such that W (p) = in {W (p, ): / A(p)}. Let (n) / A(p) be a sequence such 8

9 that W (p, (n) ) W (p). For simplicity, we consider X = {x}, i.e., only one instance, and set n (x) = (n). Then R Ψ ( n ) = W (p, (n) ) W (p) = R Ψ, yielding l(p, (n) ) in l(p, ) which is contrary to l(p, (n) ) in l(p, ) + γ(p) where γ(p) = in / A(p) l(p, ) in l(p, ) > 0, since (n) / A(p) and L is interval. Thus we complete the proo. 6.2 Proo o Theorem 6 We proceed by contradiction. Suppose that the surrogate loss Ψ is multi-label consistent with ranking loss. We have W (p, ) = p Y Ψ(, Y ) = p Y a Y ϕ( j i ) = ϕ( j i ) i,j + ϕ( i j ) j,i. Y Y Y Y y i <y j 1 i<j Q Consider the probability vector p = (p Y ) Y Y and penalty vector (a Y ) Y Y such that P Y1 = P Y2 and a Y1 = a Y2 or every Y 1 Y 2, Y 1, Y 2 Y. This yields i,j = m,n or all 1 i j, m n Q, and thus we get W (p, ) = 1,2 1 i<j Q ϕ( j i ) + ϕ( i j ). From the convexity o ϕ, minimizing W (p, ) gives W (p) = W (p,ˆ) = 1,2 1 i<j Q 2ϕ(0) = Q(Q 1)ϕ(0) 1,2, where ˆ = {ˆ : ˆ 1 = ˆ 2 =... = ˆ Q }. Notice that ˆ / A(p) rom Lemma 5, and we have which completes the proo. W (p) = in{w (p, ): / A(p)}, 6.3 Proo o Theorem 7 For any p Λ and non-negative vector (a Y ) Y Y, we will prove that, i > j i i,j < j,i and i < j i i,j > j,i, or all satisying W (p) = W (p, ). Without loss o generality, it is suicient to prove that 1 > 2 i 1,2 < 2,1. We proceed by contradiction, and assume that there exists a vector such that 1 2 and W (p) = W (p, ). For the case 1 < 2, we can introduce another vector with 1 = 2, 2 = 1 and k = k or k 1, 2. From the deinition o conditional surrogate risk, we have which yields that W (p, ) = 1 i<j Q ϕ( j i ) i,j + ϕ( i j ) j,i, W (p, ) W (p, ) = ( 1,2 2,1 )(ϕ( 2 1 ) ϕ( 1 2 )) + Q i=3 ( 1,i 2,i )(ϕ( i 1 ) ϕ( i 2 )) + Q From Property 4 o Lemma 4, we have i=3 ( i,1 i,2 )(ϕ( 1 i ) ϕ( 2 i )). 1,i 2,i i,1 + i,2 = 1,2 2,1. (13) It ollows that W (p, ) W (p, ) = ( 1,2 2,1 ) (ϕ( 2 1 ) ϕ( 1 2 ) + Q i=3 ( ϕ(i 1 ) ϕ( i 2 ) )) + Q ( i,1 i,2 )(ϕ( 1 i ) + ϕ( i 1 ) ϕ( i 2 ) ϕ( 2 i )) i=3 = ( 1,2 2,1 ) ( ϕ( 2 1 ) ϕ( 1 2 ) ) + ( 1,2 2,1 ) Q ( ϕ(i 1 ) ϕ( i 2 ) ) i=3 9

10 where the last equality holds by the condition ϕ(x) + ϕ( x) 2ϕ(0). For non-increasing unction ϕ with ϕ (0) < 0, we have ϕ(z) < ϕ( z) or all z > 0, and ϕ( i 1 ) ϕ( i 2 ). From the condition 1,2 < 2,1 we get W (p, ) > W (p, ), which is contrary to W (p, ) = W (p). We now consider the case 1 = 2. For the optimal solution, we have the irst-order condition i W (p, ) = 0 or i = 1, 2: i 1 ϕ ( 1 i ) i,1 = i 1 ϕ ( i 1 ) 1,i and i 2 ϕ ( i 2 ) 2,i = i 2 ϕ ( 2 i ) i,2. By combining Eq.(13), 1 = 2 and ϕ (x) = ϕ ( x) rom Eq.(7), we have: ( ( 2,1 1,2 ) 2ϕ (0) + ) i 1,2 ϕ ( 1 i ) = 0, which is impossible since 1,2 < 2,1 and 2ϕ (0) + i 1,2 ϕ ( 1 i ) 2ϕ (0) < 0. complete the proo. Thus, we 6.4 Proo o Theorem 10 For convex unction ϕ we have ϕ (x) ϕ (y) or every x y rom (Rockaellar, 1997), and the derivative unction ϕ (x) is continuous or x R i ϕ is dierential and convex. Since ϕ is nonincreasing, we have ϕ (x) 0 or all x R, and without loss o generality, we assume ϕ (x) < 0. We proceed by contradiction. Assume the surrogate loss Ψ is multi-label consistent with partial ranking loss or some non-linear unction ϕ. Then, rom the continuity o ϕ (x), there exists an interval (c, d) or c < d < 0 or 0 < c < d, such that ϕ (x) < ϕ (y) or every x < y, x, y (c, d). In the ollowing, we ocus on the case 0 < c < d, and similar consideration could be made or the case c < d < 0. We irst ix a (c, d) and introduce a new unction G(x) = ( ϕ (x a) ϕ (a x) )( ϕ (a) + ϕ (x) ) + ϕ (x) ( ϕ ( a) ϕ (a) ). It is easy to ind that G(a) = ϕ (a) ( ϕ ( a) ϕ (a) ) > 0 and G(x) is continuous. Thus there exists b > a and b (c, d) such that which gives G(b) = ( ϕ (b a) ϕ (a b) )( ϕ (a) + ϕ (b) ) + ϕ (b) ( ϕ ( a) ϕ (a) ) > 0, ϕ (b a)(ϕ (a) + ϕ (b)) + ϕ (b)ϕ ( a) ϕ (a b)(ϕ (a) + ϕ (b)) + ϕ (b)ϕ (a) Moreover, rom ϕ (a) < ϕ (b) < 0 and ϕ ( b) ϕ ( a) < 0, we have 0 < ϕ ( a) ϕ (a) > 1. (14) ϕ ( b) ϕ (a) < ϕ ( b) ϕ (b). (15) We consider the ollowing multi-label classiication with Q = 3 labels: Y 1 = ( 1, +1, +1), Y 2 = (+1, 1, 1), Y 3 = (+1, +1, 1), Y 4 = ( 1, 1, +1). Let = ( 1, 2, 3 ) such that a = 3 1 and b = 3 2, and thus 1 > 2. For every probability vector p = (p Y1, p Y2, p Y3, p Y4 ) Λ and every penalty vector (a Y1, a Y2, a Y3, a Y4 ), we have W (p, ) = 1,2 ϕ( 2 1 ) + 2,1 ϕ( 1 2 ) + 1,3 ϕ( 3 1 ) + 3,1 ϕ( 1 3 ) + 2,3 ϕ( 3 2 ) + 3,2 ϕ( 2 3 ), where 1,2 = P 1, 2,1 = P 2, 1,3 = P 1 + P 4, 3,1 = P 2 + P 3, 2,3 = P 4 and 3,2 = P 3 with P i = p Y1 a Y1 or i = 1, 2, 3, 4. In the ollowing, we will construct some p and ā such that W ( p) = W ( p, ) and 1,2 > 2,1. The subgradient condition or optimality o W (p, ) gives that W (p, )/ 1 = P 1 ϕ (a b) + P 2 ϕ (b a) (P 1 + P 4 )ϕ (a) + (P 2 + P 3 )ϕ ( a) = 0, W (p, )/ 2 = P 1 ϕ (a b) P 2 ϕ (b a) P 4 ϕ (b) + P 3 ϕ ( b) = 0, W (p, )/ 3 = (P 1 + P 4 )ϕ (a) (P 2 + P 3 )ϕ ( a) + P 4 ϕ (b) P 3 ϕ ( b) = 0, 10

11 P 4 I 1 I 2 o P 3 Figure 1: Lines I 1 (solid) and I 2 (dash) corresponding to Eqs. (16) and (17), respectively. equivalent to P 1 ϕ (a b) P 2 ϕ (b a) = P 4 ϕ (b) P 3 ϕ ( b), (16) P 1 ϕ (a) + P 2 ϕ ( a) = P 4 (ϕ (a) + ϕ (b)) P 3 (ϕ ( a) + ϕ ( b)). (17) From Lemma 18, there exists p = ( p Y1, p Y2, p Y3, p Y4 ) and (ā Y1, ā Y2, ā Y3, ā Y4 ) satisying Eqs. (16), (17) and P 1 > P 2. Hence this yields W ( p, ) = W ( p). Notice that / A( p) rom Eq.(6), since 1,2 = P 1 > P 2 = 2,1 and 1 > 2, we have which completes the proo. W ( p) = in{w ( p, ): / A( p)}, Lemma 18 There exist some P 1 > P 2 > 0, P 3 > 0 and P 4 > 0 satisying Eqs. (16) and (17), i Eqs. (14) and (15) hold or some 0 < a < b. Proo: From Eq.(14), we set 1 < P 1 P 2 < ϕ (b a)(ϕ (a) + ϕ (b)) + ϕ (b)ϕ ( a) ϕ (a b)(ϕ (a) + ϕ (b)) + ϕ (b)ϕ (a). (18) For a < b, we have ϕ (a b) ϕ (b a) < 0, yielding P 1 > 1 ϕ (b a) P 2 ϕ (a b), which gives P 1 ϕ (a b) P 2 ϕ (b a) < 0. Thus, Eq.(16) corresponds to the Line I 1 in Figure 1. From Eq.(15), we urther obtain 0 < ϕ ( a) + ϕ ( b) ϕ (a) + ϕ (b) < ϕ ( b) ϕ (b). To guarantee P 3 > 0 and P 4 > 0 satisying Eqs. (16) and (17), as shown in Figure 1, we need: P 1 ϕ (a b) P 2 ϕ (b a) ϕ (b) < P 1ϕ (a) + P 2 ϕ ( a) ϕ (a) + ϕ. (b) The above holds obviously rom Eq.(18). Thus, we complete the proo. 6.5 Proo o Lemma 13 We consider a multi-label classiication task with Q = 2 labels, i.e., L = {1, 2} and let Y 1 = ( 1, 1), Y 2 = ( 1, 1), Y 3 = (1, 1) and Y 4 = (1, 1). Suppose p Y4 = 0 and p Y2 + p Y3 < p Y1 < 2p Y2 + p Y3. It is evident that p Y1 > 0.5 and thus, this multi-label classiication task is deterministic. It is necessary to consider = ( Y1, Y2, Y3 ). By combining Eqs. (8) and (10), we get A(p) = { : Y1 Y2 and Y1 Y3 }. 11

12 We also have W (p, ) = p Y1 max{ϕ(1 + Y2 Y1 ), ϕ(2 + Y3 Y1 )} +p Y2 max{ϕ(1 + Y1 Y2 ), ϕ(1 + Y3 Y2 )} + p Y3 max{ϕ(2 + Y1 Y3 ), ϕ(1 + Y2 Y3 )}, and W (p) = W (p, ) = in W (p, ) = 2p Y1 + 2p Y3, where = (Y 1, Y 2, Y 3 ) such that Y 1 = Y 3 and Y 1 = Y 2 1. Hence / A(p) and is not multi-label consistent. 6.6 Proo o Theorem 14 We irst introduce a useul lemma: Lemma 19 For the surrogate loss Ψ given by Eq.(10) with δ(ŷ, Y ) = I[Ŷ Y ], and or every p Λ and such that W (p, ) = W (p), we have Y1 Y2 i p Y1 > p Y2. Proo: We proceed by contradiction. Assume there exist some p Λ and such that Y1 < Y2, p Y1 > p Y2 and W (p) = W (p, ). We deine a new real-valued vector such that Y 1 = Y2, Y 2 = Y1 and Y i = Yi or i 1, 2. This ollows that W (p, ) W (p, ) = (p Y1 p Y2 ) ( max Y Y 1 ϕ(1 + Y (X) Y1 (X)) max Y Y 2 ϕ(1 + Y (X) Y2 (X)) ). From the assumption Y1 < Y2, we have leading to max ϕ(1 + Y (X) Y1 (X)) > max ϕ(1 + Y (X) Y2 (X)), Y Y 1 Y Y 2 W (p, ) W (p, ) > 0, which is contrary to the assumption W (p) = W (p, ). Thus, we complete the proo. Proo o Theorem 14: Without loss o generality, we consider a probability vector p = (p Y ) Y Y satisying p Y1 > p Y2 p Y3... p Y2 Q and p Y 1 > 0.5. It is easy to get A(p) = { : Y1 > Y or Y Y 1 }, rom Eqs. (8) and (9). On the other hand, or each satisying W (p, ) = W (p), we have Y1 Y2... Y2 Q rom Lemma 19. Thus, we complete the proo rom p Y1 > Y Y 1 p Y and the act that W (p, ) = p Y1 ϕ(1 + Y2 Y1 ) + Y Y 1 p Y ϕ(1 + Y1 Y ). Acknowledgements The authors want to thank the anonymous reviewers or their helpul comments and suggestions, and Tong Zhang or reading a preliminary version. This research was supported by the National Fundamental Research Program o China (2010CB327903), the National Science Foundation o China ( ) and the Jiangsu Science Foundation (BK ). Reerences P. L. Bartlett, M. I. Jordan, and J. D. McAulie. Convexity, classiication, and risk bounds. Journal o the American Statistical Association, 101(473): , M. R. Boutell, J. Luo, X. Shen, and C.M. Brown. Learning multi-label scene classiication. Pattern Recognition, 37(9): , D. Cossock and T. Zhang. Statistical analysis o bayes optimal subset ranking. IEEE Transactions on Inormation Theory, 54(11): ,

13 O. Dekel, C. Manning, and Y. Singer. Log-linear models or label ranking. In S. Thrun, L. K. Saul, and B. Schölkop, editors, Advances in Neural Inormation Processing Systems 16. MIT Press, Cambridge, MA, K. Dembczyński, W. W. Cheng, and E. Hüllermeier. Bayes optimal multilabel classiication via probabilistic classiier chains. In Proceedings o the 27th International Conerence on Machine Learning, pages , Haia, Israel, J. C. Duchi, L. W. Mackey, and M. I. Jordan. On the consistency o ranking algorithms. In Proceedings o the 27th International Conerence on Machine Learning, pages , Haia, Israel, A. Elissee and J. Weston. A kernel method or multi-labelled classiication. In T. G. Dietterich, S. Becker, and Z. Ghahramani, editors, Advances in Neural Inormation Processing Systems 14, pages MIT Press, Cambridge, MA, N. Ghamrawi and A. McCallum. Collective multi-label classiication. In Proceedings o the 14th ACM International Conerence on Inormation and Knowledge Management, pages , Bremen, Germany, S. Godbole and S. Sarawagi. Discriminative methods or multi-labeled classiication. In Proceedings o the 8th Paciic-Asia Conerence on Knowledge Discovery and Data Mining, pages 22 30, Sydney, Austrialia, B. Hariharan, L. Zelnik-Manor, S. Vishwanathan, and M. Varma. Large scale max-margin multi-label classiication with priors. In Proceedings o the 27th International Conerence on Machine Learning, pages , Haia, Israel, D. Hsu, S. M. Kakade, J. Langord, and T. Zhang. Multi-label prediction via compressed sensing. In Y. Bengio, D. Schuurmans, J. Laerty, C. K. I. Williams, and A. Culotta, editors, Advances in Neural Inormation Processing Systems 22, pages MIT Press, Cambridge, MA, Y. Lin. Support vector machines and the bayes rule in classiication. Data Mining and Knowledge Discovery, 6(3): , J. Petterson and T. Caetano. Reverse multi-label learning. In J. Laerty, C. K. I. Williams, J. Shawe- Taylor, R. S. Zemel, and A. Culotta, editors, Advances in Neural Inormation Processing Systems 23, pages MIT Press, Cambridge, MA, G.-J. Qi, X.-S. Hua, Y. Rui, J. Tang, T. Mei, and H.-J. Zhang. Correlative multi-label video annotation. In Proceedings o the 15th ACM International Conerence on Multimedia, pages 17 26, Augsburg, Germany, R. T. Rockaellar. Convex Analysis. Princeton University Press, Princeton, NJ, R. E. Schapire and Y. Singer. BoosTexter: A boosting-based system or text categorization. Machine Learning, 39(2): , I. Steinwart. Consistency o support vector machines and other regularized kernel classiiers. IEEE Transactions on Inormation Theory, 51(1): , B. Taskar, C. Guestrin, and D. Koller. Max-margin markov networks. In S. Thrun, L. Saul, and B. Schölkop, editors, Advances in Neural Inormation Processing Systems 16. MIT Press, Cambridge, MA, A. Tewari and P. L. Bartlett. On the consistency o multiclass classiication methods. Journal o Machine Learning Research, 8: , F. Xia, T. Y. Liu, J. Wang, W. Zhang, and H. Li. Listwise approach to learning to rank: Theory and algorithm. In Proceedings o the 25th International Conerence on Machine Learning, pages , Helsinki, Finland, M.-L. Zhang and Z.-H. Zhou. Multi-label neural network with applications to unctional genomics and text categorization. IEEE Transactions on Knowledge and Data Engineering, 18(10): , M.-L. Zhang and Z.-H. Zhou. ML-KNN: A lazy learning approach to multi-label learning. Pattern Recognition, 40(7): , T. Zhang. Statistical analysis o some multi-category large margin classiication methods. Journal o Machine Learning Research, 5: , 2004a. T. Zhang. Statistical behavior and consistency o classiication methods based on convex risk minimization. Annals o Statistics, 32(1):56 85, 2004b. Z.-H. Zhou and M.-L. Zhang. Multi-instance multi-label learning with application to scene classiication. In B. Schölkop, J. Platt, and T. Homann, editors, Advances in Neural Inormation Processing Systems 19, pages MIT Press, Cambridge, MA,

Fisher Consistency of Multicategory Support Vector Machines

Fisher Consistency of Multicategory Support Vector Machines Fisher Consistency o Multicategory Support Vector Machines Yueng Liu Department o Statistics and Operations Research Carolina Center or Genome Sciences University o North Carolina Chapel Hill, NC 7599-360