On the Consistency of Multi-Label Learning
|
|
- Rafe Quinn
- 5 years ago
- Views:
Transcription
1 On the Consistency o Multi-Label Learning Wei Gao and Zhi-Hua Zhou National Key Laboratory or Novel Sotware Technology Nanjing University, Nanjing , China {gaow,zhouzh}@lamda.nju.edu.cn Abstract Multi-label learning has attracted much attention during the past ew years. Many multilabel learning approaches have been developed, mostly working with surrogate loss unctions since multi-label loss unctions are usually diicult to optimize directly owing to non-convexity and discontinuity. Though these approaches are eective, to the best o our knowledge, there is no theoretical result on the convergence o risk o the learned unctions to the Bayes risk. In this paper, ocusing on two well-known multi-label loss unctions, i.e., ranking loss and hamming loss, we prove a necessary and suicient condition or the consistency o multi-label learning based on surrogate loss unctions. Our results disclose that, surprisingly, none convex surrogate loss is consistent with the ranking loss. Inspired by the inding, we introduce the partial ranking loss, with which some surrogate unctions are consistent. For hamming loss, we show that some recent multi-label learning approaches are inconsistent even or deterministic multi-label classiication, and give a surrogate loss unction which is consistent or the deterministic case. Finally, we discuss on the consistency o learning approaches which address multi-label learning by decomposing into a set o binary classiication problems. 1 Introduction In traditional supervised learning, each instance is associated with a single label. In real-world applications, however, one object is usually relevant to multiple labels simultaneously. For example, a document about national education service may be categorized into several predeined topics, such as government and education; an image containing orests may be annotated with trees and mountains. For learning with such objects, multi-label learning has attracted much attention during the past ew years and many eective approaches have been developed (Schapire and Singer, 2000, Elissee and Weston, 2002, Zhou and Zhang, 2007, Zhang and Zhou, 2007, Hsu et al., 2009, Dembczyński et al., 2010, Petterson and Caetano, 2010). The consistency (also called Bayes consistency) o learning algorithms concerns whether the expected risk o a learned unction converges to the Bayes risk as the training sample size increases (Lin, 2002, Zhang, 2004b, Steinwart, 2005, Bartlett et al., 2006, Tewari and Bartlett, 2007, Duchi et al., 2010). Nowadays, it is well-accepted that a good learner should at least be consistent with large samples. It is noteworthy that, though many eorts have been devoted to multi-label learning, ew theoretical aspects were explored; in particular, to the best o our knowledge, the consistency o multi-label learning remains untouched although it is a very important theoretical issue. In this paper, ocusing on two well-known multi-label loss unctions, i.e., ranking loss and hamming loss, we present a theoretical study on the consistency o multi-label learning. We prove a necessary and suicient condition or the consistency o multi-label learning based on surrogate loss unctions. Our analysis discloses that, surprisingly, any convex surrogate loss is inconsistent with the ranking loss. Based on this inding, we introduce the partial ranking loss, which is consistent with some surrogate loss unctions; we also show that many current multi-label learning approaches are even not consistent with the partial ranking loss. As or hamming loss, our analysis shows that some recent multi-label learning approaches are inconsistent even or deterministic multi-label classiication, and we give a surrogate loss which is consistent with hamming loss or the deterministic case. Finally, we discuss on the consistency o approaches which address multi-label learning by transorming the problem into a set o binary classiication problems.
2 The rest o this paper is organized as ollows. Section 2 briely introduces the research background. Section 3 presents our necessary and suicient condition or the consistency o multi-label learning. Sections 4 and 5 study the consistency o multi-label learning approaches with regard to the ranking loss and hamming loss, respectively. Section 6 gives the detailed proos. 2 Background Let X be an instance space and L = {1, 2,..., Q} denotes a inite set o possible labels. An instance X X is associated with a subset o labels Y L which is called relevant labels, while the complement L \ Y is called irrelevant labels. For convenience o discussion, we represent the labels as a binary vector Y = (y 1, y 2,..., y Q ), where y i = +1 i label i is relevant to X and 1 otherwise, and denote by Y = {+1, 1} Q the set o all possible labels. Let D denote an unknown underlying probability distribution over X Y. For integer m > 0, we denote by [m] = {1, 2,..., m}, and or real r, r denotes the greatest integer which is no more than r. The ormal description o multi-label learning in the probabilistic setting is given as ollows: Given a training sample S = {(X 1, Y 1 ), (X 2, Y 2 ),..., (X m, Y m )} drawn i.i.d. according to the distribution D, the objective is to learn a unction h: X Y, which is able to assign a set o labels to unseen instances. In general, it is not easy to learn h directly, and in practice, one instead learns a real-valued vector unction = ( 1, 2,..., K ): X R K or some integer K > 0, where K = Q or K = 2 Q are common choices. Based on this vector unction, a prediction unction F : R K Y can be attained or assigning the set o relevant labels to an instance. Another popular approach or multi-label learning is to learn a real-valued vector unction = ( 1, 2,..., Q ) such that i (X) > j (X) i y i = +1 and y j = 1 or an instance-label pair (X, Y ), and a unction F should be learned to determine the number o relevant labels. Essentially, multi-label learning approaches try to minimize the expected risk o with regard to some loss L, i.e., R () = E (X,Y ) D [L ( (X), Y )]. (1) Notice that could be a prediction unction or a vector o real-valued unctions according to dierent losses. Further denote by R = in R() the minimal risk (e.g., the Bayes risk) over all measurable unctions. In this paper, we mainly ocus on loss unctions which are below-bounded and interval, deined as ollows: Deinition 1 A loss unction L is said to be below-bounded i L(, ) B holds or some constant B; L is said to be interval i it holds, or some constant γ > 0, that either L((X), Y ) = L((X ), Y ) or L((X), Y ) L((X ), Y ) γ, or every X, X X and Y, Y Y. There are many multi-label loss unctions (also called evaluation criteria), e.g., ranking loss, hamming loss, one-error, coverage and average precision (Schapire and Singer, 2000, Zhang and Zhou, 2006); accuracy, precision, recall and F 1 (Godbole and Sarawagi, 2004, Qi et al., 2007); subset accuracy (Ghamrawi and McCallum, 2005); etc. In this paper, we ocus on two well-known losses, i.e., ranking loss and hamming loss, and leave the discussion on other losses to uture work. Notice that all the above multi-label losses are non-convex and discontinuous. It is diicult, or even impossible, to optimize these losses directly. A easible method in practice is to consider instead a surrogate loss unction which can be optimized by eicient algorithms. Actually, most existing multi-label learning approaches, such as the boosting algorithm AdaBoost.MH (Schapire and Singer, 2000), neural network algorithm BP-MIL (Zhang and Zhou, 2006), SVM-style algorithms (Elissee and Weston, 2002, Taskar et al., 2004, Hariharan et al., 2010), etc., in essence, try to optimize some surrogate losses such as the exponential loss and hinge loss. There are many deinitions o consistency, e.g., the Fisher consistency (Lin, 2002), ininite-sample consistency (Zhang, 2004a), classiication calibration (Bartlett et al., 2006, Tewari and Bartlett, 2007), edge-consistency (Duchi et al., 2010), etc., and the consistency o learning algorithms based on optimizing a surrogate loss unction has been well-studied or binary classiication (Zhang, 2004b, Steinwart, 2005, Bartlett et al., 2006), multi-class classiication (Zhang, 2004a, Tewari and Bartlett, 2007), learning to rank (Cossock and Zhang, 2008, Xia et al., 2008, Duchi et al., 2010), etc. The consistency o multi-label learning, however, remains untouched, and to the best o our knowledge, this paper presents the irst theoretical analysis on the consistency o multi-label learning. 2
3 3 Multi-Label Consistency Given an instance X X, we denote by p(x) a vector o conditional probability or Y Y, i.e., p(x) = (p Y (X)) Y Y = (Pr (Y X)) Y Y, or some p(x) Λ, where Λ is the set o all possible conditional probability distribution vectors, i.e., { Λ = p R Y : } p Y = 1 and p Y 0. Y Y In the ollowing, or notational simplicity, we will suppress dependence o p(x) and (X) on the instance X as p and, respectively, when it is clear rom the context. For an instance X X, we deine the conditional risk o as l(p, ) = p Y L(, Y ) = Pr(Y X)L((X), Y ). (2) Y Y Y Y It is easy to get the expected risk and the minimal risk, respectively, as We urther deine the set o Bayes predictors as R() = E X [l(p, )] and R = E X [ in [l(p, )]]. A(p) = { : l(p, ) = in l(p, ) }. Notice that A(p) since L is interval and below-bounded. As mentioned above, the loss L in multi-label learning is generally non-convex and discontinuous, and it is diicult, even impossible, to minimize the risk given by Eq.(1) directly. In practice, a surrogate loss unction Ψ is usually considered in place o L. We deine the Ψ-risk and Bayes Ψ-risk o, respectively, as R Ψ () = E X,Y [Ψ((X), Y )] and RΨ = in R Ψ (). Similarly, we deine the conditional surrogate risk and the conditional Bayes surrogate risk o, respectively, as W (p, ) = p Y Ψ(, Y ) and W (p) = in W (p, ). Y Y It is obvious that R Ψ () = E X [W (p, )] and RΨ = E X[W (p)]. We now deine the multi-label consistency as ollows: Deinition 2 Given a below-bounded surrogate loss unction Ψ where Ψ(, Y ) is continuous or every Y Y, Ψ is said to be multi-label consistent w.r.t. the loss L i it holds, or every p Λ, that W (p) < in{w (p, ): / A(p)}. The ollowing theorem states that the multi-label consistency is a necessary and suicient condition or the convergence o Ψ-risk to the Bayes Ψ-risk, implying R() R. Theorem 3 The surrogate loss Ψ is multi-label consistent w.r.t. the loss L i and only i it holds or any sequence n that R Ψ ( n ) R Ψ then R( n ) R. The proo o this theorem is inspired by the technique o (Zhang, 2004a) and (Tewari and Bartlett, 2007), and we deer it to Section Consistency w.r.t. Ranking Loss The ranking loss concerns about the average raction o label pairs which are ordered reversely or an instance. For a real-valued vector unction = ( 1, 2,..., Q ), the ranking loss is given by L rankloss (, (X, Y )) = y i = 1 a Y I[ i (X) j (X)] = a Y I[ i (X) j (X)], (3) y y j=+1 i<y j where a Y is a non-negative penalty and I is the indicator unction, i.e., I[π] equals 1 i π holds and 0 otherwise. In multi-label learning, the most common penalty is a Y = {y i : y i = 1} 1 {y j : y j = +1} 1. 3
4 In this paper we consider the more general penalty, i.e., any non-negative penalty. It is easy to see that the ranking loss is below-bounded and interval since L rankloss (, (X, Y )) 0, and or every X, X X and Y, Y Y, it holds that either L rankloss (, (X, Y )) = L rankloss (, (X, Y )) or L rankloss (, (X, Y )) L rankloss (, (X, Y )) γ, where γ = min { ia Y ja Y > 0: i, j [(Q Q 2 ) Q 2 ]}. Considering Eq.(3), a natural choice or the surrogate loss is Ψ((X), Y ) = y i = 1 a Y ϕ( j (X) i (X)) = a Y ϕ( j (X) i (X)), (4) y y j=+1 i <y j where ϕ is a convex real-valued unction, which was chosen as hinge loss ϕ(x) = (1 x) + in (Elissee and Weston, 2002) and exponential loss ϕ(x) = exp( x) in (Schapire and Singer, 2000, Dekel et al., 2004, Zhang and Zhou, 2006). Beore our discussion, it is necessary to introduce the notations i,j = Y : y i <y j p Y a Y and δ(i, j; k, l) = Y : y i <y j,y k <y l p Y a Y, or a given vector p Λ and a non-negative vector (a Y ) Y Y. It is easy to get the ollowing lemmas: Lemma 4 For a vector p Λ and a non-negative vector (a Y ) Y Y, the ollowing properties hold: 1. i,i = 0; 2. δ(i, j; k, l) = δ(k, l; i, j); 3. i,j = δ(i, j; i, k) + δ(i, j; k, j) or every k i, j; 4. i,k + k,j + j,i = k,i + i,j + j,k ; 5. i,k k,i i i,j j,i and j,k k,j. Proo: The Properties 1 and 2 are immediate rom the deinitions. For Property 3, we have y i = 1 and y j = +1 or every Y Y satisying y i < y j. For y k (k i, j), there are only two choices: y k = +1 or y k = 1, and thus i,j = δ(i, j; i, k) + δ(i, j; k, j). For Property 4, we have, rom Property 3: i,k = δ(i, k; j, k) + δ(i, k; i, j), k,i = δ(k, i; j, i) + δ(k, i; k, j), j,i = δ(j, i; k, i) + δ(j, i; j, k), i,j = δ(i, j; k, j) + δ(i, j; i, k), k,j = δ(k, j; k, i) + δ(k, j; i, j), j,k = δ(j, k; j, i) + δ(j, k; i, k). Thus, it holds by combining with Property 2. Property 5 ollows directly rom Property 4. Lemma 5 For every p Λ and non-negative vector (a Y ) Y Y, the set o Bayes predictors or ranking loss is given by A(p) = { : or all i < j, i > j i i,j < j,i ; i j i i,j = j,i ; and i < j otherwise}. Proo: From the deinition o the conditional risk given by Eq.(2), we have l(p, ) = p Y L rankloss (, (X, Y )) = p Y a Y I[ i j ] Y Y Y Y y i<y j = p Y a Y I[ i j ] I[y i < y j ]. Y Y 1 i,j Q By swapping the two sums, we get l(p, ) = 1 i,j Q I[ i j ] Y : y i <y j p Y a Y = 1 i,j Q I[ i j ] i,j = 1 i<j Q I[ i j ] i,j + I[ i j ] j,i. Hence we complete the proo by combining with Property 5 in Lemma 4. The ollowing theorem discloses that none convex surrogate loss is consistent with ranking loss, and the proo is deerred to Section
5 Theorem 6 For any convex unction ϕ, the surrogate loss is not multi-label consistent w.r.t. ranking loss. Ψ((X), Y ) = y i<y j a Y ϕ( j (X) i (X)) Intuitively, Property 5 o Lemma 4 implies that { i,j } deines an order or the label set L = {1, 2,..., Q} by i j i i,j j,i. Notice that, or i j, there is possible that i,j = j,i. The set o Bayes predictors o a reasonable loss unction should include all unctions which are compatible with this order, i.e., s which enable i j i i j. In the deinition o ranking loss given by Eq.(3), the same penalty term is applied to i < j and i = j ; thus, the set o Bayes predictors with respect to ranking loss does not include some unctions which are compatible with the above order since what enorces by ranking loss is i j or j i i i,j = j,i or the label set L. For an extreme example, i.e., when all i,j s are equal or all i j, minimizing the convex surrogate loss unction Ψ leads to the optimal solution { : 1 = 2 = = Q } but / A(p) (rom Lemma 5). So, the same penalty on i < j and i = j encumbers the multi-label consistency. To overcome the deiciency, we instead introduce the partial ranking loss L p-rankloss (, (X, Y )) = a Y (I[ i (X) > j (X)] + 1 ) y i <y j 2 I[ i(x) = j (X)], (5) which has been commonly used or ranking problems. The only dierence rom ranking loss lies in the use o dierent penalties or y i <y j I[ i = j ], where the ranking loss uses a Y while the partial ranking loss uses a Y /2. With a proo similar to that o Lemma 5, we can get the set o Bayes predictors with respect to the partial ranking loss: A(p) = { : or all i < j, i > j i i,j < j,i ; and i < j i i,j > j,i }. (6) Now, consider the above extreme example, i.e., i,j s are equal or all i j, again. It is easy to see that by minimizing the surrogate loss unction Ψ, the optimal solution { : 1 = 2 = = Q } A(p), which exhibits multi-label consistency. We urther study more general cases, and the ollowing theorem provides a suicient condition or multi-label consistency w.r.t. partial ranking loss: Theorem 7 The surrogate loss Ψ given by Eq.(4) is multi-label consistent w.r.t. partial ranking loss i ϕ: R R is a dierential and non-increasing unction, and it holds that ϕ (0) < 0 and ϕ(x) + ϕ( x) 2ϕ(0), (7) i.e., ϕ(x) + ϕ( x) = 2ϕ(0) or every x R. The proo o Theorem 7 can be ound in Section 6.3, and rom this theorem we can get: Corollary 8 The surrogate loss Ψ given by Eq.(4) is multi-label consistent w.r.t partial ranking loss i ϕ(x) = arctan(x) or ϕ(x) = 1 e2x 1 + e 2x. Notice that Theorem 7 could not be applied directly to ϕ(x) = cx 2k+1 or some constant c > 0 and integer k 0, since such setting yields that the surrogate loss Ψ is not below-bounded. This problem, however, can be solved by introducing a regularization term. With a proo similar to that o Theorem 7, we get: Theorem 9 The ollowing surrogate loss is multi-label consistent w.r.t. partial ranking loss: Ψ((X), Y ) = y i<y j a Y ϕ( j (X) i (X)) + τυ((x)), where τ > 0, ϕ(x) = cx 2k+1 or some constant c > 0 and integer k 0, and Υ is symmetric, that is, Υ(..., i (X),..., j (X),...) = Υ(..., j (X),..., i (X),...). For example, we can easily construct the ollowing convex surrogate loss Ψ((X), Y ) = y i <y j a Y ( j (X) i (X)) + τ Q which is multi-label consistent w.r.t. partial ranking loss. i=1 2 i (X), It is worth noting that it does not mean any convex surrogate loss Ψ given by Eq.(4) is consistent w.r.t. partial ranking loss. In act, the ollowing theorem proves that, many non-linear surrogate losses are inconsistent w.r.t. partial ranking loss. 5
6 Theorem 10 I ϕ: R R is a convex, dierential, non-linear and non-increasing unction, the surrogate loss Ψ given by Eq.(4) is not multi-label consistent w.r.t. partial ranking loss. The proo is deerred to Section 6.4. The ollowing corollary shows that some state-o-the-art multi-label learning approaches (Schapire and Singer, 2000, Dekel et al., 2004, Zhang and Zhou, 2006) are even not multi-label consistent w.r.t. partial ranking loss. Corollary 11 I ϕ(x) = e x or ϕ(x) = ln(1 + exp( x)), the surrogate loss Ψ given by Eq.(4) is not multi-label consistent w.r.t. partial ranking loss. In summary, the ranking loss has been suggested as a standard rom early work by Schapire and Singer (2000), and many state-o-the-art multi-label learning approaches work with the ormulation given by Eq.(4). However, our analysis shows that none convex surrogate loss unctions are consistent w.r.t. ranking loss. Thus, ranking loss might not be a good loss unction and evaluation criterion or multi-label learning. The partial ranking loss is more reasonable than ranking loss, since it enables many, though not all, convex surrogate loss unctions to have consistency. In uture work it would be interesting to develop some new multi-label learning approaches based on minimizing the partial ranking loss, and design other loss unctions with better consistency. 5 Consistency w.r.t. Hamming Loss The hamming loss concerns about how many instance-label pairs are misclassiied. vector and prediction unction F, the hamming loss is given by L hamloss (F ((X)), Y ) = 1 Q Q i=1 I[ŷ i y i ], For a given where Ŷ = F ((X)) = (ŷ 1, ŷ 2,..., ŷ Q ). Hamming loss is obviously below-bounded and interval since 0 L hamloss (F ((X)), Y ) 1, and or every X, X X and Y, Y Y, it holds either L hamloss (F ((X)), Y )) = L hamloss (F ((X )), Y )) or L hamloss (F ((X)), Y )) L hamloss (F ((X )), Y )) 1/Q. We have the conditional risk: l(p, F ((X))) = p Y L hamloss (Ŷ, Y ) = 1 Y Y Q p Q Y I[ŷ i y i ] Y Y i=1 = 1 Q ( Q p Y I[ŷ i +1] + ) p Y I[ŷ i 1], i=1 Y Y,y i =+1 Y Y,y i = 1 and the set o Bayes predictors with regard to hamming loss: { ( A(p) = = (X): Ŷ = F () with ŷ i = sgn p Y 1 )}. (8) Y Y,y i =+1 2 A straightorward multi-label learning approach is to regard each subset o labels as a new class and then try to learn 2 Q unctions, i.e., = ( Y ) Y Y. Such a prediction unction is given by F ((X)) = max Y Y Y (X). (9) This approach can be viewed as a direct extension o the one-vs-all strategy or multi-class learning. We consider the ollowing ormulation: Ψ((X), Y ) = max ϕ(δ(ŷ, Y ) + Ŷ (X) Y (X)), (10) Ŷ Y where ϕ(x) and δ(y, Ŷ ) are set as ϕ(x) = max(0, x) and δ(y, Ŷ ) = Q i=1 I[y i ŷ i ], respectively, by Taskar et al. (2004), Hariharan et al. (2010). Beore a urther discussion, we divide multi-label classiication tasks into two categories, i.e., deterministic and non-deterministic, as ollows: Deinition 12 In multi-label classiication, i or every instance X X there exists a label Y Y such that P (Y X) > 0.5, the task is deterministic, and non-deterministic otherwise. Consistency or the deterministic case is easier than non-deterministic case. For example, many ormulations o SVMs are inconsistent or non-deterministic multi-class classiication (Zhang, 2004a, Tewari and Bartlett, 2007), but consistent or deterministic case as indicated by Zhang (2004a). The ollowing lemma shows that the approaches o (Taskar et al., 2004, Hariharan et al., 2010) are not multi-label consistent w.r.t. hamming loss, even or the deterministic case. 6
7 Lemma 13 For deterministic multi-label classiication, the surrogate loss Ψ given by Eq.(10) with δ(ŷ, Y ) = Q i=1 I[y i ŷ i ] is not multi-label consistent w.r.t. hamming loss. However, i we instead choose δ(ŷ, Y ) = I[Ŷ Y ], the ollowing theorem guarantees that the surrogate loss is multi-label consistent w.r.t. hamming loss, at least or the deterministic case. Theorem 14 For deterministic multi-label classiication, the surrogate loss Ψ given by Eq.(10) with δ(ŷ, Y ) = I[Y Ŷ ] is multi-label consistent w.r.t. hamming loss. The proos o Lemma 13 and Theorem 14 can be ound in Sections 6.5 and 6.6, respectively. Alternatively, it is possible to transorm a multi-label learning task into Q independent binary classiication tasks (Boutell et al., 2004). Now the goal is to learn Q unctions, = ( 1, 2,..., Q ), and the prediction unction is given by A common choice or the surrogate loss is F ((X)) = (sgn[ 1 (X)], sgn[ 2 (X)],..., sgn[ Q (X)]). Ψ((X), Y ) = Q i=1 ϕ(y i i (X)), (11) where ϕ is a convex unction, which was chosen as hinge loss ϕ(t) = (1 t) + in (Elissee and Weston, 2002) and exponential loss ϕ(t) = exp( t) in (Schapire and Singer, 2000). We have W (p, ) = Y Y p Y Ψ((X), Y ) = Q = Q i=1 i=1 p+ i ϕ( i(x)) + (1 p + i )ϕ( i(x)), Y Y p Y ϕ(y i i (X)) where p + i = Y : y i =+1 p Y and 1 p + i = Y : y i = 1 p Y. For simplicity, we denote by W i (p + i, i) = p + i ϕ( i) + (1 p + i )ϕ( i). This yields that minimizing W (p, ) is equivalent to minimizing W i (p + i, i) or every 1 i Q, i.e., W (p) = in W (p, ) = Q in i=1 i W i (p + i, i). The consistency or binary classiication has been well-studied (Zhang, 2004b, Bartlett et al., 2006), and based on the work o Bartlett et al. (2006) we can easily get: Theorem 15 The surrogate loss Ψ given by Eq.(11) is consistent w.r.t. hamming loss or convex unction ϕ with ϕ (0) < 0. It is evident rom this theorem that the surrogate loss Ψ given by Eq.(11) is consistent w.r.t. hamming loss i ϕ is any o the ollowing: Exponential: ϕ(x) = e x ; Hinge: ϕ(x) = max(0, 1 x); Least squares: ϕ(x) = (1 x) 2 ; Logistic regression: ϕ(x) = ln(1 + exp( x)). Notice that in this paper, we do not consider how to learn the real-valued unctions or small data or large number o labels, where many labels or label subsets lack enough training examples, while this is a challenging task and the exploitation o label correlation is thereore needed. In our analysis, we assume suicient training data and ignore the label correlation. Moreover, we do not consider how to decide the number o relevant labels or ranking loss and partial ranking loss, while in practice this is very challenging. These are important uture issues. 7
8 6 Proos 6.1 Proo o Theorem 3 We irst introduce some useul lemmas: Lemma 16 W (p) is continuous on Λ. Proo: From the Heine deinition o continuity, we need to show that W (p (n) ) W (p) or any sequence p (n) p. Let B r be a closed ball with radius r in R K. Since Y is inite, we have Y Y p(n) Y Ψ(, Y ) Y Y p Y Ψ(, Y ) uniormly or every B r and sequence p (n) p, leading to in B r Y Y p(n) Y From the act W (p (n) ) in Br Ψ(, Y ) in Y Y p(n) Y B r Y Y p Y Ψ(, Y ). Ψ(, Y ), by letting r, we have: lim sup W (p (n) ) W (p). (12) n Denote by Y = {Y p Y > 0 or Y Y} and assume Ψ(, ) C or some constant C (since Ψ is below-bounded). We have W (p (n) ) in Y Y p (n) Y Ψ(, Y ) + C Y Y/Y p (n) Y yielding ( lim in W (p (n) ) lim in in n n Y Y p(n) Y Ψ(, Y ) + C ) Y Y/Y p(n) Y = W (p), which completes the proo by combining Eq.(12). Lemma 17 I the surrogate loss unction Ψ is multi-label consistent w.r.t. loss unction L, then or any ϵ > 0 there exists δ > 0 such that, or every p Λ, l(p, ) in l(p, ) ϵ implies W (p, ) W (p) δ. Proo: We proceed by contradiction. Suppose Ψ is multi-label consistent and there exists ϵ > 0 and a sequence (p (n), (n) ) such that l(p (n), (n) ) in l(p (n), ) ϵ and W (p (n), (n) ) W (p (n) ). From the compactness o Λ, there exists a convergence sequence n k such that p (n k) p or some p Λ. From Lemma 16, we have W (p (n k), (n k) ) W (p). Similar to the proo o Lemma 16, denoted by Y = {Y p Y > 0 or Y Y}, we have ( lim sup W (p, (nk) ) = lim sup C ) n k n k Y Y/Y p(n k) Y + in Y Y p(n k) Y Ψ( (nk), Y ) lim W (p (nk), (nk) ) = W (p). nk This gives W (p, (n k) ) W (p) rom the deinition o W (p). Since Ψ is multi-label consistent, there exists a sequence (n k i ) satisying l(p, (n k i ) ) in l(p, ), which contradicts the assumption l(p (n), (n) ) in l(p (n), ) ϵ, and thus the lemma holds. Proo o Theorem 3: ( ) We irst introduce a new notation H(ϵ) = in p Λ, {W (p, ) in W (p, ): l(p, ) in l(p, ) ϵ}. It is obvious that H(0) = 0 and H(ϵ) > 0 or ϵ > 0 rom Lemma 17. Corollary 26 o (Zhang, 2004a) guarantees there exists a concave unction η on [0, ] such that η(0) = 0 and η(ϵ) 0 as ϵ 0 and R() R η(r Ψ () R Ψ). Thus, i R Ψ () RΨ then R() R. ( ) We proceed by contradiction. Suppose Ψ is not multi-label consistent, and thus there exists some p such that W (p) = in {W (p, ): / A(p)}. Let (n) / A(p) be a sequence such 8
9 that W (p, (n) ) W (p). For simplicity, we consider X = {x}, i.e., only one instance, and set n (x) = (n). Then R Ψ ( n ) = W (p, (n) ) W (p) = R Ψ, yielding l(p, (n) ) in l(p, ) which is contrary to l(p, (n) ) in l(p, ) + γ(p) where γ(p) = in / A(p) l(p, ) in l(p, ) > 0, since (n) / A(p) and L is interval. Thus we complete the proo. 6.2 Proo o Theorem 6 We proceed by contradiction. Suppose that the surrogate loss Ψ is multi-label consistent with ranking loss. We have W (p, ) = p Y Ψ(, Y ) = p Y a Y ϕ( j i ) = ϕ( j i ) i,j + ϕ( i j ) j,i. Y Y Y Y y i <y j 1 i<j Q Consider the probability vector p = (p Y ) Y Y and penalty vector (a Y ) Y Y such that P Y1 = P Y2 and a Y1 = a Y2 or every Y 1 Y 2, Y 1, Y 2 Y. This yields i,j = m,n or all 1 i j, m n Q, and thus we get W (p, ) = 1,2 1 i<j Q ϕ( j i ) + ϕ( i j ). From the convexity o ϕ, minimizing W (p, ) gives W (p) = W (p,ˆ) = 1,2 1 i<j Q 2ϕ(0) = Q(Q 1)ϕ(0) 1,2, where ˆ = {ˆ : ˆ 1 = ˆ 2 =... = ˆ Q }. Notice that ˆ / A(p) rom Lemma 5, and we have which completes the proo. W (p) = in{w (p, ): / A(p)}, 6.3 Proo o Theorem 7 For any p Λ and non-negative vector (a Y ) Y Y, we will prove that, i > j i i,j < j,i and i < j i i,j > j,i, or all satisying W (p) = W (p, ). Without loss o generality, it is suicient to prove that 1 > 2 i 1,2 < 2,1. We proceed by contradiction, and assume that there exists a vector such that 1 2 and W (p) = W (p, ). For the case 1 < 2, we can introduce another vector with 1 = 2, 2 = 1 and k = k or k 1, 2. From the deinition o conditional surrogate risk, we have which yields that W (p, ) = 1 i<j Q ϕ( j i ) i,j + ϕ( i j ) j,i, W (p, ) W (p, ) = ( 1,2 2,1 )(ϕ( 2 1 ) ϕ( 1 2 )) + Q i=3 ( 1,i 2,i )(ϕ( i 1 ) ϕ( i 2 )) + Q From Property 4 o Lemma 4, we have i=3 ( i,1 i,2 )(ϕ( 1 i ) ϕ( 2 i )). 1,i 2,i i,1 + i,2 = 1,2 2,1. (13) It ollows that W (p, ) W (p, ) = ( 1,2 2,1 ) (ϕ( 2 1 ) ϕ( 1 2 ) + Q i=3 ( ϕ(i 1 ) ϕ( i 2 ) )) + Q ( i,1 i,2 )(ϕ( 1 i ) + ϕ( i 1 ) ϕ( i 2 ) ϕ( 2 i )) i=3 = ( 1,2 2,1 ) ( ϕ( 2 1 ) ϕ( 1 2 ) ) + ( 1,2 2,1 ) Q ( ϕ(i 1 ) ϕ( i 2 ) ) i=3 9
10 where the last equality holds by the condition ϕ(x) + ϕ( x) 2ϕ(0). For non-increasing unction ϕ with ϕ (0) < 0, we have ϕ(z) < ϕ( z) or all z > 0, and ϕ( i 1 ) ϕ( i 2 ). From the condition 1,2 < 2,1 we get W (p, ) > W (p, ), which is contrary to W (p, ) = W (p). We now consider the case 1 = 2. For the optimal solution, we have the irst-order condition i W (p, ) = 0 or i = 1, 2: i 1 ϕ ( 1 i ) i,1 = i 1 ϕ ( i 1 ) 1,i and i 2 ϕ ( i 2 ) 2,i = i 2 ϕ ( 2 i ) i,2. By combining Eq.(13), 1 = 2 and ϕ (x) = ϕ ( x) rom Eq.(7), we have: ( ( 2,1 1,2 ) 2ϕ (0) + ) i 1,2 ϕ ( 1 i ) = 0, which is impossible since 1,2 < 2,1 and 2ϕ (0) + i 1,2 ϕ ( 1 i ) 2ϕ (0) < 0. complete the proo. Thus, we 6.4 Proo o Theorem 10 For convex unction ϕ we have ϕ (x) ϕ (y) or every x y rom (Rockaellar, 1997), and the derivative unction ϕ (x) is continuous or x R i ϕ is dierential and convex. Since ϕ is nonincreasing, we have ϕ (x) 0 or all x R, and without loss o generality, we assume ϕ (x) < 0. We proceed by contradiction. Assume the surrogate loss Ψ is multi-label consistent with partial ranking loss or some non-linear unction ϕ. Then, rom the continuity o ϕ (x), there exists an interval (c, d) or c < d < 0 or 0 < c < d, such that ϕ (x) < ϕ (y) or every x < y, x, y (c, d). In the ollowing, we ocus on the case 0 < c < d, and similar consideration could be made or the case c < d < 0. We irst ix a (c, d) and introduce a new unction G(x) = ( ϕ (x a) ϕ (a x) )( ϕ (a) + ϕ (x) ) + ϕ (x) ( ϕ ( a) ϕ (a) ). It is easy to ind that G(a) = ϕ (a) ( ϕ ( a) ϕ (a) ) > 0 and G(x) is continuous. Thus there exists b > a and b (c, d) such that which gives G(b) = ( ϕ (b a) ϕ (a b) )( ϕ (a) + ϕ (b) ) + ϕ (b) ( ϕ ( a) ϕ (a) ) > 0, ϕ (b a)(ϕ (a) + ϕ (b)) + ϕ (b)ϕ ( a) ϕ (a b)(ϕ (a) + ϕ (b)) + ϕ (b)ϕ (a) Moreover, rom ϕ (a) < ϕ (b) < 0 and ϕ ( b) ϕ ( a) < 0, we have 0 < ϕ ( a) ϕ (a) > 1. (14) ϕ ( b) ϕ (a) < ϕ ( b) ϕ (b). (15) We consider the ollowing multi-label classiication with Q = 3 labels: Y 1 = ( 1, +1, +1), Y 2 = (+1, 1, 1), Y 3 = (+1, +1, 1), Y 4 = ( 1, 1, +1). Let = ( 1, 2, 3 ) such that a = 3 1 and b = 3 2, and thus 1 > 2. For every probability vector p = (p Y1, p Y2, p Y3, p Y4 ) Λ and every penalty vector (a Y1, a Y2, a Y3, a Y4 ), we have W (p, ) = 1,2 ϕ( 2 1 ) + 2,1 ϕ( 1 2 ) + 1,3 ϕ( 3 1 ) + 3,1 ϕ( 1 3 ) + 2,3 ϕ( 3 2 ) + 3,2 ϕ( 2 3 ), where 1,2 = P 1, 2,1 = P 2, 1,3 = P 1 + P 4, 3,1 = P 2 + P 3, 2,3 = P 4 and 3,2 = P 3 with P i = p Y1 a Y1 or i = 1, 2, 3, 4. In the ollowing, we will construct some p and ā such that W ( p) = W ( p, ) and 1,2 > 2,1. The subgradient condition or optimality o W (p, ) gives that W (p, )/ 1 = P 1 ϕ (a b) + P 2 ϕ (b a) (P 1 + P 4 )ϕ (a) + (P 2 + P 3 )ϕ ( a) = 0, W (p, )/ 2 = P 1 ϕ (a b) P 2 ϕ (b a) P 4 ϕ (b) + P 3 ϕ ( b) = 0, W (p, )/ 3 = (P 1 + P 4 )ϕ (a) (P 2 + P 3 )ϕ ( a) + P 4 ϕ (b) P 3 ϕ ( b) = 0, 10
11 P 4 I 1 I 2 o P 3 Figure 1: Lines I 1 (solid) and I 2 (dash) corresponding to Eqs. (16) and (17), respectively. equivalent to P 1 ϕ (a b) P 2 ϕ (b a) = P 4 ϕ (b) P 3 ϕ ( b), (16) P 1 ϕ (a) + P 2 ϕ ( a) = P 4 (ϕ (a) + ϕ (b)) P 3 (ϕ ( a) + ϕ ( b)). (17) From Lemma 18, there exists p = ( p Y1, p Y2, p Y3, p Y4 ) and (ā Y1, ā Y2, ā Y3, ā Y4 ) satisying Eqs. (16), (17) and P 1 > P 2. Hence this yields W ( p, ) = W ( p). Notice that / A( p) rom Eq.(6), since 1,2 = P 1 > P 2 = 2,1 and 1 > 2, we have which completes the proo. W ( p) = in{w ( p, ): / A( p)}, Lemma 18 There exist some P 1 > P 2 > 0, P 3 > 0 and P 4 > 0 satisying Eqs. (16) and (17), i Eqs. (14) and (15) hold or some 0 < a < b. Proo: From Eq.(14), we set 1 < P 1 P 2 < ϕ (b a)(ϕ (a) + ϕ (b)) + ϕ (b)ϕ ( a) ϕ (a b)(ϕ (a) + ϕ (b)) + ϕ (b)ϕ (a). (18) For a < b, we have ϕ (a b) ϕ (b a) < 0, yielding P 1 > 1 ϕ (b a) P 2 ϕ (a b), which gives P 1 ϕ (a b) P 2 ϕ (b a) < 0. Thus, Eq.(16) corresponds to the Line I 1 in Figure 1. From Eq.(15), we urther obtain 0 < ϕ ( a) + ϕ ( b) ϕ (a) + ϕ (b) < ϕ ( b) ϕ (b). To guarantee P 3 > 0 and P 4 > 0 satisying Eqs. (16) and (17), as shown in Figure 1, we need: P 1 ϕ (a b) P 2 ϕ (b a) ϕ (b) < P 1ϕ (a) + P 2 ϕ ( a) ϕ (a) + ϕ. (b) The above holds obviously rom Eq.(18). Thus, we complete the proo. 6.5 Proo o Lemma 13 We consider a multi-label classiication task with Q = 2 labels, i.e., L = {1, 2} and let Y 1 = ( 1, 1), Y 2 = ( 1, 1), Y 3 = (1, 1) and Y 4 = (1, 1). Suppose p Y4 = 0 and p Y2 + p Y3 < p Y1 < 2p Y2 + p Y3. It is evident that p Y1 > 0.5 and thus, this multi-label classiication task is deterministic. It is necessary to consider = ( Y1, Y2, Y3 ). By combining Eqs. (8) and (10), we get A(p) = { : Y1 Y2 and Y1 Y3 }. 11
12 We also have W (p, ) = p Y1 max{ϕ(1 + Y2 Y1 ), ϕ(2 + Y3 Y1 )} +p Y2 max{ϕ(1 + Y1 Y2 ), ϕ(1 + Y3 Y2 )} + p Y3 max{ϕ(2 + Y1 Y3 ), ϕ(1 + Y2 Y3 )}, and W (p) = W (p, ) = in W (p, ) = 2p Y1 + 2p Y3, where = (Y 1, Y 2, Y 3 ) such that Y 1 = Y 3 and Y 1 = Y 2 1. Hence / A(p) and is not multi-label consistent. 6.6 Proo o Theorem 14 We irst introduce a useul lemma: Lemma 19 For the surrogate loss Ψ given by Eq.(10) with δ(ŷ, Y ) = I[Ŷ Y ], and or every p Λ and such that W (p, ) = W (p), we have Y1 Y2 i p Y1 > p Y2. Proo: We proceed by contradiction. Assume there exist some p Λ and such that Y1 < Y2, p Y1 > p Y2 and W (p) = W (p, ). We deine a new real-valued vector such that Y 1 = Y2, Y 2 = Y1 and Y i = Yi or i 1, 2. This ollows that W (p, ) W (p, ) = (p Y1 p Y2 ) ( max Y Y 1 ϕ(1 + Y (X) Y1 (X)) max Y Y 2 ϕ(1 + Y (X) Y2 (X)) ). From the assumption Y1 < Y2, we have leading to max ϕ(1 + Y (X) Y1 (X)) > max ϕ(1 + Y (X) Y2 (X)), Y Y 1 Y Y 2 W (p, ) W (p, ) > 0, which is contrary to the assumption W (p) = W (p, ). Thus, we complete the proo. Proo o Theorem 14: Without loss o generality, we consider a probability vector p = (p Y ) Y Y satisying p Y1 > p Y2 p Y3... p Y2 Q and p Y 1 > 0.5. It is easy to get A(p) = { : Y1 > Y or Y Y 1 }, rom Eqs. (8) and (9). On the other hand, or each satisying W (p, ) = W (p), we have Y1 Y2... Y2 Q rom Lemma 19. Thus, we complete the proo rom p Y1 > Y Y 1 p Y and the act that W (p, ) = p Y1 ϕ(1 + Y2 Y1 ) + Y Y 1 p Y ϕ(1 + Y1 Y ). Acknowledgements The authors want to thank the anonymous reviewers or their helpul comments and suggestions, and Tong Zhang or reading a preliminary version. This research was supported by the National Fundamental Research Program o China (2010CB327903), the National Science Foundation o China ( ) and the Jiangsu Science Foundation (BK ). Reerences P. L. Bartlett, M. I. Jordan, and J. D. McAulie. Convexity, classiication, and risk bounds. Journal o the American Statistical Association, 101(473): , M. R. Boutell, J. Luo, X. Shen, and C.M. Brown. Learning multi-label scene classiication. Pattern Recognition, 37(9): , D. Cossock and T. Zhang. Statistical analysis o bayes optimal subset ranking. IEEE Transactions on Inormation Theory, 54(11): ,
13 O. Dekel, C. Manning, and Y. Singer. Log-linear models or label ranking. In S. Thrun, L. K. Saul, and B. Schölkop, editors, Advances in Neural Inormation Processing Systems 16. MIT Press, Cambridge, MA, K. Dembczyński, W. W. Cheng, and E. Hüllermeier. Bayes optimal multilabel classiication via probabilistic classiier chains. In Proceedings o the 27th International Conerence on Machine Learning, pages , Haia, Israel, J. C. Duchi, L. W. Mackey, and M. I. Jordan. On the consistency o ranking algorithms. In Proceedings o the 27th International Conerence on Machine Learning, pages , Haia, Israel, A. Elissee and J. Weston. A kernel method or multi-labelled classiication. In T. G. Dietterich, S. Becker, and Z. Ghahramani, editors, Advances in Neural Inormation Processing Systems 14, pages MIT Press, Cambridge, MA, N. Ghamrawi and A. McCallum. Collective multi-label classiication. In Proceedings o the 14th ACM International Conerence on Inormation and Knowledge Management, pages , Bremen, Germany, S. Godbole and S. Sarawagi. Discriminative methods or multi-labeled classiication. In Proceedings o the 8th Paciic-Asia Conerence on Knowledge Discovery and Data Mining, pages 22 30, Sydney, Austrialia, B. Hariharan, L. Zelnik-Manor, S. Vishwanathan, and M. Varma. Large scale max-margin multi-label classiication with priors. In Proceedings o the 27th International Conerence on Machine Learning, pages , Haia, Israel, D. Hsu, S. M. Kakade, J. Langord, and T. Zhang. Multi-label prediction via compressed sensing. In Y. Bengio, D. Schuurmans, J. Laerty, C. K. I. Williams, and A. Culotta, editors, Advances in Neural Inormation Processing Systems 22, pages MIT Press, Cambridge, MA, Y. Lin. Support vector machines and the bayes rule in classiication. Data Mining and Knowledge Discovery, 6(3): , J. Petterson and T. Caetano. Reverse multi-label learning. In J. Laerty, C. K. I. Williams, J. Shawe- Taylor, R. S. Zemel, and A. Culotta, editors, Advances in Neural Inormation Processing Systems 23, pages MIT Press, Cambridge, MA, G.-J. Qi, X.-S. Hua, Y. Rui, J. Tang, T. Mei, and H.-J. Zhang. Correlative multi-label video annotation. In Proceedings o the 15th ACM International Conerence on Multimedia, pages 17 26, Augsburg, Germany, R. T. Rockaellar. Convex Analysis. Princeton University Press, Princeton, NJ, R. E. Schapire and Y. Singer. BoosTexter: A boosting-based system or text categorization. Machine Learning, 39(2): , I. Steinwart. Consistency o support vector machines and other regularized kernel classiiers. IEEE Transactions on Inormation Theory, 51(1): , B. Taskar, C. Guestrin, and D. Koller. Max-margin markov networks. In S. Thrun, L. Saul, and B. Schölkop, editors, Advances in Neural Inormation Processing Systems 16. MIT Press, Cambridge, MA, A. Tewari and P. L. Bartlett. On the consistency o multiclass classiication methods. Journal o Machine Learning Research, 8: , F. Xia, T. Y. Liu, J. Wang, W. Zhang, and H. Li. Listwise approach to learning to rank: Theory and algorithm. In Proceedings o the 25th International Conerence on Machine Learning, pages , Helsinki, Finland, M.-L. Zhang and Z.-H. Zhou. Multi-label neural network with applications to unctional genomics and text categorization. IEEE Transactions on Knowledge and Data Engineering, 18(10): , M.-L. Zhang and Z.-H. Zhou. ML-KNN: A lazy learning approach to multi-label learning. Pattern Recognition, 40(7): , T. Zhang. Statistical analysis o some multi-category large margin classiication methods. Journal o Machine Learning Research, 5: , 2004a. T. Zhang. Statistical behavior and consistency o classiication methods based on convex risk minimization. Annals o Statistics, 32(1):56 85, 2004b. Z.-H. Zhou and M.-L. Zhang. Multi-instance multi-label learning with application to scene classiication. In B. Schölkop, J. Platt, and T. Homann, editors, Advances in Neural Inormation Processing Systems 19, pages MIT Press, Cambridge, MA,
Fisher Consistency of Multicategory Support Vector Machines
Fisher Consistency o Multicategory Support Vector Machines Yueng Liu Department o Statistics and Operations Research Carolina Center or Genome Sciences University o North Carolina Chapel Hill, NC 7599-360
More informationOn the Consistency of AUC Pairwise Optimization
On the Consistency of AUC Pairwise Optimization Wei Gao and Zhi-Hua Zhou National Key Laboratory for Novel Software Technology, Nanjing University Collaborative Innovation Center of Novel Software Technology
More informationBinary Classification, Multi-label Classification and Ranking: A Decision-theoretic Approach
Binary Classification, Multi-label Classification and Ranking: A Decision-theoretic Approach Krzysztof Dembczyński and Wojciech Kot lowski Intelligent Decision Support Systems Laboratory (IDSS) Poznań
More informationA Simple Algorithm for Multilabel Ranking
A Simple Algorithm for Multilabel Ranking Krzysztof Dembczyński 1 Wojciech Kot lowski 1 Eyke Hüllermeier 2 1 Intelligent Decision Support Systems Laboratory (IDSS), Poznań University of Technology, Poland
More informationMulti-Label Selective Ensemble
Multi-Label Selective Ensemble Nan Li, Yuan Jiang and Zhi-Hua Zhou National Key Laboratory for Novel Software Technology, Nanjing University Collaborative Innovation Center of Novel Software Technology
More informationA Large Deviation Bound for the Area Under the ROC Curve
A Large Deviation Bound for the Area Under the ROC Curve Shivani Agarwal, Thore Graepel, Ralf Herbrich and Dan Roth Dept. of Computer Science University of Illinois Urbana, IL 680, USA {sagarwal,danr}@cs.uiuc.edu
More informationA Brief Survey on Semi-supervised Learning with Graph Regularization
000 00 002 003 004 005 006 007 008 009 00 0 02 03 04 05 06 07 08 09 020 02 022 023 024 025 026 027 028 029 030 03 032 033 034 035 036 037 038 039 040 04 042 043 044 045 046 047 048 049 050 05 052 053 A
More informationFoundations of Machine Learning Multi-Class Classification. Mehryar Mohri Courant Institute and Google Research
Foundations of Machine Learning Multi-Class Classification Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Motivation Real-world problems often have multiple classes: text, speech,
More informationSelective Ensemble of Classifier Chains
Selective Ensemble of Classifier Chains Nan Li 1,2 and Zhi-Hua Zhou 1 1 National Key Laboratory for Novel Software Technology Nanjing University, Nanjing 210023, China 2 School of Mathematical Sciences,
More informationMulti-Label Learning with Weak Label
Proceedings of the Twenty-Fourth AAAI Conference on Artificial Intelligence (AAAI-10) Multi-Label Learning with Weak Label Yu-Yin Sun Yin Zhang Zhi-Hua Zhou National Key Laboratory for Novel Software Technology
More informationGaussian Process Regression Models for Predicting Stock Trends
Gaussian Process Regression Models or Predicting Stock Trends M. Todd Farrell Andrew Correa December 5, 7 Introduction Historical stock price data is a massive amount o time-series data with little-to-no
More informationIntroduction to Machine Learning Lecture 13. Mehryar Mohri Courant Institute and Google Research
Introduction to Machine Learning Lecture 13 Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Multi-Class Classification Mehryar Mohri - Introduction to Machine Learning page 2 Motivation
More informationFoundations of Machine Learning Lecture 9. Mehryar Mohri Courant Institute and Google Research
Foundations of Machine Learning Lecture 9 Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Multi-Class Classification page 2 Motivation Real-world problems often have multiple classes:
More informationBasic mathematics of economic models. 3. Maximization
John Riley 1 January 16 Basic mathematics o economic models 3 Maimization 31 Single variable maimization 1 3 Multi variable maimization 6 33 Concave unctions 9 34 Maimization with non-negativity constraints
More informationVALUATIVE CRITERIA BRIAN OSSERMAN
VALUATIVE CRITERIA BRIAN OSSERMAN Intuitively, one can think o separatedness as (a relative version o) uniqueness o limits, and properness as (a relative version o) existence o (unique) limits. It is not
More informationRoberto s Notes on Differential Calculus Chapter 8: Graphical analysis Section 1. Extreme points
Roberto s Notes on Dierential Calculus Chapter 8: Graphical analysis Section 1 Extreme points What you need to know already: How to solve basic algebraic and trigonometric equations. All basic techniques
More informationOn the Consistency of Ranking Algorithms
John C. Duchi jduchi@cs.berkeley.edu Lester W. Mackey lmackey@cs.berkeley.edu Computer Science Division, University of California, Berkeley, CA 94720, USA Michael I. Jordan jordan@cs.berkeley.edu Computer
More informationDoes Modeling Lead to More Accurate Classification?
Does Modeling Lead to More Accurate Classification? A Comparison of the Efficiency of Classification Methods Yoonkyung Lee* Department of Statistics The Ohio State University *joint work with Rui Wang
More informationFeedback Linearization
Feedback Linearization Peter Al Hokayem and Eduardo Gallestey May 14, 2015 1 Introduction Consider a class o single-input-single-output (SISO) nonlinear systems o the orm ẋ = (x) + g(x)u (1) y = h(x) (2)
More informationSTATISTICAL BEHAVIOR AND CONSISTENCY OF CLASSIFICATION METHODS BASED ON CONVEX RISK MINIMIZATION
STATISTICAL BEHAVIOR AND CONSISTENCY OF CLASSIFICATION METHODS BASED ON CONVEX RISK MINIMIZATION Tong Zhang The Annals of Statistics, 2004 Outline Motivation Approximation error under convex risk minimization
More informationA Study of Relative Efficiency and Robustness of Classification Methods
A Study of Relative Efficiency and Robustness of Classification Methods Yoonkyung Lee* Department of Statistics The Ohio State University *joint work with Rui Wang April 28, 2011 Department of Statistics
More informationConsistency of Nearest Neighbor Methods
E0 370 Statistical Learning Theory Lecture 16 Oct 25, 2011 Consistency of Nearest Neighbor Methods Lecturer: Shivani Agarwal Scribe: Arun Rajkumar 1 Introduction In this lecture we return to the study
More informationTelescoping Decomposition Method for Solving First Order Nonlinear Differential Equations
Telescoping Decomposition Method or Solving First Order Nonlinear Dierential Equations 1 Mohammed Al-Reai 2 Maysem Abu-Dalu 3 Ahmed Al-Rawashdeh Abstract The Telescoping Decomposition Method TDM is a new
More information2. ETA EVALUATIONS USING WEBER FUNCTIONS. Introduction
. ETA EVALUATIONS USING WEBER FUNCTIONS Introduction So ar we have seen some o the methods or providing eta evaluations that appear in the literature and we have seen some o the interesting properties
More informationA Bahadur Representation of the Linear Support Vector Machine
A Bahadur Representation of the Linear Support Vector Machine Yoonkyung Lee Department of Statistics The Ohio State University October 7, 2008 Data Mining and Statistical Learning Study Group Outline Support
More informationSEPARATED AND PROPER MORPHISMS
SEPARATED AND PROPER MORPHISMS BRIAN OSSERMAN The notions o separatedness and properness are the algebraic geometry analogues o the Hausdor condition and compactness in topology. For varieties over the
More informationSemideterministic Finite Automata in Operational Research
Applied Mathematical Sciences, Vol. 0, 206, no. 6, 747-759 HIKARI Ltd, www.m-hikari.com http://dx.doi.org/0.2988/ams.206.62 Semideterministic Finite Automata in Operational Research V. N. Dumachev and
More informationOn High-Rate Cryptographic Compression Functions
On High-Rate Cryptographic Compression Functions Richard Ostertág and Martin Stanek Department o Computer Science Faculty o Mathematics, Physics and Inormatics Comenius University Mlynská dolina, 842 48
More informationFinite Dimensional Hilbert Spaces are Complete for Dagger Compact Closed Categories (Extended Abstract)
Electronic Notes in Theoretical Computer Science 270 (1) (2011) 113 119 www.elsevier.com/locate/entcs Finite Dimensional Hilbert Spaces are Complete or Dagger Compact Closed Categories (Extended bstract)
More information1 Relative degree and local normal forms
THE ZERO DYNAMICS OF A NONLINEAR SYSTEM 1 Relative degree and local normal orms The purpose o this Section is to show how single-input single-output nonlinear systems can be locally given, by means o a
More informationAdaptive Kernel Principal Component Analysis With Unsupervised Learning of Kernels
Adaptive Kernel Principal Component Analysis With Unsupervised Learning of Kernels Daoqiang Zhang Zhi-Hua Zhou National Laboratory for Novel Software Technology Nanjing University, Nanjing 2193, China
More informationVALUATIVE CRITERIA FOR SEPARATED AND PROPER MORPHISMS
VALUATIVE CRITERIA FOR SEPARATED AND PROPER MORPHISMS BRIAN OSSERMAN Recall that or prevarieties, we had criteria or being a variety or or being complete in terms o existence and uniqueness o limits, where
More informationFoundations of Machine Learning
Introduction to ML Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu page 1 Logistics Prerequisites: basics in linear algebra, probability, and analysis of algorithms. Workload: about
More informationRegret Analysis for Performance Metrics in Multi-Label Classification: The Case of Hamming and Subset Zero-One Loss
Regret Analysis for Performance Metrics in Multi-Label Classification: The Case of Hamming and Subset Zero-One Loss Krzysztof Dembczyński 1,3, Willem Waegeman 2, Weiwei Cheng 1, and Eyke Hüllermeier 1
More informationComplexity and Algorithms for Two-Stage Flexible Flowshop Scheduling with Availability Constraints
Complexity and Algorithms or Two-Stage Flexible Flowshop Scheduling with Availability Constraints Jinxing Xie, Xijun Wang Department o Mathematical Sciences, Tsinghua University, Beijing 100084, China
More informationMargin Maximizing Loss Functions
Margin Maximizing Loss Functions Saharon Rosset, Ji Zhu and Trevor Hastie Department of Statistics Stanford University Stanford, CA, 94305 saharon, jzhu, hastie@stat.stanford.edu Abstract Margin maximizing
More informationBANDELET IMAGE APPROXIMATION AND COMPRESSION
BANDELET IMAGE APPOXIMATION AND COMPESSION E. LE PENNEC AND S. MALLAT Abstract. Finding eicient geometric representations o images is a central issue to improve image compression and noise removal algorithms.
More informationComputing proximal points of nonconvex functions
Mathematical Programming manuscript No. (will be inserted by the editor) Warren Hare Claudia Sagastizábal Computing proximal points o nonconvex unctions the date o receipt and acceptance should be inserted
More informationProblem Set. Problems on Unordered Summation. Math 5323, Fall Februray 15, 2001 ANSWERS
Problem Set Problems on Unordered Summation Math 5323, Fall 2001 Februray 15, 2001 ANSWERS i 1 Unordered Sums o Real Terms In calculus and real analysis, one deines the convergence o an ininite series
More informationOn the Bayes-Optimality of F-Measure Maximizers
Journal of Machine Learning Research 15 (2014) 3513-3568 Submitted 10/13; Revised 6/14; Published 11/14 On the Bayes-Optimality of F-Measure Maximizers Willem Waegeman willem.waegeman@ugent.be Department
More informationSOME CHARACTERIZATIONS OF HARMONIC CONVEX FUNCTIONS
International Journal o Analysis and Applications ISSN 2291-8639 Volume 15, Number 2 2017, 179-187 DOI: 10.28924/2291-8639-15-2017-179 SOME CHARACTERIZATIONS OF HARMONIC CONVEX FUNCTIONS MUHAMMAD ASLAM
More informationCSCI-567: Machine Learning (Spring 2019)
CSCI-567: Machine Learning (Spring 2019) Prof. Victor Adamchik U of Southern California Mar. 19, 2019 March 19, 2019 1 / 43 Administration March 19, 2019 2 / 43 Administration TA3 is due this week March
More informationStrong Lyapunov Functions for Systems Satisfying the Conditions of La Salle
06 IEEE TRANSACTIONS ON AUTOMATIC CONTROL, VOL. 49, NO. 6, JUNE 004 Strong Lyapunov Functions or Systems Satisying the Conditions o La Salle Frédéric Mazenc and Dragan Ne sić Abstract We present a construction
More informationMulti-Label Hypothesis Reuse
Multi-Label Hypothesis Reuse Sheng-Jun Huang, Yang Yu, and Zhi-Hua Zhou National Key Laboratory for Novel Software Technology Nanjing University, Nanjing 210046, China {huangsj, yuy, zhouzh}@lamda.nju.edu.cn
More informationOn Label Dependence in Multi-Label Classification
Krzysztof Dembczynski 1,2 dembczynski@informatik.uni-marburg.de Willem Waegeman 3 willem.waegeman@ugent.be Weiwei Cheng 1 cheng@informatik.uni-marburg.de Eyke Hüllermeier 1 eyke@informatik.uni-marburg.de
More informationDefinition: Let f(x) be a function of one variable with continuous derivatives of all orders at a the point x 0, then the series.
2.4 Local properties o unctions o several variables In this section we will learn how to address three kinds o problems which are o great importance in the ield o applied mathematics: how to obtain the
More informationLearning with Rejection
Learning with Rejection Corinna Cortes 1, Giulia DeSalvo 2, and Mehryar Mohri 2,1 1 Google Research, 111 8th Avenue, New York, NY 2 Courant Institute of Mathematical Sciences, 251 Mercer Street, New York,
More informationCalibrated Surrogate Losses
EECS 598: Statistical Learning Theory, Winter 2014 Topic 14 Calibrated Surrogate Losses Lecturer: Clayton Scott Scribe: Efrén Cruz Cortés Disclaimer: These notes have not been subjected to the usual scrutiny
More informationSEPARATED AND PROPER MORPHISMS
SEPARATED AND PROPER MORPHISMS BRIAN OSSERMAN Last quarter, we introduced the closed diagonal condition or a prevariety to be a prevariety, and the universally closed condition or a variety to be complete.
More informationSTAT 801: Mathematical Statistics. Hypothesis Testing
STAT 801: Mathematical Statistics Hypothesis Testing Hypothesis testing: a statistical problem where you must choose, on the basis o data X, between two alternatives. We ormalize this as the problem o
More informationLecture Support Vector Machine (SVM) Classifiers
Introduction to Machine Learning Lecturer: Amir Globerson Lecture 6 Fall Semester Scribe: Yishay Mansour 6.1 Support Vector Machine (SVM) Classifiers Classification is one of the most important tasks in
More informationAlgorithms for Predicting Structured Data
1 / 70 Algorithms for Predicting Structured Data Thomas Gärtner / Shankar Vembu Fraunhofer IAIS / UIUC ECML PKDD 2010 Structured Prediction 2 / 70 Predicting multiple outputs with complex internal structure
More informationarxiv: v1 [math.gt] 20 Feb 2008
EQUIVALENCE OF REAL MILNOR S FIBRATIONS FOR QUASI HOMOGENEOUS SINGULARITIES arxiv:0802.2746v1 [math.gt] 20 Feb 2008 ARAÚJO DOS SANTOS, R. Abstract. We are going to use the Euler s vector ields in order
More informationThe Deutsch-Jozsa Problem: De-quantization and entanglement
The Deutsch-Jozsa Problem: De-quantization and entanglement Alastair A. Abbott Department o Computer Science University o Auckland, New Zealand May 31, 009 Abstract The Deustch-Jozsa problem is one o the
More informationGENERALIZED ABSTRACT NONSENSE: CATEGORY THEORY AND ADJUNCTIONS
GENERALIZED ABSTRACT NONSENSE: CATEGORY THEORY AND ADJUNCTIONS CHRIS HENDERSON Abstract. This paper will move through the basics o category theory, eventually deining natural transormations and adjunctions
More informationLogistic Regression. William Cohen
Logistic Regression William Cohen 1 Outline Quick review classi5ication, naïve Bayes, perceptrons new result for naïve Bayes Learning as optimization Logistic regression via gradient ascent Over5itting
More informationOn Information Divergence Measures, Surrogate Loss Functions and Decentralized Hypothesis Testing
On Information Divergence Measures, Surrogate Loss Functions and Decentralized Hypothesis Testing XuanLong Nguyen Martin J. Wainwright Michael I. Jordan Electrical Engineering & Computer Science Department
More informationAdaptive Subgradient Methods for Online Learning and Stochastic Optimization John Duchi, Elad Hanzan, Yoram Singer
Adaptive Subgradient Methods for Online Learning and Stochastic Optimization John Duchi, Elad Hanzan, Yoram Singer Vicente L. Malave February 23, 2011 Outline Notation minimize a number of functions φ
More informationPhysics 5153 Classical Mechanics. Solution by Quadrature-1
October 14, 003 11:47:49 1 Introduction Physics 5153 Classical Mechanics Solution by Quadrature In the previous lectures, we have reduced the number o eective degrees o reedom that are needed to solve
More information1 Machine Learning Concepts (16 points)
CSCI 567 Fall 2018 Midterm Exam DO NOT OPEN EXAM UNTIL INSTRUCTED TO DO SO PLEASE TURN OFF ALL CELL PHONES Problem 1 2 3 4 5 6 Total Max 16 10 16 42 24 12 120 Points Please read the following instructions
More informationResearch Article Fixed Points of Difference Operator of Meromorphic Functions
e Scientiic World Journal, Article ID 03249, 4 pages http://dx.doi.org/0.55/204/03249 Research Article Fixed Points o Dierence Operator o Meromorphic Functions Zhaojun Wu and Hongyan Xu 2 School o Mathematics
More informationMetric Embedding for Kernel Classification Rules
Metric Embedding for Kernel Classification Rules Bharath K. Sriperumbudur University of California, San Diego (Joint work with Omer Lang & Gert Lanckriet) Bharath K. Sriperumbudur (UCSD) Metric Embedding
More informationSupport Vector Machines for Classification: A Statistical Portrait
Support Vector Machines for Classification: A Statistical Portrait Yoonkyung Lee Department of Statistics The Ohio State University May 27, 2011 The Spring Conference of Korean Statistical Society KAIST,
More informationClassification of effective GKM graphs with combinatorial type K 4
Classiication o eective GKM graphs with combinatorial type K 4 Shintarô Kuroki Department o Applied Mathematics, Faculty o Science, Okayama Uniervsity o Science, 1-1 Ridai-cho Kita-ku, Okayama 700-0005,
More informationLecture 3: Multiclass Classification
Lecture 3: Multiclass Classification Kai-Wei Chang CS @ University of Virginia kw@kwchang.net Some slides are adapted from Vivek Skirmar and Dan Roth CS6501 Lecture 3 1 Announcement v Please enroll in
More information(C) The rationals and the reals as linearly ordered sets. Contents. 1 The characterizing results
(C) The rationals and the reals as linearly ordered sets We know that both Q and R are something special. When we think about about either o these we usually view it as a ield, or at least some kind o
More information4.1 Online Convex Optimization
CS/CNS/EE 53: Advanced Topics in Machine Learning Topic: Online Convex Optimization and Online SVM Lecturer: Daniel Golovin Scribe: Xiaodi Hou Date: Jan 3, 4. Online Convex Optimization Definition 4..
More information( x) f = where P and Q are polynomials.
9.8 Graphing Rational Functions Lets begin with a deinition. Deinition: Rational Function A rational unction is a unction o the orm ( ) ( ) ( ) P where P and Q are polynomials. Q An eample o a simple rational
More informationSupplementary material for Continuous-action planning for discounted infinite-horizon nonlinear optimal control with Lipschitz values
Supplementary material or Continuous-action planning or discounted ininite-horizon nonlinear optimal control with Lipschitz values List o main notations x, X, u, U state, state space, action, action space,
More informationReducing Multiclass to Binary: A Unifying Approach for Margin Classifiers
Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers Erin Allwein, Robert Schapire and Yoram Singer Journal of Machine Learning Research, 1:113-141, 000 CSE 54: Seminar on Learning
More informationConvex Calibration Dimension for Multiclass Loss Matrices
Journal of Machine Learning Research 7 (06) -45 Submitted 8/4; Revised 7/5; Published 3/6 Convex Calibration Dimension for Multiclass Loss Matrices Harish G. Ramaswamy Shivani Agarwal Department of Computer
More informationStatistical Properties of Large Margin Classifiers
Statistical Properties of Large Margin Classifiers Peter Bartlett Division of Computer Science and Department of Statistics UC Berkeley Joint work with Mike Jordan, Jon McAuliffe, Ambuj Tewari. slides
More informationCutting Plane Training of Structural SVM
Cutting Plane Training of Structural SVM Seth Neel University of Pennsylvania sethneel@wharton.upenn.edu September 28, 2017 Seth Neel (Penn) Short title September 28, 2017 1 / 33 Overview Structural SVMs
More informationThe Margin Vector, Admissible Loss and Multi-class Margin-based Classifiers
The Margin Vector, Admissible Loss and Multi-class Margin-based Classifiers Hui Zou University of Minnesota Ji Zhu University of Michigan Trevor Hastie Stanford University Abstract We propose a new framework
More informationCorrLog: Correlated Logistic Models for Joint Prediction of Multiple Labels
CorrLog: Correlated Logistic Models for Joint Prediction of Multiple Labels Wei Bian Bo Xie Dacheng Tao Georgia Tech Center for Music Technology, Georgia Institute of Technology bo.xie@gatech.edu Centre
More informationFeedback linearization control of systems with singularities: a ball-beam revisit
Chapter 1 Feedback linearization control o systems with singularities: a ball-beam revisit Fu Zhang The Mathworks, Inc zhang@mathworks.com Benito Fernndez-Rodriguez Department o Mechanical Engineering,
More informationNumerical Methods - Lecture 2. Numerical Methods. Lecture 2. Analysis of errors in numerical methods
Numerical Methods - Lecture 1 Numerical Methods Lecture. Analysis o errors in numerical methods Numerical Methods - Lecture Why represent numbers in loating point ormat? Eample 1. How a number 56.78 can
More informationGlobal Weak Solution of Planetary Geostrophic Equations with Inviscid Geostrophic Balance
Global Weak Solution o Planetary Geostrophic Equations with Inviscid Geostrophic Balance Jian-Guo Liu 1, Roger Samelson 2, Cheng Wang 3 Communicated by R. Temam) Abstract. A reormulation o the planetary
More informationSequential Supervised Learning
Sequential Supervised Learning Many Application Problems Require Sequential Learning Part-of of-speech Tagging Information Extraction from the Web Text-to to-speech Mapping Part-of of-speech Tagging Given
More informationNON-AUTONOMOUS INHOMOGENEOUS BOUNDARY CAUCHY PROBLEMS AND RETARDED EQUATIONS. M. Filali and M. Moussi
Electronic Journal: Southwest Journal o Pure and Applied Mathematics Internet: http://rattler.cameron.edu/swjpam.html ISSN 83-464 Issue 2, December, 23, pp. 26 35. Submitted: December 24, 22. Published:
More informationScattering of Solitons of Modified KdV Equation with Self-consistent Sources
Commun. Theor. Phys. Beijing, China 49 8 pp. 89 84 c Chinese Physical Society Vol. 49, No. 4, April 5, 8 Scattering o Solitons o Modiied KdV Equation with Sel-consistent Sources ZHANG Da-Jun and WU Hua
More informationSimpler Functions for Decompositions
Simpler Functions or Decompositions Bernd Steinbach Freiberg University o Mining and Technology, Institute o Computer Science, D-09596 Freiberg, Germany Abstract. This paper deals with the synthesis o
More informationLarge-Margin Thresholded Ensembles for Ordinal Regression
Large-Margin Thresholded Ensembles for Ordinal Regression Hsuan-Tien Lin and Ling Li Learning Systems Group, California Institute of Technology, U.S.A. Conf. on Algorithmic Learning Theory, October 9,
More informationLearning Binary Classifiers for Multi-Class Problem
Research Memorandum No. 1010 September 28, 2006 Learning Binary Classifiers for Multi-Class Problem Shiro Ikeda The Institute of Statistical Mathematics 4-6-7 Minami-Azabu, Minato-ku, Tokyo, 106-8569,
More informationReliability Assessment with Correlated Variables using Support Vector Machines
Reliability Assessment with Correlated Variables using Support Vector Machines Peng Jiang, Anirban Basudhar, and Samy Missoum Aerospace and Mechanical Engineering Department, University o Arizona, Tucson,
More informationABC-Boost: Adaptive Base Class Boost for Multi-class Classification
ABC-Boost: Adaptive Base Class Boost for Multi-class Classification Ping Li Department of Statistical Science, Cornell University, Ithaca, NY 14853 USA pingli@cornell.edu Abstract We propose -boost (adaptive
More informationSupport Vector Machines
Support Vector Machines Tobias Pohlen Selected Topics in Human Language Technology and Pattern Recognition February 10, 2014 Human Language Technology and Pattern Recognition Lehrstuhl für Informatik 6
More informationDivergences, surrogate loss functions and experimental design
Divergences, surrogate loss functions and experimental design XuanLong Nguyen University of California Berkeley, CA 94720 xuanlong@cs.berkeley.edu Martin J. Wainwright University of California Berkeley,
More informationELEG 3143 Probability & Stochastic Process Ch. 4 Multiple Random Variables
Department o Electrical Engineering University o Arkansas ELEG 3143 Probability & Stochastic Process Ch. 4 Multiple Random Variables Dr. Jingxian Wu wuj@uark.edu OUTLINE 2 Two discrete random variables
More informationJournal of Mathematical Analysis and Applications
J. Math. Anal. Appl. 377 20 44 449 Contents lists available at ScienceDirect Journal o Mathematical Analysis and Applications www.elsevier.com/locate/jmaa Value sharing results or shits o meromorphic unctions
More informationCS 361 Meeting 28 11/14/18
CS 361 Meeting 28 11/14/18 Announcements 1. Homework 9 due Friday Computation Histories 1. Some very interesting proos o undecidability rely on the technique o constructing a language that describes the
More informationLearning Theory. Ingo Steinwart University of Stuttgart. September 4, 2013
Learning Theory Ingo Steinwart University of Stuttgart September 4, 2013 Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 1 / 62 Basics Informal Introduction Informal Description
More informationThe local equivalence of two distances between clusterings: the Misclassification Error metric and the χ 2 distance
The local equivalence of two distances between clusterings: the Misclassification Error metric and the χ 2 distance Marina Meilă University of Washington Department of Statistics Box 354322 Seattle, WA
More informationGlobal Weak Solution of Planetary Geostrophic Equations with Inviscid Geostrophic Balance
Global Weak Solution o Planetary Geostrophic Equations with Inviscid Geostrophic Balance Jian-Guo Liu 1, Roger Samelson 2, Cheng Wang 3 Communicated by R. Temam Abstract. A reormulation o the planetary
More informationFactors of words under an involution
Journal o Mathematics and Inormatics Vol 1, 013-14, 5-59 ISSN: 349-063 (P), 349-0640 (online) Published on 8 May 014 wwwresearchmathsciorg Journal o Factors o words under an involution C Annal Deva Priya
More informationSubmodular-Bregman and the Lovász-Bregman Divergences with Applications
Submodular-Bregman and the Lovász-Bregman Divergences with Applications Rishabh Iyer Department o Electrical Engineering University o Washington rkiyer@u.washington.edu Je Bilmes Department o Electrical
More informationCS229 Supplemental Lecture notes
CS229 Supplemental Lecture notes John Duchi Binary classification In binary classification problems, the target y can take on at only two values. In this set of notes, we show how to model this problem
More informationA New Approach to Estimating the Expected First Hitting Time of Evolutionary Algorithms
A New Approach to Estimating the Expected First Hitting Time of Evolutionary Algorithms Yang Yu and Zhi-Hua Zhou National Laboratory for Novel Software Technology Nanjing University, Nanjing 20093, China
More informationMath 754 Chapter III: Fiber bundles. Classifying spaces. Applications
Math 754 Chapter III: Fiber bundles. Classiying spaces. Applications Laurențiu Maxim Department o Mathematics University o Wisconsin maxim@math.wisc.edu April 18, 2018 Contents 1 Fiber bundles 2 2 Principle
More informationMachine Learning Lecture 5
Machine Learning Lecture 5 Linear Discriminant Functions 26.10.2017 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de leibe@vision.rwth-aachen.de Course Outline Fundamentals Bayes Decision Theory
More information