On the Consistency of Multi-Label Learning

Size: px
Start display at page:

Download "On the Consistency of Multi-Label Learning"

Transcription

1 On the Consistency o Multi-Label Learning Wei Gao and Zhi-Hua Zhou National Key Laboratory or Novel Sotware Technology Nanjing University, Nanjing , China {gaow,zhouzh}@lamda.nju.edu.cn Abstract Multi-label learning has attracted much attention during the past ew years. Many multilabel learning approaches have been developed, mostly working with surrogate loss unctions since multi-label loss unctions are usually diicult to optimize directly owing to non-convexity and discontinuity. Though these approaches are eective, to the best o our knowledge, there is no theoretical result on the convergence o risk o the learned unctions to the Bayes risk. In this paper, ocusing on two well-known multi-label loss unctions, i.e., ranking loss and hamming loss, we prove a necessary and suicient condition or the consistency o multi-label learning based on surrogate loss unctions. Our results disclose that, surprisingly, none convex surrogate loss is consistent with the ranking loss. Inspired by the inding, we introduce the partial ranking loss, with which some surrogate unctions are consistent. For hamming loss, we show that some recent multi-label learning approaches are inconsistent even or deterministic multi-label classiication, and give a surrogate loss unction which is consistent or the deterministic case. Finally, we discuss on the consistency o learning approaches which address multi-label learning by decomposing into a set o binary classiication problems. 1 Introduction In traditional supervised learning, each instance is associated with a single label. In real-world applications, however, one object is usually relevant to multiple labels simultaneously. For example, a document about national education service may be categorized into several predeined topics, such as government and education; an image containing orests may be annotated with trees and mountains. For learning with such objects, multi-label learning has attracted much attention during the past ew years and many eective approaches have been developed (Schapire and Singer, 2000, Elissee and Weston, 2002, Zhou and Zhang, 2007, Zhang and Zhou, 2007, Hsu et al., 2009, Dembczyński et al., 2010, Petterson and Caetano, 2010). The consistency (also called Bayes consistency) o learning algorithms concerns whether the expected risk o a learned unction converges to the Bayes risk as the training sample size increases (Lin, 2002, Zhang, 2004b, Steinwart, 2005, Bartlett et al., 2006, Tewari and Bartlett, 2007, Duchi et al., 2010). Nowadays, it is well-accepted that a good learner should at least be consistent with large samples. It is noteworthy that, though many eorts have been devoted to multi-label learning, ew theoretical aspects were explored; in particular, to the best o our knowledge, the consistency o multi-label learning remains untouched although it is a very important theoretical issue. In this paper, ocusing on two well-known multi-label loss unctions, i.e., ranking loss and hamming loss, we present a theoretical study on the consistency o multi-label learning. We prove a necessary and suicient condition or the consistency o multi-label learning based on surrogate loss unctions. Our analysis discloses that, surprisingly, any convex surrogate loss is inconsistent with the ranking loss. Based on this inding, we introduce the partial ranking loss, which is consistent with some surrogate loss unctions; we also show that many current multi-label learning approaches are even not consistent with the partial ranking loss. As or hamming loss, our analysis shows that some recent multi-label learning approaches are inconsistent even or deterministic multi-label classiication, and we give a surrogate loss which is consistent with hamming loss or the deterministic case. Finally, we discuss on the consistency o approaches which address multi-label learning by transorming the problem into a set o binary classiication problems.

2 The rest o this paper is organized as ollows. Section 2 briely introduces the research background. Section 3 presents our necessary and suicient condition or the consistency o multi-label learning. Sections 4 and 5 study the consistency o multi-label learning approaches with regard to the ranking loss and hamming loss, respectively. Section 6 gives the detailed proos. 2 Background Let X be an instance space and L = {1, 2,..., Q} denotes a inite set o possible labels. An instance X X is associated with a subset o labels Y L which is called relevant labels, while the complement L \ Y is called irrelevant labels. For convenience o discussion, we represent the labels as a binary vector Y = (y 1, y 2,..., y Q ), where y i = +1 i label i is relevant to X and 1 otherwise, and denote by Y = {+1, 1} Q the set o all possible labels. Let D denote an unknown underlying probability distribution over X Y. For integer m > 0, we denote by [m] = {1, 2,..., m}, and or real r, r denotes the greatest integer which is no more than r. The ormal description o multi-label learning in the probabilistic setting is given as ollows: Given a training sample S = {(X 1, Y 1 ), (X 2, Y 2 ),..., (X m, Y m )} drawn i.i.d. according to the distribution D, the objective is to learn a unction h: X Y, which is able to assign a set o labels to unseen instances. In general, it is not easy to learn h directly, and in practice, one instead learns a real-valued vector unction = ( 1, 2,..., K ): X R K or some integer K > 0, where K = Q or K = 2 Q are common choices. Based on this vector unction, a prediction unction F : R K Y can be attained or assigning the set o relevant labels to an instance. Another popular approach or multi-label learning is to learn a real-valued vector unction = ( 1, 2,..., Q ) such that i (X) > j (X) i y i = +1 and y j = 1 or an instance-label pair (X, Y ), and a unction F should be learned to determine the number o relevant labels. Essentially, multi-label learning approaches try to minimize the expected risk o with regard to some loss L, i.e., R () = E (X,Y ) D [L ( (X), Y )]. (1) Notice that could be a prediction unction or a vector o real-valued unctions according to dierent losses. Further denote by R = in R() the minimal risk (e.g., the Bayes risk) over all measurable unctions. In this paper, we mainly ocus on loss unctions which are below-bounded and interval, deined as ollows: Deinition 1 A loss unction L is said to be below-bounded i L(, ) B holds or some constant B; L is said to be interval i it holds, or some constant γ > 0, that either L((X), Y ) = L((X ), Y ) or L((X), Y ) L((X ), Y ) γ, or every X, X X and Y, Y Y. There are many multi-label loss unctions (also called evaluation criteria), e.g., ranking loss, hamming loss, one-error, coverage and average precision (Schapire and Singer, 2000, Zhang and Zhou, 2006); accuracy, precision, recall and F 1 (Godbole and Sarawagi, 2004, Qi et al., 2007); subset accuracy (Ghamrawi and McCallum, 2005); etc. In this paper, we ocus on two well-known losses, i.e., ranking loss and hamming loss, and leave the discussion on other losses to uture work. Notice that all the above multi-label losses are non-convex and discontinuous. It is diicult, or even impossible, to optimize these losses directly. A easible method in practice is to consider instead a surrogate loss unction which can be optimized by eicient algorithms. Actually, most existing multi-label learning approaches, such as the boosting algorithm AdaBoost.MH (Schapire and Singer, 2000), neural network algorithm BP-MIL (Zhang and Zhou, 2006), SVM-style algorithms (Elissee and Weston, 2002, Taskar et al., 2004, Hariharan et al., 2010), etc., in essence, try to optimize some surrogate losses such as the exponential loss and hinge loss. There are many deinitions o consistency, e.g., the Fisher consistency (Lin, 2002), ininite-sample consistency (Zhang, 2004a), classiication calibration (Bartlett et al., 2006, Tewari and Bartlett, 2007), edge-consistency (Duchi et al., 2010), etc., and the consistency o learning algorithms based on optimizing a surrogate loss unction has been well-studied or binary classiication (Zhang, 2004b, Steinwart, 2005, Bartlett et al., 2006), multi-class classiication (Zhang, 2004a, Tewari and Bartlett, 2007), learning to rank (Cossock and Zhang, 2008, Xia et al., 2008, Duchi et al., 2010), etc. The consistency o multi-label learning, however, remains untouched, and to the best o our knowledge, this paper presents the irst theoretical analysis on the consistency o multi-label learning. 2

3 3 Multi-Label Consistency Given an instance X X, we denote by p(x) a vector o conditional probability or Y Y, i.e., p(x) = (p Y (X)) Y Y = (Pr (Y X)) Y Y, or some p(x) Λ, where Λ is the set o all possible conditional probability distribution vectors, i.e., { Λ = p R Y : } p Y = 1 and p Y 0. Y Y In the ollowing, or notational simplicity, we will suppress dependence o p(x) and (X) on the instance X as p and, respectively, when it is clear rom the context. For an instance X X, we deine the conditional risk o as l(p, ) = p Y L(, Y ) = Pr(Y X)L((X), Y ). (2) Y Y Y Y It is easy to get the expected risk and the minimal risk, respectively, as We urther deine the set o Bayes predictors as R() = E X [l(p, )] and R = E X [ in [l(p, )]]. A(p) = { : l(p, ) = in l(p, ) }. Notice that A(p) since L is interval and below-bounded. As mentioned above, the loss L in multi-label learning is generally non-convex and discontinuous, and it is diicult, even impossible, to minimize the risk given by Eq.(1) directly. In practice, a surrogate loss unction Ψ is usually considered in place o L. We deine the Ψ-risk and Bayes Ψ-risk o, respectively, as R Ψ () = E X,Y [Ψ((X), Y )] and RΨ = in R Ψ (). Similarly, we deine the conditional surrogate risk and the conditional Bayes surrogate risk o, respectively, as W (p, ) = p Y Ψ(, Y ) and W (p) = in W (p, ). Y Y It is obvious that R Ψ () = E X [W (p, )] and RΨ = E X[W (p)]. We now deine the multi-label consistency as ollows: Deinition 2 Given a below-bounded surrogate loss unction Ψ where Ψ(, Y ) is continuous or every Y Y, Ψ is said to be multi-label consistent w.r.t. the loss L i it holds, or every p Λ, that W (p) < in{w (p, ): / A(p)}. The ollowing theorem states that the multi-label consistency is a necessary and suicient condition or the convergence o Ψ-risk to the Bayes Ψ-risk, implying R() R. Theorem 3 The surrogate loss Ψ is multi-label consistent w.r.t. the loss L i and only i it holds or any sequence n that R Ψ ( n ) R Ψ then R( n ) R. The proo o this theorem is inspired by the technique o (Zhang, 2004a) and (Tewari and Bartlett, 2007), and we deer it to Section Consistency w.r.t. Ranking Loss The ranking loss concerns about the average raction o label pairs which are ordered reversely or an instance. For a real-valued vector unction = ( 1, 2,..., Q ), the ranking loss is given by L rankloss (, (X, Y )) = y i = 1 a Y I[ i (X) j (X)] = a Y I[ i (X) j (X)], (3) y y j=+1 i<y j where a Y is a non-negative penalty and I is the indicator unction, i.e., I[π] equals 1 i π holds and 0 otherwise. In multi-label learning, the most common penalty is a Y = {y i : y i = 1} 1 {y j : y j = +1} 1. 3

4 In this paper we consider the more general penalty, i.e., any non-negative penalty. It is easy to see that the ranking loss is below-bounded and interval since L rankloss (, (X, Y )) 0, and or every X, X X and Y, Y Y, it holds that either L rankloss (, (X, Y )) = L rankloss (, (X, Y )) or L rankloss (, (X, Y )) L rankloss (, (X, Y )) γ, where γ = min { ia Y ja Y > 0: i, j [(Q Q 2 ) Q 2 ]}. Considering Eq.(3), a natural choice or the surrogate loss is Ψ((X), Y ) = y i = 1 a Y ϕ( j (X) i (X)) = a Y ϕ( j (X) i (X)), (4) y y j=+1 i <y j where ϕ is a convex real-valued unction, which was chosen as hinge loss ϕ(x) = (1 x) + in (Elissee and Weston, 2002) and exponential loss ϕ(x) = exp( x) in (Schapire and Singer, 2000, Dekel et al., 2004, Zhang and Zhou, 2006). Beore our discussion, it is necessary to introduce the notations i,j = Y : y i <y j p Y a Y and δ(i, j; k, l) = Y : y i <y j,y k <y l p Y a Y, or a given vector p Λ and a non-negative vector (a Y ) Y Y. It is easy to get the ollowing lemmas: Lemma 4 For a vector p Λ and a non-negative vector (a Y ) Y Y, the ollowing properties hold: 1. i,i = 0; 2. δ(i, j; k, l) = δ(k, l; i, j); 3. i,j = δ(i, j; i, k) + δ(i, j; k, j) or every k i, j; 4. i,k + k,j + j,i = k,i + i,j + j,k ; 5. i,k k,i i i,j j,i and j,k k,j. Proo: The Properties 1 and 2 are immediate rom the deinitions. For Property 3, we have y i = 1 and y j = +1 or every Y Y satisying y i < y j. For y k (k i, j), there are only two choices: y k = +1 or y k = 1, and thus i,j = δ(i, j; i, k) + δ(i, j; k, j). For Property 4, we have, rom Property 3: i,k = δ(i, k; j, k) + δ(i, k; i, j), k,i = δ(k, i; j, i) + δ(k, i; k, j), j,i = δ(j, i; k, i) + δ(j, i; j, k), i,j = δ(i, j; k, j) + δ(i, j; i, k), k,j = δ(k, j; k, i) + δ(k, j; i, j), j,k = δ(j, k; j, i) + δ(j, k; i, k). Thus, it holds by combining with Property 2. Property 5 ollows directly rom Property 4. Lemma 5 For every p Λ and non-negative vector (a Y ) Y Y, the set o Bayes predictors or ranking loss is given by A(p) = { : or all i < j, i > j i i,j < j,i ; i j i i,j = j,i ; and i < j otherwise}. Proo: From the deinition o the conditional risk given by Eq.(2), we have l(p, ) = p Y L rankloss (, (X, Y )) = p Y a Y I[ i j ] Y Y Y Y y i<y j = p Y a Y I[ i j ] I[y i < y j ]. Y Y 1 i,j Q By swapping the two sums, we get l(p, ) = 1 i,j Q I[ i j ] Y : y i <y j p Y a Y = 1 i,j Q I[ i j ] i,j = 1 i<j Q I[ i j ] i,j + I[ i j ] j,i. Hence we complete the proo by combining with Property 5 in Lemma 4. The ollowing theorem discloses that none convex surrogate loss is consistent with ranking loss, and the proo is deerred to Section

5 Theorem 6 For any convex unction ϕ, the surrogate loss is not multi-label consistent w.r.t. ranking loss. Ψ((X), Y ) = y i<y j a Y ϕ( j (X) i (X)) Intuitively, Property 5 o Lemma 4 implies that { i,j } deines an order or the label set L = {1, 2,..., Q} by i j i i,j j,i. Notice that, or i j, there is possible that i,j = j,i. The set o Bayes predictors o a reasonable loss unction should include all unctions which are compatible with this order, i.e., s which enable i j i i j. In the deinition o ranking loss given by Eq.(3), the same penalty term is applied to i < j and i = j ; thus, the set o Bayes predictors with respect to ranking loss does not include some unctions which are compatible with the above order since what enorces by ranking loss is i j or j i i i,j = j,i or the label set L. For an extreme example, i.e., when all i,j s are equal or all i j, minimizing the convex surrogate loss unction Ψ leads to the optimal solution { : 1 = 2 = = Q } but / A(p) (rom Lemma 5). So, the same penalty on i < j and i = j encumbers the multi-label consistency. To overcome the deiciency, we instead introduce the partial ranking loss L p-rankloss (, (X, Y )) = a Y (I[ i (X) > j (X)] + 1 ) y i <y j 2 I[ i(x) = j (X)], (5) which has been commonly used or ranking problems. The only dierence rom ranking loss lies in the use o dierent penalties or y i <y j I[ i = j ], where the ranking loss uses a Y while the partial ranking loss uses a Y /2. With a proo similar to that o Lemma 5, we can get the set o Bayes predictors with respect to the partial ranking loss: A(p) = { : or all i < j, i > j i i,j < j,i ; and i < j i i,j > j,i }. (6) Now, consider the above extreme example, i.e., i,j s are equal or all i j, again. It is easy to see that by minimizing the surrogate loss unction Ψ, the optimal solution { : 1 = 2 = = Q } A(p), which exhibits multi-label consistency. We urther study more general cases, and the ollowing theorem provides a suicient condition or multi-label consistency w.r.t. partial ranking loss: Theorem 7 The surrogate loss Ψ given by Eq.(4) is multi-label consistent w.r.t. partial ranking loss i ϕ: R R is a dierential and non-increasing unction, and it holds that ϕ (0) < 0 and ϕ(x) + ϕ( x) 2ϕ(0), (7) i.e., ϕ(x) + ϕ( x) = 2ϕ(0) or every x R. The proo o Theorem 7 can be ound in Section 6.3, and rom this theorem we can get: Corollary 8 The surrogate loss Ψ given by Eq.(4) is multi-label consistent w.r.t partial ranking loss i ϕ(x) = arctan(x) or ϕ(x) = 1 e2x 1 + e 2x. Notice that Theorem 7 could not be applied directly to ϕ(x) = cx 2k+1 or some constant c > 0 and integer k 0, since such setting yields that the surrogate loss Ψ is not below-bounded. This problem, however, can be solved by introducing a regularization term. With a proo similar to that o Theorem 7, we get: Theorem 9 The ollowing surrogate loss is multi-label consistent w.r.t. partial ranking loss: Ψ((X), Y ) = y i<y j a Y ϕ( j (X) i (X)) + τυ((x)), where τ > 0, ϕ(x) = cx 2k+1 or some constant c > 0 and integer k 0, and Υ is symmetric, that is, Υ(..., i (X),..., j (X),...) = Υ(..., j (X),..., i (X),...). For example, we can easily construct the ollowing convex surrogate loss Ψ((X), Y ) = y i <y j a Y ( j (X) i (X)) + τ Q which is multi-label consistent w.r.t. partial ranking loss. i=1 2 i (X), It is worth noting that it does not mean any convex surrogate loss Ψ given by Eq.(4) is consistent w.r.t. partial ranking loss. In act, the ollowing theorem proves that, many non-linear surrogate losses are inconsistent w.r.t. partial ranking loss. 5

6 Theorem 10 I ϕ: R R is a convex, dierential, non-linear and non-increasing unction, the surrogate loss Ψ given by Eq.(4) is not multi-label consistent w.r.t. partial ranking loss. The proo is deerred to Section 6.4. The ollowing corollary shows that some state-o-the-art multi-label learning approaches (Schapire and Singer, 2000, Dekel et al., 2004, Zhang and Zhou, 2006) are even not multi-label consistent w.r.t. partial ranking loss. Corollary 11 I ϕ(x) = e x or ϕ(x) = ln(1 + exp( x)), the surrogate loss Ψ given by Eq.(4) is not multi-label consistent w.r.t. partial ranking loss. In summary, the ranking loss has been suggested as a standard rom early work by Schapire and Singer (2000), and many state-o-the-art multi-label learning approaches work with the ormulation given by Eq.(4). However, our analysis shows that none convex surrogate loss unctions are consistent w.r.t. ranking loss. Thus, ranking loss might not be a good loss unction and evaluation criterion or multi-label learning. The partial ranking loss is more reasonable than ranking loss, since it enables many, though not all, convex surrogate loss unctions to have consistency. In uture work it would be interesting to develop some new multi-label learning approaches based on minimizing the partial ranking loss, and design other loss unctions with better consistency. 5 Consistency w.r.t. Hamming Loss The hamming loss concerns about how many instance-label pairs are misclassiied. vector and prediction unction F, the hamming loss is given by L hamloss (F ((X)), Y ) = 1 Q Q i=1 I[ŷ i y i ], For a given where Ŷ = F ((X)) = (ŷ 1, ŷ 2,..., ŷ Q ). Hamming loss is obviously below-bounded and interval since 0 L hamloss (F ((X)), Y ) 1, and or every X, X X and Y, Y Y, it holds either L hamloss (F ((X)), Y )) = L hamloss (F ((X )), Y )) or L hamloss (F ((X)), Y )) L hamloss (F ((X )), Y )) 1/Q. We have the conditional risk: l(p, F ((X))) = p Y L hamloss (Ŷ, Y ) = 1 Y Y Q p Q Y I[ŷ i y i ] Y Y i=1 = 1 Q ( Q p Y I[ŷ i +1] + ) p Y I[ŷ i 1], i=1 Y Y,y i =+1 Y Y,y i = 1 and the set o Bayes predictors with regard to hamming loss: { ( A(p) = = (X): Ŷ = F () with ŷ i = sgn p Y 1 )}. (8) Y Y,y i =+1 2 A straightorward multi-label learning approach is to regard each subset o labels as a new class and then try to learn 2 Q unctions, i.e., = ( Y ) Y Y. Such a prediction unction is given by F ((X)) = max Y Y Y (X). (9) This approach can be viewed as a direct extension o the one-vs-all strategy or multi-class learning. We consider the ollowing ormulation: Ψ((X), Y ) = max ϕ(δ(ŷ, Y ) + Ŷ (X) Y (X)), (10) Ŷ Y where ϕ(x) and δ(y, Ŷ ) are set as ϕ(x) = max(0, x) and δ(y, Ŷ ) = Q i=1 I[y i ŷ i ], respectively, by Taskar et al. (2004), Hariharan et al. (2010). Beore a urther discussion, we divide multi-label classiication tasks into two categories, i.e., deterministic and non-deterministic, as ollows: Deinition 12 In multi-label classiication, i or every instance X X there exists a label Y Y such that P (Y X) > 0.5, the task is deterministic, and non-deterministic otherwise. Consistency or the deterministic case is easier than non-deterministic case. For example, many ormulations o SVMs are inconsistent or non-deterministic multi-class classiication (Zhang, 2004a, Tewari and Bartlett, 2007), but consistent or deterministic case as indicated by Zhang (2004a). The ollowing lemma shows that the approaches o (Taskar et al., 2004, Hariharan et al., 2010) are not multi-label consistent w.r.t. hamming loss, even or the deterministic case. 6

7 Lemma 13 For deterministic multi-label classiication, the surrogate loss Ψ given by Eq.(10) with δ(ŷ, Y ) = Q i=1 I[y i ŷ i ] is not multi-label consistent w.r.t. hamming loss. However, i we instead choose δ(ŷ, Y ) = I[Ŷ Y ], the ollowing theorem guarantees that the surrogate loss is multi-label consistent w.r.t. hamming loss, at least or the deterministic case. Theorem 14 For deterministic multi-label classiication, the surrogate loss Ψ given by Eq.(10) with δ(ŷ, Y ) = I[Y Ŷ ] is multi-label consistent w.r.t. hamming loss. The proos o Lemma 13 and Theorem 14 can be ound in Sections 6.5 and 6.6, respectively. Alternatively, it is possible to transorm a multi-label learning task into Q independent binary classiication tasks (Boutell et al., 2004). Now the goal is to learn Q unctions, = ( 1, 2,..., Q ), and the prediction unction is given by A common choice or the surrogate loss is F ((X)) = (sgn[ 1 (X)], sgn[ 2 (X)],..., sgn[ Q (X)]). Ψ((X), Y ) = Q i=1 ϕ(y i i (X)), (11) where ϕ is a convex unction, which was chosen as hinge loss ϕ(t) = (1 t) + in (Elissee and Weston, 2002) and exponential loss ϕ(t) = exp( t) in (Schapire and Singer, 2000). We have W (p, ) = Y Y p Y Ψ((X), Y ) = Q = Q i=1 i=1 p+ i ϕ( i(x)) + (1 p + i )ϕ( i(x)), Y Y p Y ϕ(y i i (X)) where p + i = Y : y i =+1 p Y and 1 p + i = Y : y i = 1 p Y. For simplicity, we denote by W i (p + i, i) = p + i ϕ( i) + (1 p + i )ϕ( i). This yields that minimizing W (p, ) is equivalent to minimizing W i (p + i, i) or every 1 i Q, i.e., W (p) = in W (p, ) = Q in i=1 i W i (p + i, i). The consistency or binary classiication has been well-studied (Zhang, 2004b, Bartlett et al., 2006), and based on the work o Bartlett et al. (2006) we can easily get: Theorem 15 The surrogate loss Ψ given by Eq.(11) is consistent w.r.t. hamming loss or convex unction ϕ with ϕ (0) < 0. It is evident rom this theorem that the surrogate loss Ψ given by Eq.(11) is consistent w.r.t. hamming loss i ϕ is any o the ollowing: Exponential: ϕ(x) = e x ; Hinge: ϕ(x) = max(0, 1 x); Least squares: ϕ(x) = (1 x) 2 ; Logistic regression: ϕ(x) = ln(1 + exp( x)). Notice that in this paper, we do not consider how to learn the real-valued unctions or small data or large number o labels, where many labels or label subsets lack enough training examples, while this is a challenging task and the exploitation o label correlation is thereore needed. In our analysis, we assume suicient training data and ignore the label correlation. Moreover, we do not consider how to decide the number o relevant labels or ranking loss and partial ranking loss, while in practice this is very challenging. These are important uture issues. 7

8 6 Proos 6.1 Proo o Theorem 3 We irst introduce some useul lemmas: Lemma 16 W (p) is continuous on Λ. Proo: From the Heine deinition o continuity, we need to show that W (p (n) ) W (p) or any sequence p (n) p. Let B r be a closed ball with radius r in R K. Since Y is inite, we have Y Y p(n) Y Ψ(, Y ) Y Y p Y Ψ(, Y ) uniormly or every B r and sequence p (n) p, leading to in B r Y Y p(n) Y From the act W (p (n) ) in Br Ψ(, Y ) in Y Y p(n) Y B r Y Y p Y Ψ(, Y ). Ψ(, Y ), by letting r, we have: lim sup W (p (n) ) W (p). (12) n Denote by Y = {Y p Y > 0 or Y Y} and assume Ψ(, ) C or some constant C (since Ψ is below-bounded). We have W (p (n) ) in Y Y p (n) Y Ψ(, Y ) + C Y Y/Y p (n) Y yielding ( lim in W (p (n) ) lim in in n n Y Y p(n) Y Ψ(, Y ) + C ) Y Y/Y p(n) Y = W (p), which completes the proo by combining Eq.(12). Lemma 17 I the surrogate loss unction Ψ is multi-label consistent w.r.t. loss unction L, then or any ϵ > 0 there exists δ > 0 such that, or every p Λ, l(p, ) in l(p, ) ϵ implies W (p, ) W (p) δ. Proo: We proceed by contradiction. Suppose Ψ is multi-label consistent and there exists ϵ > 0 and a sequence (p (n), (n) ) such that l(p (n), (n) ) in l(p (n), ) ϵ and W (p (n), (n) ) W (p (n) ). From the compactness o Λ, there exists a convergence sequence n k such that p (n k) p or some p Λ. From Lemma 16, we have W (p (n k), (n k) ) W (p). Similar to the proo o Lemma 16, denoted by Y = {Y p Y > 0 or Y Y}, we have ( lim sup W (p, (nk) ) = lim sup C ) n k n k Y Y/Y p(n k) Y + in Y Y p(n k) Y Ψ( (nk), Y ) lim W (p (nk), (nk) ) = W (p). nk This gives W (p, (n k) ) W (p) rom the deinition o W (p). Since Ψ is multi-label consistent, there exists a sequence (n k i ) satisying l(p, (n k i ) ) in l(p, ), which contradicts the assumption l(p (n), (n) ) in l(p (n), ) ϵ, and thus the lemma holds. Proo o Theorem 3: ( ) We irst introduce a new notation H(ϵ) = in p Λ, {W (p, ) in W (p, ): l(p, ) in l(p, ) ϵ}. It is obvious that H(0) = 0 and H(ϵ) > 0 or ϵ > 0 rom Lemma 17. Corollary 26 o (Zhang, 2004a) guarantees there exists a concave unction η on [0, ] such that η(0) = 0 and η(ϵ) 0 as ϵ 0 and R() R η(r Ψ () R Ψ). Thus, i R Ψ () RΨ then R() R. ( ) We proceed by contradiction. Suppose Ψ is not multi-label consistent, and thus there exists some p such that W (p) = in {W (p, ): / A(p)}. Let (n) / A(p) be a sequence such 8

9 that W (p, (n) ) W (p). For simplicity, we consider X = {x}, i.e., only one instance, and set n (x) = (n). Then R Ψ ( n ) = W (p, (n) ) W (p) = R Ψ, yielding l(p, (n) ) in l(p, ) which is contrary to l(p, (n) ) in l(p, ) + γ(p) where γ(p) = in / A(p) l(p, ) in l(p, ) > 0, since (n) / A(p) and L is interval. Thus we complete the proo. 6.2 Proo o Theorem 6 We proceed by contradiction. Suppose that the surrogate loss Ψ is multi-label consistent with ranking loss. We have W (p, ) = p Y Ψ(, Y ) = p Y a Y ϕ( j i ) = ϕ( j i ) i,j + ϕ( i j ) j,i. Y Y Y Y y i <y j 1 i<j Q Consider the probability vector p = (p Y ) Y Y and penalty vector (a Y ) Y Y such that P Y1 = P Y2 and a Y1 = a Y2 or every Y 1 Y 2, Y 1, Y 2 Y. This yields i,j = m,n or all 1 i j, m n Q, and thus we get W (p, ) = 1,2 1 i<j Q ϕ( j i ) + ϕ( i j ). From the convexity o ϕ, minimizing W (p, ) gives W (p) = W (p,ˆ) = 1,2 1 i<j Q 2ϕ(0) = Q(Q 1)ϕ(0) 1,2, where ˆ = {ˆ : ˆ 1 = ˆ 2 =... = ˆ Q }. Notice that ˆ / A(p) rom Lemma 5, and we have which completes the proo. W (p) = in{w (p, ): / A(p)}, 6.3 Proo o Theorem 7 For any p Λ and non-negative vector (a Y ) Y Y, we will prove that, i > j i i,j < j,i and i < j i i,j > j,i, or all satisying W (p) = W (p, ). Without loss o generality, it is suicient to prove that 1 > 2 i 1,2 < 2,1. We proceed by contradiction, and assume that there exists a vector such that 1 2 and W (p) = W (p, ). For the case 1 < 2, we can introduce another vector with 1 = 2, 2 = 1 and k = k or k 1, 2. From the deinition o conditional surrogate risk, we have which yields that W (p, ) = 1 i<j Q ϕ( j i ) i,j + ϕ( i j ) j,i, W (p, ) W (p, ) = ( 1,2 2,1 )(ϕ( 2 1 ) ϕ( 1 2 )) + Q i=3 ( 1,i 2,i )(ϕ( i 1 ) ϕ( i 2 )) + Q From Property 4 o Lemma 4, we have i=3 ( i,1 i,2 )(ϕ( 1 i ) ϕ( 2 i )). 1,i 2,i i,1 + i,2 = 1,2 2,1. (13) It ollows that W (p, ) W (p, ) = ( 1,2 2,1 ) (ϕ( 2 1 ) ϕ( 1 2 ) + Q i=3 ( ϕ(i 1 ) ϕ( i 2 ) )) + Q ( i,1 i,2 )(ϕ( 1 i ) + ϕ( i 1 ) ϕ( i 2 ) ϕ( 2 i )) i=3 = ( 1,2 2,1 ) ( ϕ( 2 1 ) ϕ( 1 2 ) ) + ( 1,2 2,1 ) Q ( ϕ(i 1 ) ϕ( i 2 ) ) i=3 9

10 where the last equality holds by the condition ϕ(x) + ϕ( x) 2ϕ(0). For non-increasing unction ϕ with ϕ (0) < 0, we have ϕ(z) < ϕ( z) or all z > 0, and ϕ( i 1 ) ϕ( i 2 ). From the condition 1,2 < 2,1 we get W (p, ) > W (p, ), which is contrary to W (p, ) = W (p). We now consider the case 1 = 2. For the optimal solution, we have the irst-order condition i W (p, ) = 0 or i = 1, 2: i 1 ϕ ( 1 i ) i,1 = i 1 ϕ ( i 1 ) 1,i and i 2 ϕ ( i 2 ) 2,i = i 2 ϕ ( 2 i ) i,2. By combining Eq.(13), 1 = 2 and ϕ (x) = ϕ ( x) rom Eq.(7), we have: ( ( 2,1 1,2 ) 2ϕ (0) + ) i 1,2 ϕ ( 1 i ) = 0, which is impossible since 1,2 < 2,1 and 2ϕ (0) + i 1,2 ϕ ( 1 i ) 2ϕ (0) < 0. complete the proo. Thus, we 6.4 Proo o Theorem 10 For convex unction ϕ we have ϕ (x) ϕ (y) or every x y rom (Rockaellar, 1997), and the derivative unction ϕ (x) is continuous or x R i ϕ is dierential and convex. Since ϕ is nonincreasing, we have ϕ (x) 0 or all x R, and without loss o generality, we assume ϕ (x) < 0. We proceed by contradiction. Assume the surrogate loss Ψ is multi-label consistent with partial ranking loss or some non-linear unction ϕ. Then, rom the continuity o ϕ (x), there exists an interval (c, d) or c < d < 0 or 0 < c < d, such that ϕ (x) < ϕ (y) or every x < y, x, y (c, d). In the ollowing, we ocus on the case 0 < c < d, and similar consideration could be made or the case c < d < 0. We irst ix a (c, d) and introduce a new unction G(x) = ( ϕ (x a) ϕ (a x) )( ϕ (a) + ϕ (x) ) + ϕ (x) ( ϕ ( a) ϕ (a) ). It is easy to ind that G(a) = ϕ (a) ( ϕ ( a) ϕ (a) ) > 0 and G(x) is continuous. Thus there exists b > a and b (c, d) such that which gives G(b) = ( ϕ (b a) ϕ (a b) )( ϕ (a) + ϕ (b) ) + ϕ (b) ( ϕ ( a) ϕ (a) ) > 0, ϕ (b a)(ϕ (a) + ϕ (b)) + ϕ (b)ϕ ( a) ϕ (a b)(ϕ (a) + ϕ (b)) + ϕ (b)ϕ (a) Moreover, rom ϕ (a) < ϕ (b) < 0 and ϕ ( b) ϕ ( a) < 0, we have 0 < ϕ ( a) ϕ (a) > 1. (14) ϕ ( b) ϕ (a) < ϕ ( b) ϕ (b). (15) We consider the ollowing multi-label classiication with Q = 3 labels: Y 1 = ( 1, +1, +1), Y 2 = (+1, 1, 1), Y 3 = (+1, +1, 1), Y 4 = ( 1, 1, +1). Let = ( 1, 2, 3 ) such that a = 3 1 and b = 3 2, and thus 1 > 2. For every probability vector p = (p Y1, p Y2, p Y3, p Y4 ) Λ and every penalty vector (a Y1, a Y2, a Y3, a Y4 ), we have W (p, ) = 1,2 ϕ( 2 1 ) + 2,1 ϕ( 1 2 ) + 1,3 ϕ( 3 1 ) + 3,1 ϕ( 1 3 ) + 2,3 ϕ( 3 2 ) + 3,2 ϕ( 2 3 ), where 1,2 = P 1, 2,1 = P 2, 1,3 = P 1 + P 4, 3,1 = P 2 + P 3, 2,3 = P 4 and 3,2 = P 3 with P i = p Y1 a Y1 or i = 1, 2, 3, 4. In the ollowing, we will construct some p and ā such that W ( p) = W ( p, ) and 1,2 > 2,1. The subgradient condition or optimality o W (p, ) gives that W (p, )/ 1 = P 1 ϕ (a b) + P 2 ϕ (b a) (P 1 + P 4 )ϕ (a) + (P 2 + P 3 )ϕ ( a) = 0, W (p, )/ 2 = P 1 ϕ (a b) P 2 ϕ (b a) P 4 ϕ (b) + P 3 ϕ ( b) = 0, W (p, )/ 3 = (P 1 + P 4 )ϕ (a) (P 2 + P 3 )ϕ ( a) + P 4 ϕ (b) P 3 ϕ ( b) = 0, 10

11 P 4 I 1 I 2 o P 3 Figure 1: Lines I 1 (solid) and I 2 (dash) corresponding to Eqs. (16) and (17), respectively. equivalent to P 1 ϕ (a b) P 2 ϕ (b a) = P 4 ϕ (b) P 3 ϕ ( b), (16) P 1 ϕ (a) + P 2 ϕ ( a) = P 4 (ϕ (a) + ϕ (b)) P 3 (ϕ ( a) + ϕ ( b)). (17) From Lemma 18, there exists p = ( p Y1, p Y2, p Y3, p Y4 ) and (ā Y1, ā Y2, ā Y3, ā Y4 ) satisying Eqs. (16), (17) and P 1 > P 2. Hence this yields W ( p, ) = W ( p). Notice that / A( p) rom Eq.(6), since 1,2 = P 1 > P 2 = 2,1 and 1 > 2, we have which completes the proo. W ( p) = in{w ( p, ): / A( p)}, Lemma 18 There exist some P 1 > P 2 > 0, P 3 > 0 and P 4 > 0 satisying Eqs. (16) and (17), i Eqs. (14) and (15) hold or some 0 < a < b. Proo: From Eq.(14), we set 1 < P 1 P 2 < ϕ (b a)(ϕ (a) + ϕ (b)) + ϕ (b)ϕ ( a) ϕ (a b)(ϕ (a) + ϕ (b)) + ϕ (b)ϕ (a). (18) For a < b, we have ϕ (a b) ϕ (b a) < 0, yielding P 1 > 1 ϕ (b a) P 2 ϕ (a b), which gives P 1 ϕ (a b) P 2 ϕ (b a) < 0. Thus, Eq.(16) corresponds to the Line I 1 in Figure 1. From Eq.(15), we urther obtain 0 < ϕ ( a) + ϕ ( b) ϕ (a) + ϕ (b) < ϕ ( b) ϕ (b). To guarantee P 3 > 0 and P 4 > 0 satisying Eqs. (16) and (17), as shown in Figure 1, we need: P 1 ϕ (a b) P 2 ϕ (b a) ϕ (b) < P 1ϕ (a) + P 2 ϕ ( a) ϕ (a) + ϕ. (b) The above holds obviously rom Eq.(18). Thus, we complete the proo. 6.5 Proo o Lemma 13 We consider a multi-label classiication task with Q = 2 labels, i.e., L = {1, 2} and let Y 1 = ( 1, 1), Y 2 = ( 1, 1), Y 3 = (1, 1) and Y 4 = (1, 1). Suppose p Y4 = 0 and p Y2 + p Y3 < p Y1 < 2p Y2 + p Y3. It is evident that p Y1 > 0.5 and thus, this multi-label classiication task is deterministic. It is necessary to consider = ( Y1, Y2, Y3 ). By combining Eqs. (8) and (10), we get A(p) = { : Y1 Y2 and Y1 Y3 }. 11

12 We also have W (p, ) = p Y1 max{ϕ(1 + Y2 Y1 ), ϕ(2 + Y3 Y1 )} +p Y2 max{ϕ(1 + Y1 Y2 ), ϕ(1 + Y3 Y2 )} + p Y3 max{ϕ(2 + Y1 Y3 ), ϕ(1 + Y2 Y3 )}, and W (p) = W (p, ) = in W (p, ) = 2p Y1 + 2p Y3, where = (Y 1, Y 2, Y 3 ) such that Y 1 = Y 3 and Y 1 = Y 2 1. Hence / A(p) and is not multi-label consistent. 6.6 Proo o Theorem 14 We irst introduce a useul lemma: Lemma 19 For the surrogate loss Ψ given by Eq.(10) with δ(ŷ, Y ) = I[Ŷ Y ], and or every p Λ and such that W (p, ) = W (p), we have Y1 Y2 i p Y1 > p Y2. Proo: We proceed by contradiction. Assume there exist some p Λ and such that Y1 < Y2, p Y1 > p Y2 and W (p) = W (p, ). We deine a new real-valued vector such that Y 1 = Y2, Y 2 = Y1 and Y i = Yi or i 1, 2. This ollows that W (p, ) W (p, ) = (p Y1 p Y2 ) ( max Y Y 1 ϕ(1 + Y (X) Y1 (X)) max Y Y 2 ϕ(1 + Y (X) Y2 (X)) ). From the assumption Y1 < Y2, we have leading to max ϕ(1 + Y (X) Y1 (X)) > max ϕ(1 + Y (X) Y2 (X)), Y Y 1 Y Y 2 W (p, ) W (p, ) > 0, which is contrary to the assumption W (p) = W (p, ). Thus, we complete the proo. Proo o Theorem 14: Without loss o generality, we consider a probability vector p = (p Y ) Y Y satisying p Y1 > p Y2 p Y3... p Y2 Q and p Y 1 > 0.5. It is easy to get A(p) = { : Y1 > Y or Y Y 1 }, rom Eqs. (8) and (9). On the other hand, or each satisying W (p, ) = W (p), we have Y1 Y2... Y2 Q rom Lemma 19. Thus, we complete the proo rom p Y1 > Y Y 1 p Y and the act that W (p, ) = p Y1 ϕ(1 + Y2 Y1 ) + Y Y 1 p Y ϕ(1 + Y1 Y ). Acknowledgements The authors want to thank the anonymous reviewers or their helpul comments and suggestions, and Tong Zhang or reading a preliminary version. This research was supported by the National Fundamental Research Program o China (2010CB327903), the National Science Foundation o China ( ) and the Jiangsu Science Foundation (BK ). Reerences P. L. Bartlett, M. I. Jordan, and J. D. McAulie. Convexity, classiication, and risk bounds. Journal o the American Statistical Association, 101(473): , M. R. Boutell, J. Luo, X. Shen, and C.M. Brown. Learning multi-label scene classiication. Pattern Recognition, 37(9): , D. Cossock and T. Zhang. Statistical analysis o bayes optimal subset ranking. IEEE Transactions on Inormation Theory, 54(11): ,

13 O. Dekel, C. Manning, and Y. Singer. Log-linear models or label ranking. In S. Thrun, L. K. Saul, and B. Schölkop, editors, Advances in Neural Inormation Processing Systems 16. MIT Press, Cambridge, MA, K. Dembczyński, W. W. Cheng, and E. Hüllermeier. Bayes optimal multilabel classiication via probabilistic classiier chains. In Proceedings o the 27th International Conerence on Machine Learning, pages , Haia, Israel, J. C. Duchi, L. W. Mackey, and M. I. Jordan. On the consistency o ranking algorithms. In Proceedings o the 27th International Conerence on Machine Learning, pages , Haia, Israel, A. Elissee and J. Weston. A kernel method or multi-labelled classiication. In T. G. Dietterich, S. Becker, and Z. Ghahramani, editors, Advances in Neural Inormation Processing Systems 14, pages MIT Press, Cambridge, MA, N. Ghamrawi and A. McCallum. Collective multi-label classiication. In Proceedings o the 14th ACM International Conerence on Inormation and Knowledge Management, pages , Bremen, Germany, S. Godbole and S. Sarawagi. Discriminative methods or multi-labeled classiication. In Proceedings o the 8th Paciic-Asia Conerence on Knowledge Discovery and Data Mining, pages 22 30, Sydney, Austrialia, B. Hariharan, L. Zelnik-Manor, S. Vishwanathan, and M. Varma. Large scale max-margin multi-label classiication with priors. In Proceedings o the 27th International Conerence on Machine Learning, pages , Haia, Israel, D. Hsu, S. M. Kakade, J. Langord, and T. Zhang. Multi-label prediction via compressed sensing. In Y. Bengio, D. Schuurmans, J. Laerty, C. K. I. Williams, and A. Culotta, editors, Advances in Neural Inormation Processing Systems 22, pages MIT Press, Cambridge, MA, Y. Lin. Support vector machines and the bayes rule in classiication. Data Mining and Knowledge Discovery, 6(3): , J. Petterson and T. Caetano. Reverse multi-label learning. In J. Laerty, C. K. I. Williams, J. Shawe- Taylor, R. S. Zemel, and A. Culotta, editors, Advances in Neural Inormation Processing Systems 23, pages MIT Press, Cambridge, MA, G.-J. Qi, X.-S. Hua, Y. Rui, J. Tang, T. Mei, and H.-J. Zhang. Correlative multi-label video annotation. In Proceedings o the 15th ACM International Conerence on Multimedia, pages 17 26, Augsburg, Germany, R. T. Rockaellar. Convex Analysis. Princeton University Press, Princeton, NJ, R. E. Schapire and Y. Singer. BoosTexter: A boosting-based system or text categorization. Machine Learning, 39(2): , I. Steinwart. Consistency o support vector machines and other regularized kernel classiiers. IEEE Transactions on Inormation Theory, 51(1): , B. Taskar, C. Guestrin, and D. Koller. Max-margin markov networks. In S. Thrun, L. Saul, and B. Schölkop, editors, Advances in Neural Inormation Processing Systems 16. MIT Press, Cambridge, MA, A. Tewari and P. L. Bartlett. On the consistency o multiclass classiication methods. Journal o Machine Learning Research, 8: , F. Xia, T. Y. Liu, J. Wang, W. Zhang, and H. Li. Listwise approach to learning to rank: Theory and algorithm. In Proceedings o the 25th International Conerence on Machine Learning, pages , Helsinki, Finland, M.-L. Zhang and Z.-H. Zhou. Multi-label neural network with applications to unctional genomics and text categorization. IEEE Transactions on Knowledge and Data Engineering, 18(10): , M.-L. Zhang and Z.-H. Zhou. ML-KNN: A lazy learning approach to multi-label learning. Pattern Recognition, 40(7): , T. Zhang. Statistical analysis o some multi-category large margin classiication methods. Journal o Machine Learning Research, 5: , 2004a. T. Zhang. Statistical behavior and consistency o classiication methods based on convex risk minimization. Annals o Statistics, 32(1):56 85, 2004b. Z.-H. Zhou and M.-L. Zhang. Multi-instance multi-label learning with application to scene classiication. In B. Schölkop, J. Platt, and T. Homann, editors, Advances in Neural Inormation Processing Systems 19, pages MIT Press, Cambridge, MA,

Fisher Consistency of Multicategory Support Vector Machines

Fisher Consistency of Multicategory Support Vector Machines Fisher Consistency o Multicategory Support Vector Machines Yueng Liu Department o Statistics and Operations Research Carolina Center or Genome Sciences University o North Carolina Chapel Hill, NC 7599-360

More information

On the Consistency of AUC Pairwise Optimization

On the Consistency of AUC Pairwise Optimization On the Consistency of AUC Pairwise Optimization Wei Gao and Zhi-Hua Zhou National Key Laboratory for Novel Software Technology, Nanjing University Collaborative Innovation Center of Novel Software Technology

More information

Binary Classification, Multi-label Classification and Ranking: A Decision-theoretic Approach

Binary Classification, Multi-label Classification and Ranking: A Decision-theoretic Approach Binary Classification, Multi-label Classification and Ranking: A Decision-theoretic Approach Krzysztof Dembczyński and Wojciech Kot lowski Intelligent Decision Support Systems Laboratory (IDSS) Poznań

More information

A Simple Algorithm for Multilabel Ranking

A Simple Algorithm for Multilabel Ranking A Simple Algorithm for Multilabel Ranking Krzysztof Dembczyński 1 Wojciech Kot lowski 1 Eyke Hüllermeier 2 1 Intelligent Decision Support Systems Laboratory (IDSS), Poznań University of Technology, Poland

More information

Multi-Label Selective Ensemble

Multi-Label Selective Ensemble Multi-Label Selective Ensemble Nan Li, Yuan Jiang and Zhi-Hua Zhou National Key Laboratory for Novel Software Technology, Nanjing University Collaborative Innovation Center of Novel Software Technology

More information

A Large Deviation Bound for the Area Under the ROC Curve

A Large Deviation Bound for the Area Under the ROC Curve A Large Deviation Bound for the Area Under the ROC Curve Shivani Agarwal, Thore Graepel, Ralf Herbrich and Dan Roth Dept. of Computer Science University of Illinois Urbana, IL 680, USA {sagarwal,danr}@cs.uiuc.edu

More information

A Brief Survey on Semi-supervised Learning with Graph Regularization

A Brief Survey on Semi-supervised Learning with Graph Regularization 000 00 002 003 004 005 006 007 008 009 00 0 02 03 04 05 06 07 08 09 020 02 022 023 024 025 026 027 028 029 030 03 032 033 034 035 036 037 038 039 040 04 042 043 044 045 046 047 048 049 050 05 052 053 A

More information

Foundations of Machine Learning Multi-Class Classification. Mehryar Mohri Courant Institute and Google Research

Foundations of Machine Learning Multi-Class Classification. Mehryar Mohri Courant Institute and Google Research Foundations of Machine Learning Multi-Class Classification Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Motivation Real-world problems often have multiple classes: text, speech,

More information

Selective Ensemble of Classifier Chains

Selective Ensemble of Classifier Chains Selective Ensemble of Classifier Chains Nan Li 1,2 and Zhi-Hua Zhou 1 1 National Key Laboratory for Novel Software Technology Nanjing University, Nanjing 210023, China 2 School of Mathematical Sciences,

More information

Multi-Label Learning with Weak Label

Multi-Label Learning with Weak Label Proceedings of the Twenty-Fourth AAAI Conference on Artificial Intelligence (AAAI-10) Multi-Label Learning with Weak Label Yu-Yin Sun Yin Zhang Zhi-Hua Zhou National Key Laboratory for Novel Software Technology

More information

Gaussian Process Regression Models for Predicting Stock Trends

Gaussian Process Regression Models for Predicting Stock Trends Gaussian Process Regression Models or Predicting Stock Trends M. Todd Farrell Andrew Correa December 5, 7 Introduction Historical stock price data is a massive amount o time-series data with little-to-no

More information

Introduction to Machine Learning Lecture 13. Mehryar Mohri Courant Institute and Google Research

Introduction to Machine Learning Lecture 13. Mehryar Mohri Courant Institute and Google Research Introduction to Machine Learning Lecture 13 Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Multi-Class Classification Mehryar Mohri - Introduction to Machine Learning page 2 Motivation

More information

Foundations of Machine Learning Lecture 9. Mehryar Mohri Courant Institute and Google Research

Foundations of Machine Learning Lecture 9. Mehryar Mohri Courant Institute and Google Research Foundations of Machine Learning Lecture 9 Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Multi-Class Classification page 2 Motivation Real-world problems often have multiple classes:

More information

Basic mathematics of economic models. 3. Maximization

Basic mathematics of economic models. 3. Maximization John Riley 1 January 16 Basic mathematics o economic models 3 Maimization 31 Single variable maimization 1 3 Multi variable maimization 6 33 Concave unctions 9 34 Maimization with non-negativity constraints

More information

VALUATIVE CRITERIA BRIAN OSSERMAN

VALUATIVE CRITERIA BRIAN OSSERMAN VALUATIVE CRITERIA BRIAN OSSERMAN Intuitively, one can think o separatedness as (a relative version o) uniqueness o limits, and properness as (a relative version o) existence o (unique) limits. It is not

More information

Roberto s Notes on Differential Calculus Chapter 8: Graphical analysis Section 1. Extreme points

Roberto s Notes on Differential Calculus Chapter 8: Graphical analysis Section 1. Extreme points Roberto s Notes on Dierential Calculus Chapter 8: Graphical analysis Section 1 Extreme points What you need to know already: How to solve basic algebraic and trigonometric equations. All basic techniques

More information

On the Consistency of Ranking Algorithms

On the Consistency of Ranking Algorithms John C. Duchi jduchi@cs.berkeley.edu Lester W. Mackey lmackey@cs.berkeley.edu Computer Science Division, University of California, Berkeley, CA 94720, USA Michael I. Jordan jordan@cs.berkeley.edu Computer

More information

Does Modeling Lead to More Accurate Classification?

Does Modeling Lead to More Accurate Classification? Does Modeling Lead to More Accurate Classification? A Comparison of the Efficiency of Classification Methods Yoonkyung Lee* Department of Statistics The Ohio State University *joint work with Rui Wang

More information

Feedback Linearization

Feedback Linearization Feedback Linearization Peter Al Hokayem and Eduardo Gallestey May 14, 2015 1 Introduction Consider a class o single-input-single-output (SISO) nonlinear systems o the orm ẋ = (x) + g(x)u (1) y = h(x) (2)

More information

STATISTICAL BEHAVIOR AND CONSISTENCY OF CLASSIFICATION METHODS BASED ON CONVEX RISK MINIMIZATION

STATISTICAL BEHAVIOR AND CONSISTENCY OF CLASSIFICATION METHODS BASED ON CONVEX RISK MINIMIZATION STATISTICAL BEHAVIOR AND CONSISTENCY OF CLASSIFICATION METHODS BASED ON CONVEX RISK MINIMIZATION Tong Zhang The Annals of Statistics, 2004 Outline Motivation Approximation error under convex risk minimization

More information

A Study of Relative Efficiency and Robustness of Classification Methods

A Study of Relative Efficiency and Robustness of Classification Methods A Study of Relative Efficiency and Robustness of Classification Methods Yoonkyung Lee* Department of Statistics The Ohio State University *joint work with Rui Wang April 28, 2011 Department of Statistics

More information

Consistency of Nearest Neighbor Methods

Consistency of Nearest Neighbor Methods E0 370 Statistical Learning Theory Lecture 16 Oct 25, 2011 Consistency of Nearest Neighbor Methods Lecturer: Shivani Agarwal Scribe: Arun Rajkumar 1 Introduction In this lecture we return to the study

More information

Telescoping Decomposition Method for Solving First Order Nonlinear Differential Equations

Telescoping Decomposition Method for Solving First Order Nonlinear Differential Equations Telescoping Decomposition Method or Solving First Order Nonlinear Dierential Equations 1 Mohammed Al-Reai 2 Maysem Abu-Dalu 3 Ahmed Al-Rawashdeh Abstract The Telescoping Decomposition Method TDM is a new

More information

2. ETA EVALUATIONS USING WEBER FUNCTIONS. Introduction

2. ETA EVALUATIONS USING WEBER FUNCTIONS. Introduction . ETA EVALUATIONS USING WEBER FUNCTIONS Introduction So ar we have seen some o the methods or providing eta evaluations that appear in the literature and we have seen some o the interesting properties

More information

A Bahadur Representation of the Linear Support Vector Machine

A Bahadur Representation of the Linear Support Vector Machine A Bahadur Representation of the Linear Support Vector Machine Yoonkyung Lee Department of Statistics The Ohio State University October 7, 2008 Data Mining and Statistical Learning Study Group Outline Support

More information

SEPARATED AND PROPER MORPHISMS

SEPARATED AND PROPER MORPHISMS SEPARATED AND PROPER MORPHISMS BRIAN OSSERMAN The notions o separatedness and properness are the algebraic geometry analogues o the Hausdor condition and compactness in topology. For varieties over the

More information

Semideterministic Finite Automata in Operational Research

Semideterministic Finite Automata in Operational Research Applied Mathematical Sciences, Vol. 0, 206, no. 6, 747-759 HIKARI Ltd, www.m-hikari.com http://dx.doi.org/0.2988/ams.206.62 Semideterministic Finite Automata in Operational Research V. N. Dumachev and

More information

On High-Rate Cryptographic Compression Functions

On High-Rate Cryptographic Compression Functions On High-Rate Cryptographic Compression Functions Richard Ostertág and Martin Stanek Department o Computer Science Faculty o Mathematics, Physics and Inormatics Comenius University Mlynská dolina, 842 48

More information

Finite Dimensional Hilbert Spaces are Complete for Dagger Compact Closed Categories (Extended Abstract)

Finite Dimensional Hilbert Spaces are Complete for Dagger Compact Closed Categories (Extended Abstract) Electronic Notes in Theoretical Computer Science 270 (1) (2011) 113 119 www.elsevier.com/locate/entcs Finite Dimensional Hilbert Spaces are Complete or Dagger Compact Closed Categories (Extended bstract)

More information

1 Relative degree and local normal forms

1 Relative degree and local normal forms THE ZERO DYNAMICS OF A NONLINEAR SYSTEM 1 Relative degree and local normal orms The purpose o this Section is to show how single-input single-output nonlinear systems can be locally given, by means o a

More information

Adaptive Kernel Principal Component Analysis With Unsupervised Learning of Kernels

Adaptive Kernel Principal Component Analysis With Unsupervised Learning of Kernels Adaptive Kernel Principal Component Analysis With Unsupervised Learning of Kernels Daoqiang Zhang Zhi-Hua Zhou National Laboratory for Novel Software Technology Nanjing University, Nanjing 2193, China

More information

VALUATIVE CRITERIA FOR SEPARATED AND PROPER MORPHISMS

VALUATIVE CRITERIA FOR SEPARATED AND PROPER MORPHISMS VALUATIVE CRITERIA FOR SEPARATED AND PROPER MORPHISMS BRIAN OSSERMAN Recall that or prevarieties, we had criteria or being a variety or or being complete in terms o existence and uniqueness o limits, where

More information

Foundations of Machine Learning

Foundations of Machine Learning Introduction to ML Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu page 1 Logistics Prerequisites: basics in linear algebra, probability, and analysis of algorithms. Workload: about

More information

Regret Analysis for Performance Metrics in Multi-Label Classification: The Case of Hamming and Subset Zero-One Loss

Regret Analysis for Performance Metrics in Multi-Label Classification: The Case of Hamming and Subset Zero-One Loss Regret Analysis for Performance Metrics in Multi-Label Classification: The Case of Hamming and Subset Zero-One Loss Krzysztof Dembczyński 1,3, Willem Waegeman 2, Weiwei Cheng 1, and Eyke Hüllermeier 1

More information

Complexity and Algorithms for Two-Stage Flexible Flowshop Scheduling with Availability Constraints

Complexity and Algorithms for Two-Stage Flexible Flowshop Scheduling with Availability Constraints Complexity and Algorithms or Two-Stage Flexible Flowshop Scheduling with Availability Constraints Jinxing Xie, Xijun Wang Department o Mathematical Sciences, Tsinghua University, Beijing 100084, China

More information

Margin Maximizing Loss Functions

Margin Maximizing Loss Functions Margin Maximizing Loss Functions Saharon Rosset, Ji Zhu and Trevor Hastie Department of Statistics Stanford University Stanford, CA, 94305 saharon, jzhu, hastie@stat.stanford.edu Abstract Margin maximizing

More information

BANDELET IMAGE APPROXIMATION AND COMPRESSION

BANDELET IMAGE APPROXIMATION AND COMPRESSION BANDELET IMAGE APPOXIMATION AND COMPESSION E. LE PENNEC AND S. MALLAT Abstract. Finding eicient geometric representations o images is a central issue to improve image compression and noise removal algorithms.

More information

Computing proximal points of nonconvex functions

Computing proximal points of nonconvex functions Mathematical Programming manuscript No. (will be inserted by the editor) Warren Hare Claudia Sagastizábal Computing proximal points o nonconvex unctions the date o receipt and acceptance should be inserted

More information

Problem Set. Problems on Unordered Summation. Math 5323, Fall Februray 15, 2001 ANSWERS

Problem Set. Problems on Unordered Summation. Math 5323, Fall Februray 15, 2001 ANSWERS Problem Set Problems on Unordered Summation Math 5323, Fall 2001 Februray 15, 2001 ANSWERS i 1 Unordered Sums o Real Terms In calculus and real analysis, one deines the convergence o an ininite series

More information

On the Bayes-Optimality of F-Measure Maximizers

On the Bayes-Optimality of F-Measure Maximizers Journal of Machine Learning Research 15 (2014) 3513-3568 Submitted 10/13; Revised 6/14; Published 11/14 On the Bayes-Optimality of F-Measure Maximizers Willem Waegeman willem.waegeman@ugent.be Department

More information

SOME CHARACTERIZATIONS OF HARMONIC CONVEX FUNCTIONS

SOME CHARACTERIZATIONS OF HARMONIC CONVEX FUNCTIONS International Journal o Analysis and Applications ISSN 2291-8639 Volume 15, Number 2 2017, 179-187 DOI: 10.28924/2291-8639-15-2017-179 SOME CHARACTERIZATIONS OF HARMONIC CONVEX FUNCTIONS MUHAMMAD ASLAM

More information

CSCI-567: Machine Learning (Spring 2019)

CSCI-567: Machine Learning (Spring 2019) CSCI-567: Machine Learning (Spring 2019) Prof. Victor Adamchik U of Southern California Mar. 19, 2019 March 19, 2019 1 / 43 Administration March 19, 2019 2 / 43 Administration TA3 is due this week March

More information

Strong Lyapunov Functions for Systems Satisfying the Conditions of La Salle

Strong Lyapunov Functions for Systems Satisfying the Conditions of La Salle 06 IEEE TRANSACTIONS ON AUTOMATIC CONTROL, VOL. 49, NO. 6, JUNE 004 Strong Lyapunov Functions or Systems Satisying the Conditions o La Salle Frédéric Mazenc and Dragan Ne sić Abstract We present a construction

More information

Multi-Label Hypothesis Reuse

Multi-Label Hypothesis Reuse Multi-Label Hypothesis Reuse Sheng-Jun Huang, Yang Yu, and Zhi-Hua Zhou National Key Laboratory for Novel Software Technology Nanjing University, Nanjing 210046, China {huangsj, yuy, zhouzh}@lamda.nju.edu.cn

More information

On Label Dependence in Multi-Label Classification

On Label Dependence in Multi-Label Classification Krzysztof Dembczynski 1,2 dembczynski@informatik.uni-marburg.de Willem Waegeman 3 willem.waegeman@ugent.be Weiwei Cheng 1 cheng@informatik.uni-marburg.de Eyke Hüllermeier 1 eyke@informatik.uni-marburg.de

More information

Definition: Let f(x) be a function of one variable with continuous derivatives of all orders at a the point x 0, then the series.

Definition: Let f(x) be a function of one variable with continuous derivatives of all orders at a the point x 0, then the series. 2.4 Local properties o unctions o several variables In this section we will learn how to address three kinds o problems which are o great importance in the ield o applied mathematics: how to obtain the

More information

Learning with Rejection

Learning with Rejection Learning with Rejection Corinna Cortes 1, Giulia DeSalvo 2, and Mehryar Mohri 2,1 1 Google Research, 111 8th Avenue, New York, NY 2 Courant Institute of Mathematical Sciences, 251 Mercer Street, New York,

More information

Calibrated Surrogate Losses

Calibrated Surrogate Losses EECS 598: Statistical Learning Theory, Winter 2014 Topic 14 Calibrated Surrogate Losses Lecturer: Clayton Scott Scribe: Efrén Cruz Cortés Disclaimer: These notes have not been subjected to the usual scrutiny

More information

SEPARATED AND PROPER MORPHISMS

SEPARATED AND PROPER MORPHISMS SEPARATED AND PROPER MORPHISMS BRIAN OSSERMAN Last quarter, we introduced the closed diagonal condition or a prevariety to be a prevariety, and the universally closed condition or a variety to be complete.

More information

STAT 801: Mathematical Statistics. Hypothesis Testing

STAT 801: Mathematical Statistics. Hypothesis Testing STAT 801: Mathematical Statistics Hypothesis Testing Hypothesis testing: a statistical problem where you must choose, on the basis o data X, between two alternatives. We ormalize this as the problem o

More information

Lecture Support Vector Machine (SVM) Classifiers

Lecture Support Vector Machine (SVM) Classifiers Introduction to Machine Learning Lecturer: Amir Globerson Lecture 6 Fall Semester Scribe: Yishay Mansour 6.1 Support Vector Machine (SVM) Classifiers Classification is one of the most important tasks in

More information

Algorithms for Predicting Structured Data

Algorithms for Predicting Structured Data 1 / 70 Algorithms for Predicting Structured Data Thomas Gärtner / Shankar Vembu Fraunhofer IAIS / UIUC ECML PKDD 2010 Structured Prediction 2 / 70 Predicting multiple outputs with complex internal structure

More information

arxiv: v1 [math.gt] 20 Feb 2008

arxiv: v1 [math.gt] 20 Feb 2008 EQUIVALENCE OF REAL MILNOR S FIBRATIONS FOR QUASI HOMOGENEOUS SINGULARITIES arxiv:0802.2746v1 [math.gt] 20 Feb 2008 ARAÚJO DOS SANTOS, R. Abstract. We are going to use the Euler s vector ields in order

More information

The Deutsch-Jozsa Problem: De-quantization and entanglement

The Deutsch-Jozsa Problem: De-quantization and entanglement The Deutsch-Jozsa Problem: De-quantization and entanglement Alastair A. Abbott Department o Computer Science University o Auckland, New Zealand May 31, 009 Abstract The Deustch-Jozsa problem is one o the

More information

GENERALIZED ABSTRACT NONSENSE: CATEGORY THEORY AND ADJUNCTIONS

GENERALIZED ABSTRACT NONSENSE: CATEGORY THEORY AND ADJUNCTIONS GENERALIZED ABSTRACT NONSENSE: CATEGORY THEORY AND ADJUNCTIONS CHRIS HENDERSON Abstract. This paper will move through the basics o category theory, eventually deining natural transormations and adjunctions

More information

Logistic Regression. William Cohen

Logistic Regression. William Cohen Logistic Regression William Cohen 1 Outline Quick review classi5ication, naïve Bayes, perceptrons new result for naïve Bayes Learning as optimization Logistic regression via gradient ascent Over5itting

More information

On Information Divergence Measures, Surrogate Loss Functions and Decentralized Hypothesis Testing

On Information Divergence Measures, Surrogate Loss Functions and Decentralized Hypothesis Testing On Information Divergence Measures, Surrogate Loss Functions and Decentralized Hypothesis Testing XuanLong Nguyen Martin J. Wainwright Michael I. Jordan Electrical Engineering & Computer Science Department

More information

Adaptive Subgradient Methods for Online Learning and Stochastic Optimization John Duchi, Elad Hanzan, Yoram Singer

Adaptive Subgradient Methods for Online Learning and Stochastic Optimization John Duchi, Elad Hanzan, Yoram Singer Adaptive Subgradient Methods for Online Learning and Stochastic Optimization John Duchi, Elad Hanzan, Yoram Singer Vicente L. Malave February 23, 2011 Outline Notation minimize a number of functions φ

More information

Physics 5153 Classical Mechanics. Solution by Quadrature-1

Physics 5153 Classical Mechanics. Solution by Quadrature-1 October 14, 003 11:47:49 1 Introduction Physics 5153 Classical Mechanics Solution by Quadrature In the previous lectures, we have reduced the number o eective degrees o reedom that are needed to solve

More information

1 Machine Learning Concepts (16 points)

1 Machine Learning Concepts (16 points) CSCI 567 Fall 2018 Midterm Exam DO NOT OPEN EXAM UNTIL INSTRUCTED TO DO SO PLEASE TURN OFF ALL CELL PHONES Problem 1 2 3 4 5 6 Total Max 16 10 16 42 24 12 120 Points Please read the following instructions

More information

Research Article Fixed Points of Difference Operator of Meromorphic Functions

Research Article Fixed Points of Difference Operator of Meromorphic Functions e Scientiic World Journal, Article ID 03249, 4 pages http://dx.doi.org/0.55/204/03249 Research Article Fixed Points o Dierence Operator o Meromorphic Functions Zhaojun Wu and Hongyan Xu 2 School o Mathematics

More information

Metric Embedding for Kernel Classification Rules

Metric Embedding for Kernel Classification Rules Metric Embedding for Kernel Classification Rules Bharath K. Sriperumbudur University of California, San Diego (Joint work with Omer Lang & Gert Lanckriet) Bharath K. Sriperumbudur (UCSD) Metric Embedding

More information

Support Vector Machines for Classification: A Statistical Portrait

Support Vector Machines for Classification: A Statistical Portrait Support Vector Machines for Classification: A Statistical Portrait Yoonkyung Lee Department of Statistics The Ohio State University May 27, 2011 The Spring Conference of Korean Statistical Society KAIST,

More information

Classification of effective GKM graphs with combinatorial type K 4

Classification of effective GKM graphs with combinatorial type K 4 Classiication o eective GKM graphs with combinatorial type K 4 Shintarô Kuroki Department o Applied Mathematics, Faculty o Science, Okayama Uniervsity o Science, 1-1 Ridai-cho Kita-ku, Okayama 700-0005,

More information

Lecture 3: Multiclass Classification

Lecture 3: Multiclass Classification Lecture 3: Multiclass Classification Kai-Wei Chang CS @ University of Virginia kw@kwchang.net Some slides are adapted from Vivek Skirmar and Dan Roth CS6501 Lecture 3 1 Announcement v Please enroll in

More information

(C) The rationals and the reals as linearly ordered sets. Contents. 1 The characterizing results

(C) The rationals and the reals as linearly ordered sets. Contents. 1 The characterizing results (C) The rationals and the reals as linearly ordered sets We know that both Q and R are something special. When we think about about either o these we usually view it as a ield, or at least some kind o

More information

4.1 Online Convex Optimization

4.1 Online Convex Optimization CS/CNS/EE 53: Advanced Topics in Machine Learning Topic: Online Convex Optimization and Online SVM Lecturer: Daniel Golovin Scribe: Xiaodi Hou Date: Jan 3, 4. Online Convex Optimization Definition 4..

More information

( x) f = where P and Q are polynomials.

( x) f = where P and Q are polynomials. 9.8 Graphing Rational Functions Lets begin with a deinition. Deinition: Rational Function A rational unction is a unction o the orm ( ) ( ) ( ) P where P and Q are polynomials. Q An eample o a simple rational

More information

Supplementary material for Continuous-action planning for discounted infinite-horizon nonlinear optimal control with Lipschitz values

Supplementary material for Continuous-action planning for discounted infinite-horizon nonlinear optimal control with Lipschitz values Supplementary material or Continuous-action planning or discounted ininite-horizon nonlinear optimal control with Lipschitz values List o main notations x, X, u, U state, state space, action, action space,

More information

Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers

Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers Erin Allwein, Robert Schapire and Yoram Singer Journal of Machine Learning Research, 1:113-141, 000 CSE 54: Seminar on Learning

More information

Convex Calibration Dimension for Multiclass Loss Matrices

Convex Calibration Dimension for Multiclass Loss Matrices Journal of Machine Learning Research 7 (06) -45 Submitted 8/4; Revised 7/5; Published 3/6 Convex Calibration Dimension for Multiclass Loss Matrices Harish G. Ramaswamy Shivani Agarwal Department of Computer

More information

Statistical Properties of Large Margin Classifiers

Statistical Properties of Large Margin Classifiers Statistical Properties of Large Margin Classifiers Peter Bartlett Division of Computer Science and Department of Statistics UC Berkeley Joint work with Mike Jordan, Jon McAuliffe, Ambuj Tewari. slides

More information

Cutting Plane Training of Structural SVM

Cutting Plane Training of Structural SVM Cutting Plane Training of Structural SVM Seth Neel University of Pennsylvania sethneel@wharton.upenn.edu September 28, 2017 Seth Neel (Penn) Short title September 28, 2017 1 / 33 Overview Structural SVMs

More information

The Margin Vector, Admissible Loss and Multi-class Margin-based Classifiers

The Margin Vector, Admissible Loss and Multi-class Margin-based Classifiers The Margin Vector, Admissible Loss and Multi-class Margin-based Classifiers Hui Zou University of Minnesota Ji Zhu University of Michigan Trevor Hastie Stanford University Abstract We propose a new framework

More information

CorrLog: Correlated Logistic Models for Joint Prediction of Multiple Labels

CorrLog: Correlated Logistic Models for Joint Prediction of Multiple Labels CorrLog: Correlated Logistic Models for Joint Prediction of Multiple Labels Wei Bian Bo Xie Dacheng Tao Georgia Tech Center for Music Technology, Georgia Institute of Technology bo.xie@gatech.edu Centre

More information

Feedback linearization control of systems with singularities: a ball-beam revisit

Feedback linearization control of systems with singularities: a ball-beam revisit Chapter 1 Feedback linearization control o systems with singularities: a ball-beam revisit Fu Zhang The Mathworks, Inc zhang@mathworks.com Benito Fernndez-Rodriguez Department o Mechanical Engineering,

More information

Numerical Methods - Lecture 2. Numerical Methods. Lecture 2. Analysis of errors in numerical methods

Numerical Methods - Lecture 2. Numerical Methods. Lecture 2. Analysis of errors in numerical methods Numerical Methods - Lecture 1 Numerical Methods Lecture. Analysis o errors in numerical methods Numerical Methods - Lecture Why represent numbers in loating point ormat? Eample 1. How a number 56.78 can

More information

Global Weak Solution of Planetary Geostrophic Equations with Inviscid Geostrophic Balance

Global Weak Solution of Planetary Geostrophic Equations with Inviscid Geostrophic Balance Global Weak Solution o Planetary Geostrophic Equations with Inviscid Geostrophic Balance Jian-Guo Liu 1, Roger Samelson 2, Cheng Wang 3 Communicated by R. Temam) Abstract. A reormulation o the planetary

More information

Sequential Supervised Learning

Sequential Supervised Learning Sequential Supervised Learning Many Application Problems Require Sequential Learning Part-of of-speech Tagging Information Extraction from the Web Text-to to-speech Mapping Part-of of-speech Tagging Given

More information

NON-AUTONOMOUS INHOMOGENEOUS BOUNDARY CAUCHY PROBLEMS AND RETARDED EQUATIONS. M. Filali and M. Moussi

NON-AUTONOMOUS INHOMOGENEOUS BOUNDARY CAUCHY PROBLEMS AND RETARDED EQUATIONS. M. Filali and M. Moussi Electronic Journal: Southwest Journal o Pure and Applied Mathematics Internet: http://rattler.cameron.edu/swjpam.html ISSN 83-464 Issue 2, December, 23, pp. 26 35. Submitted: December 24, 22. Published:

More information

Scattering of Solitons of Modified KdV Equation with Self-consistent Sources

Scattering of Solitons of Modified KdV Equation with Self-consistent Sources Commun. Theor. Phys. Beijing, China 49 8 pp. 89 84 c Chinese Physical Society Vol. 49, No. 4, April 5, 8 Scattering o Solitons o Modiied KdV Equation with Sel-consistent Sources ZHANG Da-Jun and WU Hua

More information

Simpler Functions for Decompositions

Simpler Functions for Decompositions Simpler Functions or Decompositions Bernd Steinbach Freiberg University o Mining and Technology, Institute o Computer Science, D-09596 Freiberg, Germany Abstract. This paper deals with the synthesis o

More information

Large-Margin Thresholded Ensembles for Ordinal Regression

Large-Margin Thresholded Ensembles for Ordinal Regression Large-Margin Thresholded Ensembles for Ordinal Regression Hsuan-Tien Lin and Ling Li Learning Systems Group, California Institute of Technology, U.S.A. Conf. on Algorithmic Learning Theory, October 9,

More information

Learning Binary Classifiers for Multi-Class Problem

Learning Binary Classifiers for Multi-Class Problem Research Memorandum No. 1010 September 28, 2006 Learning Binary Classifiers for Multi-Class Problem Shiro Ikeda The Institute of Statistical Mathematics 4-6-7 Minami-Azabu, Minato-ku, Tokyo, 106-8569,

More information

Reliability Assessment with Correlated Variables using Support Vector Machines

Reliability Assessment with Correlated Variables using Support Vector Machines Reliability Assessment with Correlated Variables using Support Vector Machines Peng Jiang, Anirban Basudhar, and Samy Missoum Aerospace and Mechanical Engineering Department, University o Arizona, Tucson,

More information

ABC-Boost: Adaptive Base Class Boost for Multi-class Classification

ABC-Boost: Adaptive Base Class Boost for Multi-class Classification ABC-Boost: Adaptive Base Class Boost for Multi-class Classification Ping Li Department of Statistical Science, Cornell University, Ithaca, NY 14853 USA pingli@cornell.edu Abstract We propose -boost (adaptive

More information

Support Vector Machines

Support Vector Machines Support Vector Machines Tobias Pohlen Selected Topics in Human Language Technology and Pattern Recognition February 10, 2014 Human Language Technology and Pattern Recognition Lehrstuhl für Informatik 6

More information

Divergences, surrogate loss functions and experimental design

Divergences, surrogate loss functions and experimental design Divergences, surrogate loss functions and experimental design XuanLong Nguyen University of California Berkeley, CA 94720 xuanlong@cs.berkeley.edu Martin J. Wainwright University of California Berkeley,

More information

ELEG 3143 Probability & Stochastic Process Ch. 4 Multiple Random Variables

ELEG 3143 Probability & Stochastic Process Ch. 4 Multiple Random Variables Department o Electrical Engineering University o Arkansas ELEG 3143 Probability & Stochastic Process Ch. 4 Multiple Random Variables Dr. Jingxian Wu wuj@uark.edu OUTLINE 2 Two discrete random variables

More information

Journal of Mathematical Analysis and Applications

Journal of Mathematical Analysis and Applications J. Math. Anal. Appl. 377 20 44 449 Contents lists available at ScienceDirect Journal o Mathematical Analysis and Applications www.elsevier.com/locate/jmaa Value sharing results or shits o meromorphic unctions

More information

CS 361 Meeting 28 11/14/18

CS 361 Meeting 28 11/14/18 CS 361 Meeting 28 11/14/18 Announcements 1. Homework 9 due Friday Computation Histories 1. Some very interesting proos o undecidability rely on the technique o constructing a language that describes the

More information

Learning Theory. Ingo Steinwart University of Stuttgart. September 4, 2013

Learning Theory. Ingo Steinwart University of Stuttgart. September 4, 2013 Learning Theory Ingo Steinwart University of Stuttgart September 4, 2013 Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 1 / 62 Basics Informal Introduction Informal Description

More information

The local equivalence of two distances between clusterings: the Misclassification Error metric and the χ 2 distance

The local equivalence of two distances between clusterings: the Misclassification Error metric and the χ 2 distance The local equivalence of two distances between clusterings: the Misclassification Error metric and the χ 2 distance Marina Meilă University of Washington Department of Statistics Box 354322 Seattle, WA

More information

Global Weak Solution of Planetary Geostrophic Equations with Inviscid Geostrophic Balance

Global Weak Solution of Planetary Geostrophic Equations with Inviscid Geostrophic Balance Global Weak Solution o Planetary Geostrophic Equations with Inviscid Geostrophic Balance Jian-Guo Liu 1, Roger Samelson 2, Cheng Wang 3 Communicated by R. Temam Abstract. A reormulation o the planetary

More information

Factors of words under an involution

Factors of words under an involution Journal o Mathematics and Inormatics Vol 1, 013-14, 5-59 ISSN: 349-063 (P), 349-0640 (online) Published on 8 May 014 wwwresearchmathsciorg Journal o Factors o words under an involution C Annal Deva Priya

More information

Submodular-Bregman and the Lovász-Bregman Divergences with Applications

Submodular-Bregman and the Lovász-Bregman Divergences with Applications Submodular-Bregman and the Lovász-Bregman Divergences with Applications Rishabh Iyer Department o Electrical Engineering University o Washington rkiyer@u.washington.edu Je Bilmes Department o Electrical

More information

CS229 Supplemental Lecture notes

CS229 Supplemental Lecture notes CS229 Supplemental Lecture notes John Duchi Binary classification In binary classification problems, the target y can take on at only two values. In this set of notes, we show how to model this problem

More information

A New Approach to Estimating the Expected First Hitting Time of Evolutionary Algorithms

A New Approach to Estimating the Expected First Hitting Time of Evolutionary Algorithms A New Approach to Estimating the Expected First Hitting Time of Evolutionary Algorithms Yang Yu and Zhi-Hua Zhou National Laboratory for Novel Software Technology Nanjing University, Nanjing 20093, China

More information

Math 754 Chapter III: Fiber bundles. Classifying spaces. Applications

Math 754 Chapter III: Fiber bundles. Classifying spaces. Applications Math 754 Chapter III: Fiber bundles. Classiying spaces. Applications Laurențiu Maxim Department o Mathematics University o Wisconsin maxim@math.wisc.edu April 18, 2018 Contents 1 Fiber bundles 2 2 Principle

More information

Machine Learning Lecture 5

Machine Learning Lecture 5 Machine Learning Lecture 5 Linear Discriminant Functions 26.10.2017 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de leibe@vision.rwth-aachen.de Course Outline Fundamentals Bayes Decision Theory

More information