Fisher Consistency of Multicategory Support Vector Machines

Size: px

Start display at page:

Download "Fisher Consistency of Multicategory Support Vector Machines"

Randell Gallagher
6 years ago
Views:

1 Fisher Consistency o Multicategory Support Vector Machines Yueng Liu Department o Statistics and Operations Research Carolina Center or Genome Sciences University o North Carolina Chapel Hill, NC yliu@ .unc.edu Abstract The Support Vector Machine (SVM) has become one o the most popular machine learning techniques in recent years. The success o the SVM is mostly due to its elegant margin concept and theory in binary classiication. Generalization to the multicategory setting, however, is not trivial. There are a number o dierent multicategory extensions o the SVM in the literature. In this paper, we review several commonly used extensions and Fisher consistency o these extensions. For inconsistent extensions, we propose two approaches to make them Fisher consistent, one is to add bounded constraints and the other is to truncate unbounded hinge losses. Background on Binary SVM The Support Vector Machine (SVM) is a well known large margin classiier and has achieved great success in many applications (Vapnik, 998, Cristianini and Shawe-Taylor, 000, and Hastie, Tibshirani, and Friedman, 000). The basic concept behind the binary SVM is to search a separating hyperplane with maximum separation between the two classes. Suppose a training dataset containing n training pairs {x i, y i } n i=, i.i.d realizations rom a probability distribution P (x, y), is given. The goal is to search or a linear unction (x) = w x + b so that sign((x)) can be used or prediction o labels or new inputs. The SVM aims to ind such an so that points o class + and points o class are best separated. In particular, or the separable case, the SVM s solution maximizes the distance between (x) = ± subject to y i (x i ) ; i =,..., n. This distance can be expressed as and is known as the geometric margin. w When perect separation between two classes is not easible, slack variables ξ i ; i =,..., n, can be used to measure the amount o violation o the original constraints. Then the SVM solves the ollowing optimization problem: w,b w + C ξ i () s.t. i= y i (x i ) ξ i, ξ i 0, i, where C > 0 is a tuning parameter which balances the separation and the amount o violation o the constraints. Optimization ormulation in () is also known as the primal problem o the SVM. Using the Lagrange multipliers, () can be converted into an equivalent dual problem as ollows: s.t. α i,j= y i y j α i α j x i, x j y i α i = 0; 0 α i C, i. i= i= α i () Once the solution o problem () is obtained, w can be calculated as n i= y iα i x i and the intercept b can be computed using the Karush-Kuhn-Tucker (KKT) complementary conditions o the optimization theory. I nonlinear learning is needed, one can apply the kernel trick by replacing the inner product x i, x j by K(x i, x j ), where the kernel K is a positive deinite unction. This amounts to applying linear learning in the eature space induced by the kernel K to achieve nonlinear learning in the original input space. It is now known that the SVM can be it in the regularization ramework (Wahba, 998) as ollows: J() + C [ y i (x i )] +, (3) i= where the unction [ u] + = u i u and 0 otherwise and it is known as the hinge loss unction (see Figure ). The term J() is a regularization term and in the linear learning setting, it becomes w which is related to the geometric margin. Denote P (x) = P (Y = X = x). Then a binary classiier with loss V ((x), y) is Fisher consistent i the imizer o E[V ((X), Y X = x] has

2 loss Figure : Plot o the hinge loss H (u) = [ u] +. the same sign as P (x) /. Clearly, Fisher consistency requires the loss unction asymptotically yields the Bayes decision boundary. Fisher consistency is also known as classiication-calibration (Bartlett et al., 006). For the SVM, Lin (00) shows that the imizer o E[[ Y (X)] + X = x] is sign(p (x) /) and consequently the hinge loss o the SVM is Fisher consistent. Interestingly, the SVM only targets on the classiication set {x : P (x) /} without estimating P (x) itsel. Extending the binary SVM to multicategory SVM is not trivial. In this paper, we irst review our common extensions in Section. Fisher consistency o dierent extensions will be discussed in Secion 3. Among the our extensions, three extensions are not always Fisher consistent. In Sections 4 and 5, we propose to use bounded constraints and truncation to derive Fisher consistent analogs o dierent losses discussed in Section. Some discussions are given in Section 6. Multicategory SVM The standard SVM only solves binary problems. However, it is very oten or one to encounter multicategory problems in practice. To solve a multicategory problem using the SVM, one can use two possible approaches. The irst approach is to solve the multicategory problem via a sequence o binary problems, e.g., one-versus-rest and one-versus-one. The second approach is to generalize the binary SVM into a simultaneous multicategory ormulation. In this paper, we will ocus on the second approach. u Consider a k-class classiication problem with k. When k =, the methodology to be discussed here reduces to the binary counterpart in Section. Let = (,,, k ) be the decision unction vector, where each component represents one class and maps rom S to R. To remove redundant solutions, a sum-to-zero constraint j= j = 0 is employed. For any new input vector x, its label is estimated via a decision rule ŷ = argmax j=,,,k j (x). Clearly, the argmax rule is equivalent to the sign unction used in the binary case in Section. The extension o the SVM rom the binary to multicategory case is nontrivial. Beore we discuss the detailed ormulation o multicategory hinge loss, we irst discuss Fisher consistency or multicategory problems. Consider y {,..., k} and let P j (x) = P (Y = j x). Suppose V ((x), y) is a multicategory loss unction. Then in this context, Fisher consistency requires that argmax j j = argmax j P j, where (x) = ( (x),..., k (x)) denotes the imizer o E[V ((X), Y ) X = x]. Fisher consistency is a desirable condition o a loss unction, although a consistent loss may not always translate into better classiication accuracy (Hsu and Lin, 00, Rikin and Klautau, 004). For simplicity, we only ocus on standard learning where all types o misclassiication are treated equally. The proposed techniques, however, can be extended to more general settings with unequal losses. Note that a point (x, y) is misclassiied by i y argmax j j (x). Thus a sensible loss V should try to orce y to be the maximum among k unctions. Once V is chosen, multicategory SVM solves the ollowing problem J( j ) + C j= subject to j= j(x) = 0. V ((x i ), y i ) (4) i= The key o extending the SVM rom binary to multicategory is the choice o loss V. There are a number o extensions o the binary hinge loss to the multicategory case proposed in the literature. We consider the ollowing our commonly used extensions and their Fisher consistency or inconsistency will be discussed. For inconsistent losses, we propose two methods to make them Fisher consistent. (a). (Naive hinge loss, c.., Zou et al. 006) [ y (x)] + ; (b). (Lee et al., 004) j y [ + j(x)] + ; (c). (Vapnik, 998; Weston and Watkins, 999; Bredensteiner and Bennett, 999; Guermeur, 00) j y [ ( y(x) j (x))] + ; (d). (Crammer and Singer, 00; Liu and Shen, 006) [ j ( y (x) j (x))] +. Note that the constant in these losses can be changed to a general positive value. However, the resulting losses will be equivalent to the current ones by re-scaling.

3 As a remark, we note that the sum-to-zero constraint is essential or Losses (a) and (b), not so or Losses (c) and (d). It is easy to see that all these losses try to encourage y to be the maximum among k unctions, either explicitly or implicitly. In the next section, we explore Fisher consistency o these our multicategory hinge loss unctions. 3 Fisher Consistency o Multicategory Hinge Losses In this section, we discuss Fisher consistency o all our losses. We would like to clariy here that some o the Fisher consistency results are already available in the literature. There are previous studies on Fisher consistency o multicategory SVMs such as Zhang (004), Lee et al. (004), Tewari and Bartlett (005), and Zou et al. (006). The goal o this paper is to irst review Fisher consistency or inconsistency o these losses and then explore ways, motivated rom the discussions in this section, to modiy inconsistent extensions to be consistent. Inconsistency o Loss (a): Lemma. The imizer o E[[ Y (X)] + X = x] subject to j j(x) = 0 satisies the ollowing: j (x) = (k ) i j = arg j P j (x) and otherwise. Proo: E[[ Y (X)] + ] = E[ [ l (X)] + P l (X)]. For any ixed X = x, our goal is to imize [ l(x)] + P l (x). We irst show the imizer satisies j or j =,..., k. To show this, suppose a solution having j >. Then we can construct another solution with j = and l = l + A, where A = (j )/(k ) > 0. Then l l = 0 and l > l ; l j. Consequently, [ l ] +P l < [ l ] +P l. This implies that cannot be the imizer. Thereore, the imizer satisies j or j. Using the property o, we only need to consider with j or j. Thus, k [ l (x)] + P l (x) = ( l(x))p l (x) = l(x)p l (x). Then the problem reduces to max subject to P l (x) l (x) (5) l (x) = 0; l (x), l. (6) It is easy to see that the solution satisies j (x) = (k ) i j = arg j P j (x) and otherwise. 3 From Lemma, we can see that Loss (a) is not Fisher consistent since except the smallest element, all remaining elements o its imizer are. Consequently, the argmax rule cannot be uniquely detered and thus the loss is not Fisher consistent when k 3 no matter how the P j s are distributed. Consistency o Loss (b): Lemma. The imizer o E[ j Y [ + j (X)] + X = x] subject to j j(x) = 0 satisies the ollowing: j (x) = k i j = argmax jp j (x) and otherwise. Proo: Note that E[ j Y [ + j(x)] + ] = E[E( j Y [ + j(x)] + X = x)]. Thus, it is suicient to consider the imizer or a given x and E( j Y [ + j(x)] + X = x) = j l [ + j (x)] + P l (x). Next, we show the imizer satisies j or j =,..., k. To show this, suppose a solution having j <. Then we can construct another solution with j = and l = l A, where A = ( j )/(k ) > 0. Then l l = 0 and l < l ; l j. Consequently, j l [ + j ] +P l < j l [ + j ] +P l. This implies that cannot be the imizer. Thereore, the imizer satisies j or j. Using the property o, we only need to consider with j or j. Thus, j l [+ j ] + P l = P l j l ( + j) = P l(k + j l j) = P l(k l ) = k P l l. Consequently, imizing j l [ + j] + P l is equivalent to maximizing P l l. Then the problem reduces to max subject to P l (x) l (x) (7) l (x) = 0; l (x), l.(8) It is easy to see that the solution satisies j (x) = k i j = argmax j P j (x) and otherwise. Lemma implies that Loss (b) yields the Bayes classiication boundary asymptotically, consequently it is a Fisher consistent loss. A similar result was also established by Lee et al. (004). Inconsistency o Loss (c): Lemma 3. Consider k = 3 with / > P > P > P 3. Then imizer(s) o E[ j Y [ ( Y (X) j (X))] + X = x] is(are) as ollows: P = /3: Any = (,, 3 ) satisying 3 and 3 = is a imizer. P > /3: = (,, 3 ) satisying 3, =, and 3 =. P < /3: = (,, 3 ) satisying 3, = 3, and =.

4 Proo: Note that E[ j Y [ ( Y (X) j (X))] + X = x] can be rewritten as P l j l [ ( l(x) j (x))] +. By the non-increasing property o [ u] +, the imizer must satisy... k i P... P k. Next we ocus on k = 3. Consider 3 with = A 0 and 3 = B 0. Then the objective unction becomes L(A, B) = P [[ A] + + [ (A + B)] + ](9) +P [ + A + [ B] + ] (0) +P 3 [ + A + B + + B]. () Beore proving the lemma, we irst need to prove that the imizer must satisy that A + B. To this end, we prove it in two steps: () A, B ; () A + B. To prove (), suppose that the imizer satisies A >. Using contradiction, we can consider another solution with A = A ɛ > with ɛ > 0 and B = B. It is easy to see that L(A ɛ, B) < L(A, B). Thus, the imizer must have A. Similarly, we can show B. To prove (), we know A and B by property (). Suppose the imizer have A + B >. Using contradiction, consider another solution with A + B = A + B ɛ > and A = A ɛ and B = B. Then by the act that P < P + P 3, we have L(A, B ) L(A, B) = P ɛ P ɛ P 3 ɛ < 0. () This implies A + B. Using properties () and (), imizing L(A, B) can be reduced as ollows: A,B ( P + P + P 3 )A + ( P P + P 3 )B s.t. A + B, A, B. (3) I P = /3, then ( P + P + P 3 ) = ( P P + P 3 ) < 0. Thereore, the solution o (3) satisies A + B =, implying 3 =. I P > /3, then 0 > ( P + P + P 3 ) > ( P P + P 3 ) and consequently the solution satisies A = 0 and B =, i.e., = and 3 =. I P < /3, then 0 > ( P P + P 3 ) > ( P + P + P 3 ) and consequently the solution satisies A = and B = 0, i.e., = and = 3. The desired result then ollows. Lemma 3 tells us that Loss (c) may be Fisher inconsistent. In act, in the case o k = 3 with / > P > P > P 3, Loss (c) is Fisher consistent only when P < /3. (P, P, P 3 ) = (5/, /3, /4), both (/, 0, /) and (/3, /3, /3) are imizers. In general, one cannot claim uniqueness o the imizer unless the objective unction to be imized is strictly convex. Inconsistency o Loss (d): Denote g((x), y)) = { y (x) j (x); j y} and H (u) = [ u] +. Then Loss (d) can written as H ( g((x), y)). Lemma 4. The imizer o E[H ( g((x), Y )) X = x] has the ollowing properties: (). I max j P j > /, then argmax j j = argmax j P j and g ((x), argmax j j ) = ; (). I max j P j < /, then = 0. Proo: Note E[H ( g((x), Y ))] can be written as E[E(H ( g((x), Y )) X = x)] = E[ P j (X)H ( g((x), j))] = E[ j= P j (X)( g((x), j)) + ]. j= For any given X = x, let g j = g((x), j); j =,, k. The problem reduces to imizing k j= P j( g j ) +. To prove the lemma, we utilize two properties o the imizer : () The imizer satisies max gj ; () Let j 0 = argmaxgj. For l j 0, gl equals to max gj = g j 0. Using property (), we have P j ( gj ) + = j= P j gj. j= Thereore, imizing j= P j( gj ) + is equivalent to maximizing j= P jgj. Using property (), the problem reduces to max (P j 0 )g j0 or g j0 [0, ]. Clearly, the imizer satisies the ollowing conditions: I P j0 > /, g j0 = ; i P j0 < /, g j0 = 0. The desired result o the lemma then ollows. We are now let to show the two properties o. Property (): The imizer satisies max g j. To show this property, we note ( g j 0 ) + = 0 or g j 0. However, or l j 0, j l { l j } = l j 0 l j 0 { j 0 l } = g j 0 = max g j, (4) Remark: Lee et al. (004) considered a special case with P > /3. Our Lemma 3 here is more general. Interestingly, or the case o which is less than i max g j >. Thereore, P = /3, there are ininite imizers although the loss is convex. For example, i then ollows. cannot be a imizer i max gj >. Property () 4

5 Property (): For l j 0, gl equals to max gj = gj 0. Since gl gj 0 or l j 0 as shown in (4), to maximize j= P jgj, property () holds. Lemma 4 suggests that H ( g((x), y)) is Fisher consistent when max j P j > /, i.e., when there is a doating class. Except or the Bayes decision boundary, this condition always holds or a binary problem. For a problem with k >, however, existence o a doating class may not be guaranteed. I max j P j (x) < / or a given x, then (x) = 0 and consequently the argmax o (x) cannot be uniquely detered. The Fisher inconsistency o Loss (d) was also noted by Zhang (004) and Tewari and Bartlett (005). 4 A Consistent Hinge Loss with Bounded Constraints (x)= (x)=0 (x)= Figure : Illustration o the binary SVM classiier using the hinge loss (a). In Section 3, we show that Losses (a), (c), and (d) may not always be Fisher consistent while in contrast, Loss (b) is always Fisher consistent. An interesting observation we have is that although Loss (a) is inconsistent while Loss (b) is consistent, the reduced problems (5) and (7) in the derivation o their asymptotic properties are very similar. In act, the only dierence is that Loss (b) has the constraint l (x) or l, and in contrast, Loss (a) has the constraint l (x) or l. Our observation is that one can add additional constraints on to orce Loss (a) to be consistent. In act, i additional constraints j /(k ) or j are imposed on (5), then the imizer becomes j (x) = i j = argmax jp j (x) and /(k ) otherwise. Consequently, the corresponding loss is Fisher consistent. To be speciic, we can modiy the scale o Loss (a) and propose the ollowing new loss: [k y (x)] + subject to j j(x) = 0 and l (x) or l. Clearly, this loss can be reduced to (e). y (x) subject to j j(x) = 0 and l (x) k or l. Consistency o Loss (e): Lemma 5. The imizer o E[ Y (X)], subject to j j(x) = 0 and l (x) k or l, satisies the ollowing: j (x) = k i j = argmax j P j (x) and otherwise. Proo: The proo is analogous to that o Lemma. It is easy to see that the problem can be reduced to s.t. max P l (x) l (x) (5) l (x) = 0; l (x) k, l. 5 (x)= (x)=0 (x)= Figure 3: Illustration o the binary SVM classiier using the bounded hinge loss (e). Thus, the solution satisies j (x) = k i j = argmax j P j (x) and otherwise. It is worthwhile to point out that Losses (b) and (c) can be reduced to Loss (e) using bounded constraints. Speciically, or Loss (b), i l (x) k or l and l(x) = 0, then y j [ + j ] + = y j (+ j) = k y. Thus, it is equivalent to Loss (e). For Loss (c), we can rewrite it in an equivalent orm as j y [k ( y j )] +. Then i l (x) k, l and l(x) = 0, j y [k ( y j )] + = j y (k ( y j )) = k(k ) k y. Thus, it is equivalent to Loss (e) as well. As a remark, we want to point out that the constraint l (x) k or l can be diicult to implement or all x S. For simplicity o learning, we suggest to relax such constraints on all training points only, that is l (x i ) k or i =,..., n and l =,..., k. Then we have the

6 ollowing multicategory learning method: s.t. j C j= yi (x i ) (6) i= j (x i ) = 0; l (x i ) ; j l =,..., k, i =,..., n. H 3 H s 3 T s 3 To urther understand learning using the proposed Loss (e), we discuss the binary case with y {±}. Recall in the separable case, the standard SVM tries to ind a hyperplane with maximum separation, i.e., the distance between (x) = ± is maximized. As we can see Figure, only a small subset o the training data, the so called support vectors, deteres the solution. In contrast, using the relaxed bounded constraints in Loss (e) amounts to orcing (x i ) [, ] or all training points. As a result, in the separable case, this new classiier tries to ind a hyperplane so that the total distance o all training points to the classiication boundary, n i= y i(x i ), subject to (x i ) [, ]; i =,..., n, is maximized as shown in Figure 3. Thus, all points play a role in detering the inal solution. In summary, we can conclude that Loss (e) aims or a dierent classiier rom that o Loss (a) due to the bounded constraints. Note that in contrast to the notion o support vectors o the SVM, the new classiier in (6) utilizes all training points to detere its solution, as illustrated in Figure 3. In order to make the classiier sparse in training points, one can select a raction o training points to approximate the classiier using the whole training set without jeopardizing the perormance in classiication. Using such a sparse classiier may help to reduce the computational cost, especially or large training datasets. The idea o Import Vector Machine (IVM, Zhu and Hastie, 005) may be proven useul here. Although Losses (a), (b) and (c) can all be reduced to Loss (e) using bounded constraints, Loss (d) appears to behave dierently. Currently, we are not able to make Loss (d) Fisher consistent using bounded constraints due to special properties o the unction. In Section 5, we propose to use truncation to make Loss (d) Fisher consistent. 5 Fisher Consistent Truncated Hinge Loss In many applications, outliers may exist in the training sample and unbounded losses can be sensitive to such points (Shen et al. 003, Liu et al, 005, Liu and Shen, 006). The hinge loss is unbounded and truncating unbounded hinge losses may help to improve robustness o the corresponding classiiers. In this section, we explore the idea z 0 3 s 0 3 z 0 3 s 0 3 z Figure 4: The let, middle, and right panels display unctions H (u), H s (u), and T s (u) respectively. o truncation on Loss (d). We show that the truncated version o Loss (d) can be Fisher consistent or certain truncating locations. Deine T s (u) = H (u) H s (u). Then T s ( g((x), y)) with s 0 becomes a truncated version o Loss (d). Figure 4 shows unctions H (u), H s (u), and T s (u). We irst show in Lemma 6 that or a binary problem, T s is Fisher consistent or any s 0. For multicategory problems, truncating H ( g((x), y)) can make it Fisher consistent even in the situation o no doating class as shown in Lemma 7. The ollowing lemma establishes Fisher consistency o the truncated hinge loss T s or the binary case: Lemma 6. The imizer o E[T s (Y (X)) X = x] has the same sign as P (x) / or any s 0. Proo: Notice E[T s (Y (X))] = E[E(T s (Y (X)) X = x)]. We can imize E[T s (Y (X))] by imizing E(T s (Y (X)) X = x) or every x. For any ixed x, E(T s (Y (X)) X = x) can be written as P (x)t s ((x)) + ( P (x))t s ( (x)). Since T s is a non-increasing unction, the imizer must satisy that (x) 0 i P (x) > / and (x) 0 otherwise. Thus, it is suicient to show that = 0 is not a imizer. Without loss o generality, assume P (x) > /, then E(T s (0) X = x) =. Consider another solution (x) =. Then E(T s (Y (X)) X = x) = ( P )T s ( ) ( P ) < or any s 0. Thereore, (x) = gives a smaller value o E(T s (Y (X)) X = x) than (x) = 0, which implies that = 0 is not a imizer. We can then conclude that (x) has the same sign as P (x) /. Truncating Loss (d): Lemma 7. The imizer o E[T s ( g((x), Y )) X = x] satisies that argmax j j = argmax jp j, or any s [ k, 0]. Proo: Note that E[T s ( g((x), Y ))] = 6

7 E[ j= T s( g((x), j)p j (X))]. For any given x, we need to imize j= T s(g j )P j where g j = g((x), j). By deinition and the act that j= j = 0, we can conclude that max j g j 0 and at most one o g j s is positive. Let j p satisy that P jp = max j P j. Then using the nonincreasing property o T s, the imizer satisies that gj p 0. We are now let to show that gj p 0, equivalently that 0 cannot be a imizer. To this end, assume P jp > /k and consider another solution 0 with j 0 p = (k )/k and j 0 = /k or j j p. Then gj 0 p = and gj 0 = or j j p. Clearly, j= T s(gj 0)P j ( + /(k ))( P jp ) <. Thereore, 0 gives a smaller value o j= T s(g j )P j than 0 and consequently > 0. The desired result then ollows. g j p Remark: The truncation operation can be applied to many other losses such as Loss (b). Denote Hs (u) = [u s] +. Then Loss (b) can be expressed as j= I(y j)h ( j (x)). Deine Ts (u) = H (u) Hs (u) or s 0. Then the truncated version o Loss (b) becomes j= I(y j)ts ( j (x)). It can be shown that the truncated loss (b) is Fisher consistent or any s 0 (Wu and Liu, 006a). 6 Discussion Fisher consistency is a desirable property or a loss unction in classiication. It ensures the corresponding classiier delivers the Bayes classiication boundary asymptotically. Consequently, we view Fisher consistency a necessary property or any good loss to have. In this paper, we ocus on the method o the SVM and discuss Fisher consistency o several commonly used multicategory hinge losses. We study our dierent losses and three out o our are Fisher inconsistent. For Fisher inconsistent losses, we propose two methods to make them Fisher consistent. Our irst approach is to add bounded constraints to orce the corresponding hinge loss to be always Fisher consistent. This results in a very interesting new loss (Loss (e)) and a new learning algorithm. In contrast to the notion o support vectors in the standard SVM, the new classiier utilizes all points to detere the classiication boundary. Implementation o the new loss involves convex imization and solves a similar quadratic programg problem as that o the standard SVM. For our uture research, we will explore perormance o the new classiier and compare it with the standard SVM both theoretically and numerically. With Fisher consistency o the new loss or both binary and multicategory problems, we believe this new classiier is promising. 7 Our second approach is to use truncation to make Loss (d) Fisher consistent. Interestingly, truncation not only helps to remedy Fisher inconsistency, it also improves robustness o the resulting classiier as discussed in Wu and Liu (006b). One drawback o the truncated losses, however, is that the corresponding optimization is nonconvex. For a truncated hinge loss, one can decompose it into a dierence o two convex unctions and then apply the d.c. algorithm (Liu et al., 005). Acknowledgements This work is partially supported by Grant DMS rom the National Science Foundation and the UNC Junior Faculty Development Award. Reerences Bartlett, P. and Jordan, M., and McAulie, J. (006). Convexity, classiication, and risk bounds. Journal o the American Statistical Association, 0, Bredensteiner, E. and Bennett, K. (999). Multicategory classiication by support vector machines. Computational Optimizations and Applications,, Crammer, K., and Singer, Y. (00). On the algorithmic implementation o multiclass kernelbased vector machines. Journal o Machine Learning Research,, Cristianini, N. and Shawe-Taylor, J. (000). An Introduction to Support Vector Machines and other Kernel-Based Learning Methods. Cambridge University Press. Guermeur, Y. (00). Combining discriant models with new multiclass SVMS. Pattern Analysis and Applications (PAA), 5, Hastie, T., Tibshirani, R., and Friedman, J. (000). The Elements o Statistical Learning. Springer-Verlag, New York. Hsu C. W. and Lin C. J. (00). A comparison o methods or multiclass support vector machines. IEEE transations on neural networks, 3,, Lee, Y., Lin, Y., and Wahba, G. (004). Multicategory Support Vector Machines, theory, and application to the classiication o microarray data and satellite radiance data. Journal o the American Statistical Association, 99, 465, Lin, Y. (00). Support vector machines and the Bayes rule in classiication. Data Mining and Knowledge Discovery, 6, Liu, Y. and Shen, X. (006). Multicategory ψ- learning. Journal o the American Statistical Association, 0, 474, Liu, Y., Shen, X., and Doss, H. (005). Multicategory ψ-learning and support vector machine: computational tools. Journal o Computational and Graphical Statistics, 4,, 9-

8 36. Rikin R. and Klautau A. (004). In deense o onevs-all classiication, Journal o Machine Learning Research, 5, 0-4. Shen, X., Tseng, G. C., Zhang, X., and Wong, W. H. (003). On ψ-learning. Journal o the American Statistical Association, 98, Tewari, A. and Bartlett, P. L. (005). On the consistency o multiclass classiication methods. In Proceedings o the 8th Annual Conerence on Learning Theory, volume 3559, pages Springer. Vapnik, V. (998). Statistical learning theory. Wiley, New York. Wahba, G. (998). Support vector machines, reproducing kernel Hilbert spaces, and randomized GACV. In: B. Schölkop, C. J. C. Burges and A. J. Smola (eds), Advances in Kernel Methods: Support Vector Learning, MIT Press, Weston, J., and Watkins, C. (999). Support vector machines or multi-class pattern recognition. Proceedings o the Seventh European Symposium On Artiicial Neural Networks. Wu, Y. and Liu, Y. (006a). On multicategory truncated-hinge-loss support vector machines. To appear in Proceedings o Joint Summer Research Conerence on Machine and Statistical Learning: Prediction and Discovery. Wu, Y. and Liu, Y. (006b). Robust truncatedhinge-loss support vector machines. Submitted. Zhang, T. (004). Statistical analysis o some multi-category large margin classiication methods. Journal o Machine Learning Research, 5, 5-5. Zhu, J. and Hastie, T. (005). Kernel logistic regression and the import vector machine. Journal o Computational and Graphical Statistics, 4,, Zou, H., Zhu, J. and Hastie, T. (006). The margin vector, admissible loss, and multiclass marginbased classiers. Technical report, Department o Statistics, University o Minnesota. 8

Lecture 18: Multiclass Support Vector Machines

Lecture 18: Multiclass Support Vector Machines Fall, 2017 Outlines Overview of Multiclass Learning Traditional Methods for Multiclass Problems One-vs-rest approaches Pairwise approaches Recent development for Multiclass Problems Simultaneous Classification