arxiv: v1 [cs.lg] 23 Aug 2018

Size: px

Start display at page:

Download "arxiv: v1 [cs.lg] 23 Aug 2018"

Sophia Harrison
5 years ago
Views:

1 Muticass Universum SVM Sauptik Dhar 1 Vadimir Cherkassky 2 Mohak Shah 1 3 arxiv: v1 [cs.lg] 23 Aug 2018 Abstract We introduce Universum earning for muticass probems and propose a nove formuation for muticass universum SVM (MU-SVM). We aso propose an anaytic span bound for mode seection with 2 4 faster computation times than standard resamping techniques. We empiricay demonstrate the efficacy of the proposed MU- SVM formuation on severa rea word datasets achieving > 20% improvement in test accuracies compared to muti-cass SVM. 1. Introduction Many appications of machine earning invove anaysis of sparse high-dimensiona data, where the number of input features is arger than the number of data sampes. Such settings are typicay seen in severa rea ife appications in domains such as, heathcare, autonomous driving, prognostics and heath management etc. (Cherkassky & Muier, 2007). Such high-dimensiona data sets present new chaenges for most earning probems. Nove data intensive deep architectures are naturay not suited for such scenarios (Goodfeow et a., 2016). Recent studies have shown Universum earning to be particuary effective for such high-dimensiona ow sampe size data settings (Sinz et a., 2008; Chen & Zhang, 2009; Dhar & Cherkassky, 2015; Lu & Tong, 2014; Qi et a., 2014; Shen et a., 2012; Wang et a., 2014; Zhang et a., 2008; Xu et a., 2015; 2016; Zhu, 2016; Chen et a., 2017; Dhar & Cherkassky, 2017). However, most such studies are imited to binary cassification probems. On the other hand, many practica appications invove cassification of more than two categories. In order to incorporate a priori knowedge (in the form of universum data) for such appications, there is a need to extend universum earning for muticass probems. In this paper we focus on formuating the universum earning for muticass SVM under baanced settings with equa miscassification costs. Researchers have proposed severa 1 LG Siicon Vaey Lab, Santa Cara, CA, USA. 2 University of Minnesota, MN, USA. 3 University of Iinois at Chicago, IL, USA. Correspondence to: Sauptik Dhar <sauptik.dhar@gmai.com>. methods to sove a muticass SVM probem. Typicay these methods foow two basic approaches (Hsu & Lin, 2002; Wang & Xue, 2014). The first approach foows an Error Correcting Output Code (ECOC) based setting (Dietterich & Bakiri, 1995), where severa binary cassifiers are combined to sove the muticass probem viz., one-vs-one, one-vs-a, directed acycic graph SVM (Patt et a., 1999). Previous works, such as (Sinz, 2007; Chen & Zhang, 2009) which foow this setting, focus on the binary universum earning paradigm and ony provide some hints for their extensions to the muticass probems. An aternative to the ECOC based setting is the direct approach, where the entire muticass probem is soved through a singe arger optimization formuation (Vapnik, 1998; Crammer & Singer, 2002; Weston & Watkins, 1998). Recenty, (Zhang & Le- Cun, 2017) adopted such a direct approach for universum earning under a probabiistic framework using a ogistic oss function. This paper aso adopts such a direct approach, but proposes an aternate universum earning framework that utiizes an SVM ike oss function foowing (Crammer & Singer, 2002), and introduces the Muticass Universum SVM (MU-SVM) formuation. The proposed framework aows for: a) an efficient impementation for MU-SVM using existing muticass SVM sovers (Section 3.2), and b) deriving practica anaytic error bounds for mode seection (Section 3.3). Further, compared to ECOC based approaches, we provide a unified framework for muticass earning under universum settings, with simiar (or better) performance accuracies (see Appendix B.1). The main contributions of this paper are as foows: 1. We formaize the notion of universum earning for SVM under muticass settings, and propose a nove direct formuation caed Muticass Universum SVM (MU-SVM) (in Section 3.1). The proposed MU-SVM formuation has the neat property that it reduces to: i) standard (C&S) muticass SVM in absence of universum data and ii) binary U-SVM formuation (Weston et a., 2006) for two-cass probems (Section 3.1, Proposition 1). This consoidates the propriety of MU- SVM as the apt extension for muticass SVM under universum settings. 2. The proposed formuation has a desirabe structure that renders the MU-SVM formuation sovabe through

Muticass Universum SVM any state-of-art muticass SVM sovers (Section 3.2, Proposition 2). 3. We provide a new Span definition for muticass formuations, and derive a eave-one-out bound for MU-SVM (Section 3.

2 Muticass Universum SVM any state-of-art muticass SVM sovers (Section 3.2, Proposition 2). 3. We provide a new Span definition for muticass formuations, and derive a eave-one-out bound for MU-SVM (Section 3.3, Theorem 1). Under additiona assumptions, we provide a computationay efficient version of the eave-one-out error bound (Section 3.3, Theorem 2), which presents a practica mechanism for mode seection. 4. Empirica resuts are provided in support of the proposed strategy (Section 4) Finay, concusions are presented in Section 5. Note that, a shorter version of this work is avaiabe in (Dhar et a., 2016). Compared to (Dhar et a., 2016), this paper incudes additiona proofs and resuts as highighted beow, This paper provides the new Propositions 1, 2 & 3. We provide a new eave-one-out bound in Theorem 1 without any assumptions. Under the assumptions in Section 3.3, we provide a stricter eave-one-out error bound, which hods for both Type 1 & 2 support vectors. Exhaustive resuts for a the caims are provided for additiona data sets. 2. Muticass SVM This section provides a brief description of the muticass SVM formuation foowing (Crammer & Singer, 2002). Given i.i.d training sampes (x i, y i ) n i=1, with x R d and y {1,..., L} ; where n = number of training sampes, Figure 1: Loss function for muticass SVM with f k (x) = d = dimensionaity wk x. A sampe (x, y = k) of the input space and L ying inside the margin is penaized = tota number of casses, the task of a muticass ineary using the sack variabe ξ. cassifier is to estimate a vector vaued function f = [f 1,..., f L ] for predicting the cass abes for future unseen sampes (x, y) using the decision rue ŷ = argmax f (x). The C&S muticass =1,...,L SVM is a widey used formuation which gener- aizes the concept of arge margin cassifier for muticass probems. This muticass SVM setting empoys a specia margin-based oss (simiar to the hinge oss), L(y, f(x)) = [max(f (x) + 1 δ y ) f y (x)] + where { 1; y = [a] + = max(0, a) and δ y = (see Fig 1). 0; y Here, for any sampe (x, y = k), having L(y, f(x)) = 0 ensures a margin-distance of +1 for the correct prediction i.e. f k (x) f (x) 1; k. The SVM muticass formuation (for inear parameterization) is provided beow: min w 1...w L,ξ 1 2 L w C =1 n ξ i (1) i=1 s.t. (w yi w ) x i e i ξ i ; e i = 1 δ i i = 1... n, = 1... L here, f (x) = w x. Note that training sampes faing inside the margin border ( +1 ) are ineary penaized using the sack variabes ξ i 0, i = 1... n (as shown in Fig 1). These sack variabes contribute to the empirica risk for the muticass SVM formuation R emp (w) = n i=1 ξ i. The SVM 1 formuation attempts to strike a baance between minimization of the empirica risk and the reguarization term. This is controed through the user-defined parameter C Muticass Universum SVM 3.1. Muticass U-SVM formuation The idea of Universum earning was introduced by (Vapnik, 1998; 2006) to incorporate a priori knowedge about admissibe data sampes. The Universum earning was introduced for binary cassification, where in addition to abeed training data we are aso given a set of unabeed exampes from the Universum. The Universum contains data Figure 2: Loss function for universum sampes x for k th cass decision boundary wk x max =1...L w x = 0. A sampe ying outside the - insensitive zone is penaized ineary using the sack variabe ζ k. that beongs to the same appication domain as the training data. However, these sampes are known not to beong to either cass. In fact, this idea can aso be extended to muticass probems. For muticass probems in addition to the abeed training data we are aso given a set of unabeed exampes from the Universum. These Universum sampes are known not to beong to any of the casses in the training 1 We refer to the C & S formuation in (1) as SVM throughout.

3 Muticass Universum SVM data. For exampe, if the goa of earning is to discriminate between handwritten digits 0, 1, 2,...,9; one can introduce additiona knowedge in the form of handwritten etters A, B, C,...,Z. These exampes from the Universum contain certain information about handwriting styes, but they cannot be assigned to any of the casses (0 to 9). Aso note that, Universum sampes do not have the same distribution as abeed training sampes. These unabeed Universum sampes are introduced into the earning as contradictions and hence shoud ie cose to the decision boundaries of a the casses 1... L. This argument foows from (Vapnik, 2006; Weston et a., 2006), where the universum sampes ying cose to the decision boundaries are more ikey to fasify the cassifier. To ensure this, we incorporate a - insensitive oss function for the universum sampes (shown in Fig 2). This - insensitive oss forces the universum sampes to ie cose to the decision boundaries ( 0 in Fig. 2). Note that, this idea of using a - insensitive oss for Universum sampes has been previousy introduced in (Weston et a., 2006) for binary cassification. However, different from (Weston et a., 2006), here the - insensitive oss is introduced for the decision boundary for a the casses i.e. w k x max =1...L w x = 0 ; k = 1... L. This reasoning motivates the new muticass Universum-SVM (MU-SVM) formuation where: Standard hinge oss is used for the training sampes (shown in Fig. 1). This oss forces the training sampes to ie outside the +1 margin border. The universum sampes are penaized by a - insensitive oss (see Fig. 2) for the decision functions of a the casses f = [f 1,..., f L ]. This eads to the MU-SVM formuation. Given training sampes T := (x i, y i ) n i=1, where y i {1,..., L} and additiona unabeed universum sampes U := (x i )m i =1. Sove2, min w 1...w L,ξ,ζ 1 2 L w C =1 n ξ i + C i=1 m L i =1 k=1 ζ i k s.t. i = 1... n, i = 1... m (2) (w yi w ) x i e i ξ i ; e i = 1 δ i, = 1... L (wk x i max =1...L w x i ) + ζ i k; { 1; yi = ζ i k 0, δ i = 0; y i k = 1... L Here, for the k th cass decision boundary the universum sampes (x i )m i =1 that ie outside the - insensitive zone are ineary penaized using the sack variabes ζ i k 0, i = 1... m. The user-defined parameters C, C 0 2 Throughout this paper, we use index i, j for training sampes, i for universum sampes and k, for the cass abes. contro the trade-off between the margin size, the error on training sampes, and the contradictions (sampes ying outside ± zone) on the universum sampes. Note that for C = 0 eq. (2) reduces to the muticass SVM cassifier. Proposition 1. For binary cassification L = 2, (2) reduces to the standard U-SVM formuation in (Weston et a., 2006) with w = w 1 w 2 and b = Computationa Impementation of MU-SVM This section describes computationa impementation of the MU-SVM formuation (2). Here, for every universum sampe x i we create L artificia sampes beonging to a the casses, i.e. (x i, y i 1 = 1),..., (x i, y i L = L) as beow, (x i, y i ) i = 1... n (x i, y i ) = (x i, y i ) i = n n + ml; i = 1... m; = 1... L e i = C i = e i i = 1... n; = 1... L (1 δ i ) i = n n + ml; i = 1... m; = 1... L (3) C C i = 1... n i = n n + ml; i = 1... m; = 1... L Proposition 2. Under transformation (3), the MU-SVM formuation in eq. (2) can be exacty soved using, min w 1...w L,ξ s.t. 1 2 L w =1 n+ml i=1 (w yi w ) x i e i ξ i i = 1... n + ml, = 1... L C i ξ i (4) The formuation (4) has the same form as (1) except that the former has additiona ml constraints for the universum sampes. Like most other SVM sovers, the MU-SVM formuation in (4) is aso soved in its dua form as shown in Agorithm 1 see (Hsu & Lin, 2002). Hence, the computationa compexity is same as soving a muticass SVM formuation (in (1)) with n + ml sampes. Most off-theshef muticass SVM sovers can be used for soving the proposed MU-SVM Mode Seection As presented in (5), the current MU-SVM agorithm has four tunabe parameters: C, C, kerne parameter, and. So in practice, muticass SVM may yied better resuts than MU-SVM, simpy because it has an inherenty simper mode seection. Successfu appication of the proposed MU-SVM heaviy depends on the optima tuning of its mode parameters. This paper adopts a simpified strategy

4 Muticass Universum SVM Agorithm 1 MU-SVM (dua form) 1. Given training (x i, y i ) n i=1 and universum (x i )m j=1 2. Transform (3) and sove (5), max α s.t. W (α) = 1 α i α j K(x i, x j ) α i e i 2 i,j i, α i = 0 (5) α i, C i if = y i ; α i, 0 if y i i, j = 1... n + ml, = 1... L 3. Obtain the cass abe using the foowing decision rue: ŷ = argmax α i K(x i, x) i for mode seection previousy used in (Cherkassky et a., 2011). This mainy invoves two steps, a. First, perform optima tuning of the C and kerne parameters for muticass SVM cassifier. This step equivaenty performs mode seection for the parameters specific ony to the training sampes in the MU-SVM formuation (2). b. Second, tune the parameter whie keeping C and kerne parameters fixed (as seected in Step a). Parameter C /C = n ml is kept fixed throughout the paper to ensure equa contribution of training and universum sampes in the optimization formuation. This strategy seects an MU-SVM soution (in step b) cose to a given SVM soution (seected in step a). The mode parameters in Steps (a) & are typicay seected through resamping techniques such as, eave one out (.o.o) or stratified cross-vaidation approaches (Japkowicz & Shah, 2011). Of these approaches,.o.o provides an amost unbiased estimate of the test error (Luntz, 1969; Schokopf & Smoa, 2001). However, on the downside it is very computationay intensive. In this paper, we propose a new anaytic bound for the eave-one-out error for MU-SVM formuation. The proposed bound can be used for mode seection in Steps (a) & and provides a computationa edge over standard resamping techniques. Detaied discussion regarding this new.o.o error bound is provided next. Note that, the.o.o formuation with the t th training sampe dropped is the same as in (5) with an additiona constraint α t = 0;. Then, the.o.o error is given as: R.o.o = 1 n n 1[y t ŷ t ], where ŷ t = arg max αi t K(x i, x t ) t=1 i is the predicted cass abe for the t th sampe and α t = [α11, t..., α1l t,..., αt1 t = 0,..., αtl t = 0,...] is the.o.o }{{}}{{} α t 1 α t t =0 soution. In this paper, we foow a strategy very simiar to the one used in (Vapnik & Chapee, 2000), and derive the new.o.o bound for the MU-SVM formuation in (5). The necessary prerequisites are presented next. Definition 1. (Support vector categories) 1. A support vector obtained from eq. (5) is caed a Type 1 support vector if 0 < α iyi < C i. This is represented as, SV 1 = { i 0 < α iyi < C i } 2. A support vector obtained from eq. (5) is caed a Type 2 support vector if α iyi = C i. This is represented as, SV 2 = { i α iyi = C i } The set of a support vectors are represented as, SV = SV 1 SV 2. Simiary, the set of support vectors for.o.o soution is given as SV t. Under definition (1) we have, Lemma 1. If in eave-one-out procedure a Type 1 support vector x t is cassified incorrecty, then we have, where, S 2 t = min β S t max( 2D, ( i,j 1 C ) 1 β i β j )K(x i, x j ) (6) s.t. α i β i C i ; {(i t, ) 0 < α i < C i ; = y i } α i β i 0; {(i t, ) α i < 0; y i } β i = 0; i / SV 1 {t} = 1... L β t = α t ; = 1... L β i = 0 S t := Span of the Type 1 support vector x t D := Diameter of the smaest hypersphere containing a training sampes. This eads to the foowing upper bound on the.o.o error. Theorem 1. The eave-one-out error is upper bounded as: R.o.o 1 n ( Ψ 1 + Ψ 2 ) (7) { Ψ 2 := t SV 1 T S t max( 1 } 2D, ) 1 C { } Ψ 1 := t SV 2 T ; := Cardinaity of a set and T := Training Set. Foowing Theorem 1, it is desirabe to seect a mode with a) ower number of Type 2 training support vectors and, b) smaer span for the type 1 training support vectors. Roughy, for a fixed number of type 2 support vectors a soution with smaer span vaue (for the type 1 training support vectors) coud yied ower test error. The foowing proposition shows how the universum sampes infuence these span vaues in (6).

5 Muticass Universum SVM [ Proposition 3. If the Type 1 training support vectors i.e. KSV1 I here, H := L A t SV 1 T for SVM and MU-SVM soutions remain same, A 0 then St SV M St MU SV M ; t SV 1 T. Loosey speaking, for cases where the type of training support vectors remain same, introducing universum sampes through the MU-SVM formuation coud resut in smaer span vaues and better generaization for future test data compared to standard SVM soution. Now, Theorem 1 provides an anaytic too for mode seection with sma.o.o error. Here, the right hand side of (7) serves as a eave-one-out error estimate, and the goa is to seect a mode parameter which minimizes this vaue. However, the practica utiity of (7) is imited due to the significant computationa compexity invoved in estimating the span of the type 1 training support vectors O(n + ml) 4 (worst case). Next, we provide a more computationay attractive aternative to the above.o.o bound. Assumption: For the MU-SVM soution, i The set of support vectors of the Type1 and Type2 categories remain the same during the eave-one-out procedure. ii The dua variabes of the Type1 support vectors have ony two active eements i.e. α i s.t. {0 < α iyi < C i } k y i s.t. α ik = α iyi. Lemma 2. Under the above assumptions the foowing equaity hods for both Type 1& 2 support vectors, S 2 t =[α t α i K(x i, x t ) (8) i SV α tyt g y tk i SV t αik(x t i, x t )] with, St 2 = {min β i,j( β i β j )K(x i, x j ) β t = α t ; β i = 0 ; (i, j) SV 1 } and g ytk = [0,... 1,..., 1,..., 0]; k = argmax α yt jq t K(x j, x t ) k th q y t j Now S t can be efficienty computed using emma (3). Lemma 3. Under Assumptions (i) & (ii) ] ; A := I SV1 (1 L ) ; 1 L = [ } 1 1 {{... 1 } ] L eements (H 1 ) tt := sub-matrix of H 1 for indices K SV1 i = (t 1)L tl := Kerne matrix of Type 1 support vectors. K t = [(k T t 1 L ) 0 L SV1 ] T where, k t = n SV1 1 dim vector where ith eement is K(x i, x t ), x i SV 1 ; and is the Kronecker product. Finay, we have, Theorem 2. Under the Assumptions (i) & (ii) the eave-oneout error is upper bounded as: R.o.o 1 n Ψ 3 (9) { } ] Ψ 3 = t SV T St 2 α t α i K(x i, x t ) i SV and T := Training Set ; and S t := defined in Lemma 3 Note that, simiar to (Vapnik & Chapee, 2000), the assumptions (i) & (ii) are not satisfied in most cases. Nevertheess, Theorem 2 provides a good approximation of the.o.o procedure (see Section 4.2.2). In addition, compared to Theorem 1, it provides the foowing advantages, Eq. (9) is vaid for both type 1 & 2 training support vectors and typicay resuts in a stricter bound. Span computation for a support vectors requires inverting the H - matrix ony once (Lemma 3). This resuts to an overa cost of O(n + ml) 3 for computing (9) and provides a computation edge over (7) which invoves a cost of O(n + ml) 4. Empirica resuts for mode seection using Theorem 2 are provided in Section Empirica Resuts 4.1. Datasets and Experimenta settings Our empirica resuts use three rea ife datasets : German Traffic Sign Recognition Benchmark (GTSRB) (Stakamp et a., 2012): The goa here is to identify the traffic signs for the speed-zones 30, 70 and 80. Here, the images are represented by their histogram of gradient (HOG 1) features. The experimenta setting is provided in Tabe 1. For this data we use three kinds of Universum: { St 2 α = t [(H 1 ) tt ] 1 α t t SV 1 T α t [K(x t, x t ) I L K T t H 1 K t ]α t t SV 2 T Random Averaging: syntheticay created by first seecting a random traffic sign from each cass ( 30, 70 and 80 ) in the training set and averaging them.

6 Muticass Universum SVM Tabe 1: Rea-ife datasets. DATASET TRAIN/TEST SIZE DIMENSION GTSRB ABCDETC ISOLET 300 / 1500 (100 / 500 PER CLASS) 600 / 400 (150 / 100 PER CLASS) 500 / 500 (100 / 100 PER CLASS) 1568 (HOG FEATURES) (100 X 100 PIXEL) 617 Non-Speed : a other non-speed zone traffic signs. Sign priority-road : An exhaustive search over severa non-speed zone traffic signs showed this universum to provide the best performance (Appendix B.3) Handwritten characters (ABCDETC) (Weston et a., 2006): The data consists images of handwritten digits 0-9, uppercase A-Z, owercase etters a-z and some additiona symbos:!?,. ; : = - + / / ( ) $ The goa here is to identify the handwritten digits 0-3 based on their pixe vaues. We use four different types of universum: Upper: A - Z, Lower: a - z, Symbos: a additiona symbos and Random Averaging (RA) obtained by randomy averaging the training sampes. Speech-based Isoated Letter Recognition (ISOLET) (Fanty & Coe, 1991): This is a speech recognition dataset where 150 subjects spoke the name of each etter a - z twice. The goa is to identify the spoken etters a - e using the spectra coefficients, contour features, sonorant features, presonorant features, and post-sonorant features. We use two different types of universum: Others, which consists of a other speech i.e. f - z and Random Averaging (RA). Note that, to simpify our anaysis (in Section 4.2.1) we used a subset of the training casses. However, simiar resuts can be expected using a the training casses (Appendix B.2). Our initia experiments suggest that inear parameterization is optima for the GTSRB dataset; hence ony inear kerne has been used for it. For the ABCDETC and ISOLET datasets an RBF kerne of the form K(x i, x j ) = exp( γ x i x j 2 ) with γ = 2 7 provided optima resuts for SVM. For a the experiments mode seection is done over the range of parameters, C = [10 4,..., 10 3 ], C /C = n ml and = [0, 0.01, 0.05, 0.1] using stratified 5-Fod cross vaidation Resuts COMPARISON BETWEEN SVM VS. MU-SVM Performance comparisons between SVM and MU-SVM for the different types of Universum are shown in Tabe 2. The tabe shows the average test error over 10 random Tabe 2: Mean (± standard deviation) of the test errors (in %) over 10 runs of the experimenta setting in Tabe 1. GTSRB SVM 7.54 ± 0.82 NO. OF UNIVERSUM SAMPLES MU-SVM PRIORITY ROAD 6.97 ± ± ± 0.78 RA 7.08 ± ± ± 0.43 NON ± ± ± 0.93 SPEED ABCDETC SVM 27.1 ± 3.5 UPPER 26.5 ± ± ± 4.0 LOWER 25 ± ± ± 3.1 SYMBOLS 23.5 ± ± ± 3.2 RA 23.2 ± ± ± 3.2 ISOLET SVM 3.6 ± 0.3 RA 3.05 ± ± ± 0.28 OTHERS 3.50 ± ± ± 0.3 training/test partitioning of the data in simiar proportions as shown in Tabe 1. As seen from Tabe 2, MU-SVM provides better generaization than SVM. In fact, for certain universum types, ike Priority-Road for GTSRB, RA for ABCDETC and ISOLET; MU-SVM significanty outperforms the muticass SVM mode. In such cases, the performance gains improve significanty upto 20 25% with the increase in number of universum sampes, and stagnates for a significanty arge universum set size. This indicates that for sufficienty arge universum set size the effectiveness of MU-SVM depends mosty on the type (statistica characteristics) of the universum data. For a better understanding of such statistica characteristics, we adopt the technique of histogram of projections originay introduced for binary cassification in (Cherkassky & Dhar, 2010). However, different from binary cassification, here we project a training sampe (x, y = k) onto the decision space for that cass i.e. w k x max k w x = 0 and the universum sampes onto the decision spaces of a the casses. Finay, we generate the histograms of the projection vaues for our anaysis. Further, in addition to the histograms, we aso generate the frequency pot of the predicted abes for the universum sampes. Figs. 3 shows the typica histograms and frequency pots for the SVM and MU-SVM modes for the GTSRB dataset using the priority-road sign (as universum). As seen from Fig. 3, the optima SVM mode has high separabiity for

Muticass Universum SVM Figure 3: Typica histogram of projection for training sampes (n = 300) (shown in bue) and universum sampes priorityroad (m = 500) (shown in red).

01) for (e) sign 30. (f) sign 70.(g) sign 80. (h) frequency pot of predicted abes for universum sampes using MU-SVM mode.

SVM decision functions (with C = 1) for (a) sign 30. sign 70.(c) sign 80. (d) frequency pot of predicted abes for universum sampes using SVM mode. MU-SVM decision functions (with C /C = 0.

7 Muticass Universum SVM Figure 3: Typica histogram of projection for training sampes (n = 300) (shown in bue) and universum sampes priorityroad (m = 500) (shown in red). SVM decision functions (with C = 1) for (a) sign 30. sign 70.(c) sign 80. (d) frequency pot of predicted abes for universum sampes using SVM mode. MU-SVM decision functions (with C /C = 0.2, = 0.01) for (e) sign 30. (f) sign 70.(g) sign 80. (h) frequency pot of predicted abes for universum sampes using MU-SVM mode. Figure 4: Typica histogram of projection for training sampes (n = 300) (shown in bue) and universum sampes Random Averaging (m = 500) (shown in red). SVM decision functions (with C = 1) for (a) sign 30. sign 70.(c) sign 80. (d) frequency pot of predicted abes for universum sampes using SVM mode. MU-SVM decision functions (with C /C = 0.2, = 0) for (e) sign 30. (f) sign 70.(g) sign 80. (h) frequency pot of predicted abes for universum sampes using MU-SVM mode. Figure 5: Typica histogram of projection for training sampes (n = 300) (shown in bue) and universum sampes Others (m = 500) (shown in red). SVM decision functions (with C = 1) for (a) sign 30. sign 70.(c) sign 80. (d) frequency pot of predicted abes for universum sampes using SVM mode. MU-SVM decision functions (with C /C = 0.2, = 0.05) for (e) sign 30. (f) sign 70.(g) sign 80. (h) frequency pot of predicted abes for universum sampes using MU-SVM mode. the training sampes i.e., most of the training sampes ie outside the margin borders. In fact, simiar to binary SVM (Cherkassky & Dhar, 2010), we see data-piing effects for the training sampes near the +1 - margin borders of the decision functions for a the casses. This is typicay seen under high-dimensiona ow sampe size settings. However, the universum sampes ( priority-road ) are widey spread about the margin-borders. Moreover, here the universum sampes are biased towards the positive side of the decision boundary of the sign 30 (Fig. 3(a)) and hence predominanty gets cassified as sign 30 (Fig.3(d)). As seen from Figs 3. (e)-(h), appying the MU-SVM mode preserves the separabiity of the training sampes and additionay reduces the spread of the universum sampes. Such a mode exhibits uncertainty on the universum sampes cass membership, and uniformy assigns them over a the casses i.e. signs 30, 70 and 80 (Fig. 3(h)). This shows that, the resuting MU-SVM mode has higher contradiction (uncertainty) on the universum sampes and hence provides better generaization compared to SVM. Fig 4 shows the histograms and the frequency pots for SVM and MU-SVM modes for RA universum. As shown in Fig 4 (a), the SVM mode aready resuts in a narrow distribution of the universum sampes and in turn provides near random prediction on the universum sampes (Fig. 4(d)). Appying MU-SVM for this case provides no significant change to the muticass SVM soution and hence no additiona improvement in generaization (see Tabe 2 and Fig.4 (e)-(h)). Finay, we provide the histograms and the frequency pots for SVM and MU-SVM modes for the Non-Speed Univer-

8 Muticass Universum SVM Tabe 3: Performance comparisons for mode seection using cross vaidation vs. anaytic bound in Theorem 2. Train/Test partitioning foows Tabe 1. No. of universum sampes (m = 1000). Mode parameters used C /C = n ml, = [0, 0.01, 0.05, 0.1] GTSRB ABCDETC ISOLET MUSVM PRIORITY ROAD. 5-FOLD CV THEOREM 2 TEST ERROR (IN %) TIME TEST ERROR ( 10 4 sec) (IN %) TIME ( 10 4 sec) 5.5 ± ± ± ± 0.1 RA 6.9 ± ± ± ± 0.3 NON- SPEED 6.9 ± ± ± ± 0.5 UPPER 26.1 ± ± ± ± 0.1 LOWER 24.2 ± ± ± ± 0.1 SYMBOLS 23.3 ± ± ± ± 0.09 RA 22.1 ± ± ± ± 0.1 RA 2.8 ± ± ± ± 0.7 OTHERS 3.3 ± ± ± ± 0.5 sum sampes. In this case, athough the universum sampes are widey spread about the SVM margin-borders (Figs 5(a)- (c)), yet the uncertainity on the universum sampes cass membership is uniform across a the casses (Fig 5(d)). Appying MU-SVM reduces the spread of the universum sampes (Figs. 5(e) - (g)). However, it does not significanty increase the contradiction (uncertainity) on the universum sampes (compare Figs. 5 (d) vs. (h)). Hence, appying MU- SVM does not provide any significant improvement over the SVM mode (see Tabe 2). The histograms for the other datasets provide simiar insights and have been provided in Appendix B.4. This section shows that for high-dimensiona ow sampe size settings, MU-SVM provides better generaization than muticass SVM. Under such settings the training data exhibits arge data-piing effects near the margin border ( +1 ). For such i-posed settings, introducing the Universum can provide improved generaization over the muticass SVM soution. However, the effectiveness of the MU-SVM aso depends on the properties of the universum data. Such statistica characteristics of the training and universum sampes for the effectiveness of MU-SVM can be convenienty captured using the histogram-of-projections method introduced in this paper EFFECTIVENESS USING ANALYTIC BOUND IN THEOREM 2 Next we iustrate the practica utiity of the bound in Theorem 2 for mode seection. First, we provide a comparison between the error estimates using 5-Fod cross vaidation (CV) vs. Theorem 2 3. For iustration we use the GTSRB dataset under the experimenta setting provided in Tabe 1. Fig. 6 (a) shows the average error estimates using 5-Fod CV and Theorem 2 as we as the true test error for the MU-SVM mode using priority-road over the range of parameters C /C = [10 3, 10 2, 10 1, 10 0 ] with fixed = 0. The resuts are obtained over 10 random partitioning of the training/test dataset. Fig. 6 (a) shows that the error estimates using Theorem 2 foows a very simiar pattern as 5-Fod CV and test error. This shows that the mode parameter C /C = 10 1 that minimizes the.o.o error estimate in Theorem 2, aso minimizes the test error and 5 Fod CV. Hence, Theorem 2 provides a practica aternative to mode seection using resamping techniques. Throughout our resuts we observe that the error estimates using Theorem 2 are uniformy ower than the 5-Fod CV and test error. This can be attributed to two main reasons. First, for high-dimensiona ow sampe size settings, majority of the training sampes ie outside the margin borders (see Figs. 3-5). This resuts in a significanty ow proportion of training SVs, and hence ow.o.o error in genera. Secondy, Theorem 2 hods under additiona assumptions (i) & (ii), and is further constrained compared to Theorem 1. Hence, Theorem 2 is an under estimator of the oo bound in Theorem 1. Of course, for the purpose of mode seection we are ony interested in the pattern, rather than the scae of the error estimates. Hence, such a difference in scae wi not impact the mode seection. However, to further simpify our iustrations, we aso provide a scae invariant ranking curve of the mode parameters in Fig. 6. The figure shows the average rankings of the mode parameters based on the error estimate vaues over each experiments. Here, for each experiment we rank the mode parameter with the smaest error estimate as 1, and the parameter with the argest estimate as 4, and average these rank vaues over the 10 experiments. The parameter with the smaest rank vaue 1 (in Fig. 6) is typicay seected through the mode seection strategy. Finay, as seen from 3 Note that, Theorem 2 approximates Theorem 1 to provide an upper bound on the.o.o error. Hence, a good comparison woud be between Theorem 2 vs. Theorem 1 and.o.o error. However, resuts using.o.o and Theorem 1 were prohibitivey sow and hence coud not be reported in this paper. As an aternative, we compare the error estimates from Theorem 2 with 5-Fod cross vaidation (CV) and test error. The objective is to iustrate that simiar to 5-Fod CV, using Theorem 2 we can obtain the optima mode parameters providing smaest test error.

9 Muticass Universum SVM (a) Figure 6: Performance of MU-SVM with priority-road universum for the GTSRB dataset. Here, no. of training sampes (n = 300), no. of universum sampes (m = 1000) (a) Error estimates for the mode parameters C /C = [10 3, 10 2, 10 1, 10 0 ], C = 1, = 0. Ranking of the mode parameters based on the error estimate vaues over each experiments. (a) Figure 7: Performance of MU-SVM with priority-road universum for the GTSRB dataset. Here, no. of training sampes (n = 300), no. of universum sampes (m = 1000) (a) Error estimates for the mode parameters n ml = [0, 0.01, 0.05, 0.1], C = 1, C /C = = 0.1 Ranking of the mode parameters based on the error estimate vaues over each experiments. Figs. 6 (a) -, athough different in scae, the error estimates using Theorem 2 correcty captures the pattern of the test error and seects the mode parameter with the smaest test error (i.e. C /C = 10 1 ). A simiar comparison over the range of parameters = [0.001, 0.01, 0.1, 1] with n ml fixed C /C = = 0.3 is aso provided in Fig. 7. Here, compared to 5 - Fod CV, Theorem 2 correcty seects the optima parameter = 0.01 with the smaest test error (Fig. 7 ). As seen from Figs. 6 and 7, the mode parameters minimizing the error estimates in Theorem 2 aso minimizes the true test error. This can be aso seen for a other datasets in Tabe 1 (Appendix B.5). Hence, Theorem 2 provides a practica aternative to resamping techniques for mode seection. This is further confirmed from the resuts in Tabe 3. Tabe 3 shows the average test error over 10 random training/test partitioning of the data in simiar proportions as shown in Tabe 1. Here, the MU-SVM modes seected using Theorem 2 provides simiar generaization error compared to the modes seected through 5-Fod CV. Further, the proposed mode seection strategy using Theorem 2 invoves an O(n + ml) 3 operation, and provides a computationa edge over standard resamping techniques. Tabe 3 provides the average time (in seconds) for the MU-SVM mode seection using Theorem 2 vs. 5-fod CV for 10 runs over the entire range of parameters. The experiments were run on a desktop with 12 core Inte Ghz and 32 GB RAM. As seen from Tabe 3, the bound based mode seection is 2-4 times faster than the standard 5-fod resamping technique. 5. Concusions We introduced a new universum-based formuation for muticass SVM (MU-SVM). The proposed formuation embodies severa usefu mathematica properties amenabe to: a) an efficient impementation of the MU-SVM formuation using existing muticass SVM sovers, and b) deriving practica anaytic bounds for mode seection. We empiricay demonstrated the effectiveness of the proposed formuation as we as the bound on rea-word datasets. In addition, we aso provided insights into the underying behavior of universum earning and its dependence on the choice of universum sampes using the proposed histogram-of-projections method. References Chen, Shuo and Zhang, Changshui. Seecting informative universum sampe for semi-supervised earning. In IJCAI, pp , Chen, Xiaohong, Yin, Hujun, Hu, Mengei, and Wang, Liping. Universum Discriminant Canonica Correation Anaysis, pp Springer Internationa Pubishing, Cham, ISBN doi: / URL org/ / _61. Cherkassky, Vadimir and Dhar, Sauptik. Simpe method for interpretation of high-dimensiona noninear svm cassification modes. In Stahbock, Robert, Crone, Sven F., Abou-Nasr, Mahmoud, Arabnia, Hamid R., Kourentzes, Nikoaos, Lenca, Phiippe, Lippe, Wofram-Manfred, and Weiss, Gary M. (eds.), DMIN, pp CSREA Press, ISBN Cherkassky, Vadimir and Muier, Fiip M. Learning from Data: Concepts, Theory, and Methods. Wiey-IEEE Press, ISBN Cherkassky, Vadimir, Dhar, Sauptik, and Dai, Wuyang. Practica conditions for effectiveness of the universum

10 Muticass Universum SVM earning. Neura Networks, IEEE Transactions on, 22(8): , Crammer, Koby and Singer, Yoram. On the earnabiity and design of output codes for muticass probems. Machine earning, 47(2-3): , Dhar, Sauptik and Cherkassky, Vadimir. Deveopment and evauation of cost-sensitive universum-svm. Cybernetics, IEEE Transactions on, 45(4): , Dhar, Sauptik and Cherkassky, Vadimir. Universum earning for svm regression. In Neura Networks (IJCNN), 2017 Internationa Joint Conference on, pp IEEE, Dhar, Sauptik, Ramakrishnan, Naveen, Cherkassky, Vadimir, and Shah, Mohak. On muticass universum earning. Dhar, Sauptik, Ramakrishnan, Naveen, Cherkassky, Vadimir, and Shah, Mohak. Universum earning for muticass svm. arxiv preprint arxiv: , Dietterich, Thomas G and Bakiri, Ghuum. Soving muticass earning probems via error-correcting output codes. Journa of artificia inteigence research, 2: , Fanty, Mark and Coe, Ronad. Spoken etter recognition. In Advances in Neura Information Processing Systems, pp , Goodfeow, Ian, Bengio, Yoshua, and Courvie, Aaron. Deep Learning. MIT Press, deepearningbook.org. Hsu, ChihWei and Lin, ChihJen. A comparison of methods for muticass support vector machines. Neura Networks, IEEE Transactions on, 13(2): , Japkowicz, Nathaie and Shah, Mohak. Evauating earning agorithms: a cassification perspective. Cambridge University Press, Lu, Shuxia and Tong, Le. Weighted twin support vector machine with universum. Advances in Computer Science: an Internationa Journa, 3(2):17 23, Luntz, Aeksandr. On estimation of characters obtained in statistica procedure of recognition. Technicheskaya Kibernetica, Patt, John C, Cristianini, Neo, and Shawe-Tayor, John. Large margin dags for muticass cassification. In NIPS, voume 12, pp , Qi, Zhiquan, Tian, Yingjie, and Shi, Yong. A nonparae support vector machine for a cassification probem with universum earning. Journa of Computationa and Appied Mathematics, 263: , Schokopf, Bernhard and Smoa, Aexander J. Learning with kernes: support vector machines, reguarization, optimization, and beyond. MIT press, Shen, Chunhua, Wang, Peng, Shen, Fumin, and Wang, Hanzi. Uboost: Boosting with the universum. Pattern Anaysis and Machine Inteigence, IEEE Transactions on, 34(4): , Sinz, F. A priori knowedge from non-exampes. PhD thesis, Mar Sinz, FH., Chapee, O., Agarwa, A., and Schökopf, B. An anaysis of inference with the universum. In Advances in neura information processing systems 20, pp , NY, USA, September Curran. Stakamp, J., Schipsing, M., Samen, J., and Ige, C. Man vs. computer: Benchmarking machine earning agorithms for traffic sign recognition. Neura Networks, pp., ISSN doi: /j.neunet Vapnik, V. Estimation of Dependences Based on Empirica Data (Information Science and Statistics). Springer, March ISBN Vapnik, Vadimir and Chapee, Oivier. Bounds on error expectation for support vector machines. Neura computation, 12(9): , Vapnik, Vadimir N. Statistica Learning Theory. Wiey- Interscience, Wang, Zhe and Xue, Xiangyang. Muti-cass support vector machine. In Support Vector Machines Appications, pp Springer, Wang, Zhe, Zhu, Yujin, Liu, Wenwen, Chen, Zhihua, and Gao, Daqi. Muti-view earning with universum. Knowedge-Based Systems, 70: , Weston, Jason and Watkins, Chris. Muti-cass support vector machines. Technica report, Citeseer, Weston, Jason, Coobert, Ronan, Sinz, Fabian, Bottou, Léon, and Vapnik, Vadimir. Inference with the universum. In Proceedings of the 23rd internationa conference on Machine earning, pp ACM, Xu, Yitian, Chen, Mei, and Li, Guohui. Least squares twin support vector machine with universum data for cassification. Internationa Journa of Systems Science, pp. 1 9, 2015.

11 Xu, Yitian, Chen, Mei, Yang, Zhiji, and Li, Guohui. ν- twin support vector machine with universum data for cassification. Appied Inteigence, 44(4): , Zhang, Dan, Wang, Jingdong, Wang, Fei, and Zhang, Changshui. Semi-supervised cassification with universum. In SDM, pp SIAM, ISBN Zhang, Xiang and LeCun, Yann. Universum prescription: Reguarization using unabeed data. In AAAI, pp , Zhu, Changming. Improved muti-kerne cassification machine with nyström approximation technique and universum data. Neurocomputing, 175: , Muticass Universum SVM

12 Appendix arxiv: v1 [cs.lg] 23 Aug 2018 Contents A A Proofs A.1 Proof of Proposition 1 A.2 Proof of Proposition 2 A.3 Derivation of Agorithm 1 A.4 Proof of Lemma 1 A.5 Proof of Theorem 1 A.6 Proof of Proposition 3 A.7 Proof of Lemma 2 A.8 Proof of Lemma 3 A.9 Proof of Theorem 2 B Additiona Resuts B.1 ECOC vs. Direct Approach B.2 SVM vs. MU-SVM using a training casses B.3 Performance comparisons for severa Universum types with varying Training set size for GTSRB dataset B.4 Additiona Histogram of Projections B.5 Comparison of the error estimates using 5-Fod CV vs. Theorem 2 Proofs The references cited in this document foows the numbering used in the main paper. A.1 Proof of Proposition 1 Such a proposition is avaiabe for muticass SVMs (Crammer & Singer, 2002). Here, we provide a proof for the MU-SVM formuation. Formuation (2) for binary cassification becomes, min w 1,w 2,ξ,ζ 1 2 ( w w 2 2 2) + C n ξ i + C i=1 m i =1 s.t. (w yi w ) x i e i ξ i ; e i = 1 δ i, = 1, 2 (wk x i max =1,2 w x i ) + ζ i k; ζ i k, k = 1, 2 { i = 1... n, i 1; yi = = 1... m, δ i = 0; y i The constraints become, Training sampes ( i = 1... n) (ζ i 1 + ζ i 2) (10)

13 For any x i cass 1 abeed as y i = +1; we have (w 1 w 1 ) x i ξ i ξ i 0 (w 1 w 2 ) x i 1 ξ i y i (w 1 w 2 ) x i 1 ξ i Simiary, for any x i cass 2 abeed as y i = 1; we have, (w 2 w 1 ) x i 1 ξ i y i (w 1 w 2 ) x i 1 ξ i (w 2 w 2 ) x i ξ i ξ i 0 Universum sampes ( i = 1... m) For any universum sampe x i WLOG we assume w 1x i w 2x i. Then, When k = 1 we have w1 x i max =1,2 w x i + ζ i k ζ i k (true ζ i k 0). When k = 2 we have w 2 x i max =1,2 w x i +ζ i k w 2 x i w 1 x i +ζ i k, ζ i k 0. Hence, eq. (10) can be re-written as, min w 1,w 2,ξ,ζ 1 2 ( w w 2 2 2) + C n ξ i + C i=1 m i =1 s.t. y i (w 1 w 2 ) x i 1 ξ i ; ξ i 0, i = 1... n (w 1 w 2 ) x i + ζ i ; ζ i 0, i = 1... m ζ i (11) The soution to the KKT system of (11) satisfies w 1 = w 2. Hence repacing w = w 1 w 2 in (11) sti soves (10). This is the U-SVM formuation in (Weston et. a, 2006) with b = 0. A.2 Proof of Proposition 2 The contribution due to the universum sampes are same for both (2) and (3). For any universum sampe (x i ) we identify the active constraints and its overa contribution to the objective function through sack variabes i.e. Equation (2), the overa contribution of the universum sampe x i is, C L k=1 ζ i k s.t. w k x i max =1...L w x i + ζ i k, ζ i k 0, k = 1... L Case 1: If k = argmax w x i. The constraint is inactive and ζ i k = 0. =1...L Case 2: Let k argmax w x i. Since, ζ i k 0 the constraint is active if, (wk x i =1...L max k w x i ) >. Then, ζ i k = [ + (wk x i max k w x i )]. Hence, keeping ony the active constraints the overa contribution of the sampe x i is, C k K i [ + wk x i max k w x i ] where, K i = {k (w k x i max k w x i ) > } Equation (3), Foowing eq. (3) for the universum sampe x i we have L artificia sampes as (x i, y i = 1),..., (x i, y i = L) stacked at indices i = n + (i 1)L n + i L. Hence for x i we have the overa contribution as, (12) C n+i L i=n+(i 1)L+1 ξ i s.t. (w yi w ) (1 δ i ) ξ i

14 Now, for i = n + (i 1) + k, we have x i = x i, y i = k. The constraints are, (w k w 1 ) x i ξ i (w k w 1 ) x i ξ i.. (w k w k ) x i ξ i (inactive but ensures) ξ i 0.. (w k w L ) x i ξ i (w k w L ) x i ξ i This is equivaent to, (wk x i max k w x i ) + ξ i. Since, ξ i 0 the constraint is active if, (wk x i max k w x i ) >, and the contribution becomes, ξ i = [ + wk x i max k w x i ]. Combining a contributions we get, C n+i L i=n+(i 1)L+1 = C k K i ξ i s.t. (w yi w ) (1 δ i ) ξ i [ + wk x i max k w x i ] where, K i = {k (w k x i max k w x i ) > } (13) Comparing (12) and (13), the universum sampe has simiar contribution for both the objective functions in (2) and (4). This is vaid for a universum sampes. A.3 Derivation of Agorithm 1 In this section we provide the KKT system for (4) and the derivation for the dua form in (5). The proof is avaiabe in (Crammer & Singer, 2002), (Hsu & Lin, 2002a). We reproduce it for competeness and for better readabiity of the subsequent proofs. The Lagrangian of the MU-SVM formuation is given as, Lagrangian, L = 1 w KKT System w L = 0 w = i n+ml i=1 C i ξ i i η i [(w yi w ) T x i e i + ξ i ] (14) (C i δ i η i )x i (15) ξi L = 0 η i = C i Compimentary Sackness η i [(w yi w ) T x i e i + ξ i ] = 0 (i, ) Constraints, (w yi w ) T x i e i + ξ i (i, ) η i 0 Finay the dua probem is, max 1 (C i δ i η i )(C j δ j η j )K(x i, x j ) + η 2 i,j i, s.t. η i = C i η i 0 Setting α i = C i δ i η i we get (5). η i e i (16)

15 A.4 Proof of Lemma 1 First we prove some interesting properties specific to the MU-SVM soution. Lemma A.1. α i SV 1 = {i 0 < α i < C i ; y i = }, i. α ik [ α jk K(x i, x j ) + e ik ] = 0 ; k = 1... L k j ii. k y i with α ik < 0 (strict); α jk K(x i, x j ) + e ik = α jyi K(x i, x j ) + e iyi i.e. j j the projection vaues for the type 1 support vectors for such casses are equa. iii. For any γ i {γ i γ ik = 0; γ ik = 0 if α i SV 1 and α ik = 0} we have k γ ik [ α jk K(x i, x j ) + e ik ] = 0 j k Proof For simpicity we provide the proof for inear kerne. The same proof appies for non-inear transformations. The proof uses the KKT system for (4).(Appendix A.3) i. η ik (w yi w k ) T x i [From (15)] k = k η ik ( δ i w ) T x i k η ik w T k x i = C i δ i w T x i η ik wk T x i = (C i δ ik η ik )wk T x i k k = α ik α jk K(x i, x j ) k j From compimentary sackness, if α i < C i with y i = η i = (C i δ i α i ) > 0. This gives, (w yi= w k= ) T x i e ik= + ξ i = 0 ξ i = 0 ( i.e. ies on margin). Now, from compimentary sackness in (15), η ik [(w yi w k ) T x i e ik ] = 0 k k α ik [ j α jk K(x i, x j ) + e ik ] = 0 [ η ik e ik = (C i δ ik α ik )e ik = α ik e ik ] ii. From compimentary sackness (15) η ik [(w yi w k ) T x i e ik ] = 0 ( k y i ; α ik < 0, ξ i = 0) (w yi w k ) T x i e ik = 0 ( η ik > 0) w T y i x i = w T k x i + e ik j α jyi K(x i, x j ) + e iyi = j α jk yi K(x i, x j ) + e ik yi iii. For any such γ i, γ ik [ α jk K(x i, x j ) + e ik ] k j =γ iyi α jyi K(x i, x j ) + α jk yi K(x i, x j ) + e ik yi ] j =(γ iyi + k y i γ iyi )[ j γ ik [ k y i,α ik <0 j α jyi K(x i, x j )] (from ii above and e iyi = 1 δ iyi = 0) =0 ( k γ ik = 0 by construction)

16 With the above properties for the MU-SVM soution we provide the proof for Lemma 1 foowing simiar ines as in (Vapnik & Chapee, 2000). We restate the emma here for better readabiity. Lemma 1. If in eave-one-out procedure a Type 1 (training) support vector x t recognized incorrecty, then we have, SV 1 T is S t max( 2D, 1 C ) > 1 where, St 2 = min ( β i β j )K(x i, x j ) β i,j s.t. α i β i C i ; {(i t, ) α i < C i ; = y i } α i β i 0; {(i t, ) α i > 0; y i } β i = 0; (i, ) / SV 1 {t} β t = α t ; β i = 0 D = Diameter of the smaest hypersphere containing a training sampes, and T = Training set Proof The eave-one-out formuation for MU-SVM with the t T sampe dropped is, max α s.t. W (α) = 1 α i α j K(x i, x j ) α i e i 2 i,j i, α i = 0 (17) α i C i if = y i ; α i 0 if y i α t = 0; (additiona constraint) Then, the eave-one-out (.o.o) error is given as: R.o.o = 1 n n 1[y t ŷ t ] where, α t = [α11, t..., α1l t,..., αt1 t = 0,..., αtl t = 0,...] is the soution for (17) and }{{}}{{} α t 1 α t t =0 = arg max αi t K(x i, x t ) (estimated cass abe for the t th sampe). The overa proof ŷ t i for the bound on the.o.o error foows three major steps. First, we construct a feasibe soution for (5) using the optima eave-one-out soution α t. i.e., construct α t + γ as shown beow, t=1 α t i + γ i C i ; (i, ) {(i, ) 0 < α t i < C i ; = y i } := A t 1 αi t + γ i 0; (i, ) {(i, ) αi t < 0; y i } := A t 2 γ i = 0; (i, ) / SV1 t [SV1 t = A t 1 A t 2] γ i = 0; (18)

17 Now, I 1 = W (α t + γ) W (α t ) = 1 (αi t + γ i )(αj t + γ j )K(x i, x j ) (αi t + γ i )e i + 1 α 2 2 iα t jk(x t i, x j ) + i,j i i,j i = 1 ( γ i γ j )K(x i, x j ) ( γ i α 2 j)k(x t i, x j ) γ i e i i,j i,j i = 1 ( γ i γ j )K(x i, x j ) γ i [ α 2 jk(x t i, x j ) + e i ] i,j i, j = 1 ( γ i γ j )K(x i, x j ) γ t [ α 2 jk(x t j, x t ) + e t ] (Lemma A.1 (iii)) (19) i,j j α t ie i As a specia case we set, γ t = [... a yt,..., a k th,...] = ag ytk; (k = argmax q y t j α t jqk(x j, x t ) ; g ytk = [... 1 yt... 1 k th ]) Further, we seect another p SV 1 where γ p t = ag ytk. Finay, we set, γ i = 0 i / {t, p}. For such a case, I 1 = a 2 x t x p 2 + a[1 ( j α t jy t K(x j, x t ) j α t jkk(x j, x t ))] â 2 D 2 + â[1 ( j α t jy t K(x j, x t ) j α t jkk(x j, x t ))] (20) with, â = 1 2D [1 ( α 2 jy t t K(x j, x t ) αjk t K(x j, x t ))] (the vaue that maximizes the R.H.S in j j (20)) and D = Diameter of the smaest hypersphere containing a training sampes. Now, if; â C I 1 1 4D 2 [1 ( j ese, I 1 C 2 D 2 + C[1 ( j α t jy t K(x j, x t ) j α t jy t K(x j, x t ) j α t jkk(x j, x t ))] = 1 2â α t jkk(x j, x t ))] = 2CD 2 [â C 2 ] 2CD2 â 2 If there is an error due to eave one out procedure, then max q y t j α t jm K(x j, x t ) > j α t jy t K(x j, x t ). This gives, I 1 > 1 2 min(c, 1 ) (for.o.o error) (21) 2D2 Second, we construct a feasibe soution for the eave-one-out formuation (17) using the optima soution for (5). i.e., construct α β as shown beow, α i β i C i ; (i, ) A 1 {t}; A 1 = {(i, ) 0 < α i < C i ; = y i } α i β i 0; (i, ) A 2 {t}; A 2 = {(i, ) α i < 0; y i } βi = 0; β i = 0 (i, ) / SV 1 {t} (22) β t = α t

18 with SV 1 = A 1 A 2 = {i 0 < α iyi < C i } such that, it is a feasibe soution for (17). As before, define I 2 = W (α) W (α β) = 1 α i α j K(x i, x j ) α i e i + 1 (α i β i )(α j β j )K(x i, x j ) 2 2 i,j k i i,j + (α i β i )e i i = 1 ( 2 i,j = 1 ( 2 i,j β i β j )K(x i, x j ) i β i [ j α j K(x j, x i ) + e i ] β i β j )K(x i, x j ) (Lemma A.1 (iii)) (23) Third, as the fina step define, St 2 = min ( β i β j )K(x i, x j ) (24) β i,j s.t. α i β i C i ; (i, ) A 1 {t} α i β i 0; (i, ) A 2 {t} β i = 0; (i, ) / SV 1 {t} β t = α t ; β i = 0 Now, et β be the minimizer for (24). For such a β I 2 (= 1 2 S2 t ) I 1 [ W (α) W (α + γ) γ; W (α β) W (α) β] > 1 2 min(c, 1 2D 2 ) (from(21)) A.5 Proof of Theorem 1 Theorem 1. The eave-one-out error is upper bounded as: R.o.o 1 n ( Ψ 1 + Ψ 2 ) (25) { Ψ 2 := t SV 1 T S t max( 1 } 2D, ) 1 C { } Ψ 1 := t SV 2 T ; := Cardinaity of a set where T := Training Set. Proof The proof depends on the contribution of a sampe to the eave-one-out error, First, for a sampe (x t, y t ) which is not a support vector, i.e. t / SV and t T (Training set); it ies outside margin borders. Dropping such a sampe does not change the origina soution of (5). Hence, it does not contribute to an error. Secondy, for a sampe (x t, y t ) SV 1 T contributing to eave-one-out error, Lemma 1 hods i.e. S t max( 1 2D, C ) > 1. Finay, for a sampe (x t, y t ) with t SV 2 T we add to the eave-one-out error.

19 A.6 Proof of Proposition 3 Remark 2. If the Type 1 training support vectors i.e. {t t SV 1 T } for SVM and MU-SVM soutions remain same, then we have St SV M St MU SV M. Proof By definition in Lemma 1, St 2 =min ( β i β j )K(x i, x j ) β i,j s.t. α i β i C i ; α i β i 0; β MU SV M := β i = 0; β t = α t ; β i = 0 (i, ) A 1 {t} (i, ) A 2 {t} (i, ) / SV 1 {t} If the Type 1 (training) support vectors for SVM and MU-SVM soutions remain same, we get the same reation as Lemma 1 for C&S SVM with, β SV M = {β i β MU SV M β i = α i ; i SV 1 U} i.e. β SV M β MU SV M S t (β SV M ) S t (β MU SV M ), where U = Universum sampes. A.7 Proof of Lemma 2 Lemma 2. Under the assumptions (i) and (ii) in Section 3.3 the foowing equaity hods for both Type 1& 2 training support vectors, i.e. x t SV T St 2 =[α t α i K(x i, x t ) α tyt gy tk αik(x t i, x t )] i SV i SV t with, St 2 = {min β i,j( β i β j )K(x i, x j ) β t = α t ; [0,... 1,..., 1,..., 0]; k = argmax α yt jq t K(x j, x t ) k th q y t j Proof β i = 0 ; (i, j) SV 1 } and g ytk = Under the Assumption (i) we set β = γ = (α α t ). Then I 1 = W (α) W (α t ) = I 2 A simiar anaysis as in (19) gives, I 1 = 1 γ i γ j )K(x i, x j ) α t [ α 2 jk(x t j, x t ) + e t ] (26) j SV (i,j) SV 1 ( Note the difference in form compared to (19). This is because now the anaysis appies for both type 1& 2 support vectors. Simiary, I 2 = 1 ( β i β j )K(x i, x j ) α t [ α j K(x j, x t ) + e t ] (27) 2 i,j j SV Combining, (26) and (27) β i β j K(x i, x j ) = (i,j) SV 1 α t [ j SV α j K(x j, x t ) + e t ] α t [ αjk(x t j, x t ) + e t ] j SV (28)

20 Next, et β be the minimizer for (24). Then, (α β ) is a feasibe soution for (17). Hence, W (α t ) W (α β ) W (α) W (α t ) W (α) W (α β ) i,j ( β i β j )K(x i, x j ) S 2 t However, from Assumption (i), β = (α α t ) is a feasibe soution for (24). Hence for such a β we have : St 2 ( β i β j )K(x i, x j ). Combining the above inequaity, i,j S 2 t = i,j ( β i β j )K(x i, x j ) (29) Further, under Assumption (i) the inequaity constraints in (24) are not activated. Hence, St 2 = {min β i,j( β i β j )K(x i, x j ) β t = α t ; β i = 0 ; (i, j) SV 1 }. Finay combining (28) and (29) we get, St 2 = α t [ α j K(x j, x t ) + e t ] α t [ αjk(x t j, x t ) + e t ] (30) j SV j SV For eave one out error (under Assumption (ii)), α t [ αjk(x t j, x t )] = α tyt [ αjkk(x t j, x t ) αjy t t K(x j, x t )] j SV j SV j SV 0 (k = argmax αjmk(x t j, x t )) m y t S 2 t α t [ j SV α j K(x j, x t )] j SV A.8 Proof of Lemma 3 Lemma 3. The span St 2 can be efficienty computed as { St 2 α = t [(H 1 ) tt ] 1 α t t SV 1 T α t [K(x t, x t ) I L K T t H 1 K t ]α t t SV 2 T [ ] KSV1 I here, H := L A ; A := I A 0 SV1 (1 L ) ; 1 L = [ } 1 1 {{... 1 } ] L eements (H 1 ) tt := sub-matrix of H 1 for indices i = (t 1)L tl K SV1 := Kerne matrix of Type 1 support vectors. and K t = [(k T t 1 L ) 0 L SV1 ] T where, k t = n SV1 1 dim vector where i th eement is K(x i, x t ), x i SV 1 ; and is the Kronecker product. Proof The Span is defined as: St 2 = min ( β i β j )K(x i, x j ) (31) β i,j s.t. β t = α t ; = 1,..., L β i = 0 ; (i, j) SV 1

21 Case(t SV 1 ) = min (α t α t )K(x t, x t ) + 2 β s.t. (I SV1 {t} 1 L ) β = 0 }{{} A = min β i SV 1 {t} max µ α t [K(x t, x t ) I L ]α t + 2 α t β i K(x t, x i ) + i SV 1 {t} α t β i K(x t, x i ) + + 2µ Aβ + 2α T A tt µ (µ := Lagrange Mutipier, = α t [K(x t, x t ) I L ]α t + min β max µ 2α t (H ( t) t ) λ + λh ( t) λ }{{} L(λ) (i,j) SV 1 {t}( (i,j) SV 1 {t}( β i β j )K(x i, x j ) α t = 0 α T A tt µ = 0) (with λ = [β; µ]) where, I SV1 {t} := Identity Matrix of size SV 1 {t}, A tt := submatrix of A for indices(t 1)L + 1,..., tl H ( t) := (t 1)L + 1,..., tl rows/coumns of matrix H (in Lemma??) removed; and H ( t) t := (t 1)L + 1,..., tl coumns of H. Further, at sadde point : λ L(λ) = 0 λ = [H ( t) ] 1 H ( t) t α t. Hence, S 2 t = α t [(K(x t, x t ) I L ) (H ( t) t ) (H ( t) ) 1 H ( t) t ]α t = α t (H 1 ) tt α t (32) where, (H 1 ) tt := sub-matrix of H 1 for index i = (t 1)L + 1,..., tl. Case (t SV 2 ) A simiar anaysis as above gives, S 2 t = α t [K(x t, x t ) I L K T t H 1 K t ]α t (33) where, K t = [(k T t 1 L ) 0 L SV1 ] T and k t = n SV1 1 dim vector where ith eement is K(x i, x t ), x i SV 1. β i β j )K(x i, x j ) A.9 Proof of Theorem 2 The proof has two steps. First, a sampe (x t, y t ) which is not a support vector does not contribute to an error. Secondy, for a sampe (x t, y t ) with t SV T Theorem 2 hods. Finay, combining the form of S 2 t in Lemma 3 competes the proof. B Additiona Resuts B.1 ECOC vs. Direct Approach for MU-SVM This section provides the performance comparisons between two major ECOC based approaches:- one-vs-a (OVA) and one-vs-one (OVO) vs. the direct formuation (C & S based MU-SVM in (2)). For the ECOC based approaches we use standard U-SVM formuation (in Weston et. a 2006) to sove the binary probems. Further, we use the same datasets and experimenta settings as discussed in Section 4. For a the datasets we show the resuts for Universum types which provided the best performance in Tabe. 2. As shown above, for the datasets and experimenta settings used in this paper, the C&S based direct formuation (MU-SVM) performs as good as (or better) than the ensembe based methods.

22 Tabe 4: Mean (± standard deviation) test error in % over 10 runs. DATA SET METHOD ONE VS ALL ONE VS ALL C&S (MU-SVM) GTSRB ABCDETC ISOLET SVM 7.07 ± ± ± 1.16 U-SVM (PRIORITY-ROAD) 6.05 ± ± ± 0.32 SVM 28.1 ± ± ± 3.34 U-SVM (RA) ± ± ± 2.89 SVM 3.72 ± ± ± 0.31 U-SVM (RA) 3.56 ± ± ± 0.32 B.2 SVM vs MU-SVM using a training casses Tabe 5: Performance comparisons between SVM vs. MU-SVM using a training casses. DATASETS GTSRB # TRAIN / TEST = 700 / 3500 (100 / 500 PER CLASS), # UNIVERSUM (M) = 500 MU-SVM (PRIORITY-ROAD) MU-SVM (RA) MU-SVM (NON-SPEED) SVM = ± ± ± ± # TRAIN / TEST = 1500 / 1000 (150 / 100 PER CLASS), # UNIVERSUM (M) = 300 ABCDETC UPPER LOWER SYMBOLS RA SVM = 42.1 ± ± ± ± ± 2.1 -

B.3 Performance comparisons for severa Universum

dataset The experiments foow the same setting as in

However in this case we vary the number of training

The universum set size is fixed to m = 500 foowing

does not provide significant performance gains.

deviation of the test errors for the SVM and MU-SVM

errors (in %) over 10 runs for the GTSRB dataset.

(100) 750 (250) 1500 (500) C&S SVM 7.54 ± 0.82 4.

38 (NO PASSING) (NO PASSING FOR TRUCKS) 6.98 ± 0.

(PRIORITY ROAD) (YIELD RIGHT OF WAY) (STOP) (NO

52 ± 0.68 3.52 ± 0.37 3.15 ± 0.44 6.2 ± 0.7 3.83 ± 0.

23 B.3 Performance comparisons for severa Universum types with varying Training set size for GTSRB dataset The experiments foow the same setting as in Tabe 2. However in this case we vary the number of training sampes. The universum set size is fixed to m = 500 foowing Tabe 2 i.e. Further, increase in universum sampes does not provide significant performance gains. Tabe 6 provides the mean and std. deviation of the test errors for the SVM and MU-SVM modes over 10 random training/test partitioning of the dataset. Tabe 6: Mean (± standard deviation) of the test errors (in %) over 10 runs for the GTSRB dataset. NO. OF TRAINING SAMPLES (PER CLASS) METHODS 300 (100) 750 (250) 1500 (500) C&S SVM 7.54 ± ± ± 0.38 (NO PASSING) (NO PASSING FOR TRUCKS) 6.98 ± ± ± ± ± ± 0.41 MU-SVM NO. OF UNIVERSUM SAMSPLES = 500 (RIGHT OF WAY) (PRIORITY ROAD) (YIELD RIGHT OF WAY) (STOP) (NO VEHICLES) 6.17 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± 0.24 (NO ENTRY) (DANGER) 6.17 ± ± ± ± ± ± ± ± ± 0.62 (SLIPPERY ROAD) RA 6.98 ± ± ± 0.54 NON SPEED 7.46 ± ± ± 0.4

Figure 9: Typica histogram of projection of training sampes (n = 750) (shown in bue) and universum sampes priority-road (m = 500) (shown in red). SVM decision functions (with C = 0.1) for (a) sign 30.

(h) frequency pot of predicted abes for universum sampes using MU-SVM mode.

24 Figure 9: Typica histogram of projection of training sampes (n = 750) (shown in bue) and universum sampes priority-road (m = 500) (shown in red). SVM decision functions (with C = 0.1) for (a) sign 30. sign 70.(c) sign 80. (d) frequency pot of predicted abes for universum sampes using SVM mode. MU-SVM decision functions (with C /C = 0.5, = 0.1) for (e) sign 30. (f) sign 70.(g) sign 80. (h) frequency pot of predicted abes for universum sampes using MU-SVM mode. Figure 10: Typica histogram of projection of training sampes (n = 1500) (shown in bue) and universum sampes priority-road (m = 500) (shown in red). SVM decision functions (with C = 0.1) for (a) sign 30. sign 70.(c) sign 80. (d) frequency pot of predicted abes for universum sampes using SVM mode. MU-SVM decision functions (with C /C = 1, = 0.05) for (e) sign 30. (f) sign 70.(g) sign 80. (h) frequency pot of predicted abes for universum sampes using MU-SVM mode. Tabe 6 shows that MU-SVM with priority-road universum provides the best performance. Further, the performance gains due to MU-SVM reduces with the increase in the number of training sampes. For further anaysis of this resut we use the histogram of projections method. The histogram of projections for the priority-road universum with increased training sampes n = 750, 1500 are provided in Figs. 9 and 10 respectivey. As seen from the figures when the number of training sampes is arge, the estimation probem becomes we-posed and SVM mode does not exhibit a huge data-piing effect about the +1 margin borders (compared to Fig. 3). In such cases, appication of MU-SVM does not provide a significant improvement over the SVM soution. This is consistent with the resuts reported in (Cherkassky et a., 2011) for binary U-SVM. This shows that MU-SVM is typicay effective for (i-conditioned) high dimension ow sampe size settings.

25 B.4 Additiona Histogram of Projections This section provides the histogram of projections on the modeing resuts for the ABCDETC and ISOLET datasets. The experimenta settings are discussed in Section 4.1. B.4.1 ABCDETC Dataset Figure 11: Typica histogram of projection of training sampes (n = 600) (shown in bue) and universum sampes upper case etters (m = 1000) (shown in red). SVM decision functions (with C = 1, γ = 2 7 ) for (a) digit 0. digit 1.(c) digit 2. (d) digit 3. (e) frequency pot of predicted abes for universum sampes using SVM mode. MU-SVM decision functions (with C /C = 0.15, = 0) for (f) digit 0. (g) digit 1.(h) digit 2. (i) digit 3.(j) frequency pot of predicted abes for universum sampes using MU-SVM mode. Figure 12: Typica histogram of projection of training sampes (n = 600) (shown in bue) and universum sampes ower case etters (m = 1000) (shown in red). SVM decision functions (with C = 1, γ = 2 7 ) for (a) digit 0. digit 1.(c) digit 2. (d) digit 3. (e) frequency pot of predicted abes for universum sampes using SVM mode. MU-SVM decision functions (with C /C = 0.15, = 0) for (f) digit 0. (g) digit 1.(h) digit 2. (i) digit 3.(j) frequency pot of predicted abes for universum sampes using MU-SVM mode. As seen from Figs 11-14,

26 Figure 13: Typica histogram of projection of training sampes (n = 600) (shown in bue) and universum sampes symbos (m = 1000) (shown in red). SVM decision functions (with C = 1, γ = 2 7 ) for (a) digit 0. digit 1.(c) digit 2. (d) digit 3. (e) frequency pot of predicted abes for universum sampes using SVM mode. MU-SVM decision functions (with C /C = 0.15, = 0) for (f) digit 0. (g) digit 1.(h) digit 2. (i) digit 3.(j) frequency pot of predicted abes for universum sampes using MU-SVM mode. Figure 14: Typica histogram of projection of training sampes (n = 600) (shown in bue) and universum sampes random averaging (RA) (m = 1000) (shown in red). SVM decision functions (with C = 1, γ = 2 7 ) for (a) digit 0. digit 1.(c) digit 2. (d) digit 3. (e) frequency pot of predicted abes for universum sampes using SVM mode. MU-SVM decision functions (with C /C = 0.15, = 0) for (f) digit 0. (g) digit 1.(h) digit 2. (i) digit 3.(j) frequency pot of predicted abes for universum sampes using MU-SVM mode. Upper : the SVM mode resuts in a narrow distribution of the universum sampes and in turn provides near random prediction on the universum sampes. Appying MU-SVM for this case provides no significant change to muticass SVM soution and hence no additiona improvement in generaization (see Tabe 2). Lower : the SVM mode resuts in a reativey wider distribution of the universum sampes (compared to Upper). Appying MU-SVM for this case provides some improvement to the muticass SVM (see Tabe 2). Symbo and RA : the SVM mode resuts in a wide distribution of the universum sampes. Further, in both the cases the universum sampes are mosty predicted as digit 1. Appying MU-SVM for this case resuts to a narrow distribution of the universum sampes and increases the uncertainity on the universum sampes. This resuts to a significant improvement to the muticass SVM soution (see Tabe 2).

27 B.4.2 ISOLET Dataset Figure 15: Typica histogram of projection of training sampes (n = 500) (shown in bue) and universum sampes Others (m = 1000) (shown in red). SVM decision functions (with C = 1, γ = 2 7 ) for (a) etter a. etter b.(c) etter c. (d) etter d. (e) etter e. (f) frequency pot of predicted abes for universum sampes using SVM mode. MU-SVM decision functions (with C /C = 0.1, = 0.05) for (g) etter a. (h) etter b.(i) etter c. (j) etter d. (k) etter e. () frequency pot of predicted abes for universum sampes using MU-SVM mode. Figure 16: Typica histogram of projection of training sampes (n = 500) (shown in bue) and universum sampes RA (m = 1000) (shown in red). SVM decision functions (with C = 1, γ = 2 7 ) for (a) etter a. etter b.(c) etter c. (d) etter d. (e) etter e. (f) frequency pot of predicted abes for universum sampes using SVM mode. MU-SVM decision functions (with C /C = 0.1, = 0.1) for (g) etter a. (h) etter b.(i) etter c. (j) etter d. (k) etter e. () frequency pot of predicted abes for universum sampes using MU-SVM mode. As seen from Figs 15-16, Others : the SVM mode resuts in a near random prediction on the universum sampes. Appying MU-SVM for this case reduces the projection of the universum sampes but does not resut to a significant increase in the uncertaininty of the universum sampes, and hence no additiona improvement in generaization (see Tabe 2). RA : the SVM mode resuts in a wide distribution of the universum sampes. Further, the universum sampes are mosty predicted as etter d. Appying MU-SVM for this case resuts to a narrow distribution of the universum sampes and increases the uncertainity on the universum sampes. This resuts to a significant improvement to the muticass SVM soution (see Tabe 2).

28 B.5 Comparison of the error estimates using 5-Fod CV vs. Theorem 2 This section provides the error estimates curves for the different mode parameters for the datasets in Tabe 1. B.5.1 GTSRB dataset WITH VARYING C /C (a) Figure 17: Performance of MU-SVM with RA universum for the GTSRB dataset. Here, no. of training sampes (n = 300), no. of universum sampes (m = 1000) (a) Error estimates for the mode parameters C /C = [10 3, 10 2, 10 1, 10 0 ], C = 1, = 0. Ranking of the mode parameters with the smaest error estimate over each experiments. (a) Figure 18: Performance of MU-SVM with Non-Speed universum for the GTSRB dataset. Here, no. of training sampes (n = 300), no. of universum sampes (m = 1000) (a) Error estimates for the mode parameters C /C = [10 3, 10 2, 10 1, 10 0 ], C = 1, = 0. Ranking of the mode parameters with the smaest error estimate over each experiments. WITH VARYING (a) Figure 19: Performance of MU-SVM with RA universum for the GTSRB dataset. Here, no. of training sampes (n = 300), no. of universum sampes (m = 1000) (a) Error estimates for the mode parameters = [0, 0.01, 0.05, 0.1], C = 1, C /C = n ml = 0.1 Ranking of the mode parameters with the smaest error estimate over each experiments.

29 (a) Figure 20: Performance of MU-SVM with Non-Speed universum for the GTSRB dataset. Here, no. of training sampes (n = 300), no. of universum sampes (m = 1000) (a) Error estimates for the mode parameters = [0, 0.01, 0.05, 0.1], C = 1, C /C = n ml = 0.1 Ranking of the mode parameters with the smaest error estimate over each experiments. B.5.2 ABCDETC dataset WITH VARYING C /C (a) Figure 21: Performance of MU-SVM with Upper-case universum for the ABCDETC dataset. Here, no. of training sampes (n = 600), no. of universum sampes (m = 1000) (a) Error estimates for the mode parameters = 0, C = 1, γ = 2 7, C /C = [10 3, 10 2, 10 1, 10 0 ] Ranking of the mode parameters with the smaest error estimate over each experiments. (a) Figure 22: Performance of MU-SVM with Lower-case universum for the ABCDETC dataset. Here, no. of training sampes (n = 600), no. of universum sampes (m = 1000) (a) Error estimates for the mode parameters = 0, C = 1, γ = 2 7, C /C = [10 3, 10 2, 10 1, 10 0 ] Ranking of the mode parameters with the smaest error estimate over each experiments.

30 (a) Figure 23: Performance of MU-SVM with Symbo universum for the ABCDETC dataset. Here, no. of training sampes (n = 600), no. of universum sampes (m = 1000) (a) Error estimates for the mode parameters = 0, C = 1, γ = 2 7, C /C = [10 3, 10 2, 10 1, 10 0 ] Ranking of the mode parameters with the smaest error estimate over each experiments. (a) Figure 24: Performance of MU-SVM with RA universum for the ABCDETC dataset. Here, no. of training sampes (n = 600), no. of universum sampes (m = 1000) (a) Error estimates for the mode parameters = 0, C = 1, γ = 2 7, C /C = [10 3, 10 2, 10 1, 10 0 ] Ranking of the mode parameters with the smaest error estimate over each experiments. WITH VARYING (a) Figure 25: Performance of MU-SVM with Upper universum for the ABCDETC dataset. Here, no. of training sampes (n = 600), no. of universum sampes (m = 1000) (a) Error estimates for the mode parameters = [0, 0.01, 0.05, 0.1], C = 1, γ = 2 7, C /C = n ml = 0.15 Ranking of the mode parameters with the smaest error estimate over each experiments.

15 Ranking of the mode parameters with the smaest error estimate over each experiments. (a) Figure 27: Performance of MU-SVM with Symbo universum for the ABCDETC dataset. Here, no.

31 (a) Figure 26: Performance of MU-SVM with Lower universum for the ABCDETC dataset. Here, no. of training sampes (n = 600), no. of universum sampes (m = 1000) (a) Error estimates for the mode parameters = [0, 0.01, 0.05, 0.1], C = 1, γ = 2 7, C /C = n ml = 0.15 Ranking of the mode parameters with the smaest error estimate over each experiments. (a) Figure 27: Performance of MU-SVM with Symbo universum for the ABCDETC dataset. Here, no. of training sampes (n = 600), no. of universum sampes (m = 1000) (a) Error estimates for the mode parameters = [0, 0.01, 0.05, 0.1], C = 1, γ = 2 7, C /C = n ml = 0.15 Ranking of the mode parameters with the smaest error estimate over each experiments. (a) Figure 28: Performance of MU-SVM with RA universum for the ABCDETC dataset. Here, no. of training sampes (n = 600), no. of universum sampes (m = 1000) (a) Error estimates for the mode parameters = [0, 0.01, 0.05, 0.1], C = 1, γ = 2 7, C /C = n ml = 0.15 Ranking of the mode parameters with the smaest error estimate over each experiments. B.5.3 ISOLET dataset WITH VARYING C /C

(a) Figure 29: Performance of MU-SVM with Others universum for the ISOLET dataset. Here, no. of training sampes (n = 500), no.

error estimate over each experiments. (a) Figure 30: Performance of MU-SVM with RA universum for the ISOLET dataset. Here, no. of training sampes (n = 500), no.

of training sampes (n = 600), no. of universum sampes (m = 1000) (a) Error estimates for the mode parameters = [0, 0.01, 0.05, 0.1], C = 1, C /C = n ml = 0.

32 (a) Figure 29: Performance of MU-SVM with Others universum for the ISOLET dataset. Here, no. of training sampes (n = 500), no. of universum sampes (m = 1000) (a) Error estimates for the mode parameters C /C = [10 3, 10 2, 10 1, 10 0 ], C = 1, = 0 Ranking of the mode parameters with the smaest error estimate over each experiments. (a) Figure 30: Performance of MU-SVM with RA universum for the ISOLET dataset. Here, no. of training sampes (n = 500), no. of universum sampes (m = 1000) (a) Error estimates for the mode parameters C /C = [10 3, 10 2, 10 1, 10 0 ], C = 1, = 0 Ranking of the mode parameters with the smaest error estimate over each experiments. WITH VARYING (a) Figure 31: Performance of MU-SVM with RA universum for the ABCDETC dataset. Here, no. of training sampes (n = 600), no. of universum sampes (m = 1000) (a) Error estimates for the mode parameters = [0, 0.01, 0.05, 0.1], C = 1, C /C = n ml = 0.1 Ranking of the mode parameters with the smaest error estimate over each experiments.

of universum sampes (m = 1000) (a) Error estimates for the mode parameters = [0, 0.

33 (a) Figure 32: Performance of MU-SVM with Others universum for the ABCDETC dataset. Here, no. of training sampes (n = 600), no. of universum sampes (m = 1000) (a) Error estimates for the mode parameters = [0, 0.01, 0.05, 0.1], C = 1, C /C = n ml = 0.1 Ranking of the mode parameters with the smaest error estimate over each experiments.

Statistical Learning Theory: A Primer

Statistical Learning Theory: A Primer Internationa Journa of Computer Vision 38(), 9 3, 2000 c 2000 uwer Academic Pubishers. Manufactured in The Netherands. Statistica Learning Theory: A Primer THEODOROS EVGENIOU, MASSIMILIANO PONTIL AND TOMASO