MINIMAX PROBABILITY MACHINE (MPM) is a

Size: px

Start display at page:

Download "MINIMAX PROBABILITY MACHINE (MPM) is a"

Joanna Dixon
5 years ago
Views:

1 Efficient Minimax Custering Probabiity Machine by Generaized Probabiity Product Kerne Haiqin Yang, Kaizhu Huang, Irwin King and Michae R. Lyu Abstract Minimax Probabiity Machine (MPM), earning a decision function by minimizing the maximum probabiity of miscassification, has demonstrated very promising performance in cassification and regression. However, MPM is often chaenged for its sow training and test procedures. Aiming to sove this probem, we propose an efficient mode named Minimax Custering Probabiity Machine (MCPM). Foowing many traditiona methods, we represent training data points by severa custers. Different from these methods, a Generaized Probabiity Product Kerne is appropriatey defined to grasp the inner distributiona information over the custers. Incorporating custering information via a non-inear erne, MCPM can fast train and test in cassification probem with promising performance. Another appeaing property of the proposed approach is that MCPM can sti derive an expicit worst-case accuracy bound for the decision boundary. Experimenta resuts on synthetic and rea data vaidate the effectiveness of MCPM for cassification whie attaining high accuracy. I. INTRODUCTION MINIMAX PROBABILITY MACHINE (MPM) is a recenty proposed earning mode and has demonstrated advantages in soving cassification probem in the iterature [10]. By minimizing the maximum probabiity of miscassification of future data points, MPM has shown competitive cassification accuracy against the state-of-theart cassifier, Support Vector Machine (SVM). One appeaing feature of MPM is that it can derive an expicit worst-case accuracy bound for the decision boundary. Foowing the idea of MPM, there have been many important extensions, e.g., the worst-case optima Bayesian cassification mode [7], its regression extension [19], the Biased Minimax Probabiity Machine for imbaanced cassification [5] and Medica Diagnosis [6]. However, MPM and its extensions are often chaenged for the time-consuming training and test procedures. The training of MPM is equivaent to soving a Second Order Cone Programming (SOCP) probem, whose worst-case compexity is O(n 3 ) (n is the number of the training sampes for the erneized MPM). The test compexity of the erneized MPM is aso reated to the number of training sampes. This maes the MPM-based modes inefficient for cassifying arge datasets. Haiqin Yang, Kaizhu Huang, Irwin King and Michae R. Lyu are with the Department of Computer Science and Engineering, The Chinese University of Hong Kong, Shatin, New Territories, Hong Kong (emai: {hqyang,zhuang,ing,yu}@cse.cuh.edu.h). The wor described in this paper is supported by two grants, one from the CUHK Direct Grant # , and the other from the Research Grants Counci of the Hong Kong Specia Administrative Region, China (Project No. CUHK4150/07E). In soving arge-scae cassification probems, the stateof-the-art cassifier, Support Vector Machine, aso faces the same probem. Athough various improvements, e.g., Sequentia Minima Optimization [16], [9], Parae method [2], [4], have been made to speed up the training procedure in the SVMs, the training compexity of SVMs is sti high when the number of training sampes is arge. To sove this probem, custering-based SVMs, e.g., CB-SVMs [20], Support Custering Machine (SCM) [21], [12], have been proposed to seect representative quanta, e.g., typica points or custers, for the SVMs training, so as to reduce the training compexity. Motivated by the idea of custering-based SVMs, we propose the Minimax Custering Probabiity Machine (MCPM) to extend MPM for arge-scae cassification probems. The main idea of MCPM is as foows. The training sampes are custered in advance to output many generative modes. Then the obtained custers described by certain distributions are input as the training units, whie the test sampes are expained as specia custers centered on each specific data point. Instead of appying the probabiity product erne to measure the simiarity as used in [12], we define a nove generaized probabiity product erne, especiay, Radia Basis Functions on probabiity product erne to measure the simiarity either between any custers (in training) or between a custer and a test vector (in test). Finay, the decision function can be constructed in a erne form, which is ony reated to the training custers. Experiments on both synthetic and rea data show that the proposed MCPM reduce the computationa costs both in the training phase and the test phase, whie preserving the cassification accuracies. The proposed nove generaized probabiity product erne has a ot of advantages over the traditiona probabiity product erne as used in SCM [12]. First, as we show in the paper, the traditiona probabiity product erne is actuay a inear erne defined in the probabiity space, whie our generaized erne describes a non-inear erne which can generate more compex simiarity measures. Second, numerica probems such as the arge variance in erne matrix sometimes occur when the traditiona probabiity product erne is empoyed. These numerica probems often require carefu data adaptation, e.g., scaing up the erne matrix, maing the training sometimes not as straightforward as expected. In contrast, the proposed generaized probabiity product erne avoids such probems by projecting the probabiity into a non-inear space. The whoe earning process is easy to impement and requires no data adaptation. Third, the generaized probabiity product erne is more fexibe in /08/$25.00 c 2008 IEEE

2 measuring the simiarity. This is anaogous to the case that non-inear RBF ernes can usuay outperform the inear erne. Hence the proposed generaized erne defined over probabiities is often more accurate than the standard erne. Empirica evidence on rea data aso support this statement as ater seen in experiments. The contributions of the paper are summarized as foows. (1) The proposed MCPM argey reduces both the computationa and spatia costs for both training and test, whie eeping the cassification accuracy; (2) MCPM eeps the statistica information of training sampes by presenting them in generative modes; (3) by defining Radia Basis Functions on the probabiity product erne, the simiarity measurement coud deiver more information for cassification; (4) MCPM provides a worst-case accuracy bound for cassifying future data points; (5) MCPM can be impemented easiy by using the generaized erne. The rest of this paper is organized as foows. Section II derives the MCPM under a probabiity framewor simiar to that of the origina MPM. Section III defines the probabiity product erne and introduces a generaized probabiity product erne to measure the simiarity either between a pair of custers or between a custer and a test sampe. Section IV reports the experimenta setup and resuts on both synthetic dataset and rea datasets. Finay, the paper is concuded in Section V. II. CLASSIFICATION MODEL In this section, we first give a setch introduction to the MPM. We then formuate the MCPM in subsection II-B. A. Minimax Probabiity Machine for Binary Cassification Considering a binary cassification probem, suppose the data are generated from two casses of data, x and y. And data of cass x are drawn from a cass of distributions with mean and covariance matrices as { x, Σ x }, whie data of cass y are from another cass of distributions with mean and covariance matrices as {ȳ, Σ y },wherex, y, x, ȳ R d, and Σ x, Σ y R d d. Assuming { x, Σ x }, {ȳ, Σ y } for two casses of data are reiabe, MPM attempts to determine the hyperpane H(a,b)= {z a z = b} (a R d \{0}, andb R, and the superscript denotes the transpose) by separating two casses of data with the maxima probabiity. The formuation of the MPM mode is written as foows: max α,a 0,b α s.t. inf x { x,σ Pr{a x b} α, x} (1) inf y {ȳ,σ Pr{a y b} α, y} where α represents the worst-case accuracy of cassifying future data points. Future points z when a z b are then cassified as beonging to the cass associated with x, otherwise they are judged as beonging to the cass associated with y. This derived decision hyperpane is caimed to minimize the worst-case (maxima) probabiity of miscassification, or the worst-case error rate, of future data. Further, appying the generaization of Marsha and Oin s resut [14], [17], the optimization of MPM can be transformed to a Second Order Cone Programming (SOCP) probem as foows [13], [15]: min Σx 1/2 a 2 + Σy 1/2 a 2 s.t. a ( x ȳ) =1. (2) a The worst compexity of soving the optimization probem of MPM, i.e., soving the SOCP probem, in Eq. (2), is O(n 3 ),wherenis the number of training sampes for the erneized MPM. In the test phase, the compexity of the erneized MPM aso depends on the number of training sampes. This high computationa compexity is a main probem of appying MPM to rea appications. B. Minimax Custering Probabiity Machine Aiming at reducing the computation compexity of MPM, we propose the Minimax Custering Probabiity Machine (MCPM). The idea is as foows. The training sampes of cass x and training sampes of cass y are custered into M cx training custers and M cy training custers, respectivey. Foowing the Gaussian distribution assumption, we coud denote the training custers as generative modes, i.e., c j = (P j,μ j, Σ j ),wherep j, μ j, Σ j is the prior (weight), the mean, and the covariance matrix, of the j-th custer. For the positive custers, j ranges from 1 to M cx.forthenegative custers, j ranges from 1 to M cy. Hence, the tota number of training custers is M = M cx + M cy. In the foowing, we denote the space of generative modes as R G = R R d R d d. Therefore, the probem becomes to find a inear decision boundary H(c,b)={z R G c z = b}(c R G \{0},b R). We then transform the above generative modes, c j, j =1,...,M, from R G to a feature space, R f, via a mapping φ : R G R f. Therefore, a inear decision boundary H(c,b) = {φ(z) R f c φ(z) = b} in the feature space R f corresponds to a noninear decision decision boundary D(c,b)={z R G c φ(z) =b} in the space of R G (c R f \{0} and b R). Now, et the training custers be mapped as c x φ(c x ) {φ(c x ), Σ φ(cx)}, c y φ(c y ) {φ(c y ), Σ φ(cy)}. A noninear decision boundary in R G can then be obtained by soving the minimax probabiity decision probem of Eq. (1), in a feature space R f : max α,c 0,b α s.t. inf Pr{c φ(c x ) b} α, φ(c x) {φ(c x),σ φ(cx) } inf Pr{c φ(c y ) b} α. φ(c y) {φ(c y),σ φ(cy ) } Simiar to the optimization of Eq. (1), the above optimization can be soved by 1/2 1 τ := min Σ c φ(c x) c 2 + Σφ(c y) c 2 s.t. c (φ(c x ) φ(c y )) = 1. By adopting the erne tric simiar to that in [11], [7], we can write c as a inear combination of the training custers 1/ Internationa Joint Conference on Neura Networs (IJCNN 2008) 4015

3 and then find the coefficients. Without oss of generaity, c can be written as c = M cx ν iφ(c xi )+ Mcy j=1 ω jφ(c yj ). Let {t i } M denote the set of a M = M c x + M cy training custers as t j = c xj,j =1, 2,...,M cx,andt j = c yj Mcx,j = M cx +1,M cx +2,...,M. The Gram matrix G can be defined as G ij = G(φ(t i ),φ(t j )), i, j =1, 2,...,M. (3) Denote the first M cx rows and the ast M cy rows of G as G cx and G cy, respectivey, we get G =[G cx ; G cy ]. The boc-row-averaged Gram matrix K is then obtained by setting the row average of the G cx -boc and the G cy - boc equa to zero: ( ) ( ) Gcx 1 Mcx K = c x G cy 1 Mcy = Mcx K cx, c Mcy y K cy where 1 n is a coumn vector of ones of dimension n. The row average c x and c y are M-dimensiona row vectors given by ( cx )i = 1 M cx K(c xj, t i ) M cx j=1 ) ( M cy cy K(c yj, t i ) i M cy j=1 Hence, the objective of MCPM becomes 1 := min τ v K cx v 2 + K cy v 2 (4) s.t. v ( cx cy )=1, where v =[ν 1,ν 2,...,ν Mcx,ω 1,ω 2,...,ω Mcy ]. The decision function of MCPM is then cacuated by f MCPM (c z ) = + M cx vi K(c z, c xi ) (5) M cy vm cx +ik(c z, c yi ) b MCPM and the bias term is obtained by b MCPM = v cx τ K cx v 2 = v cy + τ K cy v 2. From the optimization of Eq. (4), we can see that the optimization is simiar to the erneized MPM. However, the number of training sampes is reduced argey to the number of training custers. III. GENERALIZED PROBABILITY PRODUCT KERNEL In soving Eq. (4), we sti need a suitabe distance definition to measure the simiarity between two custers in the training phase or the simiarity between a custer and a sampe vector in the test phase. In the foowing, we wi first introduce genera ernes in the feature space. We then derive the generaized probabiity product erne. After that, we present how to appy the generaized probabiity product erne, more speciay, the inear probabiity product erne and the radia basis function on probabiity product erne, to rea appications. A. Kerne in Feature Space The erne is defined in Eq. (3). Considering a inear erne in the feature space, we can define it as G L (φ(t i ),φ(t j )) = φ(t i ) φ(t j ). (6) Simiary, we can define a RBF erne in the feature space: G RBF (φ(t i ),φ(t j )) = exp{ γ φ(t i ) φ(t j ) 2 } (7) We can aso extend the erne in the feature space to genera forms by other functions, e.g., the poynomia erne and the hyperboic tangent erne [18]. This can attain generaized ernes in the feature space. B. Probabiity Product Kerne Here we sti need to define the inner product of two vectors in the feature space. Considering the property of the generative modes we have obtained, we turn to the probabiity product erne [8]. The probabiity product erne defines the simiarity between two distribution p and p by K(p,p )= dz, (8) R d p ρ pρ where K(p,p ) is positive definite, and the exponentia ρ wi derive a set of candidate ernes. When ρ =1, it eads to the expected ieihood erne [8]. When p and p are both Gaussian distributions, i.e., p = P p(z μ, Σ ) and p = P p(z μ, Σ ), K(p,p ) can be written as a function of two generative modes, i.e., K(c, c ).Further,K(c, c ) can be computed directy by using the corresponding parameters of two generative modes to avoid integrating the probabiity distributions in the entire input space. Hence, when ρ =1,wehave K(c, c )=φ(t i ) φ(t j ) (9) = P P (2π) d 2 (Σ 1 +Σ 1 ) Σ 1 2 Σ 1 2 { exp 1 ( ) } μ 2 Σ 1 μ + μ Σ 1 μ μ Σ 1 μ where Σ 1 =(Σ 1 +Σ 1 ) 1,and μ =Σ 1 μ +Σ 1 μ. C. Practica Soution For rea cassification probems, in order to avoid computing the inverse matrices in Eq. (9), we simpify the erne cacuation by ony using the diagona entries of the covariance matrices, i.e., Σ = diag((σ,1 2,...,(σ2,d ). Thus, the erne becomes = K(c, c )=φ(c ) φ(c ) (10) { } P P exp d 2π(σ 1 d (μ,i μ,i ) 2,i 2 + σ2,i ) 2 σ,i 2 +. σ2,i In the test phase, a test sampe z is considered as the extreme case of a Gaussian distribution, where ony one point in the distribution with fixed prior and a eements of the covariance matrix vanishing, i.e., c z =(P z =1,μ z = Internationa Joint Conference on Neura Networs (IJCNN 2008)

4 z, Σ z = 0). Hence, the simiarity between a training custer and a test vector is defined by K(c z, c )=φ(c z ) φ(c ) (11) { } d 1 = P exp 1 d (z μ,i ) 2. 2πσ,i 2 σ 2,i Hence, we term the cacuation of Eq. (10) and Eq. (11) as inear probabiity product ernes (LPPK) for cacuating simiarity between custers in training phase and that between a custer and a test vector in test phase. After pugging Eq. (10) and Eq. (11) into the RBF form of Eq. (7), we therefore define the Radia Basis Function on probabiity product erne (RBF-PPK). Simiary, by defining other measurements in the feature space, e.g., Poynomia functions, or Hyperboic tangent, we can extend and obtain a generaized probabiity product erne. Comparing the inear erne in Eq. (6) and the RBF erne in Eq. (7), we can see that the RBF erne has the advantage of containing unit vaue when two inputs are the same, whie the inear erne has no such property. For rea appications, data are usuay in high dimensiona space. This maes the cacuation vaues of Eq. (10) and Eq. (7) very sma and appear in different scaes. It woud incur the difficuty of tuning parameters to get good resuts. However, the RBF- PPK has the abiity of normaizing the erne matrix and hence avoids the probems of the inear PPK. IV. EXPERIMENTS We carry out experiments on a two-cass toy dataset and two benchmar datasets to demonstrate the effectiveness of MCPM. In the toy dataset, 2, 000 data points, 1, 000 points for each cass, were randomy generated from a mixture of Gaussian distribution in order to visuaize the earning resuts of the MCPM in the 2-D space. The rea datasets used are the two benchmar binary cassification datasets, the Pima indians diabetes dataset and the Twonorm dataset, from the machine earning repository [1], [11], [7]. The Pima indians diabetes dataset consists of 768 instances with 8 attributes. The twonorm dataset, consisting of 7, 400 sampes with 20 attributes, was generated from a mutivariate norma distribution [1], [11]. A. Toy Data B. Experimenta Setup and Mode Seection In the experiment, each dataset is partitioned into 90% training and 10% test sets. The fina resuts are the average resuts over 10 random partitions. Comparisons are performed on MCPM, Support Custering Machine (SCM) [12] (both in LPPK and RBF-PPK), and the MPM. For fair comparisons, we adopt the Threshod Order Dependent (TOD) agorithm [3], the same custering method as used in [12]. The experiments are performed on a PC with a 2.13GHz Inte Core 2 CPU and 1G RAM. We use Matab 7.1 to conduct the comparisons. Severa parameters need to be tuned in training different modes. For the erneized MPM, we use the Gaussian erne, e x y 2 /σ, with parameter σ. For SCM, the parameters are the trade-off parameter C and the width parameter γ when the RBF-PPK is used. For the MCPM with RBF-PPK, ony the width parameter γ needs to be tuned. A these parameters are chosen via cross-vaidation on the training dataset. 2, 000 data points, generated from a mixture of Gaussian distribution, are potted in Fig. 1(a). The TOD agorithm is appied to group the x-cass data into 15 positive custers and to group the y-cass sampes into 15 negative custers, respectivey. As shown in Fig. 1(b), the training custers are denoted by eipses, where the weights are proportiona to the sizes of custers. In the experiment, the obtained weights (priors), means, and covariance matrices of training custers are used as the input for SCM and MCPM. Tabe I reports the average training time, test time and accuracies obtained by the erneized MPM, SCM and MCPM with LPPK and RBF-PPK. We can see that the time cost of MCPM is argey reduced when compared to that of MPM, whie the accuracy is just sighty decreased. Especiay, the training time of the erneized MPM is reduced from to , which is over 4, 500 times reduction. Meanwhie, there is over 30 times reduction in the test time. The training and test time for SCM and MCPM are neary the same for LPPK and RBF-PPK, whie the SCM and MCPM with RBF-PPK outperforms the SCM and MCPM with LPPK respectivey in terms of the accuracy. This shows that the generaized non-inear probabiity product erne, i.e., RBF- PPK, is superior to the inear PPK in the toy data. Moreover, different from SCM, both MPM and MCMP can generate an expicit worst-case accuracy bound α. Furthermore, the bound of MCPM with RBF-PPK is tighter than those of MPM and MCPM with LPPK. This again demonstrates the superiority of the MCPM using the generaized PPK over other methods. C. Benchmar Datasets We compare the performance of the proposed MCPM with other methods on two benchmar datasets in this section. Tabe II reports the average training time, test time, worstcase accuracy bounds, and accuracies on the Pima indians diabetes dataset; whie Tabe III reports the average resuts on the Twonorm dataset. From tabe II and tabe III, we have the foowing observations: Athough the erneized MPM has better accuracy than the inear MPM, it costs too much time on the training procedure. The MCPM overcomes the shortcoming of the erneized MPM: it reduces the training time argey, over 10, 000 times reduction, whie maintaining a comparabe accuracy to that of the erneized MPM for both datasets. The MCPM can aso output a worst-case accuracy bound α. The bound is once again tighter than that of the MPM Internationa Joint Conference on Neura Networs (IJCNN 2008) 4017

5 1 Toydata cassification resuts x SCM (LPPK) SCM (RBF PPK) MPM 0.4 MCPM 1 (LPPK) MCPM (RBF PPK) x 1 (a) Sampes (b) Resuts of SCM, MPM and MCPM Fig. 1. Kerneized MPM, SCM and MCPM with LPPK and RBF-PPK in a 2-D space. Training data are indicated with bue + s for cass x and red s for cass y. Test sampes are indicated with bac s for cass x and green o s for cass y. The training custers are represented by eipses with size proportiona to the priors, bue eipses for cass c x and red eipses for cass c y. The decision boundaries constructed by the SCM with LPPK (thin green dotted ine), the SCM with RBF-PPK (thic red dash-dot ine), the MPM (thin bue soid ine), the MCPM with LPPK (thic magenta soid ine), and the MCPM with RBF-PPK (thic bac dash ine) are shown. Notice that SCM and MCPM with RBF-PPK improve the test set performance compared to the SCM and MCPM with LPPK. TABLE I AVERAGE RESULTS ON THE SYNTHETIC DATASET. Methods Training (s) Test (s) α (%) Accuracy (%) SCM (LPPK) ± ± ± 0.8 SCM (RBF-PPK) ± ± ± 0.6 MPM ± ± ± ± 1.0 MCPM (LPPK) ± ± ± ± 0.9 MCPM (RBF-PPK) ± ± ± ± 0.8 TABLE II AVERAGE RESULTS ON THE PIMA INDIANS DIABETES DATASET. Methods Training (s) Test (s) α (%) Accuracy (%) SCM (LPPK) ± ± ± 1.6 SCM (RBF-PPK) ± ± ± 1.6 MPM (Linear) ± ± ± ± 0.9 MPM (Kerne) ± ± ± ± 0.8 MCPM LPPK ± ± ± ± 1.4 MCPM RBF-PPK ± ± ± ± 1.5 The proposed generaized probabiity product erne, i.e., the RBF-PPK can argey improve the accuracy of the traditiona probabiity product erne, i.e., the LPPK, inbothscmandmcpm. The above observations vaidate the advantages of our proposed method and show that both the training and the test time can be reduced greaty by our MCPM method whie the accuracy can be maintained. Moreover, the proposed generaized probabiity erne can deiver better accuracies than the traditiona probabiity product erne. In order to examine the performance when different custer numbers are chosen, we show the average test error rates of SCM and MCPM using LPPK and RBF-PPK with respect to the number of training custers in Fig. 2. From this figure, we have the foowing observations. First, the best resuts of SCM and MCPM with LPPK and RBF-PPK are obtained in different number of training custers. This shows that the number of custers coud indeed infuence the overa accuracy. In order to obtain the best performance, we may need to choose this parameter carefuy. Second, in a the cases, the earning agorithms using the RBF- PPK consistenty outperform those using the LPPK. This ceary demonstrates the advantages of the proposed noninear generaized PPK. V. CONCLUSION In this paper, we have proposed an efficient Minimax Custering Probabiity Machine mode. This mode can eeganty incorporate custer information into the earning process so as to reduce the training and test time compexity greaty. We have proposed a generaized probabiity product erne. This erne has demonstrated desirabe properties in measuring the simiarity defined either between a pair of custers or a custer and a test vector. Experimenta resuts on both synthetic and rea data show that the proposed agorithm can reduce the training and test time significanty whie preserving the accuracy. Moreover, the proposed generaized probabiity product erne has been shown to outperform the traditiona inear probabiity product erne consistenty Internationa Joint Conference on Neura Networs (IJCNN 2008)

6 TABLE III AVERAGE RESULTS ON THE TWONORM DATASET. Methods Training (s) Test (s) α (%) Accuracy (%) SCM (LPPK) ± ± ± 1.7 SCM (RBF-PPK) ± ± ± 1.5 MPM (Linear) ± ± ± ± 0.5 MPM (Kerne) ± ± ± ± 0.5 MCPM (LPPK) ± ± ± ± 1.5 MCPM (RBF-PPK) ± ± ± ± 1.2 Average Error Rates of SCM and MCPM on Pima Dataset 0.16 Average Error Rates of SCM and MCPM on Twonorm Dataset Error rates 0.3 Error rates SCM (LPPK) SCM (RBF PPK) MCPM (LPPK) MCPM (RBF PPK) 0.02 SCM (LPPK) SCM (RBF PPK) MCPM (LPPK) MCPM (RBF PPK) No. of custers No. of custers (a) Pima Dataset (b) Twonorm Dataset Fig. 2. Average error rates of SCM and MCPM with respect to the number of custers on Pima dataset and Twonorm dataset. Severa important issues deserve our attentions in the future. First, the custering and the cassifier earning are currenty impemented in two separate steps. It remains interesting whether these two steps can be unified in one step. Second, athough both theoretica justification and empirica verification has demonstrated the advantages of the proposed generaized probabiity product erne, further exporations on its mathematic properties are sti important research topics. Third, we mainy evauate our agorithm on two-cass data in this paper for simpicity. Extensive investigations on arge-scae muti-cass rea data are aso necessary. Finay, how to choose the optima custer number is aso an important research topic in the future. REFERENCES [1] L. Breiman. Arcing cassifiers. Technica Report 460, Statistics Department, University of Caifornia, [2] R. Coobert and S. Bengio. SVMTorch: Support vector machines for arge-scae regression probems. Journa of Machine Learning Research, 1: , [3] M. Friedman and A. Kande. Introduction to pattern recognition : statistica, structura, neura, and fuzzy ogic approaches. Word scientific, Singapore, [4] Hans Peter Graf, Eric Cosatto, Léon Bottou, Igor Dourdanovic, and Vadimir Vapni. Parae support vector machines: The cascade svm. In NIPS 17, pages , [5] K. Huang, H. Yang, I. King, and M. R. Lyu. Learning cassifiers from imbaanced data based on biased minimax probabiity machine. In CVPR-2004, voume 2, pages , [6] K. Huang, H. Yang, I. King, and M. R. Lyu. Maximizing sensitivity in medica diagnosis using biased minimax probabiity machine. IEEE Transactions on Biomedica Engineering, 53: , [7] K. Huang, H. Yang, I. King, M. R. Lyu, and L. Chan. The minimum error minimax probabiity machine. Journa of Machine Learning Research, 5: , [8] T. Jebara, R. Kondor, and A. Howard. Probabiity product ernes. Journa of Machine Learning Research, Specia Topic on Learning Theory, pages , [9] S. Sathiya Keerthi, Shirish Krishnaj Shevade, Chiranjib Bhattacharyya, and K. R. K. Murthy. Improvements to patt s smo agorithm for svm cassifier design. Neura Computation, 13(3): , [10] G. R. G. Lancriet, L. E. Ghaoui, C. Bhattacharyya, and M. I. Jordan. Minimax probabiity machine. In NIPS 15, [11] G. R. G. Lancriet, L. E. Ghaoui, C. Bhattacharyya, and M. I. Jordan. A robust minimax approach to cassification. Journa of Machine Learning Research, 3: , [12] B. Li, M. Chi, J. Fan, and X. Xue. Support custer machine. In ICML 07, pages , [13] M. Lobo, L. Vandenberghe, S. Boyd, and H. Lebret. Appications of second order cone programming. Linear Agebra and its Appications, 284: , [14] A. W. Marsha and I. Oin. Mutivariate chebyshev inequaities. Annas of Mathematica Statistics, 31(4): , [15] Y. Nesterov and A. Nemirovsy. Interior point poynomia methods in convex programming: Theory and appications. Studies in Appied Mathematics. Phiadephia, [16] John C. Patt. Fast training of support vector machines using sequentia minima optimization. In B. Schöopf, C. J. C. Burges, and A. J. Smoa, editors, Advances in Kerne Methods Support Vector Learning. MIT Press, [17] I. Popescu and D. Bertsimas. Optima inequaities in probabiity theory: A convex optimization approach. Technica Report TM62, INSEAD, [18] B. Schoopf and A. Smoa. Learning with Kernes. MIT Press, Cambridge, MA, [19] T. Strohmann and G. Grudic. A formuation for minimax probabiity machine regression. In S. Becer, S. Thrun, and K. Obermayer, editors, In NIPS 15. MIT Press, [20] H. Yu, J. Yang, and J. Han. Cassifying arge data sets using svms with hierarchica custers. In Proceedings of the ACM SIGKDD Internationa Conference on Knowedge Discovery and Data Mining, pages , [21] J. Yuan, J. Li, and B. Zhang. Learning concepts from arge scae imbaanced data sets using support custer machines. In ACM Mutimedia, pages , Internationa Joint Conference on Neura Networs (IJCNN 2008) 4019

Statistical Learning Theory: A Primer

Statistical Learning Theory: A Primer Internationa Journa of Computer Vision 38(), 9 3, 2000 c 2000 uwer Academic Pubishers. Manufactured in The Netherands. Statistica Learning Theory: A Primer THEODOROS EVGENIOU, MASSIMILIANO PONTIL AND TOMASO