Theoretical Study on Reduced Support Vector Machines

Size: px

Start display at page:

Download "Theoretical Study on Reduced Support Vector Machines"

Leonard Quinn
5 years ago
Views:

1 Theoretical Study on Reduced Support Vector Machines Su-Yun Huang Institute of Statistical Science Academia Sinica Taipei 115, Taiwan Yuh-Jye Lee Computer Science & Information Engineering National Taiwan University of Science and Technology Taipei 106, Taiwan Abstract The reduced support vector machine was proposed for the practical objective to overcome the computational difficulties as well as to reduce the model complexity in generating a nonlinear separating surface for a massive dataset. It has been successfully applied to other kernel-based learning algorithms. Also, there is an experimental study on RSVM that shows the efficiency of RSVM. In 1

2 this paper we study the RSVM from the viewpoint of model building and consider the nonlinear separating surface as a mixture of kernels. The RSVM uses a compressed model representation instead of a saturated full model, where fitting data to a saturated model is deemed to be unstable and causes prediction error to rise due to over-fitting. Our main result shows that uniform random selection of the reduced set in RSVM is the optimal selection scheme in terms of the following criteria: (1) it minimizes an intrinsic model variation measure; (2) it minimizes the maximal model bias between the compressed model and the full model; (3) it maximizes the minimal ability in distinguishing the compressed model from the full model. 1 Introduction In recent years support vector machines (SVMs) with linear or nonlinear kernels [1, 2, 18] have become one of the most promising learning algorithms for classification as well as regression [3, 13, 14], which are the fundamental tasks in data mining [22]. Via the use of kernel mapping, the SVM allows effective and flexible nonlinear classification. There are some major problems that confront large data classification due to dealing with a fully dense nonlinear kernel matrix. To overcome these computational problems many approximation schemes have been proposed to approximate a symmetric positive semidefinite matrix [17, 21]. Lee and Mangasarian proposed as an alternative the method of reduced support vector machine (RSVM) [8]. The key ideas of the RSVM are as follows. RSVM randomly selects a small portion of data set to generate a thin rectangular kernel matrix. Then it uses this small rectangular kernel matrix to replace the fully dense square kernel matrix in the nonlinear SVM formulation. Computational time, as well as memory usage, is much smaller for RSVM than that for a conventional SVM using the full kernel matrix. As a result, RSVM 2

3 also simplifies the characterization of the nonlinear separating surface. According to Occam s razor, as well as Minimum Description Length (MDL) [15] and the numerical comparisons in [8], RSVM has better generalization ability than a conventional SVM. This reduced kernel technique has been successfully applied to other kernelbased learning algorithms, such as proximal support vector machine (PSVM) [6] and ɛ-smooth support vector regression (ɛ-ssvr) [9]. Also, there is an experimental study on RSVM [11] that showed the efficiency of RSVM. The RSVM results reported in [11] could be further improved if a larger tuning parameter would have been properly chosen. Since the RSVM has reduced the model complexity by using a much smaller rectangular kernel matrix we suggest to use a larger tuning parameter to put more weight on data goodness of fit. In this paper we study the RSVM from the viewpoint of model building. We provide the uniform randomness some optimality interpretations and show that this uniform random subset approach has an active effect on reducing model complexity. This uniform random subset selection in RSVM also has a link to the popular uniform design [4]. We briefly outline the contents of the paper. Section 2 provides the main idea and formulation for RSVM. Section 3 gives the uniform randomness an optimality interpretation in terms of a model variation measure. Section 4 further provides the uniform randomness some minimaxity and maximinity properties. We show that RSVM uniform design minimizes the maximal model bias between the compressed model and the full model and that it maximizes the minimal ability in distinguishing the compressed model from the full one. Section 5 concludes the paper. All proofs are put in the appendix. Aword about our notation and background material is given below. All vectors will be column vectors unless otherwise specified or transposed to a row vector by a prime superscript. For a vector x in the n-dimensional real space R n, the plus 3

4 function x + is defined as (x + ) j = max {0,x j }. The scalar (inner) product of two vectors x and z in the n-dimensional real space R n will be denoted by x z and the p-norm of x will be denoted by x p.for a matrix A R m n, A i is the ith row of A. A column vector of ones of arbitrary dimension will be denoted by e. ForA R m n and B R n l, the kernel K(A, B) maps R m n R n l into R m l. In particular, K(x,z)isareal number, K(x,A )isarow vector in R m, K(A, x) isacolumn vector in R m and K(A, A )isanm m matrix. The base of the natural logarithm will be denoted by ε. 2 Reduced Support Vector Machines Consider the problem of classifying points into two classes, A and A +.Weare given a training data set {(x i,y i )} m, where x i X R n is an input vector variable and y i { 1, 1} is a class label, indicating one of the two classes, A and A +,towhich the input point belongs. We represent these data points by an m n matrix A, where the ith row A i corresponds to the ith input data point. We use alternately A i (a row vector) and x i (a column vector) for the same ith data point depending on convenience. We use an m m diagonal matrix D, D ii = y i to specify the class membership of each input point. The main goal of the classification problem is to find a classifier that can predict correctly the unseen class labels for new data inputs. It can be achieved by constructing a linear or nonlinear separating surface, f(x) = 0, which is implicitly defined by a kernel function. We classify a test point x to A + if f(x) 0, otherwise, to A.Wewill focus on nonlinear case in this paper. In conventional SVM as well as many kernel-based learning algorithms [1, 2, 18], generating a nonlinear separating surface has to deal with a fully dense kernel matrix. When training a nonlinear SVM on a massive data set, the huge and dense full kernel matrix will lead to some computational difficulties as follows [8]: 4

5 the size of the mathematical programming problem; the dependence of the nonlinear separating surface on the entire data set, which creates unwieldy storage problems that prevents the use of nonlinear kernels for anything but a small data set; the overfitting of the separating surface, which causes high variation and less stable fitting. To avoid these difficulties, the RSVM uses a very small random subset of size m of the original m data points, where m m, tobuild up the separating surface. We denote this random subset by Ã, which is used to generate a much smaller rectangular matrix K(A, Ã ) R m m. The reduced kernel matrix is served to replace the huge and fully dense square matrix K(A, A ) used in conventional SVM to cut problem size, computation time and memory usage as well as to simplify the characterization of nonlinear separating surface. This reduced kernel technique can be applied to most of kernel-based learning algorithms. We now briefly describe the RSVM formulation, which is derived from the generalized support vector machine (GSVM) [12] and the smooth support vector machine (SSVM) [10]. The RSVM solves the following unconstrained minimization problem for a general kernel K(A, Ã ): min (ũ,γ) R m+1 ν 2 p(e D{K(A, Ã ) Dũ eγ},α) (ũ ũ + γ 2 ), (1) where the function p(x, α) isavery accurate smooth approximation to (x) + [10], which is applied componentwise to the vector e D{K(A, Ã ) Dũ eγ} and is defined componentwise by {the jth component of p(x, α)} = x j + 1 α log(1 + ε αx j ),α>0, j=1,...,n. (2) The function p(x, α) converges to (x) + as α goes to infinity. The positive tuning parameter ν here controls the tradeoff between the classification error and the suppression of (ũ, γ). Since RSVM has essentially reduced the model complexity via 5

6 using a much smaller rectangular kernel matrix, we will suggest to use a larger tuning parameter ν in RSVM than in conventional SVM. The diagonal matrix D R m m, with ones or minus ones along its diagonal, specifies the class membership of each point in the reduced set. If we let ṽ = Dũ, i.e., ṽ i = D ii ũ i, then ṽ ṽ = ũ ũ and problem (1) is equivalent to ν min (ṽ,γ) R m+1 2 p(e D{K(A, Ã )ṽ eγ},α) (ṽ ṽ + γ 2 ). (3) A solution to this minimization problem (3) leads to the nonlinear separating surface f(x) =K(x, Ã )ṽ γ =0, or m ṽ i K(x, Ã i) γ =0. (4) The minimization problem (3) retains the strong convexity and differentiability properties in the space, R m+1,of(ṽ, γ) for any arbitrary rectangular kernel. Hence we can apply the Newton-Armijo method [10] directly to solve (3) and the existence and uniqueness of the optimal solution are also guaranteed. Moreover, the matrix D has disappeared in the formulation (3). This means that we do not need to record the class label of each data point in the reduced set. We note that this nonlinear separating surface (4) is a linear combination of a set of kernel functions plus a constant. In a nutshell, the RSVM can be split into two parts. First, it selects a small random subset {K(, Ã 1),K(, Ã 2),,K(, Ã m)} from the likely highly correlated full-data atom set {K(,A i)} m for building the separating function. The full-data set is inefficient in function representation, but the conventional SVM has been using it. Secondly, the RSVM determines the best coefficients of the selected kernel functions by solving the unconstrained minimization problem (3) using the entire data set so that the surface will fit the whole data well. In next section, we study the RSVM from the model building point of view. 6

7 3 Optimal sampling design on atom set The conventional nonlinear SVM uses a saturated representation for the separating surface: f(x) = m v i K(x,A i) γ =0. (5) It is a linear combination of atom functions, {1} {K(,A i)} m. One may regard the collection, {1} {K(,A i)} m, asanovercomplete dictionary of atoms when m is large or approaching infinity. The representation (5) is said to be saturated, as there are as many parameters v i as data points. Fitting data to a saturated model is deemed to be unstable and causes generalization error to rise due to over-fitting [7]. Therefore, it is desirable to cut down the model complexity and fit a compressed model: f(x) = m ṽ i K(x, Ã i) γ =0. (6) We call {K(,A i)} m the full atom set based on training data and {K(, Ã i)} m a reduced atom set. Concerning the choice of a reduced set, a simple way is the uniform random subset approach used in RSVM. The RSVM randomly selects a small portion of atom functions from the full set to generate the compressed model, while it fits the entire data set to this compressed model. Each candidate atom in the full set has equal chance of being selected. This uniform random subset selection is simple and straightforward, but is it the most effective way in building up a compressed model? There have been experimental results showing its effectiveness for modeling classification surfaces [8, 11] as well as regression surfaces [9]. In this article, we justify and complement these experimental results with a theoretical study. We show that the uniform random selection is the most effective sampling scheme in terms of a few criteria stated later. Before proceed any further, we define some notation and state some conditions. 7

8 Let ξ 0 beaprobability measure on X. Assume the training inputs follow the distribution ξ 0.Instatistical term, ξ 0 is called the training design. Let H denote the reproducing kernel Hilbert space generated by the kernel K(x,z). Let D := {K(,z)} z X, which is an overcomplete dictionary for H. Let ξ be an absolutely continuous probability measure with respect to ξ 0, denoted by ξ ξ 0. Let P be the collection of all such probability measure, i.e., P = {ξ : probability measure on X satisfying ξ ξ 0 }. The Radon-Nikodym derivative of ξ with respect to ξ 0 is denoted by p(x), i.e., p(x) =dξ(x)/dξ 0 (x). If a random subset of size m, {K(,z j )} j=1, m isselected from the dictionary D according to ξ, i.e., z j are i.i.d. from ξ, then we say that the distribution ξ is a sampling design for atom selection. For convenience, throughout this paper a Gaussian kernel K σ (x,z)=(2πσ 2 ) n/2 exp{ x z 2 /(2σ 2 )}, is used for the RSVM. However, the theoretical results obtained in this article can easily extend to other kernels. We often suppress the subscript σ in K σ and let K(x z) :=K σ (x,z). The value of K(z x) represents the inner product of resulting vectors of x and z in the feature space after a nonlinear mapping implicated and defined by the Gaussian kernel. We can also interpret it as a measure of similarity between x and z. Inother words, K(A i,a) records the similarity between A i and all training inputs. In particular, if we use the Gaussian kernel in the RSVM we can interpret RSVM as an instance-based learning algorithms [15]. The reduced kernel matrix arranges only the similarity between reduced set and the entire training data set. In contrast to the k- nearest neighbor algorithm using the simple voting strategy, RSVM uses the 8

9 weighted voting strategy, where weights and threshold are determined by a training procedure. That is, if the weighted sum, K(x, Ã )ṽ, isgreater than a threshold, γ, the point x is classified as a positive example. The underlying separating function f(x) ismodelled as a mixture of kernels plus a bias term. However, as our main focus here is to assess the variability of model building due to atom sampling from the dictionary D, the bias term γ may then, without loss of generality, be assumed zero. The separating function f(x) isallowed to vary over the function class H, but satisfies the following conditions: C1. Function f is a mixture of Gaussians via mixing distribution ξ 0 f(x) = K(x z)v(z)dξ 0 (z), (7) where v(z) isacoefficient function satisfying conditions C2 and C3 below. C2. Assume v 2 (z)dξ 0 (z) <. C3. To fix a scale for the separating function, it is assumed, without loss of generality, that v(z) dξ 0 (z) =1. By re-expressing the above mixture via the sampling design ξ, wehave f(x) = K(x z)v(z) dξ 0(z) dξ(z) dξ(z) = K(x z)v(z)/p(z) dξ(z). (8) The separating function f(x) can be written as the expectation of f(x, Z): f(x) =Ef(x, Z) :=E {K(x Z)v(Z)/p(Z)}, Z D ξ, where D stands for follows the distribution. Sampling from f(x, Z) :=K(x Z)v(Z)/p(Z) isused to build a compressed model for separating function. The integrated variance of f(x, Z), as a measure for model variation, is given below: Var ξ {f(x, Z)}dx = K 2 (x z)v 2 (z)/p 2 (z) dξ(z)dx f 2 (x)dx. (9) 9

10 The integrated model variance is used to assess the model robustness. The smaller it is, the better stability (less variability) the model built from ξ will be. Therefore, we aim to find a good sampling design ξ with small integrated model variance. As the quantity f 2 (x)dx does not involve ξ, wemay drop it in the minimization step and use the following model variation measure. Definition 1 (Model variation) We use the following measure for model variation due to sampling design ξ: V (ξ) = K 2 (x z)v 2 (z)/p 2 (z) dξ(z)dx. (10) By Lemma 1 in Appendix, it is easy to get the optimal design with minimum V (ξ) over the set P and its pdf with respect to ξ 0 is given by p(z) v(z) K(x z)dx v(z). The coefficient function v(z) is not known and has yet to be estimated in the training process. Here we use a constant as a reference function to replace v(z) and the resulting p(z) isuniform with repect to ξ 0. a constant reference function for v(z). There are two main reasons of using One is to indicate no prior preference or information on v(z) prior to training process. The other is to reflect the intrinsic mechanism in SVM that tends to minimize the 2-norm, v 2 (z)dξ 0 (z), of v(z). Under the fixed-scale condition C3, the 2-norm of v(z) is minimized when v(z) is a constant function. Thus, the uniform p(z) is justified. In other words, atom functions K(,z) are sampled from the dictionary D according to the training design ξ = ξ 0. Theorem 1 (Optimal sampling design) The optimal sampling design for atom selection is given by ξ 0. The optimality is in terms of minimizing V (ξ) over ξ P prior to training process. This justifies the uniform sampling scheme (on training data) used in the original RSVM of Lee and Mangasarian [8]. 10

11 In practice, a discrete approximation to ξ 0 is necessary. Naturally, we use its empirical distribution, i.e., putting probability mass 1/m at each training point A i. Then, atom functions are uniformly sampled from {K(,A i)} m. This random subset approach can drastically cut down the model complexity, while the sampling design helps to guide the atom selection in terms of minimal model variation. However, we remind the reader the quantity V (ξ) does not account for parameter estimation variance, but purely the model variation due to random sampling of atoms from the dictionary D. The resulting optimal sampling design in Theorem 1 is a uniform design with respect to ξ 0, which seeks atom points uniformly distributed over training points. The use of uniform design has been popular since See articles [4] and [5] for a nice survey of theory and application on uniform design. The uniform design is a space filling design and it seeks to obtain maximal model robustness. 4 Minimaxity and maximinity In this section we show that the above mentioned optimal sampling design also possesses some minimaxity and maximinity properties. The ideas are inspired from [19, 20] and made suitable in the context of RSVM. Recall the saturated and compressed models: saturated : f(x) = compressed : f(x) = m v i K(x,A i) γ, (11) m ṽ i K(x, Ã i) γ. (12) Using kernel K corresponds to mapping the data into a feature Hilbert space with a map: Φ : X U.Inthe feature space U, the normal vector of the SVM separating hyperplane, w u γ =0,can be expanded in terms of support vectors. The saturated 11

12 model has the following full expansion for the normal vector m saturated : w = v i Φ(A i ), (13) while the compressed model has the reduced expansion compressed : w = m ṽ i Φ(Ãi). (14) Expansions given by (13) and (14) are called pre-images. See [16] for approximating the full pre-image expansion by reduced-set expansion. The error induced from approximation by reduced-set expansion is investigated here. We define some more notation. F:= linear span{k(,a i)} m and R := linear span{k(, Ã i)}. m F η := { g F: X g2 (z)dξ 0 (z) η 2, F + η := { g F: X g2 (z)dξ 0 (z) η 2, X K(Ãi,z)g(z)dξ 0 (z) =0,,..., m }. X K(Ãi,z)g(z)dξ 0 (z) =0,,..., m }. Ṽξ := X K(Ã, z)k(z, Ã )dξ(z), which is an m m matrix. B(f,ξ) := {f(x) K(x, Ã )Ṽ ξ K( Ã, z)f(z)dξ(z)} 2 dξ 0 (x), where f F, ξ P and Ṽ ξ is a generalized inverse for Ṽξ. We introduce below a measurement for model bias between the full model (11) and the compressed model (12). For an arbitrary function f F, the L 2 (dξ)-projection of f onto the reduced space R is given by P R f(x) =K(x, Ã )Ṽ ξ K(Ã, z)f(z)dξ(z). We use the following integrated square bias to account for model bias: model bias (f) := {f(x) P R f(x)} 2 dξ 0 = B(f,ξ). (15) 12

13 The quantity B(f,ξ) reflects a degree of inadequacy in f represented by the compressed representation. To guard against the maximum possible model bias (or inadequacy), we would use a design that minimizes the maximum bias. Theorem 2 (Minimaxity) The minimax design which minimizes the maximum bias B(f,ξ) for f F η R is achieved by ξ = ξ 0, i.e., sup B(f,ξ 0 )=inf sup B(f,ξ). (16) f Fη ξ P R f Fη R Equivalently, the uniform design with respect to ξ 0 is the minimax design to minimize the maximum model bias between the full model (11) and the compressed model (12). The optimal sampling design, ξ = ξ 0, minimizes the maximum bias as the saturated model ranging over a neighborhood of the compressed model. The first condition in the definition of Fη specifies a neighborhood around the compressed model and the second condition ensures the identifiability of parameter ṽ in this Fη -neighborhood. Below we further provide the optimal sampling design with a maximinity interpretation. Theorem 3 (Maximinity) For f F η + achieved by ξ = ξ 0, i.e., R, the maximinity design for B(f,ξ) is inf B(f,ξ 0 )=sup f F η + R ξ P inf f F + η R B(f,ξ). (17) Equivalently, the uniform design with respect to ξ 0 is the maximin design. Recall the definition of F + η. The first condition specifies a class which deviates from the compressed model with at least a distance η. If the true model is at least η-distance away from R and falls into the class specified by F + η,wewould hope to be able to distinguish between the true model and the compressed model in a lack of fit test. That is, we would hope to have as large model bias as possible in order to have 13

14 adequate ability to distinguish between the two models in a LOF test. Theorem 3 tells us that sampling design ξ = ξ 0 achieves the maximum ability for a worst possible f F + η. 5 Conclusion We study the RSVM from a model building point of view. We consider the nonlinear classifier f(x) asamixture of kernels and express it as a linear combination of kernel functions sampled from the full atom set on training data. We also interpret the RSVM as an instance-based learning algorithms. The reduced kernel matrix records the similarity between the reduced set and the entire training data set. The RSVM has used the weighted voting strategy to classify a new data point. That is, if the weighted sum, K(x, Ã )ṽ is great than a threshold γ, then the point x is classified as positive example, where ṽ and γ are determined by solving a unconstrained minimization problem. Our main result shows that uniformly random selecting the reduced set of RSVM is the optimal sampling strategy for atom kernel selection in the sense of minimizing an intrinsic model variation measure. We also provide this RSVM sampling strategy a minimaxity and a maximinity interpretation in terms of model bias and test ability respectively. This theoretical study provides a better understanding of RSVM learning algorithm and helps to advance the random subset approach to most kernel-based learning algorithms. 6 Appendix Lemma 1 Foragiven probability density function t(z) with respect to ξ 0, let P t denote the collection of probability density functions p(z) such that X (t(z)/p(z))dξ 0(z) < 14

15 . (Here 0/0 is defined to be zero.) Then the solution to the optimality problem arg min (t(z)/p(z))dξ 0 (z) p P t X is given by p(z) =c 1 t(z), where c = X t(z) dξ 0 (z). Proof: Since X (t(z)/p(z))dξ 0(z) < and X p(z)dξ 0(z) =1,by Hölder inequality, we have ( 2 (t(z)/p(z)) dξ 0 (z) = (t(z)/p(z)) dξ 0 (z) p(z)dξ 0 (z) t 1/2 (z)dξ 0 (z)). X X X X Equality holds if and only if, there exists some nonzero constant β such that t(z)/p(z) =βp(z) a.e. Since p(z) isapdf, then equality holds if and only if p(z) =c 1 t 1/2 (z). Lemma 2 sup g F η B(g, ξ 0 )=η 2. Proof: Recall B(g, ξ) := {g(x) K(x, Ã )Ṽ ξ K( Ã, z)g(z)dξ(z)} 2 dξ 0 (x). For g Fη,wehave K(Ã, z)g(z)dξ 0(z) =0. Then B(g, ξ 0 ) = = { g(x) K(x, Ã )Ṽ ξ 0 g 2 (x)dξ 0 η 2, for g F η. K(Ã, z)g(z)dξ 0(z)} 2 dξ0 (x) The proof can be completed by finding a function g 0 F η such that the above equality holds. A construction for such a function goes as follows. We first find a partition consisting of measurable sets X = m+1 E i with E i disjoint and ξ 0 (E i ) > 0. Let β =(β 1,...,β m+1 ) beanon-null solution to the following system of m equations in m +1unknowns: m+1 β i E i K(Ãj,z)dξ 0 (z) =0,j=1,..., m. (18) 15

16 Let g 0 (x) =η ( m+1 β 2 i ξ 0 (E i )) 1/2 m+1 β i I Ei (x). (19) Then g 2 0(x)dξ 0 (x) =η 2 and, by (18) and (19), it is easy to see that, for j =1,..., m, g 0 (x)k(ãj,x)dξ 0 (x) =η ( m+1 β 2 i ξ 0 (E i ) ) 1/2 m+1 β i E i K(Ãj,x)dξ 0 (x) =0. (20) Thus, g 0 F η. Also, by (20), we have B(g 0,ξ 0 )= X g2 0(x)dξ 0 (x) =η 2. Lemma 3 inf g F + η B(g, ξ 0 )=η 2. Proof: Start with, for g F + η, B(g, ξ 0 ) = = { X g(x) K(x, Ã )Ṽ ξ 0 g 2 (x)dξ 0 η 2. K(Ã, z)g(z)dξ 0(z)} 2 dξ0 (x) The function g 0 given in (19) is also in F + η and it satisfies the equality B(g, ξ 0 )=η 2. Lemma 4 B(h, ξ) =0for all h R. Proof: For any h R, there exists an h R m such that h can be represented as h(z) =K(z, Ã ) h K(x, Ã )Ṽ ξ = K(x, Ã )Ṽ ξ = K(x, Ã ) h = h(x). K(Ã, z)h(z)dξ(z) K(Ã, z)k(z, Ã ) hdξ(z) Thus, we have B(h, ξ) =0forh R. Lemma 5 We have the following minimaxity and maximinity properties for B(g, ξ): (1) inf ξ P sup g F η R B(g, ξ) =sup g Fη R B(g, ξ 0) and (2) sup ξ P inf g F + η R B(g, ξ) =inf g F η + R B(g, ξ 0). 16

17 Proof: For an arbitrary f F η R (or f F + η R), f can be decomposed uniquely as f = h + g, where h Rand g F η (or g F + η ). As B(h, ξ) =0for h R, then B(f,ξ) =B(g, ξ). Therefore, it is sufficient to prove statement (1) for g in the class F η and statement (2) for g in the class F + η. That is, instead of (1) and (2), it is sufficient to show (1 ) inf ξ P sup g F η B(g, ξ) =sup g F η B(g, ξ 0 ) and (2 ) sup ξ P inf g F + η B(g, ξ) =inf g F + η B(g, ξ 0 ). (1 )Wewill show that for any ξ Pthere exists a g 0 Fη such that B(g 0,ξ) η 2. Then, combined with Lemma 2, we get B(g 0,ξ) η 2 = sup g F η B(g, ξ 0 ). (21) If so, statement (1 )isthen proved. Below we construct a g 0 such that g 0 F η and B(g 0,ξ) η 2. Let ν = ξ ξ 0,asigned measure. By Hahn decomposition theorem there exists a measurable set E such that ν is positive on E and negative on E c.wecan then find a partition E = 2 m+1 E i with E i disjoint and ν(e i ) > 0. Since ξ = ν + ξ 0 and that ξ is absolutely continuous with respect to ξ 0, ξ 0 (E i ) > 0, too. Let β =(β 1,...,β 2 m+1 ) be anon-null solution to the following system of 2 m equations in 2 m +1unknowns: Let 2 m+1 2 m+1 g 0 (x) :=η β i E i K(Ãj,z)dξ 0 (z) =0,j=1,..., m. (22) β i E i K(Ãj,z) dξ(z) =0,j=1,..., m. (23) ( 2 m+1 β 2 i ξ 0 (E i )) 1/2 Since ξ = ν + ξ 0 and ν is positive on E, then 2 m+1 β i I Ei (x). (24) g0(x)dξ(x) 2 g 2 0(x)dξ 0 (x) =η 2. 17

18 By (22), we get g 0 (x)k(ãj,x)dξ 0 (x) =0for 1 j m. Thus, g 0 is in F η. By (23), we get g 0 (x)k(ãj,x)dξ(x) =0for 1 j m. Thus, { 2 B(g 0,ξ)= g 0 (x) K(x, Ã )Ṽ ξ 0(z)dξ(z)} K(Ã, z)g dξ0 (x) = g 2 0(x)dξ 0 (x) =η 2. Therefore, proof for (1 )iscompleted. (2 )Wewill show that for any ξ Pthere exists an h 0 F + η such that B(h 0,ξ) η 2. Then, combined with Lemma 3, we get B(h 0,ξ) η 2 = inf g F η + B(g, ξ 0 ). (25) If so, statement (2 ) can then be proved. Below we construct such an h 0. Let ν = ξ ξ 0 and E c be as defined earlier in this proof. We then find a partition for E c with E c = m+1 U i with U i disjoint and ν(u i ) < 0. Then, ξ 0 (U i )=ξ(u i ) ν(u i ) > 0. Let β = (β 1,...,β m+1 ) beanon-null solution to the following system of m equations in m + 1 unknowns: Let m+1 β i U i K(Ãj,z)dξ 0 (z) =0,j=1,..., m. (26) h 0 (x) :=η ( m+1 β 2 i ξ 0 (U i )) 1/2 m+1 β i I Ui (x). (27) Then h 2 0(x)dξ 0 (x) =η 2 and h 0 (x)k(ãj,x)dξ 0 (x) =0for j =1,..., m. Thus, h 0 F η +. Also, B(h 0,ξ)= {h 0 (x) K(x, Ã )Ṽ 2 0(z)dξ(z)} K(Ã, z)h dξ0 (x) h 2 0(x)dξ 0 (x) =η 2. Therefore, proof for (2 )iscompleted. References [1] C. J. C. Burges. A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery, 2(2): ,

19 [2] N. Cristianini and J. Shawe-Taylor. An Introduction to Support Vector Machines. Cambridge University Press, Cambridge, [3] H. Drucker, C. J. C. Burges, L. Kaufman, A. Smola, and V. Vapnik. Support vector regression machines. In M. C. Mozer, M. I. Jordan, and T. Petsche, editors, Advances in Neural Information Processing Systems -9-, pages , Cambridge, MA, MIT Press. [4] K. T. Fang, D. K. J. Lin, P. Winker, and Y. Zhang. Uniform design: theory and application. Technometrics, 42: , [5] K. T. Fang, Y. Wang, and P. M. Bentler. Some applications of number-theoretic methods in statistics. Statistical Science, 9: , [6] G. Fung and O. L. Mangasarian. Proximal support vector machine classifiers. In F. Provost and R. Srikant, editors, Proceedings KDD-2001: Knowledge Discovery and Data Mining, August 26-29, 2001, San Francisco, CA, pages 77 86, New York, Asscociation for Computing Machinery. ftp://ftp.cs.wisc.edu/pub/dmi/tech-reports/01-02.ps. [7] T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning. Springer-Verlag, New York, [8] Y.-J. Lee and O. L. Mangasarian. RSVM: Reduced support vector machines. Technical Report 00-07, Data Mining Institute, Computer Sciences Department, University of Wisconsin, Madison, Wisconsin, July Proceedings of the First SIAM International Conference on Data Mining, Chicago, April 5-7, 2001, CD-ROM Proceedings. ftp://ftp.cs.wisc.edu/pub/dmi/tech-reports/00-07.ps. [9] Yuh-Jye Lee and Wen-Feng Hsieh. ɛ-ssvr: A smooth vector machine for ɛ- insensitive regression. Technical report, Department of Computer Science and 19

20 Information Engineering, National Taiwan University of Science and Technology, IEEE Transactions on Knowledge and Data Engineering, submitted. [10] Yuh-Jye Lee and O. L. Mangasarian. SSVM: A smooth support vector machine. Computational Optimization and Applications, 20:5 22, Data Mining Institute, University of Wisconsin, Technical Report ftp://ftp.cs.wisc.edu/pub/dmi/tech-reports/99-03.ps. [11] K.-M. Lin and C.-J. Lin. A study on reduced support vector machines. IEEE Transactions on Neural Networks, 14: , [12] O. L. Mangasarian. Generalized support vector machines. In A. Smola, P. Bartlett, B. Schölkopf, and D. Schuurmans, editors, Advances in Large Margin Classifiers, pages , Cambridge, MA, MIT Press. ftp://ftp.cs.wisc.edu/math-prog/tech-reports/98-14.ps. [13] O. L. Mangasarian and D. R. Musicant. Large scale kernel regression via linear programming. Technical Report 99-02, Data Mining Institute, Computer Sciences Department, University of Wisconsin, Madison, Wisconsin, August ftp://ftp.cs.wisc.edu/pub/dmi/tech-reports/99-02.ps. [14] O. L. Mangasarian and D. R. Musicant. Robust linear and support vector regression. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(9): , ftp://ftp.cs.wisc.edu/pub/dmi/tech-reports/99-09.ps. [15] T. M. Mitchell. Machine Learning. McGraw-Hill, Boston, [16] A. Smola and B. Schölkopf. Learning with Kernels. MIT Press, Cambridge, MA,

21 [17] Alex J. Smola and Bernhard Schölkopf. Sparse greedy matrix approximation for machine learning. In Proc. 17th International Conf. on Machine Learning, pages Morgan Kaufmann, San Francisco, CA, [18] V. N. Vapnik. The Nature of Statistical Learning Theory. Springer, New York, [19] D. P. Wiens. Designs for approximately linear regression: Two optimality properties of uniform designs. Statist. & Probab. Letters, 12: , [20] D. P. Wiens. Minimax designs for approximately linear regression. J. Statist. Plann. Inference, 31: , [21] C. K. I. Williams and M. Seeger. Using the Nyström method to speed up kernel machines. In Todd K. Leen, Thomas G. Dietterich, and Volker Tresp, editors, Advances in Neural Information Processing Systems 13, pages , Cambridge, MA, MIT Press. [22] I. H. Witten and E. Frank. Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations. Morgan Kaufmann, San Francisco, CA,

Support Vector Machine Classification via Parameterless Robust Linear Programming

Support Vector Machine Classification via Parameterless Robust Linear Programming O. L. Mangasarian Abstract We show that the problem of minimizing the sum of arbitrary-norm real distances to misclassified