Theoretical Study on Reduced Support Vector Machines

Size: px
Start display at page:

Download "Theoretical Study on Reduced Support Vector Machines"

Transcription

1 Theoretical Study on Reduced Support Vector Machines Su-Yun Huang Institute of Statistical Science Academia Sinica Taipei 115, Taiwan Yuh-Jye Lee Computer Science & Information Engineering National Taiwan University of Science and Technology Taipei 106, Taiwan Abstract The reduced support vector machine was proposed for the practical objective to overcome the computational difficulties as well as to reduce the model complexity in generating a nonlinear separating surface for a massive dataset. It has been successfully applied to other kernel-based learning algorithms. Also, there is an experimental study on RSVM that shows the efficiency of RSVM. In 1

2 this paper we study the RSVM from the viewpoint of model building and consider the nonlinear separating surface as a mixture of kernels. The RSVM uses a compressed model representation instead of a saturated full model, where fitting data to a saturated model is deemed to be unstable and causes prediction error to rise due to over-fitting. Our main result shows that uniform random selection of the reduced set in RSVM is the optimal selection scheme in terms of the following criteria: (1) it minimizes an intrinsic model variation measure; (2) it minimizes the maximal model bias between the compressed model and the full model; (3) it maximizes the minimal ability in distinguishing the compressed model from the full model. 1 Introduction In recent years support vector machines (SVMs) with linear or nonlinear kernels [1, 2, 18] have become one of the most promising learning algorithms for classification as well as regression [3, 13, 14], which are the fundamental tasks in data mining [22]. Via the use of kernel mapping, the SVM allows effective and flexible nonlinear classification. There are some major problems that confront large data classification due to dealing with a fully dense nonlinear kernel matrix. To overcome these computational problems many approximation schemes have been proposed to approximate a symmetric positive semidefinite matrix [17, 21]. Lee and Mangasarian proposed as an alternative the method of reduced support vector machine (RSVM) [8]. The key ideas of the RSVM are as follows. RSVM randomly selects a small portion of data set to generate a thin rectangular kernel matrix. Then it uses this small rectangular kernel matrix to replace the fully dense square kernel matrix in the nonlinear SVM formulation. Computational time, as well as memory usage, is much smaller for RSVM than that for a conventional SVM using the full kernel matrix. As a result, RSVM 2

3 also simplifies the characterization of the nonlinear separating surface. According to Occam s razor, as well as Minimum Description Length (MDL) [15] and the numerical comparisons in [8], RSVM has better generalization ability than a conventional SVM. This reduced kernel technique has been successfully applied to other kernelbased learning algorithms, such as proximal support vector machine (PSVM) [6] and ɛ-smooth support vector regression (ɛ-ssvr) [9]. Also, there is an experimental study on RSVM [11] that showed the efficiency of RSVM. The RSVM results reported in [11] could be further improved if a larger tuning parameter would have been properly chosen. Since the RSVM has reduced the model complexity by using a much smaller rectangular kernel matrix we suggest to use a larger tuning parameter to put more weight on data goodness of fit. In this paper we study the RSVM from the viewpoint of model building. We provide the uniform randomness some optimality interpretations and show that this uniform random subset approach has an active effect on reducing model complexity. This uniform random subset selection in RSVM also has a link to the popular uniform design [4]. We briefly outline the contents of the paper. Section 2 provides the main idea and formulation for RSVM. Section 3 gives the uniform randomness an optimality interpretation in terms of a model variation measure. Section 4 further provides the uniform randomness some minimaxity and maximinity properties. We show that RSVM uniform design minimizes the maximal model bias between the compressed model and the full model and that it maximizes the minimal ability in distinguishing the compressed model from the full one. Section 5 concludes the paper. All proofs are put in the appendix. Aword about our notation and background material is given below. All vectors will be column vectors unless otherwise specified or transposed to a row vector by a prime superscript. For a vector x in the n-dimensional real space R n, the plus 3

4 function x + is defined as (x + ) j = max {0,x j }. The scalar (inner) product of two vectors x and z in the n-dimensional real space R n will be denoted by x z and the p-norm of x will be denoted by x p.for a matrix A R m n, A i is the ith row of A. A column vector of ones of arbitrary dimension will be denoted by e. ForA R m n and B R n l, the kernel K(A, B) maps R m n R n l into R m l. In particular, K(x,z)isareal number, K(x,A )isarow vector in R m, K(A, x) isacolumn vector in R m and K(A, A )isanm m matrix. The base of the natural logarithm will be denoted by ε. 2 Reduced Support Vector Machines Consider the problem of classifying points into two classes, A and A +.Weare given a training data set {(x i,y i )} m, where x i X R n is an input vector variable and y i { 1, 1} is a class label, indicating one of the two classes, A and A +,towhich the input point belongs. We represent these data points by an m n matrix A, where the ith row A i corresponds to the ith input data point. We use alternately A i (a row vector) and x i (a column vector) for the same ith data point depending on convenience. We use an m m diagonal matrix D, D ii = y i to specify the class membership of each input point. The main goal of the classification problem is to find a classifier that can predict correctly the unseen class labels for new data inputs. It can be achieved by constructing a linear or nonlinear separating surface, f(x) = 0, which is implicitly defined by a kernel function. We classify a test point x to A + if f(x) 0, otherwise, to A.Wewill focus on nonlinear case in this paper. In conventional SVM as well as many kernel-based learning algorithms [1, 2, 18], generating a nonlinear separating surface has to deal with a fully dense kernel matrix. When training a nonlinear SVM on a massive data set, the huge and dense full kernel matrix will lead to some computational difficulties as follows [8]: 4

5 the size of the mathematical programming problem; the dependence of the nonlinear separating surface on the entire data set, which creates unwieldy storage problems that prevents the use of nonlinear kernels for anything but a small data set; the overfitting of the separating surface, which causes high variation and less stable fitting. To avoid these difficulties, the RSVM uses a very small random subset of size m of the original m data points, where m m, tobuild up the separating surface. We denote this random subset by Ã, which is used to generate a much smaller rectangular matrix K(A, Ã ) R m m. The reduced kernel matrix is served to replace the huge and fully dense square matrix K(A, A ) used in conventional SVM to cut problem size, computation time and memory usage as well as to simplify the characterization of nonlinear separating surface. This reduced kernel technique can be applied to most of kernel-based learning algorithms. We now briefly describe the RSVM formulation, which is derived from the generalized support vector machine (GSVM) [12] and the smooth support vector machine (SSVM) [10]. The RSVM solves the following unconstrained minimization problem for a general kernel K(A, Ã ): min (ũ,γ) R m+1 ν 2 p(e D{K(A, Ã ) Dũ eγ},α) (ũ ũ + γ 2 ), (1) where the function p(x, α) isavery accurate smooth approximation to (x) + [10], which is applied componentwise to the vector e D{K(A, Ã ) Dũ eγ} and is defined componentwise by {the jth component of p(x, α)} = x j + 1 α log(1 + ε αx j ),α>0, j=1,...,n. (2) The function p(x, α) converges to (x) + as α goes to infinity. The positive tuning parameter ν here controls the tradeoff between the classification error and the suppression of (ũ, γ). Since RSVM has essentially reduced the model complexity via 5

6 using a much smaller rectangular kernel matrix, we will suggest to use a larger tuning parameter ν in RSVM than in conventional SVM. The diagonal matrix D R m m, with ones or minus ones along its diagonal, specifies the class membership of each point in the reduced set. If we let ṽ = Dũ, i.e., ṽ i = D ii ũ i, then ṽ ṽ = ũ ũ and problem (1) is equivalent to ν min (ṽ,γ) R m+1 2 p(e D{K(A, Ã )ṽ eγ},α) (ṽ ṽ + γ 2 ). (3) A solution to this minimization problem (3) leads to the nonlinear separating surface f(x) =K(x, Ã )ṽ γ =0, or m ṽ i K(x, Ã i) γ =0. (4) The minimization problem (3) retains the strong convexity and differentiability properties in the space, R m+1,of(ṽ, γ) for any arbitrary rectangular kernel. Hence we can apply the Newton-Armijo method [10] directly to solve (3) and the existence and uniqueness of the optimal solution are also guaranteed. Moreover, the matrix D has disappeared in the formulation (3). This means that we do not need to record the class label of each data point in the reduced set. We note that this nonlinear separating surface (4) is a linear combination of a set of kernel functions plus a constant. In a nutshell, the RSVM can be split into two parts. First, it selects a small random subset {K(, Ã 1),K(, Ã 2),,K(, Ã m)} from the likely highly correlated full-data atom set {K(,A i)} m for building the separating function. The full-data set is inefficient in function representation, but the conventional SVM has been using it. Secondly, the RSVM determines the best coefficients of the selected kernel functions by solving the unconstrained minimization problem (3) using the entire data set so that the surface will fit the whole data well. In next section, we study the RSVM from the model building point of view. 6

7 3 Optimal sampling design on atom set The conventional nonlinear SVM uses a saturated representation for the separating surface: f(x) = m v i K(x,A i) γ =0. (5) It is a linear combination of atom functions, {1} {K(,A i)} m. One may regard the collection, {1} {K(,A i)} m, asanovercomplete dictionary of atoms when m is large or approaching infinity. The representation (5) is said to be saturated, as there are as many parameters v i as data points. Fitting data to a saturated model is deemed to be unstable and causes generalization error to rise due to over-fitting [7]. Therefore, it is desirable to cut down the model complexity and fit a compressed model: f(x) = m ṽ i K(x, Ã i) γ =0. (6) We call {K(,A i)} m the full atom set based on training data and {K(, Ã i)} m a reduced atom set. Concerning the choice of a reduced set, a simple way is the uniform random subset approach used in RSVM. The RSVM randomly selects a small portion of atom functions from the full set to generate the compressed model, while it fits the entire data set to this compressed model. Each candidate atom in the full set has equal chance of being selected. This uniform random subset selection is simple and straightforward, but is it the most effective way in building up a compressed model? There have been experimental results showing its effectiveness for modeling classification surfaces [8, 11] as well as regression surfaces [9]. In this article, we justify and complement these experimental results with a theoretical study. We show that the uniform random selection is the most effective sampling scheme in terms of a few criteria stated later. Before proceed any further, we define some notation and state some conditions. 7

8 Let ξ 0 beaprobability measure on X. Assume the training inputs follow the distribution ξ 0.Instatistical term, ξ 0 is called the training design. Let H denote the reproducing kernel Hilbert space generated by the kernel K(x,z). Let D := {K(,z)} z X, which is an overcomplete dictionary for H. Let ξ be an absolutely continuous probability measure with respect to ξ 0, denoted by ξ ξ 0. Let P be the collection of all such probability measure, i.e., P = {ξ : probability measure on X satisfying ξ ξ 0 }. The Radon-Nikodym derivative of ξ with respect to ξ 0 is denoted by p(x), i.e., p(x) =dξ(x)/dξ 0 (x). If a random subset of size m, {K(,z j )} j=1, m isselected from the dictionary D according to ξ, i.e., z j are i.i.d. from ξ, then we say that the distribution ξ is a sampling design for atom selection. For convenience, throughout this paper a Gaussian kernel K σ (x,z)=(2πσ 2 ) n/2 exp{ x z 2 /(2σ 2 )}, is used for the RSVM. However, the theoretical results obtained in this article can easily extend to other kernels. We often suppress the subscript σ in K σ and let K(x z) :=K σ (x,z). The value of K(z x) represents the inner product of resulting vectors of x and z in the feature space after a nonlinear mapping implicated and defined by the Gaussian kernel. We can also interpret it as a measure of similarity between x and z. Inother words, K(A i,a) records the similarity between A i and all training inputs. In particular, if we use the Gaussian kernel in the RSVM we can interpret RSVM as an instance-based learning algorithms [15]. The reduced kernel matrix arranges only the similarity between reduced set and the entire training data set. In contrast to the k- nearest neighbor algorithm using the simple voting strategy, RSVM uses the 8

9 weighted voting strategy, where weights and threshold are determined by a training procedure. That is, if the weighted sum, K(x, Ã )ṽ, isgreater than a threshold, γ, the point x is classified as a positive example. The underlying separating function f(x) ismodelled as a mixture of kernels plus a bias term. However, as our main focus here is to assess the variability of model building due to atom sampling from the dictionary D, the bias term γ may then, without loss of generality, be assumed zero. The separating function f(x) isallowed to vary over the function class H, but satisfies the following conditions: C1. Function f is a mixture of Gaussians via mixing distribution ξ 0 f(x) = K(x z)v(z)dξ 0 (z), (7) where v(z) isacoefficient function satisfying conditions C2 and C3 below. C2. Assume v 2 (z)dξ 0 (z) <. C3. To fix a scale for the separating function, it is assumed, without loss of generality, that v(z) dξ 0 (z) =1. By re-expressing the above mixture via the sampling design ξ, wehave f(x) = K(x z)v(z) dξ 0(z) dξ(z) dξ(z) = K(x z)v(z)/p(z) dξ(z). (8) The separating function f(x) can be written as the expectation of f(x, Z): f(x) =Ef(x, Z) :=E {K(x Z)v(Z)/p(Z)}, Z D ξ, where D stands for follows the distribution. Sampling from f(x, Z) :=K(x Z)v(Z)/p(Z) isused to build a compressed model for separating function. The integrated variance of f(x, Z), as a measure for model variation, is given below: Var ξ {f(x, Z)}dx = K 2 (x z)v 2 (z)/p 2 (z) dξ(z)dx f 2 (x)dx. (9) 9

10 The integrated model variance is used to assess the model robustness. The smaller it is, the better stability (less variability) the model built from ξ will be. Therefore, we aim to find a good sampling design ξ with small integrated model variance. As the quantity f 2 (x)dx does not involve ξ, wemay drop it in the minimization step and use the following model variation measure. Definition 1 (Model variation) We use the following measure for model variation due to sampling design ξ: V (ξ) = K 2 (x z)v 2 (z)/p 2 (z) dξ(z)dx. (10) By Lemma 1 in Appendix, it is easy to get the optimal design with minimum V (ξ) over the set P and its pdf with respect to ξ 0 is given by p(z) v(z) K(x z)dx v(z). The coefficient function v(z) is not known and has yet to be estimated in the training process. Here we use a constant as a reference function to replace v(z) and the resulting p(z) isuniform with repect to ξ 0. a constant reference function for v(z). There are two main reasons of using One is to indicate no prior preference or information on v(z) prior to training process. The other is to reflect the intrinsic mechanism in SVM that tends to minimize the 2-norm, v 2 (z)dξ 0 (z), of v(z). Under the fixed-scale condition C3, the 2-norm of v(z) is minimized when v(z) is a constant function. Thus, the uniform p(z) is justified. In other words, atom functions K(,z) are sampled from the dictionary D according to the training design ξ = ξ 0. Theorem 1 (Optimal sampling design) The optimal sampling design for atom selection is given by ξ 0. The optimality is in terms of minimizing V (ξ) over ξ P prior to training process. This justifies the uniform sampling scheme (on training data) used in the original RSVM of Lee and Mangasarian [8]. 10

11 In practice, a discrete approximation to ξ 0 is necessary. Naturally, we use its empirical distribution, i.e., putting probability mass 1/m at each training point A i. Then, atom functions are uniformly sampled from {K(,A i)} m. This random subset approach can drastically cut down the model complexity, while the sampling design helps to guide the atom selection in terms of minimal model variation. However, we remind the reader the quantity V (ξ) does not account for parameter estimation variance, but purely the model variation due to random sampling of atoms from the dictionary D. The resulting optimal sampling design in Theorem 1 is a uniform design with respect to ξ 0, which seeks atom points uniformly distributed over training points. The use of uniform design has been popular since See articles [4] and [5] for a nice survey of theory and application on uniform design. The uniform design is a space filling design and it seeks to obtain maximal model robustness. 4 Minimaxity and maximinity In this section we show that the above mentioned optimal sampling design also possesses some minimaxity and maximinity properties. The ideas are inspired from [19, 20] and made suitable in the context of RSVM. Recall the saturated and compressed models: saturated : f(x) = compressed : f(x) = m v i K(x,A i) γ, (11) m ṽ i K(x, Ã i) γ. (12) Using kernel K corresponds to mapping the data into a feature Hilbert space with a map: Φ : X U.Inthe feature space U, the normal vector of the SVM separating hyperplane, w u γ =0,can be expanded in terms of support vectors. The saturated 11

12 model has the following full expansion for the normal vector m saturated : w = v i Φ(A i ), (13) while the compressed model has the reduced expansion compressed : w = m ṽ i Φ(Ãi). (14) Expansions given by (13) and (14) are called pre-images. See [16] for approximating the full pre-image expansion by reduced-set expansion. The error induced from approximation by reduced-set expansion is investigated here. We define some more notation. F:= linear span{k(,a i)} m and R := linear span{k(, Ã i)}. m F η := { g F: X g2 (z)dξ 0 (z) η 2, F + η := { g F: X g2 (z)dξ 0 (z) η 2, X K(Ãi,z)g(z)dξ 0 (z) =0,,..., m }. X K(Ãi,z)g(z)dξ 0 (z) =0,,..., m }. Ṽξ := X K(Ã, z)k(z, Ã )dξ(z), which is an m m matrix. B(f,ξ) := {f(x) K(x, Ã )Ṽ ξ K( Ã, z)f(z)dξ(z)} 2 dξ 0 (x), where f F, ξ P and Ṽ ξ is a generalized inverse for Ṽξ. We introduce below a measurement for model bias between the full model (11) and the compressed model (12). For an arbitrary function f F, the L 2 (dξ)-projection of f onto the reduced space R is given by P R f(x) =K(x, Ã )Ṽ ξ K(Ã, z)f(z)dξ(z). We use the following integrated square bias to account for model bias: model bias (f) := {f(x) P R f(x)} 2 dξ 0 = B(f,ξ). (15) 12

13 The quantity B(f,ξ) reflects a degree of inadequacy in f represented by the compressed representation. To guard against the maximum possible model bias (or inadequacy), we would use a design that minimizes the maximum bias. Theorem 2 (Minimaxity) The minimax design which minimizes the maximum bias B(f,ξ) for f F η R is achieved by ξ = ξ 0, i.e., sup B(f,ξ 0 )=inf sup B(f,ξ). (16) f Fη ξ P R f Fη R Equivalently, the uniform design with respect to ξ 0 is the minimax design to minimize the maximum model bias between the full model (11) and the compressed model (12). The optimal sampling design, ξ = ξ 0, minimizes the maximum bias as the saturated model ranging over a neighborhood of the compressed model. The first condition in the definition of Fη specifies a neighborhood around the compressed model and the second condition ensures the identifiability of parameter ṽ in this Fη -neighborhood. Below we further provide the optimal sampling design with a maximinity interpretation. Theorem 3 (Maximinity) For f F η + achieved by ξ = ξ 0, i.e., R, the maximinity design for B(f,ξ) is inf B(f,ξ 0 )=sup f F η + R ξ P inf f F + η R B(f,ξ). (17) Equivalently, the uniform design with respect to ξ 0 is the maximin design. Recall the definition of F + η. The first condition specifies a class which deviates from the compressed model with at least a distance η. If the true model is at least η-distance away from R and falls into the class specified by F + η,wewould hope to be able to distinguish between the true model and the compressed model in a lack of fit test. That is, we would hope to have as large model bias as possible in order to have 13

14 adequate ability to distinguish between the two models in a LOF test. Theorem 3 tells us that sampling design ξ = ξ 0 achieves the maximum ability for a worst possible f F + η. 5 Conclusion We study the RSVM from a model building point of view. We consider the nonlinear classifier f(x) asamixture of kernels and express it as a linear combination of kernel functions sampled from the full atom set on training data. We also interpret the RSVM as an instance-based learning algorithms. The reduced kernel matrix records the similarity between the reduced set and the entire training data set. The RSVM has used the weighted voting strategy to classify a new data point. That is, if the weighted sum, K(x, Ã )ṽ is great than a threshold γ, then the point x is classified as positive example, where ṽ and γ are determined by solving a unconstrained minimization problem. Our main result shows that uniformly random selecting the reduced set of RSVM is the optimal sampling strategy for atom kernel selection in the sense of minimizing an intrinsic model variation measure. We also provide this RSVM sampling strategy a minimaxity and a maximinity interpretation in terms of model bias and test ability respectively. This theoretical study provides a better understanding of RSVM learning algorithm and helps to advance the random subset approach to most kernel-based learning algorithms. 6 Appendix Lemma 1 Foragiven probability density function t(z) with respect to ξ 0, let P t denote the collection of probability density functions p(z) such that X (t(z)/p(z))dξ 0(z) < 14

15 . (Here 0/0 is defined to be zero.) Then the solution to the optimality problem arg min (t(z)/p(z))dξ 0 (z) p P t X is given by p(z) =c 1 t(z), where c = X t(z) dξ 0 (z). Proof: Since X (t(z)/p(z))dξ 0(z) < and X p(z)dξ 0(z) =1,by Hölder inequality, we have ( 2 (t(z)/p(z)) dξ 0 (z) = (t(z)/p(z)) dξ 0 (z) p(z)dξ 0 (z) t 1/2 (z)dξ 0 (z)). X X X X Equality holds if and only if, there exists some nonzero constant β such that t(z)/p(z) =βp(z) a.e. Since p(z) isapdf, then equality holds if and only if p(z) =c 1 t 1/2 (z). Lemma 2 sup g F η B(g, ξ 0 )=η 2. Proof: Recall B(g, ξ) := {g(x) K(x, Ã )Ṽ ξ K( Ã, z)g(z)dξ(z)} 2 dξ 0 (x). For g Fη,wehave K(Ã, z)g(z)dξ 0(z) =0. Then B(g, ξ 0 ) = = { g(x) K(x, Ã )Ṽ ξ 0 g 2 (x)dξ 0 η 2, for g F η. K(Ã, z)g(z)dξ 0(z)} 2 dξ0 (x) The proof can be completed by finding a function g 0 F η such that the above equality holds. A construction for such a function goes as follows. We first find a partition consisting of measurable sets X = m+1 E i with E i disjoint and ξ 0 (E i ) > 0. Let β =(β 1,...,β m+1 ) beanon-null solution to the following system of m equations in m +1unknowns: m+1 β i E i K(Ãj,z)dξ 0 (z) =0,j=1,..., m. (18) 15

16 Let g 0 (x) =η ( m+1 β 2 i ξ 0 (E i )) 1/2 m+1 β i I Ei (x). (19) Then g 2 0(x)dξ 0 (x) =η 2 and, by (18) and (19), it is easy to see that, for j =1,..., m, g 0 (x)k(ãj,x)dξ 0 (x) =η ( m+1 β 2 i ξ 0 (E i ) ) 1/2 m+1 β i E i K(Ãj,x)dξ 0 (x) =0. (20) Thus, g 0 F η. Also, by (20), we have B(g 0,ξ 0 )= X g2 0(x)dξ 0 (x) =η 2. Lemma 3 inf g F + η B(g, ξ 0 )=η 2. Proof: Start with, for g F + η, B(g, ξ 0 ) = = { X g(x) K(x, Ã )Ṽ ξ 0 g 2 (x)dξ 0 η 2. K(Ã, z)g(z)dξ 0(z)} 2 dξ0 (x) The function g 0 given in (19) is also in F + η and it satisfies the equality B(g, ξ 0 )=η 2. Lemma 4 B(h, ξ) =0for all h R. Proof: For any h R, there exists an h R m such that h can be represented as h(z) =K(z, Ã ) h K(x, Ã )Ṽ ξ = K(x, Ã )Ṽ ξ = K(x, Ã ) h = h(x). K(Ã, z)h(z)dξ(z) K(Ã, z)k(z, Ã ) hdξ(z) Thus, we have B(h, ξ) =0forh R. Lemma 5 We have the following minimaxity and maximinity properties for B(g, ξ): (1) inf ξ P sup g F η R B(g, ξ) =sup g Fη R B(g, ξ 0) and (2) sup ξ P inf g F + η R B(g, ξ) =inf g F η + R B(g, ξ 0). 16

17 Proof: For an arbitrary f F η R (or f F + η R), f can be decomposed uniquely as f = h + g, where h Rand g F η (or g F + η ). As B(h, ξ) =0for h R, then B(f,ξ) =B(g, ξ). Therefore, it is sufficient to prove statement (1) for g in the class F η and statement (2) for g in the class F + η. That is, instead of (1) and (2), it is sufficient to show (1 ) inf ξ P sup g F η B(g, ξ) =sup g F η B(g, ξ 0 ) and (2 ) sup ξ P inf g F + η B(g, ξ) =inf g F + η B(g, ξ 0 ). (1 )Wewill show that for any ξ Pthere exists a g 0 Fη such that B(g 0,ξ) η 2. Then, combined with Lemma 2, we get B(g 0,ξ) η 2 = sup g F η B(g, ξ 0 ). (21) If so, statement (1 )isthen proved. Below we construct a g 0 such that g 0 F η and B(g 0,ξ) η 2. Let ν = ξ ξ 0,asigned measure. By Hahn decomposition theorem there exists a measurable set E such that ν is positive on E and negative on E c.wecan then find a partition E = 2 m+1 E i with E i disjoint and ν(e i ) > 0. Since ξ = ν + ξ 0 and that ξ is absolutely continuous with respect to ξ 0, ξ 0 (E i ) > 0, too. Let β =(β 1,...,β 2 m+1 ) be anon-null solution to the following system of 2 m equations in 2 m +1unknowns: Let 2 m+1 2 m+1 g 0 (x) :=η β i E i K(Ãj,z)dξ 0 (z) =0,j=1,..., m. (22) β i E i K(Ãj,z) dξ(z) =0,j=1,..., m. (23) ( 2 m+1 β 2 i ξ 0 (E i )) 1/2 Since ξ = ν + ξ 0 and ν is positive on E, then 2 m+1 β i I Ei (x). (24) g0(x)dξ(x) 2 g 2 0(x)dξ 0 (x) =η 2. 17

18 By (22), we get g 0 (x)k(ãj,x)dξ 0 (x) =0for 1 j m. Thus, g 0 is in F η. By (23), we get g 0 (x)k(ãj,x)dξ(x) =0for 1 j m. Thus, { 2 B(g 0,ξ)= g 0 (x) K(x, Ã )Ṽ ξ 0(z)dξ(z)} K(Ã, z)g dξ0 (x) = g 2 0(x)dξ 0 (x) =η 2. Therefore, proof for (1 )iscompleted. (2 )Wewill show that for any ξ Pthere exists an h 0 F + η such that B(h 0,ξ) η 2. Then, combined with Lemma 3, we get B(h 0,ξ) η 2 = inf g F η + B(g, ξ 0 ). (25) If so, statement (2 ) can then be proved. Below we construct such an h 0. Let ν = ξ ξ 0 and E c be as defined earlier in this proof. We then find a partition for E c with E c = m+1 U i with U i disjoint and ν(u i ) < 0. Then, ξ 0 (U i )=ξ(u i ) ν(u i ) > 0. Let β = (β 1,...,β m+1 ) beanon-null solution to the following system of m equations in m + 1 unknowns: Let m+1 β i U i K(Ãj,z)dξ 0 (z) =0,j=1,..., m. (26) h 0 (x) :=η ( m+1 β 2 i ξ 0 (U i )) 1/2 m+1 β i I Ui (x). (27) Then h 2 0(x)dξ 0 (x) =η 2 and h 0 (x)k(ãj,x)dξ 0 (x) =0for j =1,..., m. Thus, h 0 F η +. Also, B(h 0,ξ)= {h 0 (x) K(x, Ã )Ṽ 2 0(z)dξ(z)} K(Ã, z)h dξ0 (x) h 2 0(x)dξ 0 (x) =η 2. Therefore, proof for (2 )iscompleted. References [1] C. J. C. Burges. A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery, 2(2): ,

19 [2] N. Cristianini and J. Shawe-Taylor. An Introduction to Support Vector Machines. Cambridge University Press, Cambridge, [3] H. Drucker, C. J. C. Burges, L. Kaufman, A. Smola, and V. Vapnik. Support vector regression machines. In M. C. Mozer, M. I. Jordan, and T. Petsche, editors, Advances in Neural Information Processing Systems -9-, pages , Cambridge, MA, MIT Press. [4] K. T. Fang, D. K. J. Lin, P. Winker, and Y. Zhang. Uniform design: theory and application. Technometrics, 42: , [5] K. T. Fang, Y. Wang, and P. M. Bentler. Some applications of number-theoretic methods in statistics. Statistical Science, 9: , [6] G. Fung and O. L. Mangasarian. Proximal support vector machine classifiers. In F. Provost and R. Srikant, editors, Proceedings KDD-2001: Knowledge Discovery and Data Mining, August 26-29, 2001, San Francisco, CA, pages 77 86, New York, Asscociation for Computing Machinery. ftp://ftp.cs.wisc.edu/pub/dmi/tech-reports/01-02.ps. [7] T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning. Springer-Verlag, New York, [8] Y.-J. Lee and O. L. Mangasarian. RSVM: Reduced support vector machines. Technical Report 00-07, Data Mining Institute, Computer Sciences Department, University of Wisconsin, Madison, Wisconsin, July Proceedings of the First SIAM International Conference on Data Mining, Chicago, April 5-7, 2001, CD-ROM Proceedings. ftp://ftp.cs.wisc.edu/pub/dmi/tech-reports/00-07.ps. [9] Yuh-Jye Lee and Wen-Feng Hsieh. ɛ-ssvr: A smooth vector machine for ɛ- insensitive regression. Technical report, Department of Computer Science and 19

20 Information Engineering, National Taiwan University of Science and Technology, IEEE Transactions on Knowledge and Data Engineering, submitted. [10] Yuh-Jye Lee and O. L. Mangasarian. SSVM: A smooth support vector machine. Computational Optimization and Applications, 20:5 22, Data Mining Institute, University of Wisconsin, Technical Report ftp://ftp.cs.wisc.edu/pub/dmi/tech-reports/99-03.ps. [11] K.-M. Lin and C.-J. Lin. A study on reduced support vector machines. IEEE Transactions on Neural Networks, 14: , [12] O. L. Mangasarian. Generalized support vector machines. In A. Smola, P. Bartlett, B. Schölkopf, and D. Schuurmans, editors, Advances in Large Margin Classifiers, pages , Cambridge, MA, MIT Press. ftp://ftp.cs.wisc.edu/math-prog/tech-reports/98-14.ps. [13] O. L. Mangasarian and D. R. Musicant. Large scale kernel regression via linear programming. Technical Report 99-02, Data Mining Institute, Computer Sciences Department, University of Wisconsin, Madison, Wisconsin, August ftp://ftp.cs.wisc.edu/pub/dmi/tech-reports/99-02.ps. [14] O. L. Mangasarian and D. R. Musicant. Robust linear and support vector regression. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(9): , ftp://ftp.cs.wisc.edu/pub/dmi/tech-reports/99-09.ps. [15] T. M. Mitchell. Machine Learning. McGraw-Hill, Boston, [16] A. Smola and B. Schölkopf. Learning with Kernels. MIT Press, Cambridge, MA,

21 [17] Alex J. Smola and Bernhard Schölkopf. Sparse greedy matrix approximation for machine learning. In Proc. 17th International Conf. on Machine Learning, pages Morgan Kaufmann, San Francisco, CA, [18] V. N. Vapnik. The Nature of Statistical Learning Theory. Springer, New York, [19] D. P. Wiens. Designs for approximately linear regression: Two optimality properties of uniform designs. Statist. & Probab. Letters, 12: , [20] D. P. Wiens. Minimax designs for approximately linear regression. J. Statist. Plann. Inference, 31: , [21] C. K. I. Williams and M. Seeger. Using the Nyström method to speed up kernel machines. In Todd K. Leen, Thomas G. Dietterich, and Volker Tresp, editors, Advances in Neural Information Processing Systems 13, pages , Cambridge, MA, MIT Press. [22] I. H. Witten and E. Frank. Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations. Morgan Kaufmann, San Francisco, CA,

Support Vector Machine Classification via Parameterless Robust Linear Programming

Support Vector Machine Classification via Parameterless Robust Linear Programming Support Vector Machine Classification via Parameterless Robust Linear Programming O. L. Mangasarian Abstract We show that the problem of minimizing the sum of arbitrary-norm real distances to misclassified

More information

A Finite Newton Method for Classification Problems

A Finite Newton Method for Classification Problems A Finite Newton Method for Classification Problems O. L. Mangasarian Computer Sciences Department University of Wisconsin 1210 West Dayton Street Madison, WI 53706 olvi@cs.wisc.edu Abstract A fundamental

More information

Unsupervised Classification via Convex Absolute Value Inequalities

Unsupervised Classification via Convex Absolute Value Inequalities Unsupervised Classification via Convex Absolute Value Inequalities Olvi L. Mangasarian Abstract We consider the problem of classifying completely unlabeled data by using convex inequalities that contain

More information

Unsupervised and Semisupervised Classification via Absolute Value Inequalities

Unsupervised and Semisupervised Classification via Absolute Value Inequalities Unsupervised and Semisupervised Classification via Absolute Value Inequalities Glenn M. Fung & Olvi L. Mangasarian Abstract We consider the problem of classifying completely or partially unlabeled data

More information

Nonlinear Knowledge-Based Classification

Nonlinear Knowledge-Based Classification Nonlinear Knowledge-Based Classification Olvi L. Mangasarian Edward W. Wild Abstract Prior knowledge over general nonlinear sets is incorporated into nonlinear kernel classification problems as linear

More information

Knowledge-Based Nonlinear Kernel Classifiers

Knowledge-Based Nonlinear Kernel Classifiers Knowledge-Based Nonlinear Kernel Classifiers Glenn M. Fung, Olvi L. Mangasarian, and Jude W. Shavlik Computer Sciences Department, University of Wisconsin Madison, WI 5376 {gfung,olvi,shavlik}@cs.wisc.edu

More information

Introduction to Support Vector Machine

Introduction to Support Vector Machine Introduction to Support Vector Machine Yuh-Jye Lee National Taiwan University of Science and Technology September 23, 2009 1 / 76 Binary Classification Problem 2 / 76 Binary Classification Problem (A Fundamental

More information

The Disputed Federalist Papers: SVM Feature Selection via Concave Minimization

The Disputed Federalist Papers: SVM Feature Selection via Concave Minimization The Disputed Federalist Papers: SVM Feature Selection via Concave Minimization Glenn Fung Siemens Medical Solutions In this paper, we use a method proposed by Bradley and Mangasarian Feature Selection

More information

Introduction to Support Vector Machines

Introduction to Support Vector Machines Introduction to Support Vector Machines Andreas Maletti Technische Universität Dresden Fakultät Informatik June 15, 2006 1 The Problem 2 The Basics 3 The Proposed Solution Learning by Machines Learning

More information

Brief Introduction to Machine Learning

Brief Introduction to Machine Learning Brief Introduction to Machine Learning Yuh-Jye Lee Lab of Data Science and Machine Intelligence Dept. of Applied Math. at NCTU August 29, 2016 1 / 49 1 Introduction 2 Binary Classification 3 Support Vector

More information

Linear Dependency Between and the Input Noise in -Support Vector Regression

Linear Dependency Between and the Input Noise in -Support Vector Regression 544 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 14, NO. 3, MAY 2003 Linear Dependency Between the Input Noise in -Support Vector Regression James T. Kwok Ivor W. Tsang Abstract In using the -support vector

More information

Machine Learning : Support Vector Machines

Machine Learning : Support Vector Machines Machine Learning Support Vector Machines 05/01/2014 Machine Learning : Support Vector Machines Linear Classifiers (recap) A building block for almost all a mapping, a partitioning of the input space into

More information

A GENERAL FORMULATION FOR SUPPORT VECTOR MACHINES. Wei Chu, S. Sathiya Keerthi, Chong Jin Ong

A GENERAL FORMULATION FOR SUPPORT VECTOR MACHINES. Wei Chu, S. Sathiya Keerthi, Chong Jin Ong A GENERAL FORMULATION FOR SUPPORT VECTOR MACHINES Wei Chu, S. Sathiya Keerthi, Chong Jin Ong Control Division, Department of Mechanical Engineering, National University of Singapore 0 Kent Ridge Crescent,

More information

Scale-Invariance of Support Vector Machines based on the Triangular Kernel. Abstract

Scale-Invariance of Support Vector Machines based on the Triangular Kernel. Abstract Scale-Invariance of Support Vector Machines based on the Triangular Kernel François Fleuret Hichem Sahbi IMEDIA Research Group INRIA Domaine de Voluceau 78150 Le Chesnay, France Abstract This paper focuses

More information

VC dimension, Model Selection and Performance Assessment for SVM and Other Machine Learning Algorithms

VC dimension, Model Selection and Performance Assessment for SVM and Other Machine Learning Algorithms 03/Feb/2010 VC dimension, Model Selection and Performance Assessment for SVM and Other Machine Learning Algorithms Presented by Andriy Temko Department of Electrical and Electronic Engineering Page 2 of

More information

Margin Maximizing Loss Functions

Margin Maximizing Loss Functions Margin Maximizing Loss Functions Saharon Rosset, Ji Zhu and Trevor Hastie Department of Statistics Stanford University Stanford, CA, 94305 saharon, jzhu, hastie@stat.stanford.edu Abstract Margin maximizing

More information

Statistical Properties and Adaptive Tuning of Support Vector Machines

Statistical Properties and Adaptive Tuning of Support Vector Machines Machine Learning, 48, 115 136, 2002 c 2002 Kluwer Academic Publishers. Manufactured in The Netherlands. Statistical Properties and Adaptive Tuning of Support Vector Machines YI LIN yilin@stat.wisc.edu

More information

A Tutorial on Support Vector Machine

A Tutorial on Support Vector Machine A Tutorial on School of Computing National University of Singapore Contents Theory on Using with Other s Contents Transforming Theory on Using with Other s What is a classifier? A function that maps instances

More information

SSVM: A Smooth Support Vector Machine for Classification

SSVM: A Smooth Support Vector Machine for Classification SSVM: A Smooth Support Vector Machine for Classification Yuh-Jye Lee and O. L. Mangasarian Computer Sciences Department University of Wisconsin 1210 West Dayton Street Madison, WI 53706 yuh-jye@cs.wisc.edu,

More information

Support Vector Machines

Support Vector Machines Support Vector Machines Tobias Pohlen Selected Topics in Human Language Technology and Pattern Recognition February 10, 2014 Human Language Technology and Pattern Recognition Lehrstuhl für Informatik 6

More information

Discriminative Direction for Kernel Classifiers

Discriminative Direction for Kernel Classifiers Discriminative Direction for Kernel Classifiers Polina Golland Artificial Intelligence Lab Massachusetts Institute of Technology Cambridge, MA 02139 polina@ai.mit.edu Abstract In many scientific and engineering

More information

Lecture 10: Support Vector Machine and Large Margin Classifier

Lecture 10: Support Vector Machine and Large Margin Classifier Lecture 10: Support Vector Machine and Large Margin Classifier Applied Multivariate Analysis Math 570, Fall 2014 Xingye Qiao Department of Mathematical Sciences Binghamton University E-mail: qiao@math.binghamton.edu

More information

The Performance of a New Hybrid Classifier Based on Boxes and Nearest Neighbors

The Performance of a New Hybrid Classifier Based on Boxes and Nearest Neighbors The Performance of a New Hybrid Classifier Based on Boxes and Nearest Neighbors Martin Anthony Department of Mathematics London School of Economics and Political Science Houghton Street, London WC2A2AE

More information

An introduction to Support Vector Machines

An introduction to Support Vector Machines 1 An introduction to Support Vector Machines Giorgio Valentini DSI - Dipartimento di Scienze dell Informazione Università degli Studi di Milano e-mail: valenti@dsi.unimi.it 2 Outline Linear classifiers

More information

Semi-Supervised Learning through Principal Directions Estimation

Semi-Supervised Learning through Principal Directions Estimation Semi-Supervised Learning through Principal Directions Estimation Olivier Chapelle, Bernhard Schölkopf, Jason Weston Max Planck Institute for Biological Cybernetics, 72076 Tübingen, Germany {first.last}@tuebingen.mpg.de

More information

Support Vector Machine & Its Applications

Support Vector Machine & Its Applications Support Vector Machine & Its Applications A portion (1/3) of the slides are taken from Prof. Andrew Moore s SVM tutorial at http://www.cs.cmu.edu/~awm/tutorials Mingyue Tan The University of British Columbia

More information

Sample width for multi-category classifiers

Sample width for multi-category classifiers R u t c o r Research R e p o r t Sample width for multi-category classifiers Martin Anthony a Joel Ratsaby b RRR 29-2012, November 2012 RUTCOR Rutgers Center for Operations Research Rutgers University

More information

Outliers Treatment in Support Vector Regression for Financial Time Series Prediction

Outliers Treatment in Support Vector Regression for Financial Time Series Prediction Outliers Treatment in Support Vector Regression for Financial Time Series Prediction Haiqin Yang, Kaizhu Huang, Laiwan Chan, Irwin King, and Michael R. Lyu Department of Computer Science and Engineering

More information

Analysis of Multiclass Support Vector Machines

Analysis of Multiclass Support Vector Machines Analysis of Multiclass Support Vector Machines Shigeo Abe Graduate School of Science and Technology Kobe University Kobe, Japan abe@eedept.kobe-u.ac.jp Abstract Since support vector machines for pattern

More information

Support Vector Machines

Support Vector Machines Support Vector Machines Stephan Dreiseitl University of Applied Sciences Upper Austria at Hagenberg Harvard-MIT Division of Health Sciences and Technology HST.951J: Medical Decision Support Overview Motivation

More information

Predicting the Probability of Correct Classification

Predicting the Probability of Correct Classification Predicting the Probability of Correct Classification Gregory Z. Grudic Department of Computer Science University of Colorado, Boulder grudic@cs.colorado.edu Abstract We propose a formulation for binary

More information

SPLINE FUNCTION SMOOTH SUPPORT VECTOR MACHINE FOR CLASSIFICATION. Yubo Yuan. Weiguo Fan. Dongmei Pu. (Communicated by David Yang Gao)

SPLINE FUNCTION SMOOTH SUPPORT VECTOR MACHINE FOR CLASSIFICATION. Yubo Yuan. Weiguo Fan. Dongmei Pu. (Communicated by David Yang Gao) JOURNAL OF INDUSTRIAL AND Website: http://aimsciences.org MANAGEMENT OPTIMIZATION Volume 3, Number 3, August 2007 pp. 529 542 SPLINE FUNCTION SMOOTH SUPPORT VECTOR MACHINE FOR CLASSIFICATION Yubo Yuan

More information

SUPPORT VECTOR MACHINE

SUPPORT VECTOR MACHINE SUPPORT VECTOR MACHINE Mainly based on https://nlp.stanford.edu/ir-book/pdf/15svm.pdf 1 Overview SVM is a huge topic Integration of MMDS, IIR, and Andrew Moore s slides here Our foci: Geometric intuition

More information

Support Vector Machines for Classification: A Statistical Portrait

Support Vector Machines for Classification: A Statistical Portrait Support Vector Machines for Classification: A Statistical Portrait Yoonkyung Lee Department of Statistics The Ohio State University May 27, 2011 The Spring Conference of Korean Statistical Society KAIST,

More information

Machine Learning. VC Dimension and Model Complexity. Eric Xing , Fall 2015

Machine Learning. VC Dimension and Model Complexity. Eric Xing , Fall 2015 Machine Learning 10-701, Fall 2015 VC Dimension and Model Complexity Eric Xing Lecture 16, November 3, 2015 Reading: Chap. 7 T.M book, and outline material Eric Xing @ CMU, 2006-2015 1 Last time: PAC and

More information

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers) Support vector machines In a nutshell Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers) Solution only depends on a small subset of training

More information

Machine Learning. Lecture 9: Learning Theory. Feng Li.

Machine Learning. Lecture 9: Learning Theory. Feng Li. Machine Learning Lecture 9: Learning Theory Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 2018 Why Learning Theory How can we tell

More information

Linear Classification and SVM. Dr. Xin Zhang

Linear Classification and SVM. Dr. Xin Zhang Linear Classification and SVM Dr. Xin Zhang Email: eexinzhang@scut.edu.cn What is linear classification? Classification is intrinsically non-linear It puts non-identical things in the same class, so a

More information

SVM-based Feature Selection by Direct Objective Minimisation

SVM-based Feature Selection by Direct Objective Minimisation SVM-based Feature Selection by Direct Objective Minimisation Julia Neumann, Christoph Schnörr, and Gabriele Steidl Dept. of Mathematics and Computer Science University of Mannheim, 683 Mannheim, Germany

More information

A note on the generalization performance of kernel classifiers with margin. Theodoros Evgeniou and Massimiliano Pontil

A note on the generalization performance of kernel classifiers with margin. Theodoros Evgeniou and Massimiliano Pontil MASSACHUSETTS INSTITUTE OF TECHNOLOGY ARTIFICIAL INTELLIGENCE LABORATORY and CENTER FOR BIOLOGICAL AND COMPUTATIONAL LEARNING DEPARTMENT OF BRAIN AND COGNITIVE SCIENCES A.I. Memo No. 68 November 999 C.B.C.L

More information

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machine (SVM) and Kernel Methods Support Vector Machine (SVM) and Kernel Methods CE-717: Machine Learning Sharif University of Technology Fall 2014 Soleymani Outline Margin concept Hard-Margin SVM Soft-Margin SVM Dual Problems of Hard-Margin

More information

Support Vector Machines Explained

Support Vector Machines Explained December 23, 2008 Support Vector Machines Explained Tristan Fletcher www.cs.ucl.ac.uk/staff/t.fletcher/ Introduction This document has been written in an attempt to make the Support Vector Machines (SVM),

More information

Linear & nonlinear classifiers

Linear & nonlinear classifiers Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1394 1 / 34 Table

More information

Nearest Neighbor. Machine Learning CSE546 Kevin Jamieson University of Washington. October 26, Kevin Jamieson 2

Nearest Neighbor. Machine Learning CSE546 Kevin Jamieson University of Washington. October 26, Kevin Jamieson 2 Nearest Neighbor Machine Learning CSE546 Kevin Jamieson University of Washington October 26, 2017 2017 Kevin Jamieson 2 Some data, Bayes Classifier Training data: True label: +1 True label: -1 Optimal

More information

SVM TRADE-OFF BETWEEN MAXIMIZE THE MARGIN AND MINIMIZE THE VARIABLES USED FOR REGRESSION

SVM TRADE-OFF BETWEEN MAXIMIZE THE MARGIN AND MINIMIZE THE VARIABLES USED FOR REGRESSION International Journal of Pure and Applied Mathematics Volume 87 No. 6 2013, 741-750 ISSN: 1311-8080 (printed version); ISSN: 1314-3395 (on-line version) url: http://www.ijpam.eu doi: http://dx.doi.org/10.12732/ijpam.v87i6.2

More information

Support Vector Machine

Support Vector Machine Andrea Passerini passerini@disi.unitn.it Machine Learning Support vector machines In a nutshell Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

More information

Machine Learning And Applications: Supervised Learning-SVM

Machine Learning And Applications: Supervised Learning-SVM Machine Learning And Applications: Supervised Learning-SVM Raphaël Bournhonesque École Normale Supérieure de Lyon, Lyon, France raphael.bournhonesque@ens-lyon.fr 1 Supervised vs unsupervised learning Machine

More information

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machine (SVM) and Kernel Methods Support Vector Machine (SVM) and Kernel Methods CE-717: Machine Learning Sharif University of Technology Fall 2015 Soleymani Outline Margin concept Hard-Margin SVM Soft-Margin SVM Dual Problems of Hard-Margin

More information

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers) Support vector machines In a nutshell Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers) Solution only depends on a small subset of training

More information

Support'Vector'Machines. Machine(Learning(Spring(2018 March(5(2018 Kasthuri Kannan

Support'Vector'Machines. Machine(Learning(Spring(2018 March(5(2018 Kasthuri Kannan Support'Vector'Machines Machine(Learning(Spring(2018 March(5(2018 Kasthuri Kannan kasthuri.kannan@nyumc.org Overview Support Vector Machines for Classification Linear Discrimination Nonlinear Discrimination

More information

CS 231A Section 1: Linear Algebra & Probability Review

CS 231A Section 1: Linear Algebra & Probability Review CS 231A Section 1: Linear Algebra & Probability Review 1 Topics Support Vector Machines Boosting Viola-Jones face detector Linear Algebra Review Notation Operations & Properties Matrix Calculus Probability

More information

Support Vector Machines. Maximizing the Margin

Support Vector Machines. Maximizing the Margin Support Vector Machines Support vector achines (SVMs) learn a hypothesis: h(x) = b + Σ i= y i α i k(x, x i ) (x, y ),..., (x, y ) are the training exs., y i {, } b is the bias weight. α,..., α are the

More information

Robust Kernel-Based Regression

Robust Kernel-Based Regression Robust Kernel-Based Regression Budi Santosa Department of Industrial Engineering Sepuluh Nopember Institute of Technology Kampus ITS Surabaya Surabaya 60111,Indonesia Theodore B. Trafalis School of Industrial

More information

Linear & nonlinear classifiers

Linear & nonlinear classifiers Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1396 1 / 44 Table

More information

Support Vector Machines II. CAP 5610: Machine Learning Instructor: Guo-Jun QI

Support Vector Machines II. CAP 5610: Machine Learning Instructor: Guo-Jun QI Support Vector Machines II CAP 5610: Machine Learning Instructor: Guo-Jun QI 1 Outline Linear SVM hard margin Linear SVM soft margin Non-linear SVM Application Linear Support Vector Machine An optimization

More information

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition NONLINEAR CLASSIFICATION AND REGRESSION Nonlinear Classification and Regression: Outline 2 Multi-Layer Perceptrons The Back-Propagation Learning Algorithm Generalized Linear Models Radial Basis Function

More information

Efficient Kernel Machines Using the Improved Fast Gauss Transform

Efficient Kernel Machines Using the Improved Fast Gauss Transform Efficient Kernel Machines Using the Improved Fast Gauss Transform Changjiang Yang, Ramani Duraiswami and Larry Davis Department of Computer Science University of Maryland College Park, MD 20742 {yangcj,ramani,lsd}@umiacs.umd.edu

More information

Support Vector Machines

Support Vector Machines Wien, June, 2010 Paul Hofmarcher, Stefan Theussl, WU Wien Hofmarcher/Theussl SVM 1/21 Linear Separable Separating Hyperplanes Non-Linear Separable Soft-Margin Hyperplanes Hofmarcher/Theussl SVM 2/21 (SVM)

More information

Multisurface Proximal Support Vector Machine Classification via Generalized Eigenvalues

Multisurface Proximal Support Vector Machine Classification via Generalized Eigenvalues Multisurface Proximal Support Vector Machine Classification via Generalized Eigenvalues O. L. Mangasarian and E. W. Wild Presented by: Jun Fang Multisurface Proximal Support Vector Machine Classification

More information

Computational Learning Theory

Computational Learning Theory Computational Learning Theory Pardis Noorzad Department of Computer Engineering and IT Amirkabir University of Technology Ordibehesht 1390 Introduction For the analysis of data structures and algorithms

More information

Approximate Kernel Methods

Approximate Kernel Methods Lecture 3 Approximate Kernel Methods Bharath K. Sriperumbudur Department of Statistics, Pennsylvania State University Machine Learning Summer School Tübingen, 207 Outline Motivating example Ridge regression

More information

Probabilistic Regression Using Basis Function Models

Probabilistic Regression Using Basis Function Models Probabilistic Regression Using Basis Function Models Gregory Z. Grudic Department of Computer Science University of Colorado, Boulder grudic@cs.colorado.edu Abstract Our goal is to accurately estimate

More information

1162 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 11, NO. 5, SEPTEMBER The Evidence Framework Applied to Support Vector Machines

1162 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 11, NO. 5, SEPTEMBER The Evidence Framework Applied to Support Vector Machines 1162 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 11, NO. 5, SEPTEMBER 2000 Brief Papers The Evidence Framework Applied to Support Vector Machines James Tin-Yau Kwok Abstract In this paper, we show that

More information

The Decision List Machine

The Decision List Machine The Decision List Machine Marina Sokolova SITE, University of Ottawa Ottawa, Ont. Canada,K1N-6N5 sokolova@site.uottawa.ca Nathalie Japkowicz SITE, University of Ottawa Ottawa, Ont. Canada,K1N-6N5 nat@site.uottawa.ca

More information

Understanding Generalization Error: Bounds and Decompositions

Understanding Generalization Error: Bounds and Decompositions CIS 520: Machine Learning Spring 2018: Lecture 11 Understanding Generalization Error: Bounds and Decompositions Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the

More information

Nonlinear Support Vector Machines through Iterative Majorization and I-Splines

Nonlinear Support Vector Machines through Iterative Majorization and I-Splines Nonlinear Support Vector Machines through Iterative Majorization and I-Splines P.J.F. Groenen G. Nalbantov J.C. Bioch July 9, 26 Econometric Institute Report EI 26-25 Abstract To minimize the primal support

More information

Discriminative Models

Discriminative Models No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models

More information

Exact 1-Norm Support Vector Machines via Unconstrained Convex Differentiable Minimization

Exact 1-Norm Support Vector Machines via Unconstrained Convex Differentiable Minimization Journal of Machine Learning Research 7 (2006) - Submitted 8/05; Published 6/06 Exact 1-Norm Support Vector Machines via Unconstrained Convex Differentiable Minimization Olvi L. Mangasarian Computer Sciences

More information

Announcements. Proposals graded

Announcements. Proposals graded Announcements Proposals graded Kevin Jamieson 2018 1 Bayesian Methods Machine Learning CSE546 Kevin Jamieson University of Washington November 1, 2018 2018 Kevin Jamieson 2 MLE Recap - coin flips Data:

More information

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 Outlines Overview Introduction Linear Algebra Probability Linear Regression

More information

Logistic Regression and Boosting for Labeled Bags of Instances

Logistic Regression and Boosting for Labeled Bags of Instances Logistic Regression and Boosting for Labeled Bags of Instances Xin Xu and Eibe Frank Department of Computer Science University of Waikato Hamilton, New Zealand {xx5, eibe}@cs.waikato.ac.nz Abstract. In

More information

CS 231A Section 1: Linear Algebra & Probability Review. Kevin Tang

CS 231A Section 1: Linear Algebra & Probability Review. Kevin Tang CS 231A Section 1: Linear Algebra & Probability Review Kevin Tang Kevin Tang Section 1-1 9/30/2011 Topics Support Vector Machines Boosting Viola Jones face detector Linear Algebra Review Notation Operations

More information

Support Vector Regression with Automatic Accuracy Control B. Scholkopf y, P. Bartlett, A. Smola y,r.williamson FEIT/RSISE, Australian National University, Canberra, Australia y GMD FIRST, Rudower Chaussee

More information

Beyond the Point Cloud: From Transductive to Semi-Supervised Learning

Beyond the Point Cloud: From Transductive to Semi-Supervised Learning Beyond the Point Cloud: From Transductive to Semi-Supervised Learning Vikas Sindhwani, Partha Niyogi, Mikhail Belkin Andrew B. Goldberg goldberg@cs.wisc.edu Department of Computer Sciences University of

More information

Solving Classification Problems By Knowledge Sets

Solving Classification Problems By Knowledge Sets Solving Classification Problems By Knowledge Sets Marcin Orchel a, a Department of Computer Science, AGH University of Science and Technology, Al. A. Mickiewicza 30, 30-059 Kraków, Poland Abstract We propose

More information

Does Modeling Lead to More Accurate Classification?

Does Modeling Lead to More Accurate Classification? Does Modeling Lead to More Accurate Classification? A Comparison of the Efficiency of Classification Methods Yoonkyung Lee* Department of Statistics The Ohio State University *joint work with Rui Wang

More information

Support Vector Machines with Example Dependent Costs

Support Vector Machines with Example Dependent Costs Support Vector Machines with Example Dependent Costs Ulf Brefeld, Peter Geibel, and Fritz Wysotzki TU Berlin, Fak. IV, ISTI, AI Group, Sekr. FR5-8 Franklinstr. 8/9, D-0587 Berlin, Germany Email {geibel

More information

Classifier Complexity and Support Vector Classifiers

Classifier Complexity and Support Vector Classifiers Classifier Complexity and Support Vector Classifiers Feature 2 6 4 2 0 2 4 6 8 RBF kernel 10 10 8 6 4 2 0 2 4 6 Feature 1 David M.J. Tax Pattern Recognition Laboratory Delft University of Technology D.M.J.Tax@tudelft.nl

More information

Sparse Support Vector Machines by Kernel Discriminant Analysis

Sparse Support Vector Machines by Kernel Discriminant Analysis Sparse Support Vector Machines by Kernel Discriminant Analysis Kazuki Iwamura and Shigeo Abe Kobe University - Graduate School of Engineering Kobe, Japan Abstract. We discuss sparse support vector machines

More information

Privacy-Preserving Linear Programming

Privacy-Preserving Linear Programming Optimization Letters,, 1 7 (2010) c 2010 Privacy-Preserving Linear Programming O. L. MANGASARIAN olvi@cs.wisc.edu Computer Sciences Department University of Wisconsin Madison, WI 53706 Department of Mathematics

More information

Midterm exam CS 189/289, Fall 2015

Midterm exam CS 189/289, Fall 2015 Midterm exam CS 189/289, Fall 2015 You have 80 minutes for the exam. Total 100 points: 1. True/False: 36 points (18 questions, 2 points each). 2. Multiple-choice questions: 24 points (8 questions, 3 points

More information

Proximal Knowledge-Based Classification

Proximal Knowledge-Based Classification Proximal Knowledge-Based Classification O. L. Mangasarian E. W. Wild G. M. Fung June 26, 2008 Abstract Prior knowledge over general nonlinear sets is incorporated into proximal nonlinear kernel classification

More information

A Study of Relative Efficiency and Robustness of Classification Methods

A Study of Relative Efficiency and Robustness of Classification Methods A Study of Relative Efficiency and Robustness of Classification Methods Yoonkyung Lee* Department of Statistics The Ohio State University *joint work with Rui Wang April 28, 2011 Department of Statistics

More information

Classification and Support Vector Machine

Classification and Support Vector Machine Classification and Support Vector Machine Yiyong Feng and Daniel P. Palomar The Hong Kong University of Science and Technology (HKUST) ELEC 5470 - Convex Optimization Fall 2017-18, HKUST, Hong Kong Outline

More information

Approximate Kernel PCA with Random Features

Approximate Kernel PCA with Random Features Approximate Kernel PCA with Random Features (Computational vs. Statistical Tradeoff) Bharath K. Sriperumbudur Department of Statistics, Pennsylvania State University Journées de Statistique Paris May 28,

More information

The Performance of a New Hybrid Classifier Based on Boxes and Nearest Neighbors

The Performance of a New Hybrid Classifier Based on Boxes and Nearest Neighbors R u t c o r Research R e p o r t The Performance of a New Hybrid Classifier Based on Boxes and Nearest Neighbors Martin Anthony a Joel Ratsaby b RRR 17-2011, October 2011 RUTCOR Rutgers Center for Operations

More information

Equivalence of Minimal l 0 and l p Norm Solutions of Linear Equalities, Inequalities and Linear Programs for Sufficiently Small p

Equivalence of Minimal l 0 and l p Norm Solutions of Linear Equalities, Inequalities and Linear Programs for Sufficiently Small p Equivalence of Minimal l 0 and l p Norm Solutions of Linear Equalities, Inequalities and Linear Programs for Sufficiently Small p G. M. FUNG glenn.fung@siemens.com R&D Clinical Systems Siemens Medical

More information

CIS 520: Machine Learning Oct 09, Kernel Methods

CIS 520: Machine Learning Oct 09, Kernel Methods CIS 520: Machine Learning Oct 09, 207 Kernel Methods Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the lecture They may or may not cover all the material discussed

More information

An Improved Conjugate Gradient Scheme to the Solution of Least Squares SVM

An Improved Conjugate Gradient Scheme to the Solution of Least Squares SVM An Improved Conjugate Gradient Scheme to the Solution of Least Squares SVM Wei Chu Chong Jin Ong chuwei@gatsby.ucl.ac.uk mpeongcj@nus.edu.sg S. Sathiya Keerthi mpessk@nus.edu.sg Control Division, Department

More information

below, kernel PCA Eigenvectors, and linear combinations thereof. For the cases where the pre-image does exist, we can provide a means of constructing

below, kernel PCA Eigenvectors, and linear combinations thereof. For the cases where the pre-image does exist, we can provide a means of constructing Kernel PCA Pattern Reconstruction via Approximate Pre-Images Bernhard Scholkopf, Sebastian Mika, Alex Smola, Gunnar Ratsch, & Klaus-Robert Muller GMD FIRST, Rudower Chaussee 5, 12489 Berlin, Germany fbs,

More information

Support Vector Machines on General Confidence Functions

Support Vector Machines on General Confidence Functions Support Vector Machines on General Confidence Functions Yuhong Guo University of Alberta yuhong@cs.ualberta.ca Dale Schuurmans University of Alberta dale@cs.ualberta.ca Abstract We present a generalized

More information

Convex Optimization in Classification Problems

Convex Optimization in Classification Problems New Trends in Optimization and Computational Algorithms December 9 13, 2001 Convex Optimization in Classification Problems Laurent El Ghaoui Department of EECS, UC Berkeley elghaoui@eecs.berkeley.edu 1

More information

Relevance Vector Machines for Earthquake Response Spectra

Relevance Vector Machines for Earthquake Response Spectra 2012 2011 American American Transactions Transactions on on Engineering Engineering & Applied Applied Sciences Sciences. American Transactions on Engineering & Applied Sciences http://tuengr.com/ateas

More information

Kernel Machines and Additive Fuzzy Systems: Classification and Function Approximation

Kernel Machines and Additive Fuzzy Systems: Classification and Function Approximation Kernel Machines and Additive Fuzzy Systems: Classification and Function Approximation Yixin Chen * and James Z. Wang * Dept. of Computer Science and Engineering, School of Information Sciences and Technology

More information

Lecture 3: Statistical Decision Theory (Part II)

Lecture 3: Statistical Decision Theory (Part II) Lecture 3: Statistical Decision Theory (Part II) Hao Helen Zhang Hao Helen Zhang Lecture 3: Statistical Decision Theory (Part II) 1 / 27 Outline of This Note Part I: Statistics Decision Theory (Classical

More information

Polyhedral Computation. Linear Classifiers & the SVM

Polyhedral Computation. Linear Classifiers & the SVM Polyhedral Computation Linear Classifiers & the SVM mcuturi@i.kyoto-u.ac.jp Nov 26 2010 1 Statistical Inference Statistical: useful to study random systems... Mutations, environmental changes etc. life

More information

Discriminative Models

Discriminative Models No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models

More information

MinOver Revisited for Incremental Support-Vector-Classification

MinOver Revisited for Incremental Support-Vector-Classification MinOver Revisited for Incremental Support-Vector-Classification Thomas Martinetz Institute for Neuro- and Bioinformatics University of Lübeck D-23538 Lübeck, Germany martinetz@informatik.uni-luebeck.de

More information

Support Vector Machines

Support Vector Machines Support Vector Machines Support vector machines (SVMs) are one of the central concepts in all of machine learning. They are simply a combination of two ideas: linear classification via maximum (or optimal

More information

MINIMUM EXPECTED RISK PROBABILITY ESTIMATES FOR NONPARAMETRIC NEIGHBORHOOD CLASSIFIERS. Maya Gupta, Luca Cazzanti, and Santosh Srivastava

MINIMUM EXPECTED RISK PROBABILITY ESTIMATES FOR NONPARAMETRIC NEIGHBORHOOD CLASSIFIERS. Maya Gupta, Luca Cazzanti, and Santosh Srivastava MINIMUM EXPECTED RISK PROBABILITY ESTIMATES FOR NONPARAMETRIC NEIGHBORHOOD CLASSIFIERS Maya Gupta, Luca Cazzanti, and Santosh Srivastava University of Washington Dept. of Electrical Engineering Seattle,

More information