Theoretical Study on Reduced Support Vector Machines
|
|
- Leonard Quinn
- 5 years ago
- Views:
Transcription
1 Theoretical Study on Reduced Support Vector Machines Su-Yun Huang Institute of Statistical Science Academia Sinica Taipei 115, Taiwan Yuh-Jye Lee Computer Science & Information Engineering National Taiwan University of Science and Technology Taipei 106, Taiwan Abstract The reduced support vector machine was proposed for the practical objective to overcome the computational difficulties as well as to reduce the model complexity in generating a nonlinear separating surface for a massive dataset. It has been successfully applied to other kernel-based learning algorithms. Also, there is an experimental study on RSVM that shows the efficiency of RSVM. In 1
2 this paper we study the RSVM from the viewpoint of model building and consider the nonlinear separating surface as a mixture of kernels. The RSVM uses a compressed model representation instead of a saturated full model, where fitting data to a saturated model is deemed to be unstable and causes prediction error to rise due to over-fitting. Our main result shows that uniform random selection of the reduced set in RSVM is the optimal selection scheme in terms of the following criteria: (1) it minimizes an intrinsic model variation measure; (2) it minimizes the maximal model bias between the compressed model and the full model; (3) it maximizes the minimal ability in distinguishing the compressed model from the full model. 1 Introduction In recent years support vector machines (SVMs) with linear or nonlinear kernels [1, 2, 18] have become one of the most promising learning algorithms for classification as well as regression [3, 13, 14], which are the fundamental tasks in data mining [22]. Via the use of kernel mapping, the SVM allows effective and flexible nonlinear classification. There are some major problems that confront large data classification due to dealing with a fully dense nonlinear kernel matrix. To overcome these computational problems many approximation schemes have been proposed to approximate a symmetric positive semidefinite matrix [17, 21]. Lee and Mangasarian proposed as an alternative the method of reduced support vector machine (RSVM) [8]. The key ideas of the RSVM are as follows. RSVM randomly selects a small portion of data set to generate a thin rectangular kernel matrix. Then it uses this small rectangular kernel matrix to replace the fully dense square kernel matrix in the nonlinear SVM formulation. Computational time, as well as memory usage, is much smaller for RSVM than that for a conventional SVM using the full kernel matrix. As a result, RSVM 2
3 also simplifies the characterization of the nonlinear separating surface. According to Occam s razor, as well as Minimum Description Length (MDL) [15] and the numerical comparisons in [8], RSVM has better generalization ability than a conventional SVM. This reduced kernel technique has been successfully applied to other kernelbased learning algorithms, such as proximal support vector machine (PSVM) [6] and ɛ-smooth support vector regression (ɛ-ssvr) [9]. Also, there is an experimental study on RSVM [11] that showed the efficiency of RSVM. The RSVM results reported in [11] could be further improved if a larger tuning parameter would have been properly chosen. Since the RSVM has reduced the model complexity by using a much smaller rectangular kernel matrix we suggest to use a larger tuning parameter to put more weight on data goodness of fit. In this paper we study the RSVM from the viewpoint of model building. We provide the uniform randomness some optimality interpretations and show that this uniform random subset approach has an active effect on reducing model complexity. This uniform random subset selection in RSVM also has a link to the popular uniform design [4]. We briefly outline the contents of the paper. Section 2 provides the main idea and formulation for RSVM. Section 3 gives the uniform randomness an optimality interpretation in terms of a model variation measure. Section 4 further provides the uniform randomness some minimaxity and maximinity properties. We show that RSVM uniform design minimizes the maximal model bias between the compressed model and the full model and that it maximizes the minimal ability in distinguishing the compressed model from the full one. Section 5 concludes the paper. All proofs are put in the appendix. Aword about our notation and background material is given below. All vectors will be column vectors unless otherwise specified or transposed to a row vector by a prime superscript. For a vector x in the n-dimensional real space R n, the plus 3
4 function x + is defined as (x + ) j = max {0,x j }. The scalar (inner) product of two vectors x and z in the n-dimensional real space R n will be denoted by x z and the p-norm of x will be denoted by x p.for a matrix A R m n, A i is the ith row of A. A column vector of ones of arbitrary dimension will be denoted by e. ForA R m n and B R n l, the kernel K(A, B) maps R m n R n l into R m l. In particular, K(x,z)isareal number, K(x,A )isarow vector in R m, K(A, x) isacolumn vector in R m and K(A, A )isanm m matrix. The base of the natural logarithm will be denoted by ε. 2 Reduced Support Vector Machines Consider the problem of classifying points into two classes, A and A +.Weare given a training data set {(x i,y i )} m, where x i X R n is an input vector variable and y i { 1, 1} is a class label, indicating one of the two classes, A and A +,towhich the input point belongs. We represent these data points by an m n matrix A, where the ith row A i corresponds to the ith input data point. We use alternately A i (a row vector) and x i (a column vector) for the same ith data point depending on convenience. We use an m m diagonal matrix D, D ii = y i to specify the class membership of each input point. The main goal of the classification problem is to find a classifier that can predict correctly the unseen class labels for new data inputs. It can be achieved by constructing a linear or nonlinear separating surface, f(x) = 0, which is implicitly defined by a kernel function. We classify a test point x to A + if f(x) 0, otherwise, to A.Wewill focus on nonlinear case in this paper. In conventional SVM as well as many kernel-based learning algorithms [1, 2, 18], generating a nonlinear separating surface has to deal with a fully dense kernel matrix. When training a nonlinear SVM on a massive data set, the huge and dense full kernel matrix will lead to some computational difficulties as follows [8]: 4
5 the size of the mathematical programming problem; the dependence of the nonlinear separating surface on the entire data set, which creates unwieldy storage problems that prevents the use of nonlinear kernels for anything but a small data set; the overfitting of the separating surface, which causes high variation and less stable fitting. To avoid these difficulties, the RSVM uses a very small random subset of size m of the original m data points, where m m, tobuild up the separating surface. We denote this random subset by Ã, which is used to generate a much smaller rectangular matrix K(A, Ã ) R m m. The reduced kernel matrix is served to replace the huge and fully dense square matrix K(A, A ) used in conventional SVM to cut problem size, computation time and memory usage as well as to simplify the characterization of nonlinear separating surface. This reduced kernel technique can be applied to most of kernel-based learning algorithms. We now briefly describe the RSVM formulation, which is derived from the generalized support vector machine (GSVM) [12] and the smooth support vector machine (SSVM) [10]. The RSVM solves the following unconstrained minimization problem for a general kernel K(A, Ã ): min (ũ,γ) R m+1 ν 2 p(e D{K(A, Ã ) Dũ eγ},α) (ũ ũ + γ 2 ), (1) where the function p(x, α) isavery accurate smooth approximation to (x) + [10], which is applied componentwise to the vector e D{K(A, Ã ) Dũ eγ} and is defined componentwise by {the jth component of p(x, α)} = x j + 1 α log(1 + ε αx j ),α>0, j=1,...,n. (2) The function p(x, α) converges to (x) + as α goes to infinity. The positive tuning parameter ν here controls the tradeoff between the classification error and the suppression of (ũ, γ). Since RSVM has essentially reduced the model complexity via 5
6 using a much smaller rectangular kernel matrix, we will suggest to use a larger tuning parameter ν in RSVM than in conventional SVM. The diagonal matrix D R m m, with ones or minus ones along its diagonal, specifies the class membership of each point in the reduced set. If we let ṽ = Dũ, i.e., ṽ i = D ii ũ i, then ṽ ṽ = ũ ũ and problem (1) is equivalent to ν min (ṽ,γ) R m+1 2 p(e D{K(A, Ã )ṽ eγ},α) (ṽ ṽ + γ 2 ). (3) A solution to this minimization problem (3) leads to the nonlinear separating surface f(x) =K(x, Ã )ṽ γ =0, or m ṽ i K(x, Ã i) γ =0. (4) The minimization problem (3) retains the strong convexity and differentiability properties in the space, R m+1,of(ṽ, γ) for any arbitrary rectangular kernel. Hence we can apply the Newton-Armijo method [10] directly to solve (3) and the existence and uniqueness of the optimal solution are also guaranteed. Moreover, the matrix D has disappeared in the formulation (3). This means that we do not need to record the class label of each data point in the reduced set. We note that this nonlinear separating surface (4) is a linear combination of a set of kernel functions plus a constant. In a nutshell, the RSVM can be split into two parts. First, it selects a small random subset {K(, Ã 1),K(, Ã 2),,K(, Ã m)} from the likely highly correlated full-data atom set {K(,A i)} m for building the separating function. The full-data set is inefficient in function representation, but the conventional SVM has been using it. Secondly, the RSVM determines the best coefficients of the selected kernel functions by solving the unconstrained minimization problem (3) using the entire data set so that the surface will fit the whole data well. In next section, we study the RSVM from the model building point of view. 6
7 3 Optimal sampling design on atom set The conventional nonlinear SVM uses a saturated representation for the separating surface: f(x) = m v i K(x,A i) γ =0. (5) It is a linear combination of atom functions, {1} {K(,A i)} m. One may regard the collection, {1} {K(,A i)} m, asanovercomplete dictionary of atoms when m is large or approaching infinity. The representation (5) is said to be saturated, as there are as many parameters v i as data points. Fitting data to a saturated model is deemed to be unstable and causes generalization error to rise due to over-fitting [7]. Therefore, it is desirable to cut down the model complexity and fit a compressed model: f(x) = m ṽ i K(x, Ã i) γ =0. (6) We call {K(,A i)} m the full atom set based on training data and {K(, Ã i)} m a reduced atom set. Concerning the choice of a reduced set, a simple way is the uniform random subset approach used in RSVM. The RSVM randomly selects a small portion of atom functions from the full set to generate the compressed model, while it fits the entire data set to this compressed model. Each candidate atom in the full set has equal chance of being selected. This uniform random subset selection is simple and straightforward, but is it the most effective way in building up a compressed model? There have been experimental results showing its effectiveness for modeling classification surfaces [8, 11] as well as regression surfaces [9]. In this article, we justify and complement these experimental results with a theoretical study. We show that the uniform random selection is the most effective sampling scheme in terms of a few criteria stated later. Before proceed any further, we define some notation and state some conditions. 7
8 Let ξ 0 beaprobability measure on X. Assume the training inputs follow the distribution ξ 0.Instatistical term, ξ 0 is called the training design. Let H denote the reproducing kernel Hilbert space generated by the kernel K(x,z). Let D := {K(,z)} z X, which is an overcomplete dictionary for H. Let ξ be an absolutely continuous probability measure with respect to ξ 0, denoted by ξ ξ 0. Let P be the collection of all such probability measure, i.e., P = {ξ : probability measure on X satisfying ξ ξ 0 }. The Radon-Nikodym derivative of ξ with respect to ξ 0 is denoted by p(x), i.e., p(x) =dξ(x)/dξ 0 (x). If a random subset of size m, {K(,z j )} j=1, m isselected from the dictionary D according to ξ, i.e., z j are i.i.d. from ξ, then we say that the distribution ξ is a sampling design for atom selection. For convenience, throughout this paper a Gaussian kernel K σ (x,z)=(2πσ 2 ) n/2 exp{ x z 2 /(2σ 2 )}, is used for the RSVM. However, the theoretical results obtained in this article can easily extend to other kernels. We often suppress the subscript σ in K σ and let K(x z) :=K σ (x,z). The value of K(z x) represents the inner product of resulting vectors of x and z in the feature space after a nonlinear mapping implicated and defined by the Gaussian kernel. We can also interpret it as a measure of similarity between x and z. Inother words, K(A i,a) records the similarity between A i and all training inputs. In particular, if we use the Gaussian kernel in the RSVM we can interpret RSVM as an instance-based learning algorithms [15]. The reduced kernel matrix arranges only the similarity between reduced set and the entire training data set. In contrast to the k- nearest neighbor algorithm using the simple voting strategy, RSVM uses the 8
9 weighted voting strategy, where weights and threshold are determined by a training procedure. That is, if the weighted sum, K(x, Ã )ṽ, isgreater than a threshold, γ, the point x is classified as a positive example. The underlying separating function f(x) ismodelled as a mixture of kernels plus a bias term. However, as our main focus here is to assess the variability of model building due to atom sampling from the dictionary D, the bias term γ may then, without loss of generality, be assumed zero. The separating function f(x) isallowed to vary over the function class H, but satisfies the following conditions: C1. Function f is a mixture of Gaussians via mixing distribution ξ 0 f(x) = K(x z)v(z)dξ 0 (z), (7) where v(z) isacoefficient function satisfying conditions C2 and C3 below. C2. Assume v 2 (z)dξ 0 (z) <. C3. To fix a scale for the separating function, it is assumed, without loss of generality, that v(z) dξ 0 (z) =1. By re-expressing the above mixture via the sampling design ξ, wehave f(x) = K(x z)v(z) dξ 0(z) dξ(z) dξ(z) = K(x z)v(z)/p(z) dξ(z). (8) The separating function f(x) can be written as the expectation of f(x, Z): f(x) =Ef(x, Z) :=E {K(x Z)v(Z)/p(Z)}, Z D ξ, where D stands for follows the distribution. Sampling from f(x, Z) :=K(x Z)v(Z)/p(Z) isused to build a compressed model for separating function. The integrated variance of f(x, Z), as a measure for model variation, is given below: Var ξ {f(x, Z)}dx = K 2 (x z)v 2 (z)/p 2 (z) dξ(z)dx f 2 (x)dx. (9) 9
10 The integrated model variance is used to assess the model robustness. The smaller it is, the better stability (less variability) the model built from ξ will be. Therefore, we aim to find a good sampling design ξ with small integrated model variance. As the quantity f 2 (x)dx does not involve ξ, wemay drop it in the minimization step and use the following model variation measure. Definition 1 (Model variation) We use the following measure for model variation due to sampling design ξ: V (ξ) = K 2 (x z)v 2 (z)/p 2 (z) dξ(z)dx. (10) By Lemma 1 in Appendix, it is easy to get the optimal design with minimum V (ξ) over the set P and its pdf with respect to ξ 0 is given by p(z) v(z) K(x z)dx v(z). The coefficient function v(z) is not known and has yet to be estimated in the training process. Here we use a constant as a reference function to replace v(z) and the resulting p(z) isuniform with repect to ξ 0. a constant reference function for v(z). There are two main reasons of using One is to indicate no prior preference or information on v(z) prior to training process. The other is to reflect the intrinsic mechanism in SVM that tends to minimize the 2-norm, v 2 (z)dξ 0 (z), of v(z). Under the fixed-scale condition C3, the 2-norm of v(z) is minimized when v(z) is a constant function. Thus, the uniform p(z) is justified. In other words, atom functions K(,z) are sampled from the dictionary D according to the training design ξ = ξ 0. Theorem 1 (Optimal sampling design) The optimal sampling design for atom selection is given by ξ 0. The optimality is in terms of minimizing V (ξ) over ξ P prior to training process. This justifies the uniform sampling scheme (on training data) used in the original RSVM of Lee and Mangasarian [8]. 10
11 In practice, a discrete approximation to ξ 0 is necessary. Naturally, we use its empirical distribution, i.e., putting probability mass 1/m at each training point A i. Then, atom functions are uniformly sampled from {K(,A i)} m. This random subset approach can drastically cut down the model complexity, while the sampling design helps to guide the atom selection in terms of minimal model variation. However, we remind the reader the quantity V (ξ) does not account for parameter estimation variance, but purely the model variation due to random sampling of atoms from the dictionary D. The resulting optimal sampling design in Theorem 1 is a uniform design with respect to ξ 0, which seeks atom points uniformly distributed over training points. The use of uniform design has been popular since See articles [4] and [5] for a nice survey of theory and application on uniform design. The uniform design is a space filling design and it seeks to obtain maximal model robustness. 4 Minimaxity and maximinity In this section we show that the above mentioned optimal sampling design also possesses some minimaxity and maximinity properties. The ideas are inspired from [19, 20] and made suitable in the context of RSVM. Recall the saturated and compressed models: saturated : f(x) = compressed : f(x) = m v i K(x,A i) γ, (11) m ṽ i K(x, Ã i) γ. (12) Using kernel K corresponds to mapping the data into a feature Hilbert space with a map: Φ : X U.Inthe feature space U, the normal vector of the SVM separating hyperplane, w u γ =0,can be expanded in terms of support vectors. The saturated 11
12 model has the following full expansion for the normal vector m saturated : w = v i Φ(A i ), (13) while the compressed model has the reduced expansion compressed : w = m ṽ i Φ(Ãi). (14) Expansions given by (13) and (14) are called pre-images. See [16] for approximating the full pre-image expansion by reduced-set expansion. The error induced from approximation by reduced-set expansion is investigated here. We define some more notation. F:= linear span{k(,a i)} m and R := linear span{k(, Ã i)}. m F η := { g F: X g2 (z)dξ 0 (z) η 2, F + η := { g F: X g2 (z)dξ 0 (z) η 2, X K(Ãi,z)g(z)dξ 0 (z) =0,,..., m }. X K(Ãi,z)g(z)dξ 0 (z) =0,,..., m }. Ṽξ := X K(Ã, z)k(z, Ã )dξ(z), which is an m m matrix. B(f,ξ) := {f(x) K(x, Ã )Ṽ ξ K( Ã, z)f(z)dξ(z)} 2 dξ 0 (x), where f F, ξ P and Ṽ ξ is a generalized inverse for Ṽξ. We introduce below a measurement for model bias between the full model (11) and the compressed model (12). For an arbitrary function f F, the L 2 (dξ)-projection of f onto the reduced space R is given by P R f(x) =K(x, Ã )Ṽ ξ K(Ã, z)f(z)dξ(z). We use the following integrated square bias to account for model bias: model bias (f) := {f(x) P R f(x)} 2 dξ 0 = B(f,ξ). (15) 12
13 The quantity B(f,ξ) reflects a degree of inadequacy in f represented by the compressed representation. To guard against the maximum possible model bias (or inadequacy), we would use a design that minimizes the maximum bias. Theorem 2 (Minimaxity) The minimax design which minimizes the maximum bias B(f,ξ) for f F η R is achieved by ξ = ξ 0, i.e., sup B(f,ξ 0 )=inf sup B(f,ξ). (16) f Fη ξ P R f Fη R Equivalently, the uniform design with respect to ξ 0 is the minimax design to minimize the maximum model bias between the full model (11) and the compressed model (12). The optimal sampling design, ξ = ξ 0, minimizes the maximum bias as the saturated model ranging over a neighborhood of the compressed model. The first condition in the definition of Fη specifies a neighborhood around the compressed model and the second condition ensures the identifiability of parameter ṽ in this Fη -neighborhood. Below we further provide the optimal sampling design with a maximinity interpretation. Theorem 3 (Maximinity) For f F η + achieved by ξ = ξ 0, i.e., R, the maximinity design for B(f,ξ) is inf B(f,ξ 0 )=sup f F η + R ξ P inf f F + η R B(f,ξ). (17) Equivalently, the uniform design with respect to ξ 0 is the maximin design. Recall the definition of F + η. The first condition specifies a class which deviates from the compressed model with at least a distance η. If the true model is at least η-distance away from R and falls into the class specified by F + η,wewould hope to be able to distinguish between the true model and the compressed model in a lack of fit test. That is, we would hope to have as large model bias as possible in order to have 13
14 adequate ability to distinguish between the two models in a LOF test. Theorem 3 tells us that sampling design ξ = ξ 0 achieves the maximum ability for a worst possible f F + η. 5 Conclusion We study the RSVM from a model building point of view. We consider the nonlinear classifier f(x) asamixture of kernels and express it as a linear combination of kernel functions sampled from the full atom set on training data. We also interpret the RSVM as an instance-based learning algorithms. The reduced kernel matrix records the similarity between the reduced set and the entire training data set. The RSVM has used the weighted voting strategy to classify a new data point. That is, if the weighted sum, K(x, Ã )ṽ is great than a threshold γ, then the point x is classified as positive example, where ṽ and γ are determined by solving a unconstrained minimization problem. Our main result shows that uniformly random selecting the reduced set of RSVM is the optimal sampling strategy for atom kernel selection in the sense of minimizing an intrinsic model variation measure. We also provide this RSVM sampling strategy a minimaxity and a maximinity interpretation in terms of model bias and test ability respectively. This theoretical study provides a better understanding of RSVM learning algorithm and helps to advance the random subset approach to most kernel-based learning algorithms. 6 Appendix Lemma 1 Foragiven probability density function t(z) with respect to ξ 0, let P t denote the collection of probability density functions p(z) such that X (t(z)/p(z))dξ 0(z) < 14
15 . (Here 0/0 is defined to be zero.) Then the solution to the optimality problem arg min (t(z)/p(z))dξ 0 (z) p P t X is given by p(z) =c 1 t(z), where c = X t(z) dξ 0 (z). Proof: Since X (t(z)/p(z))dξ 0(z) < and X p(z)dξ 0(z) =1,by Hölder inequality, we have ( 2 (t(z)/p(z)) dξ 0 (z) = (t(z)/p(z)) dξ 0 (z) p(z)dξ 0 (z) t 1/2 (z)dξ 0 (z)). X X X X Equality holds if and only if, there exists some nonzero constant β such that t(z)/p(z) =βp(z) a.e. Since p(z) isapdf, then equality holds if and only if p(z) =c 1 t 1/2 (z). Lemma 2 sup g F η B(g, ξ 0 )=η 2. Proof: Recall B(g, ξ) := {g(x) K(x, Ã )Ṽ ξ K( Ã, z)g(z)dξ(z)} 2 dξ 0 (x). For g Fη,wehave K(Ã, z)g(z)dξ 0(z) =0. Then B(g, ξ 0 ) = = { g(x) K(x, Ã )Ṽ ξ 0 g 2 (x)dξ 0 η 2, for g F η. K(Ã, z)g(z)dξ 0(z)} 2 dξ0 (x) The proof can be completed by finding a function g 0 F η such that the above equality holds. A construction for such a function goes as follows. We first find a partition consisting of measurable sets X = m+1 E i with E i disjoint and ξ 0 (E i ) > 0. Let β =(β 1,...,β m+1 ) beanon-null solution to the following system of m equations in m +1unknowns: m+1 β i E i K(Ãj,z)dξ 0 (z) =0,j=1,..., m. (18) 15
16 Let g 0 (x) =η ( m+1 β 2 i ξ 0 (E i )) 1/2 m+1 β i I Ei (x). (19) Then g 2 0(x)dξ 0 (x) =η 2 and, by (18) and (19), it is easy to see that, for j =1,..., m, g 0 (x)k(ãj,x)dξ 0 (x) =η ( m+1 β 2 i ξ 0 (E i ) ) 1/2 m+1 β i E i K(Ãj,x)dξ 0 (x) =0. (20) Thus, g 0 F η. Also, by (20), we have B(g 0,ξ 0 )= X g2 0(x)dξ 0 (x) =η 2. Lemma 3 inf g F + η B(g, ξ 0 )=η 2. Proof: Start with, for g F + η, B(g, ξ 0 ) = = { X g(x) K(x, Ã )Ṽ ξ 0 g 2 (x)dξ 0 η 2. K(Ã, z)g(z)dξ 0(z)} 2 dξ0 (x) The function g 0 given in (19) is also in F + η and it satisfies the equality B(g, ξ 0 )=η 2. Lemma 4 B(h, ξ) =0for all h R. Proof: For any h R, there exists an h R m such that h can be represented as h(z) =K(z, Ã ) h K(x, Ã )Ṽ ξ = K(x, Ã )Ṽ ξ = K(x, Ã ) h = h(x). K(Ã, z)h(z)dξ(z) K(Ã, z)k(z, Ã ) hdξ(z) Thus, we have B(h, ξ) =0forh R. Lemma 5 We have the following minimaxity and maximinity properties for B(g, ξ): (1) inf ξ P sup g F η R B(g, ξ) =sup g Fη R B(g, ξ 0) and (2) sup ξ P inf g F + η R B(g, ξ) =inf g F η + R B(g, ξ 0). 16
17 Proof: For an arbitrary f F η R (or f F + η R), f can be decomposed uniquely as f = h + g, where h Rand g F η (or g F + η ). As B(h, ξ) =0for h R, then B(f,ξ) =B(g, ξ). Therefore, it is sufficient to prove statement (1) for g in the class F η and statement (2) for g in the class F + η. That is, instead of (1) and (2), it is sufficient to show (1 ) inf ξ P sup g F η B(g, ξ) =sup g F η B(g, ξ 0 ) and (2 ) sup ξ P inf g F + η B(g, ξ) =inf g F + η B(g, ξ 0 ). (1 )Wewill show that for any ξ Pthere exists a g 0 Fη such that B(g 0,ξ) η 2. Then, combined with Lemma 2, we get B(g 0,ξ) η 2 = sup g F η B(g, ξ 0 ). (21) If so, statement (1 )isthen proved. Below we construct a g 0 such that g 0 F η and B(g 0,ξ) η 2. Let ν = ξ ξ 0,asigned measure. By Hahn decomposition theorem there exists a measurable set E such that ν is positive on E and negative on E c.wecan then find a partition E = 2 m+1 E i with E i disjoint and ν(e i ) > 0. Since ξ = ν + ξ 0 and that ξ is absolutely continuous with respect to ξ 0, ξ 0 (E i ) > 0, too. Let β =(β 1,...,β 2 m+1 ) be anon-null solution to the following system of 2 m equations in 2 m +1unknowns: Let 2 m+1 2 m+1 g 0 (x) :=η β i E i K(Ãj,z)dξ 0 (z) =0,j=1,..., m. (22) β i E i K(Ãj,z) dξ(z) =0,j=1,..., m. (23) ( 2 m+1 β 2 i ξ 0 (E i )) 1/2 Since ξ = ν + ξ 0 and ν is positive on E, then 2 m+1 β i I Ei (x). (24) g0(x)dξ(x) 2 g 2 0(x)dξ 0 (x) =η 2. 17
18 By (22), we get g 0 (x)k(ãj,x)dξ 0 (x) =0for 1 j m. Thus, g 0 is in F η. By (23), we get g 0 (x)k(ãj,x)dξ(x) =0for 1 j m. Thus, { 2 B(g 0,ξ)= g 0 (x) K(x, Ã )Ṽ ξ 0(z)dξ(z)} K(Ã, z)g dξ0 (x) = g 2 0(x)dξ 0 (x) =η 2. Therefore, proof for (1 )iscompleted. (2 )Wewill show that for any ξ Pthere exists an h 0 F + η such that B(h 0,ξ) η 2. Then, combined with Lemma 3, we get B(h 0,ξ) η 2 = inf g F η + B(g, ξ 0 ). (25) If so, statement (2 ) can then be proved. Below we construct such an h 0. Let ν = ξ ξ 0 and E c be as defined earlier in this proof. We then find a partition for E c with E c = m+1 U i with U i disjoint and ν(u i ) < 0. Then, ξ 0 (U i )=ξ(u i ) ν(u i ) > 0. Let β = (β 1,...,β m+1 ) beanon-null solution to the following system of m equations in m + 1 unknowns: Let m+1 β i U i K(Ãj,z)dξ 0 (z) =0,j=1,..., m. (26) h 0 (x) :=η ( m+1 β 2 i ξ 0 (U i )) 1/2 m+1 β i I Ui (x). (27) Then h 2 0(x)dξ 0 (x) =η 2 and h 0 (x)k(ãj,x)dξ 0 (x) =0for j =1,..., m. Thus, h 0 F η +. Also, B(h 0,ξ)= {h 0 (x) K(x, Ã )Ṽ 2 0(z)dξ(z)} K(Ã, z)h dξ0 (x) h 2 0(x)dξ 0 (x) =η 2. Therefore, proof for (2 )iscompleted. References [1] C. J. C. Burges. A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery, 2(2): ,
19 [2] N. Cristianini and J. Shawe-Taylor. An Introduction to Support Vector Machines. Cambridge University Press, Cambridge, [3] H. Drucker, C. J. C. Burges, L. Kaufman, A. Smola, and V. Vapnik. Support vector regression machines. In M. C. Mozer, M. I. Jordan, and T. Petsche, editors, Advances in Neural Information Processing Systems -9-, pages , Cambridge, MA, MIT Press. [4] K. T. Fang, D. K. J. Lin, P. Winker, and Y. Zhang. Uniform design: theory and application. Technometrics, 42: , [5] K. T. Fang, Y. Wang, and P. M. Bentler. Some applications of number-theoretic methods in statistics. Statistical Science, 9: , [6] G. Fung and O. L. Mangasarian. Proximal support vector machine classifiers. In F. Provost and R. Srikant, editors, Proceedings KDD-2001: Knowledge Discovery and Data Mining, August 26-29, 2001, San Francisco, CA, pages 77 86, New York, Asscociation for Computing Machinery. ftp://ftp.cs.wisc.edu/pub/dmi/tech-reports/01-02.ps. [7] T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning. Springer-Verlag, New York, [8] Y.-J. Lee and O. L. Mangasarian. RSVM: Reduced support vector machines. Technical Report 00-07, Data Mining Institute, Computer Sciences Department, University of Wisconsin, Madison, Wisconsin, July Proceedings of the First SIAM International Conference on Data Mining, Chicago, April 5-7, 2001, CD-ROM Proceedings. ftp://ftp.cs.wisc.edu/pub/dmi/tech-reports/00-07.ps. [9] Yuh-Jye Lee and Wen-Feng Hsieh. ɛ-ssvr: A smooth vector machine for ɛ- insensitive regression. Technical report, Department of Computer Science and 19
20 Information Engineering, National Taiwan University of Science and Technology, IEEE Transactions on Knowledge and Data Engineering, submitted. [10] Yuh-Jye Lee and O. L. Mangasarian. SSVM: A smooth support vector machine. Computational Optimization and Applications, 20:5 22, Data Mining Institute, University of Wisconsin, Technical Report ftp://ftp.cs.wisc.edu/pub/dmi/tech-reports/99-03.ps. [11] K.-M. Lin and C.-J. Lin. A study on reduced support vector machines. IEEE Transactions on Neural Networks, 14: , [12] O. L. Mangasarian. Generalized support vector machines. In A. Smola, P. Bartlett, B. Schölkopf, and D. Schuurmans, editors, Advances in Large Margin Classifiers, pages , Cambridge, MA, MIT Press. ftp://ftp.cs.wisc.edu/math-prog/tech-reports/98-14.ps. [13] O. L. Mangasarian and D. R. Musicant. Large scale kernel regression via linear programming. Technical Report 99-02, Data Mining Institute, Computer Sciences Department, University of Wisconsin, Madison, Wisconsin, August ftp://ftp.cs.wisc.edu/pub/dmi/tech-reports/99-02.ps. [14] O. L. Mangasarian and D. R. Musicant. Robust linear and support vector regression. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(9): , ftp://ftp.cs.wisc.edu/pub/dmi/tech-reports/99-09.ps. [15] T. M. Mitchell. Machine Learning. McGraw-Hill, Boston, [16] A. Smola and B. Schölkopf. Learning with Kernels. MIT Press, Cambridge, MA,
21 [17] Alex J. Smola and Bernhard Schölkopf. Sparse greedy matrix approximation for machine learning. In Proc. 17th International Conf. on Machine Learning, pages Morgan Kaufmann, San Francisco, CA, [18] V. N. Vapnik. The Nature of Statistical Learning Theory. Springer, New York, [19] D. P. Wiens. Designs for approximately linear regression: Two optimality properties of uniform designs. Statist. & Probab. Letters, 12: , [20] D. P. Wiens. Minimax designs for approximately linear regression. J. Statist. Plann. Inference, 31: , [21] C. K. I. Williams and M. Seeger. Using the Nyström method to speed up kernel machines. In Todd K. Leen, Thomas G. Dietterich, and Volker Tresp, editors, Advances in Neural Information Processing Systems 13, pages , Cambridge, MA, MIT Press. [22] I. H. Witten and E. Frank. Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations. Morgan Kaufmann, San Francisco, CA,
Support Vector Machine Classification via Parameterless Robust Linear Programming
Support Vector Machine Classification via Parameterless Robust Linear Programming O. L. Mangasarian Abstract We show that the problem of minimizing the sum of arbitrary-norm real distances to misclassified
More informationA Finite Newton Method for Classification Problems
A Finite Newton Method for Classification Problems O. L. Mangasarian Computer Sciences Department University of Wisconsin 1210 West Dayton Street Madison, WI 53706 olvi@cs.wisc.edu Abstract A fundamental
More informationUnsupervised Classification via Convex Absolute Value Inequalities
Unsupervised Classification via Convex Absolute Value Inequalities Olvi L. Mangasarian Abstract We consider the problem of classifying completely unlabeled data by using convex inequalities that contain
More informationUnsupervised and Semisupervised Classification via Absolute Value Inequalities
Unsupervised and Semisupervised Classification via Absolute Value Inequalities Glenn M. Fung & Olvi L. Mangasarian Abstract We consider the problem of classifying completely or partially unlabeled data
More informationNonlinear Knowledge-Based Classification
Nonlinear Knowledge-Based Classification Olvi L. Mangasarian Edward W. Wild Abstract Prior knowledge over general nonlinear sets is incorporated into nonlinear kernel classification problems as linear
More informationKnowledge-Based Nonlinear Kernel Classifiers
Knowledge-Based Nonlinear Kernel Classifiers Glenn M. Fung, Olvi L. Mangasarian, and Jude W. Shavlik Computer Sciences Department, University of Wisconsin Madison, WI 5376 {gfung,olvi,shavlik}@cs.wisc.edu
More informationIntroduction to Support Vector Machine
Introduction to Support Vector Machine Yuh-Jye Lee National Taiwan University of Science and Technology September 23, 2009 1 / 76 Binary Classification Problem 2 / 76 Binary Classification Problem (A Fundamental
More informationThe Disputed Federalist Papers: SVM Feature Selection via Concave Minimization
The Disputed Federalist Papers: SVM Feature Selection via Concave Minimization Glenn Fung Siemens Medical Solutions In this paper, we use a method proposed by Bradley and Mangasarian Feature Selection
More informationIntroduction to Support Vector Machines
Introduction to Support Vector Machines Andreas Maletti Technische Universität Dresden Fakultät Informatik June 15, 2006 1 The Problem 2 The Basics 3 The Proposed Solution Learning by Machines Learning
More informationBrief Introduction to Machine Learning
Brief Introduction to Machine Learning Yuh-Jye Lee Lab of Data Science and Machine Intelligence Dept. of Applied Math. at NCTU August 29, 2016 1 / 49 1 Introduction 2 Binary Classification 3 Support Vector
More informationLinear Dependency Between and the Input Noise in -Support Vector Regression
544 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 14, NO. 3, MAY 2003 Linear Dependency Between the Input Noise in -Support Vector Regression James T. Kwok Ivor W. Tsang Abstract In using the -support vector
More informationMachine Learning : Support Vector Machines
Machine Learning Support Vector Machines 05/01/2014 Machine Learning : Support Vector Machines Linear Classifiers (recap) A building block for almost all a mapping, a partitioning of the input space into
More informationA GENERAL FORMULATION FOR SUPPORT VECTOR MACHINES. Wei Chu, S. Sathiya Keerthi, Chong Jin Ong
A GENERAL FORMULATION FOR SUPPORT VECTOR MACHINES Wei Chu, S. Sathiya Keerthi, Chong Jin Ong Control Division, Department of Mechanical Engineering, National University of Singapore 0 Kent Ridge Crescent,
More informationScale-Invariance of Support Vector Machines based on the Triangular Kernel. Abstract
Scale-Invariance of Support Vector Machines based on the Triangular Kernel François Fleuret Hichem Sahbi IMEDIA Research Group INRIA Domaine de Voluceau 78150 Le Chesnay, France Abstract This paper focuses
More informationVC dimension, Model Selection and Performance Assessment for SVM and Other Machine Learning Algorithms
03/Feb/2010 VC dimension, Model Selection and Performance Assessment for SVM and Other Machine Learning Algorithms Presented by Andriy Temko Department of Electrical and Electronic Engineering Page 2 of
More informationMargin Maximizing Loss Functions
Margin Maximizing Loss Functions Saharon Rosset, Ji Zhu and Trevor Hastie Department of Statistics Stanford University Stanford, CA, 94305 saharon, jzhu, hastie@stat.stanford.edu Abstract Margin maximizing
More informationStatistical Properties and Adaptive Tuning of Support Vector Machines
Machine Learning, 48, 115 136, 2002 c 2002 Kluwer Academic Publishers. Manufactured in The Netherlands. Statistical Properties and Adaptive Tuning of Support Vector Machines YI LIN yilin@stat.wisc.edu
More informationA Tutorial on Support Vector Machine
A Tutorial on School of Computing National University of Singapore Contents Theory on Using with Other s Contents Transforming Theory on Using with Other s What is a classifier? A function that maps instances
More informationSSVM: A Smooth Support Vector Machine for Classification
SSVM: A Smooth Support Vector Machine for Classification Yuh-Jye Lee and O. L. Mangasarian Computer Sciences Department University of Wisconsin 1210 West Dayton Street Madison, WI 53706 yuh-jye@cs.wisc.edu,
More informationSupport Vector Machines
Support Vector Machines Tobias Pohlen Selected Topics in Human Language Technology and Pattern Recognition February 10, 2014 Human Language Technology and Pattern Recognition Lehrstuhl für Informatik 6
More informationDiscriminative Direction for Kernel Classifiers
Discriminative Direction for Kernel Classifiers Polina Golland Artificial Intelligence Lab Massachusetts Institute of Technology Cambridge, MA 02139 polina@ai.mit.edu Abstract In many scientific and engineering
More informationLecture 10: Support Vector Machine and Large Margin Classifier
Lecture 10: Support Vector Machine and Large Margin Classifier Applied Multivariate Analysis Math 570, Fall 2014 Xingye Qiao Department of Mathematical Sciences Binghamton University E-mail: qiao@math.binghamton.edu
More informationThe Performance of a New Hybrid Classifier Based on Boxes and Nearest Neighbors
The Performance of a New Hybrid Classifier Based on Boxes and Nearest Neighbors Martin Anthony Department of Mathematics London School of Economics and Political Science Houghton Street, London WC2A2AE
More informationAn introduction to Support Vector Machines
1 An introduction to Support Vector Machines Giorgio Valentini DSI - Dipartimento di Scienze dell Informazione Università degli Studi di Milano e-mail: valenti@dsi.unimi.it 2 Outline Linear classifiers
More informationSemi-Supervised Learning through Principal Directions Estimation
Semi-Supervised Learning through Principal Directions Estimation Olivier Chapelle, Bernhard Schölkopf, Jason Weston Max Planck Institute for Biological Cybernetics, 72076 Tübingen, Germany {first.last}@tuebingen.mpg.de
More informationSupport Vector Machine & Its Applications
Support Vector Machine & Its Applications A portion (1/3) of the slides are taken from Prof. Andrew Moore s SVM tutorial at http://www.cs.cmu.edu/~awm/tutorials Mingyue Tan The University of British Columbia
More informationSample width for multi-category classifiers
R u t c o r Research R e p o r t Sample width for multi-category classifiers Martin Anthony a Joel Ratsaby b RRR 29-2012, November 2012 RUTCOR Rutgers Center for Operations Research Rutgers University
More informationOutliers Treatment in Support Vector Regression for Financial Time Series Prediction
Outliers Treatment in Support Vector Regression for Financial Time Series Prediction Haiqin Yang, Kaizhu Huang, Laiwan Chan, Irwin King, and Michael R. Lyu Department of Computer Science and Engineering
More informationAnalysis of Multiclass Support Vector Machines
Analysis of Multiclass Support Vector Machines Shigeo Abe Graduate School of Science and Technology Kobe University Kobe, Japan abe@eedept.kobe-u.ac.jp Abstract Since support vector machines for pattern
More informationSupport Vector Machines
Support Vector Machines Stephan Dreiseitl University of Applied Sciences Upper Austria at Hagenberg Harvard-MIT Division of Health Sciences and Technology HST.951J: Medical Decision Support Overview Motivation
More informationPredicting the Probability of Correct Classification
Predicting the Probability of Correct Classification Gregory Z. Grudic Department of Computer Science University of Colorado, Boulder grudic@cs.colorado.edu Abstract We propose a formulation for binary
More informationSPLINE FUNCTION SMOOTH SUPPORT VECTOR MACHINE FOR CLASSIFICATION. Yubo Yuan. Weiguo Fan. Dongmei Pu. (Communicated by David Yang Gao)
JOURNAL OF INDUSTRIAL AND Website: http://aimsciences.org MANAGEMENT OPTIMIZATION Volume 3, Number 3, August 2007 pp. 529 542 SPLINE FUNCTION SMOOTH SUPPORT VECTOR MACHINE FOR CLASSIFICATION Yubo Yuan
More informationSUPPORT VECTOR MACHINE
SUPPORT VECTOR MACHINE Mainly based on https://nlp.stanford.edu/ir-book/pdf/15svm.pdf 1 Overview SVM is a huge topic Integration of MMDS, IIR, and Andrew Moore s slides here Our foci: Geometric intuition
More informationSupport Vector Machines for Classification: A Statistical Portrait
Support Vector Machines for Classification: A Statistical Portrait Yoonkyung Lee Department of Statistics The Ohio State University May 27, 2011 The Spring Conference of Korean Statistical Society KAIST,
More informationMachine Learning. VC Dimension and Model Complexity. Eric Xing , Fall 2015
Machine Learning 10-701, Fall 2015 VC Dimension and Model Complexity Eric Xing Lecture 16, November 3, 2015 Reading: Chap. 7 T.M book, and outline material Eric Xing @ CMU, 2006-2015 1 Last time: PAC and
More informationLinear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)
Support vector machines In a nutshell Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers) Solution only depends on a small subset of training
More informationMachine Learning. Lecture 9: Learning Theory. Feng Li.
Machine Learning Lecture 9: Learning Theory Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 2018 Why Learning Theory How can we tell
More informationLinear Classification and SVM. Dr. Xin Zhang
Linear Classification and SVM Dr. Xin Zhang Email: eexinzhang@scut.edu.cn What is linear classification? Classification is intrinsically non-linear It puts non-identical things in the same class, so a
More informationSVM-based Feature Selection by Direct Objective Minimisation
SVM-based Feature Selection by Direct Objective Minimisation Julia Neumann, Christoph Schnörr, and Gabriele Steidl Dept. of Mathematics and Computer Science University of Mannheim, 683 Mannheim, Germany
More informationA note on the generalization performance of kernel classifiers with margin. Theodoros Evgeniou and Massimiliano Pontil
MASSACHUSETTS INSTITUTE OF TECHNOLOGY ARTIFICIAL INTELLIGENCE LABORATORY and CENTER FOR BIOLOGICAL AND COMPUTATIONAL LEARNING DEPARTMENT OF BRAIN AND COGNITIVE SCIENCES A.I. Memo No. 68 November 999 C.B.C.L
More informationSupport Vector Machine (SVM) and Kernel Methods
Support Vector Machine (SVM) and Kernel Methods CE-717: Machine Learning Sharif University of Technology Fall 2014 Soleymani Outline Margin concept Hard-Margin SVM Soft-Margin SVM Dual Problems of Hard-Margin
More informationSupport Vector Machines Explained
December 23, 2008 Support Vector Machines Explained Tristan Fletcher www.cs.ucl.ac.uk/staff/t.fletcher/ Introduction This document has been written in an attempt to make the Support Vector Machines (SVM),
More informationLinear & nonlinear classifiers
Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1394 1 / 34 Table
More informationNearest Neighbor. Machine Learning CSE546 Kevin Jamieson University of Washington. October 26, Kevin Jamieson 2
Nearest Neighbor Machine Learning CSE546 Kevin Jamieson University of Washington October 26, 2017 2017 Kevin Jamieson 2 Some data, Bayes Classifier Training data: True label: +1 True label: -1 Optimal
More informationSVM TRADE-OFF BETWEEN MAXIMIZE THE MARGIN AND MINIMIZE THE VARIABLES USED FOR REGRESSION
International Journal of Pure and Applied Mathematics Volume 87 No. 6 2013, 741-750 ISSN: 1311-8080 (printed version); ISSN: 1314-3395 (on-line version) url: http://www.ijpam.eu doi: http://dx.doi.org/10.12732/ijpam.v87i6.2
More informationSupport Vector Machine
Andrea Passerini passerini@disi.unitn.it Machine Learning Support vector machines In a nutshell Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)
More informationMachine Learning And Applications: Supervised Learning-SVM
Machine Learning And Applications: Supervised Learning-SVM Raphaël Bournhonesque École Normale Supérieure de Lyon, Lyon, France raphael.bournhonesque@ens-lyon.fr 1 Supervised vs unsupervised learning Machine
More informationSupport Vector Machine (SVM) and Kernel Methods
Support Vector Machine (SVM) and Kernel Methods CE-717: Machine Learning Sharif University of Technology Fall 2015 Soleymani Outline Margin concept Hard-Margin SVM Soft-Margin SVM Dual Problems of Hard-Margin
More informationLinear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)
Support vector machines In a nutshell Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers) Solution only depends on a small subset of training
More informationSupport'Vector'Machines. Machine(Learning(Spring(2018 March(5(2018 Kasthuri Kannan
Support'Vector'Machines Machine(Learning(Spring(2018 March(5(2018 Kasthuri Kannan kasthuri.kannan@nyumc.org Overview Support Vector Machines for Classification Linear Discrimination Nonlinear Discrimination
More informationCS 231A Section 1: Linear Algebra & Probability Review
CS 231A Section 1: Linear Algebra & Probability Review 1 Topics Support Vector Machines Boosting Viola-Jones face detector Linear Algebra Review Notation Operations & Properties Matrix Calculus Probability
More informationSupport Vector Machines. Maximizing the Margin
Support Vector Machines Support vector achines (SVMs) learn a hypothesis: h(x) = b + Σ i= y i α i k(x, x i ) (x, y ),..., (x, y ) are the training exs., y i {, } b is the bias weight. α,..., α are the
More informationRobust Kernel-Based Regression
Robust Kernel-Based Regression Budi Santosa Department of Industrial Engineering Sepuluh Nopember Institute of Technology Kampus ITS Surabaya Surabaya 60111,Indonesia Theodore B. Trafalis School of Industrial
More informationLinear & nonlinear classifiers
Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1396 1 / 44 Table
More informationSupport Vector Machines II. CAP 5610: Machine Learning Instructor: Guo-Jun QI
Support Vector Machines II CAP 5610: Machine Learning Instructor: Guo-Jun QI 1 Outline Linear SVM hard margin Linear SVM soft margin Non-linear SVM Application Linear Support Vector Machine An optimization
More informationNONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
NONLINEAR CLASSIFICATION AND REGRESSION Nonlinear Classification and Regression: Outline 2 Multi-Layer Perceptrons The Back-Propagation Learning Algorithm Generalized Linear Models Radial Basis Function
More informationEfficient Kernel Machines Using the Improved Fast Gauss Transform
Efficient Kernel Machines Using the Improved Fast Gauss Transform Changjiang Yang, Ramani Duraiswami and Larry Davis Department of Computer Science University of Maryland College Park, MD 20742 {yangcj,ramani,lsd}@umiacs.umd.edu
More informationSupport Vector Machines
Wien, June, 2010 Paul Hofmarcher, Stefan Theussl, WU Wien Hofmarcher/Theussl SVM 1/21 Linear Separable Separating Hyperplanes Non-Linear Separable Soft-Margin Hyperplanes Hofmarcher/Theussl SVM 2/21 (SVM)
More informationMultisurface Proximal Support Vector Machine Classification via Generalized Eigenvalues
Multisurface Proximal Support Vector Machine Classification via Generalized Eigenvalues O. L. Mangasarian and E. W. Wild Presented by: Jun Fang Multisurface Proximal Support Vector Machine Classification
More informationComputational Learning Theory
Computational Learning Theory Pardis Noorzad Department of Computer Engineering and IT Amirkabir University of Technology Ordibehesht 1390 Introduction For the analysis of data structures and algorithms
More informationApproximate Kernel Methods
Lecture 3 Approximate Kernel Methods Bharath K. Sriperumbudur Department of Statistics, Pennsylvania State University Machine Learning Summer School Tübingen, 207 Outline Motivating example Ridge regression
More informationProbabilistic Regression Using Basis Function Models
Probabilistic Regression Using Basis Function Models Gregory Z. Grudic Department of Computer Science University of Colorado, Boulder grudic@cs.colorado.edu Abstract Our goal is to accurately estimate
More information1162 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 11, NO. 5, SEPTEMBER The Evidence Framework Applied to Support Vector Machines
1162 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 11, NO. 5, SEPTEMBER 2000 Brief Papers The Evidence Framework Applied to Support Vector Machines James Tin-Yau Kwok Abstract In this paper, we show that
More informationThe Decision List Machine
The Decision List Machine Marina Sokolova SITE, University of Ottawa Ottawa, Ont. Canada,K1N-6N5 sokolova@site.uottawa.ca Nathalie Japkowicz SITE, University of Ottawa Ottawa, Ont. Canada,K1N-6N5 nat@site.uottawa.ca
More informationUnderstanding Generalization Error: Bounds and Decompositions
CIS 520: Machine Learning Spring 2018: Lecture 11 Understanding Generalization Error: Bounds and Decompositions Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the
More informationNonlinear Support Vector Machines through Iterative Majorization and I-Splines
Nonlinear Support Vector Machines through Iterative Majorization and I-Splines P.J.F. Groenen G. Nalbantov J.C. Bioch July 9, 26 Econometric Institute Report EI 26-25 Abstract To minimize the primal support
More informationDiscriminative Models
No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models
More informationExact 1-Norm Support Vector Machines via Unconstrained Convex Differentiable Minimization
Journal of Machine Learning Research 7 (2006) - Submitted 8/05; Published 6/06 Exact 1-Norm Support Vector Machines via Unconstrained Convex Differentiable Minimization Olvi L. Mangasarian Computer Sciences
More informationAnnouncements. Proposals graded
Announcements Proposals graded Kevin Jamieson 2018 1 Bayesian Methods Machine Learning CSE546 Kevin Jamieson University of Washington November 1, 2018 2018 Kevin Jamieson 2 MLE Recap - coin flips Data:
More informationCheng Soon Ong & Christian Walder. Canberra February June 2018
Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 Outlines Overview Introduction Linear Algebra Probability Linear Regression
More informationLogistic Regression and Boosting for Labeled Bags of Instances
Logistic Regression and Boosting for Labeled Bags of Instances Xin Xu and Eibe Frank Department of Computer Science University of Waikato Hamilton, New Zealand {xx5, eibe}@cs.waikato.ac.nz Abstract. In
More informationCS 231A Section 1: Linear Algebra & Probability Review. Kevin Tang
CS 231A Section 1: Linear Algebra & Probability Review Kevin Tang Kevin Tang Section 1-1 9/30/2011 Topics Support Vector Machines Boosting Viola Jones face detector Linear Algebra Review Notation Operations
More informationSupport Vector Regression with Automatic Accuracy Control B. Scholkopf y, P. Bartlett, A. Smola y,r.williamson FEIT/RSISE, Australian National University, Canberra, Australia y GMD FIRST, Rudower Chaussee
More informationBeyond the Point Cloud: From Transductive to Semi-Supervised Learning
Beyond the Point Cloud: From Transductive to Semi-Supervised Learning Vikas Sindhwani, Partha Niyogi, Mikhail Belkin Andrew B. Goldberg goldberg@cs.wisc.edu Department of Computer Sciences University of
More informationSolving Classification Problems By Knowledge Sets
Solving Classification Problems By Knowledge Sets Marcin Orchel a, a Department of Computer Science, AGH University of Science and Technology, Al. A. Mickiewicza 30, 30-059 Kraków, Poland Abstract We propose
More informationDoes Modeling Lead to More Accurate Classification?
Does Modeling Lead to More Accurate Classification? A Comparison of the Efficiency of Classification Methods Yoonkyung Lee* Department of Statistics The Ohio State University *joint work with Rui Wang
More informationSupport Vector Machines with Example Dependent Costs
Support Vector Machines with Example Dependent Costs Ulf Brefeld, Peter Geibel, and Fritz Wysotzki TU Berlin, Fak. IV, ISTI, AI Group, Sekr. FR5-8 Franklinstr. 8/9, D-0587 Berlin, Germany Email {geibel
More informationClassifier Complexity and Support Vector Classifiers
Classifier Complexity and Support Vector Classifiers Feature 2 6 4 2 0 2 4 6 8 RBF kernel 10 10 8 6 4 2 0 2 4 6 Feature 1 David M.J. Tax Pattern Recognition Laboratory Delft University of Technology D.M.J.Tax@tudelft.nl
More informationSparse Support Vector Machines by Kernel Discriminant Analysis
Sparse Support Vector Machines by Kernel Discriminant Analysis Kazuki Iwamura and Shigeo Abe Kobe University - Graduate School of Engineering Kobe, Japan Abstract. We discuss sparse support vector machines
More informationPrivacy-Preserving Linear Programming
Optimization Letters,, 1 7 (2010) c 2010 Privacy-Preserving Linear Programming O. L. MANGASARIAN olvi@cs.wisc.edu Computer Sciences Department University of Wisconsin Madison, WI 53706 Department of Mathematics
More informationMidterm exam CS 189/289, Fall 2015
Midterm exam CS 189/289, Fall 2015 You have 80 minutes for the exam. Total 100 points: 1. True/False: 36 points (18 questions, 2 points each). 2. Multiple-choice questions: 24 points (8 questions, 3 points
More informationProximal Knowledge-Based Classification
Proximal Knowledge-Based Classification O. L. Mangasarian E. W. Wild G. M. Fung June 26, 2008 Abstract Prior knowledge over general nonlinear sets is incorporated into proximal nonlinear kernel classification
More informationA Study of Relative Efficiency and Robustness of Classification Methods
A Study of Relative Efficiency and Robustness of Classification Methods Yoonkyung Lee* Department of Statistics The Ohio State University *joint work with Rui Wang April 28, 2011 Department of Statistics
More informationClassification and Support Vector Machine
Classification and Support Vector Machine Yiyong Feng and Daniel P. Palomar The Hong Kong University of Science and Technology (HKUST) ELEC 5470 - Convex Optimization Fall 2017-18, HKUST, Hong Kong Outline
More informationApproximate Kernel PCA with Random Features
Approximate Kernel PCA with Random Features (Computational vs. Statistical Tradeoff) Bharath K. Sriperumbudur Department of Statistics, Pennsylvania State University Journées de Statistique Paris May 28,
More informationThe Performance of a New Hybrid Classifier Based on Boxes and Nearest Neighbors
R u t c o r Research R e p o r t The Performance of a New Hybrid Classifier Based on Boxes and Nearest Neighbors Martin Anthony a Joel Ratsaby b RRR 17-2011, October 2011 RUTCOR Rutgers Center for Operations
More informationEquivalence of Minimal l 0 and l p Norm Solutions of Linear Equalities, Inequalities and Linear Programs for Sufficiently Small p
Equivalence of Minimal l 0 and l p Norm Solutions of Linear Equalities, Inequalities and Linear Programs for Sufficiently Small p G. M. FUNG glenn.fung@siemens.com R&D Clinical Systems Siemens Medical
More informationCIS 520: Machine Learning Oct 09, Kernel Methods
CIS 520: Machine Learning Oct 09, 207 Kernel Methods Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the lecture They may or may not cover all the material discussed
More informationAn Improved Conjugate Gradient Scheme to the Solution of Least Squares SVM
An Improved Conjugate Gradient Scheme to the Solution of Least Squares SVM Wei Chu Chong Jin Ong chuwei@gatsby.ucl.ac.uk mpeongcj@nus.edu.sg S. Sathiya Keerthi mpessk@nus.edu.sg Control Division, Department
More informationbelow, kernel PCA Eigenvectors, and linear combinations thereof. For the cases where the pre-image does exist, we can provide a means of constructing
Kernel PCA Pattern Reconstruction via Approximate Pre-Images Bernhard Scholkopf, Sebastian Mika, Alex Smola, Gunnar Ratsch, & Klaus-Robert Muller GMD FIRST, Rudower Chaussee 5, 12489 Berlin, Germany fbs,
More informationSupport Vector Machines on General Confidence Functions
Support Vector Machines on General Confidence Functions Yuhong Guo University of Alberta yuhong@cs.ualberta.ca Dale Schuurmans University of Alberta dale@cs.ualberta.ca Abstract We present a generalized
More informationConvex Optimization in Classification Problems
New Trends in Optimization and Computational Algorithms December 9 13, 2001 Convex Optimization in Classification Problems Laurent El Ghaoui Department of EECS, UC Berkeley elghaoui@eecs.berkeley.edu 1
More informationRelevance Vector Machines for Earthquake Response Spectra
2012 2011 American American Transactions Transactions on on Engineering Engineering & Applied Applied Sciences Sciences. American Transactions on Engineering & Applied Sciences http://tuengr.com/ateas
More informationKernel Machines and Additive Fuzzy Systems: Classification and Function Approximation
Kernel Machines and Additive Fuzzy Systems: Classification and Function Approximation Yixin Chen * and James Z. Wang * Dept. of Computer Science and Engineering, School of Information Sciences and Technology
More informationLecture 3: Statistical Decision Theory (Part II)
Lecture 3: Statistical Decision Theory (Part II) Hao Helen Zhang Hao Helen Zhang Lecture 3: Statistical Decision Theory (Part II) 1 / 27 Outline of This Note Part I: Statistics Decision Theory (Classical
More informationPolyhedral Computation. Linear Classifiers & the SVM
Polyhedral Computation Linear Classifiers & the SVM mcuturi@i.kyoto-u.ac.jp Nov 26 2010 1 Statistical Inference Statistical: useful to study random systems... Mutations, environmental changes etc. life
More informationDiscriminative Models
No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models
More informationMinOver Revisited for Incremental Support-Vector-Classification
MinOver Revisited for Incremental Support-Vector-Classification Thomas Martinetz Institute for Neuro- and Bioinformatics University of Lübeck D-23538 Lübeck, Germany martinetz@informatik.uni-luebeck.de
More informationSupport Vector Machines
Support Vector Machines Support vector machines (SVMs) are one of the central concepts in all of machine learning. They are simply a combination of two ideas: linear classification via maximum (or optimal
More informationMINIMUM EXPECTED RISK PROBABILITY ESTIMATES FOR NONPARAMETRIC NEIGHBORHOOD CLASSIFIERS. Maya Gupta, Luca Cazzanti, and Santosh Srivastava
MINIMUM EXPECTED RISK PROBABILITY ESTIMATES FOR NONPARAMETRIC NEIGHBORHOOD CLASSIFIERS Maya Gupta, Luca Cazzanti, and Santosh Srivastava University of Washington Dept. of Electrical Engineering Seattle,
More information