Support Vector Machines

Similar documents
Support Vector Machine I

Kernel Method: Data Analysis with Positive Definite Kernels

ICS-E4030 Kernel Methods in Machine Learning

Convex Optimization. Dani Yogatama. School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA. February 12, 2014

CS-E4830 Kernel Methods in Machine Learning

A Revisit to Support Vector Data Description (SVDD)

The Lagrangian L : R d R m R r R is an (easier to optimize) lower bound on the original problem:

Lecture Support Vector Machine (SVM) Classifiers

Constrained Optimization and Lagrangian Duality

Support Vector Machines: Maximum Margin Classifiers

5. Duality. Lagrangian

Lecture 18: Optimization Programming

Convex Optimization & Lagrange Duality

Support vector machines

Lecture Notes on Support Vector Machine

Lecture: Duality of LP, SOCP and SDP

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Introduction to Machine Learning Lecture 7. Mehryar Mohri Courant Institute and Google Research

Machine Learning. Lecture 6: Support Vector Machine. Feng Li.

Extreme Abridgment of Boyd and Vandenberghe s Convex Optimization

An introduction to Support Vector Machines

Support Vector Machine

Gaps in Support Vector Optimization

Formulation with slack variables

Convex Optimization Boyd & Vandenberghe. 5. Duality

I.3. LMI DUALITY. Didier HENRION EECI Graduate School on Control Supélec - Spring 2010

Lecture: Duality.

Convex Optimization M2

Lecture 2: Linear SVM in the Dual

Machine Learning. Support Vector Machines. Manfred Huber

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Support Vector Machines and Kernel Methods

A GENERAL FORMULATION FOR SUPPORT VECTOR MACHINES. Wei Chu, S. Sathiya Keerthi, Chong Jin Ong

Polyhedral Computation. Linear Classifiers & the SVM

Support Vector Machines for Classification and Regression

Solving the SVM Optimization Problem

Support Vector Machines

SVM and Kernel machine

Lecture 7: Convex Optimizations

SMO Algorithms for Support Vector Machines without Bias Term

Support Vector Machines

14. Duality. ˆ Upper and lower bounds. ˆ General duality. ˆ Constraint qualifications. ˆ Counterexample. ˆ Complementary slackness.

Support Vector Machines

Outline. Basic concepts: SVM and kernels SVM primal/dual problems. Chih-Jen Lin (National Taiwan Univ.) 1 / 22

Convex Optimization and SVM

Support Vector Machines

Convex Optimization and Support Vector Machine

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Duality. Lagrange dual problem weak and strong duality optimality conditions perturbation and sensitivity analysis generalized inequalities

Support Vector Machines

Support Vector Machines for Classification and Regression. 1 Linearly Separable Data: Hard Margin SVMs

Applied inductive learning - Lecture 7

Machine Learning A Geometric Approach

Learning with kernels and SVM

Introduction to Machine Learning Spring 2018 Note Duality. 1.1 Primal and Dual Problem

Support Vector Machine Regression for Volatile Stock Market Prediction

Karush-Kuhn-Tucker Conditions. Lecturer: Ryan Tibshirani Convex Optimization /36-725

CSC 411 Lecture 17: Support Vector Machine

SUPPORT VECTOR MACHINE FOR THE SIMULTANEOUS APPROXIMATION OF A FUNCTION AND ITS DERIVATIVE

Support Vector Machines

Motivation. Lecture 2 Topics from Optimization and Duality. network utility maximization (NUM) problem:

Model Selection for LS-SVM : Application to Handwriting Recognition

Lagrange Duality. Daniel P. Palomar. Hong Kong University of Science and Technology (HKUST)

Introduction to Support Vector Machines

4TE3/6TE3. Algorithms for. Continuous Optimization

Support Vector Machine (SVM) and Kernel Methods

Introduction to Support Vector Machines

CONVEX OPTIMIZATION, DUALITY, AND THEIR APPLICATION TO SUPPORT VECTOR MACHINES. Contents 1. Introduction 1 2. Convex Sets

An Improved Conjugate Gradient Scheme to the Solution of Least Squares SVM

Linear vs Non-linear classifier. CS789: Machine Learning and Neural Network. Introduction

Support Vector Machine (continued)

Sequential Minimal Optimization (SMO)

EE/AA 578, Univ of Washington, Fall Duality

Statistical Machine Learning from Data

Support Vector Machines, Kernel SVM

LIBSVM: a Library for Support Vector Machines (Version 2.32)

Support Vector Machines II. CAP 5610: Machine Learning Instructor: Guo-Jun QI

Kobe University Repository : Kernel

Support Vector Machine (SVM) and Kernel Methods

Linear & nonlinear classifiers

Lagrangian Duality and Convex Optimization

A Tutorial on Support Vector Machine

Learning with Rigorous Support Vector Machines

Chapter 9. Support Vector Machine. Yongdai Kim Seoul National University

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines

LECTURE 7 Support vector machines

MACHINE LEARNING. Support Vector Machines. Alessandro Moschitti

Convex Optimization Overview (cnt d)

Linear and non-linear programming

Lagrange duality. The Lagrangian. We consider an optimization program of the form

Support Vector Machines and Kernel Methods

On the Method of Lagrange Multipliers

Indirect Rule Learning: Support Vector Machines. Donglin Zeng, Department of Biostatistics, University of North Carolina

CS798: Selected topics in Machine Learning

Outliers Treatment in Support Vector Regression for Financial Time Series Prediction

Optimization for Machine Learning

CS6375: Machine Learning Gautam Kunapuli. Support Vector Machines

SMO vs PDCO for SVM: Sequential Minimal Optimization vs Primal-Dual interior method for Convex Objectives for Support Vector Machines

ν =.1 a max. of 10% of training set can be margin errors ν =.8 a max. of 80% of training can be margin errors

Transcription:

Support Vector Machines Statistical Inference with Reproducing Kernel Hilbert Space Kenji Fukumizu Institute of Statistical Mathematics, ROIS Department of Statistical Science, Graduate University for Advanced Studies May 23, 2008 / Statistical Learning Theory II

Outline A quick course on convex optimization 1 A quick course on convex optimization 2 2 / 38

1 A quick course on convex optimization 2 3 / 38

Convexity I A quick course on convex optimization For the details on convex optimization, see [BV04]. Convex set: A set C in a vector space is convex if for every x, y C and t [0, 1] tx + (1 t)y C. Convex function: Let C be a convex set. f : C R is called a convex function if for every x, y C and t [0, 1] f(tx + (1 t)y) tf(x) + (1 t)f(y). Concave function: Let C be a convex set. f : C R is called a concave function if for every x, y C and t [0, 1] f(tx + (1 t)y) tf(x) + (1 t)f(y). 4 / 38

Convexity II A quick course on convex optimization 5 / 38

Convexity III A quick course on convex optimization Fact: If f : C R is a convex function, the set {x C f(x) α} is a convex set for every α R. If f t (x) : C R (t T ) are convex, then is also convex. f(x) = sup t T f t (x) 6 / 38

Convex optimization I A general form of convex optimization f(x), h i (x) (1 i l): D R, convex functions on D R n. a i R n, b j R (1 j m). min f(x) x D subject to { h i (x) 0 a T j x + b j = 0 h i : inequality constraints, r j (x) = a T j x + b j: linear equality constraints. Feasible set: (1 i l), (1 j m). F = {x D h i (x) 0 (1 i l), r j (x) = 0 (1 j m)}. The above optimization problem is called feasible if F. In convex optimization, there are no local minima. It is possible to find a minimizer numerically. 7 / 38

Convex optimization II Fact 1. Fact 2. The feasible set is a convex set. The set of minimizers X opt = { x F f(x) = inf{f(y) y F} } is convex. proof. The intersection of convex sets is convex, which leads (1). Let p = inf x F f(x). Then, X opt = {x D f(x) p } F. Both sets in r.h.s. are convex. This proves (2) 8 / 38

Examples A quick course on convex optimization Linear program (LP) min c T x subject to { Ax = b, Gx h. 1 The objective function, the equality and inequality constraints are all linear. Quadratic program (QP) min 1 2 xt P x + q t x + r subject to { Ax = b, Gx h, where P is a positive semidefinite matrix. The objective function is quadratic, while the equality and inequality constraints are linear. 1 Gx h denotes g T j x h j for all j, where G = (g 1,..., g m) T. 9 / 38

1 A quick course on convex optimization 2 10 / 38

Lagrange duality I A quick course on convex optimization Consider an optimization problem (which may not be convex): { h i (x) 0 (1 i l), (primal) min f(x) x D subject to r j (x) = 0 Lagrange dual function: g : R l R m [, ) where g(λ, ν) = inf L(x, λ, ν), x D L(x, λ, µ) = f(x) + l i=1 λ ih i (x) + m j=1 ν jr j (x). λ i and ν j are called Lagrange multipliers. g is a concave function. (1 j m). 11 / 38

Lagrange duality II A quick course on convex optimization Dual problem (dual) max g(λ, ν) subject to λ 0. The dual and primal problems have close connection. Theorem (weak duality) Let and Then, p = inf{f(x) h i (x) 0 (1 i l), r j (x) = 0 (1 j m)}. d = sup{g(λ, ν) λ 0, ν R m }. d p. The weak duality does not require the convexity of the primal optimization problem. 12 / 38

Lagrange duality III Proof. Let λ 0, ν R m. For any feasible point x, l i=1 λ ih i (x) + m j=1 ν jr j (x) 0. (The first part is non-positive, and the second part is zero.) This means for any feasible point x, By taking infimum, Thus, L(x, λ, ν) f(x). inf L(x, λ, ν) x:feasible p. g(λ, ν) = inf L(x, λ, ν) inf L(x, λ, ν) p x D x:feasible for any λ 0, ν R m. 13 / 38

Strong duality A quick course on convex optimization We need some conditions to obtain the strong duality d = p. Convexity of the problem: f and h i are convex, r j are linear. Slater s condition There is x D such that h i (x) < 0 (1 i l), r j (x) = 0 (1 j m). Theorem (Strong duality) Suppose the primal problem is convex, and Slater s condition holds. Then, there is λ 0 and ν R m such that g(λ, ν ) = d = p. Proof is omitted (see [BV04] Sec.5.3.2.). There are also other conditions to guarantee the strong duality. 14 / 38

Geometric interpretation of duality G = {(h(x), r(x), f(x)) R l R m R x R n }. p = inf{(u, 0, t) G u 0}. For λ 0 (neglecting ν), Figure g(λ) = inf{(λ, 1) T (u, t) (u, t) G}. (λ, 1) T (u, t) = g(λ) is a non-vertical hyperplane, which intersects with t-axis at g(λ). Weak duality d = sup g(λ) p is easy to see. Strong duality d = p holds under some conditions. 15 / 38

Complementary slackness I Consider the (not necessarily convex) optimization problem: { h i (x) 0 (1 i l), min f(x) subject to r j (x) = 0 (1 j m). Assumption: the optimum of the primal and dual problems are attained by x and (λ, ν ), and they satisfy the strong duality; Observation: g(λ, ν ) = f(x ). f(x ) = g(λ, ν ) = inf x D L(x, λ, ν ) L(x, λ, ν ) [definition] = f(x ) + l i=1 λ i h i (x ) + m j=1 ν j r j (x ) f(x ) [2nd 0 and 3rd = 0] The two inequalities are in fact equalities. 16 / 38

Complementary slackness II Consequence 1: x minimizes L(x, λ, ν ) Consequence 2: λ i h i (x ) = 0 for all i. This is called complementary slackness. Equivalently, λ i > 0 h i (x ) = 0, or h i (x ) < 0 λ i = 0. 17 / 38

Solving primal problem via dual If the dual problem is easier to solve, then we can sometimes solve the primal using the dual. Assumption: strong duality holds (e.g., Slater s condition), and we have the dual solution (λ, ν ). The primal solution x, if it exists, should give min x D L(x, λ, ν ) (previous slide). If a solution of min f(x) + l x D i=1 λ i h i (x) + m j=1 ν j r j (x) is obtained and it is primary feasible, then it must be the primal solution. 18 / 38

KKT condition I A quick course on convex optimization KKT conditions give useful relations between the primal and dual solutions. Consider the convex optimization problem. Assume D is open and f(x), h i (x) are differentiable. { h i (x) 0 (1 i l), min f(x) subject to r j (x) = 0 (1 j m). x and (λ, ν ): any optimal points of the primal and dual problems. Assume the strong duality holds. Since x minimizes L(x, λ, ν ), f(x ) + l i=1 λ i g i (x ) + m j=1 ν j r j (x ) = 0. 19 / 38

KKT condition II A quick course on convex optimization The following are necessary conditions. Karush-Kuhn-Tucker (KKT) conditions: h i (x ) 0 (i = 1,..., l) [primal constraints] r j (x ) = 0 (j = 1,..., m) [primal constraints] λ i 0 (i = 1,..., l) [dual constraints] λ i h i (x ) = 0 (i = 1,..., l) [ complementary slackness] f(x ) + l i=1 λ i g i (x ) + m j=1 ν j r j (x ) = 0. Theorem (KKT condition) For a convex optimization problem with differentiable functions, x and (λ, ν ) are the primal-dual solutions with strong duality if and only if they satisfy KKT conditions. 20 / 38

KKT condition III A quick course on convex optimization Proof. x is primal-feasible by the first two conditions. From λ i 0, L(x, λ, ν ) is convex (and differentiable). The last condition xl(x, λ, ν ) = 0 implies x is a minimizer. It follows g(λ, ν ) = inf x D L(x, λ, ν ) [by definition] = L(x, λ, ν ) [x : minimizer] = f(x ) + l i=1 λ i h i(x ) + m j=1 ν j r j(x ) = f(x ) [complementary slackness and r j(x ) = 0]. Strong duality holds, and x and (λ, ν ) must be the optimizers. 21 / 38

Example A quick course on convex optimization Quadratic minimization under equality constraints. min 1 2 xt P x + q T x + r subject to Ax = b, where P is (strictly) positive definite. KKT conditions: Ax = b, [primal constraint] x L(x, ν ) = 0 = P x + q + A T ν = 0 The solution is given by ( ) ( ) P A T x A O ν = ( ) q. b 22 / 38

1 A quick course on convex optimization 2 23 / 38

Lagrange dual of SVM I The QP for SVM can be solved in the primal form, but the dual form is easier. SVM: primal problem 1 N min i,j=1 w i,b,ξ i 2 w iw j k(x i, X j ) + C N i=1 ξ i, { Y i ( N j=1 subj. to k(x i, X j )w j + b) 1 ξ i, ξ i 0. Lagrange function of SVM: L(w, b, ξ, α, β) = 1 N i,j=1 2 w iw j k(x i, X j ) + C N i=1 ξ i + N i=1 α i(1 Y i f(x i ) ξ i ) + N i=1 β i( ξ i ), where f(x i ) = N j=1 w jk(x i, X j ) + b. 24 / 38

Lagrange dual of SVM II SVM Dual problem: max α N α i 1 2 i=1 N α i α j Y i Y j K ij i,j=1 Notation: K ij = k(x i, X i ). Derivation. Lagrange dual function subj. to { 0 α i C, N i=1 α iy i = 0. g(α, β) = min L(w, b, ξ, α, β) w,b,ξ The minimizer (w, b, ξ ) is given by w : n j=1 K ijwj + n j=1 α jy j K ij ( i), b : n j=1 α jy j = 0, ξ : C α i β i = 0 ( i). 25 / 38

Lagrange dual of SVM III From these relations, Thus, 1 N i,j=1 2 w i wj K ij = 1 N i,j=1 2 α iα j Y i Y j K ij, n j=1 (C α i β i ) = 0, N i=1 α i(1 Y i ( j K ijw j + b)) = N i=1 α i N i,j=1 α iα j Y i Y j K ij. g(α, β) = L(w, b, ξ, α, β) = N i=1 α i 1 2 N i,j=1 α iα j Y i Y j K ij. From α i 0 and β i 0, The constraints of α i is given by 0 α i C ( i). 26 / 38

KKT conditions of SVM KKT conditions (1) 1 Y i f (X i ) ξi 0 ( i), (2) ξi 0 ( i), (3) αi 0, ( i), (4) βi 0, ( i), (5) αi (1 Y if (X i ) ξi ) = 0 ( i), (6) βi ξ i = 0 ( i), n (7) w : j=1 K ijwj + n j=1 α j Y jk ij, n b : j=1 α j Y j = 0, ξ : C αi β i = 0 ( i). 27 / 38

Support vectors I A quick course on convex optimization Complementary slackness α i (1 Y i f (X i ) ξ i ) = 0 ( i), If α i = 0, then ξ i (C α i )ξ i = 0 ( i). = 0, and Y i f (X i ) 1. [well separated] Support vectors If 0 < α i < C, then ξ i = 0 and Y if (X i) = 1. If α i = C, Y if (X i) 1. 28 / 38

Support vectors II A quick course on convex optimization Sparse representation The optimum classifier is expressed only with the support vectors: f(x) = αi Y i k(x, X i ) + b i:support vector 29 / 38

How to solve b A quick course on convex optimization The optimum value of b is given by the complementary slackness. For any i with 0 < α i < C, Y i ( j k(x i, X j )Y j α j + b ) = 1. Use the above relation for any of such i, or take the average over all of such i. 30 / 38

1 A quick course on convex optimization 2 31 / 38

Computational problem in solving SVM The dual QP problem of SVM has N variables, where N is the sample size. If N is very large, say N = 100, 000, the optimization is very hard. Some approaches have been proposed for optimizing subsets of the variables sequentially. Chunking [Vap82] Osuna s method [OFG] Sequential minimal optimization (SMO) [Pla99] SVMlight (http://svmlight.joachims.org/) 32 / 38

Sequential minimal optimization (SMO) I Solve small QP problems sequentially for a pair of variables (α i, α j ). How to choose the pair? Intuition from the KKT conditions is used. After removing w, ξ, and β, the KKT conditions of SVM are equivalent to N 0 αi C, Y iαi = 0, i=1 αi = 0 Y if (X i) 1, ( ) 0 < αi < C Y if (X i) = 1, αi = C Y if (X i) 1. (shown later.) The conditions can be checked for each data point. Choose (i, j) such that at least one of them breaks the KKT conditions. 33 / 38

Sequential minimal optimization (SMO) II The QP problem for (α i, α j ) is analytically solvable! For simplicity, assume (i, j) = (1, 2). Constraint of α 1 and α 2 : α 1 + s 12 α 2 = γ, 0 α 1, α 2 C, where s 12 = Y 1 Y 2 and γ = ± l 3 Y lα l is constat. Objective function: α 1 + α 2 1 2 α2 1K 11 1 2 α2 2K 22 s 12 α 1 α 2 K 12 Y 1 α 1 j 3 Y jα j K 1j Y 2 α 2 j 3 Y jα j K 2j + const. This optimization is a quadratic optimization of one variable on an interval. Directly solved. 34 / 38

KKT conditions revisited I β and w can be removed by From (4) and (6), ξ : βi = C αi = 0 ( i), n w : j=1 K ijwj = n j=1 α jy j K ij ( i). α i C, ξ i (C α i ) = 0 ( i). The KKT conditions are equivalent to (a) 1 Y if (X i) ξi 0 ( i), (b) ξi 0 ( i), (c) 0 αi C ( i), (d) αi (1 Y if (X i) ξi ) = 0 ( i), (e) ξi (C αi ) = 0 ( i), (f) N i=1 Yiα i = 0. 35 / 38

KKT conditions revisited II We can further remove ξ. Case αi = 0: From (e), ξi = 0. Then, from (a), Y if (X i) 1. Case 0 < αi < C: From (e), ξi = 0. From (d), Y if (X i) = 1. Case αi = C: From (d), ξi = 1 Y if (X i). Note in all cases, (a) and (b) are satisfied. The KKT conditions are equivalent to 0 αi C ( i), N i=1 Y iαi = 0, αi = 0 Y if (X i ) 1, (ξi = 0) 0 < αi < C Y if (X i ) = 1, (ξi = 0) αi = C Y if (X i ) 1, (ξi = 1 Y if (X i )). 36 / 38

1 A quick course on convex optimization 2 37 / 38

Recent studies (not a compete list). Solution in primal. O. Chapelle [Cha07] T. Joachims, SVM perf [Joa06] S. Shalev-Shwartz et al. [SSSS07] Online SVM. Tax and Laskov [TL03] LaSVM [BEWB05] http://leon.bottou.org/projects/lasvm/ Parallel computation Cascade SVM [GCB + 05] Zanni et al [ZSZ06] Others Column generation technique for large scale problems [DBS02] 38 / 38

References I Antoine Bordes, Seyda Ertekin, Jason Weston, and Léon Bottou. Fast kernel classifiers with online and active learning. Journal of Machine Learning Research, 6:1579 1619, 2005. Stephen Boyd and Lieven Vandenberghe. Convex Optimization. Cambridge University Press, 2004. http://www.stanford.edu/ boyd/cvxbook/. Olivier Chapelle. Training a support vector machine in the primal. Neural Computation, 19:1155 1178, 2007. Ayhan Demiriz, Kristin P. Bennett, and John Shawe-Taylor. Linear programming boosting via column generation. Machine Learning, 46(1-3):225 254, 2002. 39 / 38

References II Hans Peter Graf, Eric Cosatto, Léon Bottou, Igor Dourdanovic, and Vladimir Vapnik. Parallel support vector machines: The Cascade SVM. In Lawrence Saul, Yair Weiss, and Léon Bottou, editors, Advances in Neural Information Processing Systems, volume 17. MIT Press, 2005. Thorsten Joachims. Training linear svms in linear time. In Proceedings of the ACM Conference on Knowledge Discovery and Data Mining (KDD), 2006. Edgar Osuna, Robert Freund, and Federico Girosi. An improved training algorithm for support vector machines. In Proceedings of the 1997 IEEE Workshop on Neural Networks for Signal Processing (IEEE NNSP 1997), pages 276 285. 40 / 38

References III John Platt. Fast training of support vector machines using sequential minimal optimization. In Bernhard Schölkopf, Cristopher Burges, and Alexander Smola, editors, Advances in Kernel Methods - Support Vector Learning, pages 185 208. MIT Press, 1999. Shai Shalev-Shwartz, Yoram Singer, and Nathan Srebro. Pegasos: Primal estimated sub-gradient solver for svm. In Proc. International Conerence of Machine Learning, 2007. D.M.J. Tax and P. Laskov. Online svm learning: from classification to data description and back. In Proceddings of IEEE 13th Workshop on Neural Networks for Signal Processing (NNSP2003), pages 499 508, 2003. 41 / 38

References IV Vladimir N. Vapnik. Estimation of Dependences Based on Empirical Data. Springer-Verlag, 1982. Luca Zanni, Thomas Serafini, and Gaetano Zanghirati. Parallel software for training large scale support vector machines on multiprocessor systems. Journal of Machine Learning Research, 7:1467 1492, 2006. 42 / 38