MATH 829: Introduction to Data Mining and Analysis Support vector machines and kernels

Similar documents
Support Vector Machines

Support Vector Machines. Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar

Machine Learning Support Vector Machines. Prof. Matteo Matteucci

SVMs, Duality and the Kernel Trick

MATH 829: Introduction to Data Mining and Analysis Support vector machines

Support Vector Machines

Introduction to Support Vector Machines

Support Vector Machines

Chapter 9. Support Vector Machine. Yongdai Kim Seoul National University

Lecture Note 5: Semidefinite Programming for Stability Analysis

Stat542 (F11) Statistical Learning. First consider the scenario where the two classes of points are separable.

Indirect Rule Learning: Support Vector Machines. Donglin Zeng, Department of Biostatistics, University of North Carolina

Jeff Howbert Introduction to Machine Learning Winter

ICS-E4030 Kernel Methods in Machine Learning

CS-E4830 Kernel Methods in Machine Learning

Lecture 10: A brief introduction to Support Vector Machine

Support Vector Machine (continued)

The Lagrangian L : R d R m R r R is an (easier to optimize) lower bound on the original problem:

Outline. Basic concepts: SVM and kernels SVM primal/dual problems. Chih-Jen Lin (National Taiwan Univ.) 1 / 22

Machine Learning. Support Vector Machines. Manfred Huber

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Support vector machines

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

Support Vector Machines and Kernel Methods

Support Vector Machines, Kernel SVM

Machine Learning. Lecture 6: Support Vector Machine. Feng Li.

Lecture 10: Support Vector Machine and Large Margin Classifier

Support Vector Machines. Maximizing the Margin

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM

CIS 520: Machine Learning Oct 09, Kernel Methods

Convex Optimization M2

Support Vector Machines and Speaker Verification

Convex Optimization and SVM

Support Vector Machine (SVM) and Kernel Methods

4. Algebra and Duality

Convex Optimization and Support Vector Machine

Support Vector Machine (SVM) and Kernel Methods

Foundation of Intelligent Systems, Part I. SVM s & Kernel Methods

Lecture Notes on Support Vector Machine

Part IB Optimisation

Optimization for Communications and Networks. Poompat Saengudomlert. Session 4 Duality and Lagrange Multipliers

Lecture 3 January 28

Support Vector Machines

14. Duality. ˆ Upper and lower bounds. ˆ General duality. ˆ Constraint qualifications. ˆ Counterexample. ˆ Complementary slackness.

Support Vector Machines for Classification and Regression

Kernel Machines. Pradeep Ravikumar Co-instructor: Manuela Veloso. Machine Learning

Reproducing Kernel Hilbert Spaces Class 03, 15 February 2006 Andrea Caponnetto

Support Vector Machines for Classification: A Statistical Portrait

Pattern Recognition 2018 Support Vector Machines

RKHS, Mercer s theorem, Unbounded domains, Frames and Wavelets Class 22, 2004 Tomaso Poggio and Sayan Mukherjee

Support Vector Machines for Classification and Regression. 1 Linearly Separable Data: Hard Margin SVMs

Machine Learning and Data Mining. Support Vector Machines. Kalev Kask

Support Vector Machine

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Kernel Method: Data Analysis with Positive Definite Kernels

CS6375: Machine Learning Gautam Kunapuli. Support Vector Machines

Support Vector Machines: Maximum Margin Classifiers

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machines

A GENERAL FORMULATION FOR SUPPORT VECTOR MACHINES. Wei Chu, S. Sathiya Keerthi, Chong Jin Ong

Machine Learning And Applications: Supervised Learning-SVM

Statistical Machine Learning from Data

Kernel Methods. Jean-Philippe Vert Last update: Jan Jean-Philippe Vert (Mines ParisTech) 1 / 444

Linear vs Non-linear classifier. CS789: Machine Learning and Neural Network. Introduction

Reproducing Kernel Hilbert Spaces

Lecture Support Vector Machine (SVM) Classifiers

Convex Optimization. Dani Yogatama. School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA. February 12, 2014

Support Vector Machines Explained

Statistical Pattern Recognition

MATH 829: Introduction to Data Mining and Analysis Clustering II

Learning From Data Lecture 25 The Kernel Trick

The Learning Problem and Regularization Class 03, 11 February 2004 Tomaso Poggio and Sayan Mukherjee

Kernels and the Kernel Trick. Machine Learning Fall 2017

Convex Optimization & Lagrange Duality

Kernel Methods and Support Vector Machines

Polyhedral Computation. Linear Classifiers & the SVM

5. Duality. Lagrangian

Multi-class SVMs. Lecture 17: Aykut Erdem April 2016 Hacettepe University

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

(Kernels +) Support Vector Machines

Reproducing Kernel Hilbert Spaces

SVM optimization and Kernel methods

Kernel Methods. Outline

HW1 solutions. 1. α Ef(x) β, where Ef(x) is the expected value of f(x), i.e., Ef(x) = n. i=1 p if(a i ). (The function f : R R is given.

Reproducing Kernel Hilbert Spaces

Lecture 18: Kernels Risk and Loss Support Vector Regression. Aykut Erdem December 2016 Hacettepe University

Introduction to SVM and RVM

Reproducing Kernel Hilbert Spaces

Support Vector Machines

Karhunen-Loève decomposition of Gaussian measures on Banach spaces

LECTURE 7 Support vector machines

Oslo Class 2 Tikhonov regularization and kernels

Machine Learning : Support Vector Machines

Extreme Abridgment of Boyd and Vandenberghe s Convex Optimization

MATH 829: Introduction to Data Mining and Analysis Consistency of Linear Regression

Announcements - Homework

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Classification and Support Vector Machine

Lecture 3: Semidefinite Programming

L5 Support Vector Classification

Transcription:

1/12 MATH 829: Introduction to Data Mining and Analysis Support vector machines and kernels Dominique Guillot Departments of Mathematical Sciences University of Delaware March 14, 2016

Separating sets: mapping the features 2/12 We saw in the previous lecture how support vector machines provide a robust way of nding a separating hyperplane:

Separating sets: mapping the features 2/12 We saw in the previous lecture how support vector machines provide a robust way of nding a separating hyperplane: What if the data is not separable? Can map into a high-dimensional space.

3/12 A brief intro to duality in optimization Consider the problem: min x D R n f 0 (x) subject to f i (x) 0, i = 1,..., m h i (x) = 0, i = 1,..., p. Denote by p the optimal value of the problem.

A brief intro to duality in optimization 3/12 Consider the problem: min x D R n f 0 (x) subject to f i (x) 0, i = 1,..., m h i (x) = 0, i = 1,..., p. Denote by p the optimal value of the problem. Lagrangian: L : D R m R p R m p L(x, λ, ν) := f 0 (x) + λ i f i (x) + ν i h i (x).

A brief intro to duality in optimization 3/12 Consider the problem: min x D R n f 0 (x) subject to f i (x) 0, i = 1,..., m h i (x) = 0, i = 1,..., p. Denote by p the optimal value of the problem. Lagrangian: L : D R m R p R m p L(x, λ, ν) := f 0 (x) + λ i f i (x) + ν i h i (x). Lagrange dual function: g : R m R p R g(λ, ν) := inf L(x, λ, ν). x D

A brief intro to duality in optimization 3/12 Consider the problem: min x D R n f 0 (x) subject to f i (x) 0, i = 1,..., m h i (x) = 0, i = 1,..., p. Denote by p the optimal value of the problem. Lagrangian: L : D R m R p R m p L(x, λ, ν) := f 0 (x) + λ i f i (x) + ν i h i (x). Lagrange dual function: g : R m R p R Claim: for every λ 0, g(λ, ν) := inf L(x, λ, ν). x D g(λ, ν) p.

A brief intro to duality in optimization 4/12 Dual problem: max g(λ, ν) λ R m, ν Rp subject to λ 0.

A brief intro to duality in optimization 4/12 Dual problem: max g(λ, ν) λ R m, ν Rp subject to λ 0. Denote by d the optimal value of the dual problem. Clearly d p (weak duality).

A brief intro to duality in optimization 4/12 Dual problem: max g(λ, ν) λ R m, ν Rp subject to λ 0. Denote by d the optimal value of the dual problem. Clearly d p (weak duality). Strong duality: d = p. Does not hold in general. Usually holds for convex problems. (See e.g. Slater's constraint qualication).

5/12 The kernel trick Recall that SVM solves: 1 min β 0,β,ξ 2 β 2 + C ξ i subject to y i (x T i β + β 0 ) 1 ξ i ξ i 0.

5/12 The kernel trick Recall that SVM solves: 1 min β 0,β,ξ 2 β 2 + C ξ i subject to y i (x T i β + β 0 ) 1 ξ i ξ i 0. The associated Lagrangian is L P = 1 2 β 2 + C ξ i α i [y i (x T i β + β 0 ) (1 ξ i )] which we minimize w.r.t. β, β 0, ξ. µ i ξ i,

The kernel trick 5/12 Recall that SVM solves: 1 min β 0,β,ξ 2 β 2 + C ξ i subject to y i (x T i β + β 0 ) 1 ξ i ξ i 0. The associated Lagrangian is L P = 1 2 β 2 + C ξ i α i [y i (x T i β + β 0 ) (1 ξ i )] which we minimize w.r.t. β, β 0, ξ. Setting the respective derivatives to 0, we obtain: β = α i y i x i, 0 = µ i ξ i, α i y i, α i = C µ i (i = 1,..., n).

The kernel trick (cont.) 6/12 Substituting into L P, we obtain the Lagrange (dual) objective function: L D = α i 1 α i α j y i y j x T i x j. 2 i,j=1

The kernel trick (cont.) 6/12 Substituting into L P, we obtain the Lagrange (dual) objective function: L D = α i 1 α i α j y i y j x T i x j. 2 i,j=1 The function L D provides a lower bound on the original objective function at any feasible point (weak duality).

The kernel trick (cont.) 6/12 Substituting into L P, we obtain the Lagrange (dual) objective function: L D = α i 1 α i α j y i y j x T i x j. 2 i,j=1 The function L D provides a lower bound on the original objective function at any feasible point (weak duality). The solution of the original SVM problem can be obtained by maximizing L D under the previous constraints (strong duality).

The kernel trick (cont.) 6/12 Substituting into L P, we obtain the Lagrange (dual) objective function: L D = α i 1 α i α j y i y j x T i x j. 2 i,j=1 The function L D provides a lower bound on the original objective function at any feasible point (weak duality). The solution of the original SVM problem can be obtained by maximizing L D under the previous constraints (strong duality). Now suppose h : R p R m, transforming our features to h(x i ) = (h 1 (x i ),..., h m (x i )) R m.

The kernel trick (cont.) 6/12 Substituting into L P, we obtain the Lagrange (dual) objective function: L D = α i 1 α i α j y i y j x T i x j. 2 i,j=1 The function L D provides a lower bound on the original objective function at any feasible point (weak duality). The solution of the original SVM problem can be obtained by maximizing L D under the previous constraints (strong duality). Now suppose h : R p R m, transforming our features to h(x i ) = (h 1 (x i ),..., h m (x i )) R m. The Lagrange dual function becomes: L D = α i 1 α i α j y i y j h(x i ) T h(x j ). 2 i,j=1

The kernel trick (cont.) Substituting into L P, we obtain the Lagrange (dual) objective function: L D = α i 1 α i α j y i y j x T i x j. 2 i,j=1 The function L D provides a lower bound on the original objective function at any feasible point (weak duality). The solution of the original SVM problem can be obtained by maximizing L D under the previous constraints (strong duality). Now suppose h : R p R m, transforming our features to h(x i ) = (h 1 (x i ),..., h m (x i )) R m. The Lagrange dual function becomes: L D = α i 1 α i α j y i y j h(x i ) T h(x j ). 2 i,j=1 Important observation: L D only depends on h(x i ), h(x j ). 6/12

Positive denite kernels 7/12 Important observation: L D only depends on h(x i ), h(x j ).

Positive denite kernels 7/12 Important observation: L D only depends on h(x i ), h(x j ). In fact, we don't even need to specify h, we only need: K(x, x ) = h(x), h(x ).

Positive denite kernels 7/12 Important observation: L D only depends on h(x i ), h(x j ). In fact, we don't even need to specify h, we only need: K(x, x ) = h(x), h(x ). Question: Given K : R p R p R, when can we guarantee that for some function h? K(x, x ) = h(x), h(x )

Positive denite kernels 7/12 Important observation: L D only depends on h(x i ), h(x j ). In fact, we don't even need to specify h, we only need: K(x, x ) = h(x), h(x ). Question: Given K : R p R p R, when can we guarantee that K(x, x ) = h(x), h(x ) for some function h? Observation: Suppose K has the desired form. Then, for x 1,..., x N R p, and v i := h(x i ), (K(x i, x j )) = ( h(x i ), h(x j ) ) = ( v i, v j ) = V T V, where V = (v1 T,..., vn). T

Positive denite kernels Important observation: L D only depends on h(x i ), h(x j ). In fact, we don't even need to specify h, we only need: K(x, x ) = h(x), h(x ). Question: Given K : R p R p R, when can we guarantee that K(x, x ) = h(x), h(x ) for some function h? Observation: Suppose K has the desired form. Then, for x 1,..., x N R p, and v i := h(x i ), (K(x i, x j )) = ( h(x i ), h(x j ) ) = ( v i, v j ) = V T V, where V = (v1 T,..., vn). T Conclusion: the matrix (K(x i, x j )) is positive semidenite. 7/12

Positive denite kernels (cont.) 8/12 Necessary condition to have K(x, x ) = h(x), h(x ) : (K(x i, x j )) N i,j=1 is psd for any x 1,..., x N, and any N 1.

Positive denite kernels (cont.) 8/12 Necessary condition to have K(x, x ) = h(x), h(x ) : (K(x i, x j )) N i,j=1 is psd for any x 1,..., x N, and any N 1. Note also that K(x, x ) = K(x, x) if K(x, x ) = h(x), h(x ).

Positive denite kernels (cont.) 8/12 Necessary condition to have K(x, x ) = h(x), h(x ) : (K(x i, x j )) N i,j=1 is psd for any x 1,..., x N, and any N 1. Note also that K(x, x ) = K(x, x) if K(x, x ) = h(x), h(x ). Denition: Let X be a set. A symmetric kernel K : X X R is said to be a positive (semi)denite kernel if (K(x i, x j )) N i,j=1 is positive (semi)denite for all x 1,..., x N X and all N 1.

Positive denite kernels (cont.) 8/12 Necessary condition to have K(x, x ) = h(x), h(x ) : (K(x i, x j )) N i,j=1 is psd for any x 1,..., x N, and any N 1. Note also that K(x, x ) = K(x, x) if K(x, x ) = h(x), h(x ). Denition: Let X be a set. A symmetric kernel K : X X R is said to be a positive (semi)denite kernel if (K(x i, x j )) N i,j=1 is positive (semi)denite for all x 1,..., x N X and all N 1. A reproducing kernel Hilbert space (RKHS) over a set X is a Hilbert space H of functions on X such that for each x X, there is a function k x H such that f, k x H = f(x) f H.

Positive denite kernels (cont.) Necessary condition to have K(x, x ) = h(x), h(x ) : (K(x i, x j )) N i,j=1 is psd for any x 1,..., x N, and any N 1. Note also that K(x, x ) = K(x, x) if K(x, x ) = h(x), h(x ). Denition: Let X be a set. A symmetric kernel K : X X R is said to be a positive (semi)denite kernel if (K(x i, x j )) N i,j=1 is positive (semi)denite for all x 1,..., x N X and all N 1. A reproducing kernel Hilbert space (RKHS) over a set X is a Hilbert space H of functions on X such that for each x X, there is a function k x H such that f, k x H = f(x) f H. Write k(, x) := k x ( ) (k = the reproducing kernel of H). 8/12

Positive denite kernels (cont.) 9/12 One can show that H is a RKHS over X i the evaluation functionals Λ x : H C f Λ x (f) = f(x) are continuous on H (use Riesz's representation theorem).

Positive denite kernels (cont.) 9/12 One can show that H is a RKHS over X i the evaluation functionals Λ x : H C f Λ x (f) = f(x) are continuous on H (use Riesz's representation theorem). Theorem: Let k : X X R be a positive denite kernel. Then there exists a RKHS H k over X such that 1 k(, x) H k for all x X. 2 span(k(, x) : x X ) is dense in H k. 3 k is a reproducing kernel on H k.

Positive denite kernels (cont.) 9/12 One can show that H is a RKHS over X i the evaluation functionals Λ x : H C f Λ x (f) = f(x) are continuous on H (use Riesz's representation theorem). Theorem: Let k : X X R be a positive denite kernel. Then there exists a RKHS H k over X such that 1 k(, x) H k for all x X. 2 span(k(, x) : x X ) is dense in H k. 3 k is a reproducing kernel on H k. Now, dene h : X H k by Then h(x) := k(, x). h(x), h(x ) Hk = k(, x), k(, x ) Hk = k(x, x ).

Positive denite kernels (cont.) One can show that H is a RKHS over X i the evaluation functionals Λ x : H C f Λ x (f) = f(x) are continuous on H (use Riesz's representation theorem). Theorem: Let k : X X R be a positive denite kernel. Then there exists a RKHS H k over X such that 1 k(, x) H k for all x X. 2 span(k(, x) : x X ) is dense in H k. 3 k is a reproducing kernel on H k. Now, dene h : X H k by Then h(x) := k(, x). h(x), h(x ) Hk = k(, x), k(, x ) Hk = k(x, x ). Moral: Positive denite kernels arise as h(x), h(x ) H. 9/12

Back to SVM 10/12 We can replace h by any positive denite kernel in the SVM problem: L D = = α i 1 2 α i 1 2 α i α j y i y j h(x i ) T h(x j ) i,j=1 α i α j y i y j K(x i, x j ). i,j=1

Back to SVM 10/12 We can replace h by any positive denite kernel in the SVM problem: L D = = α i 1 2 α i 1 2 α i α j y i y j h(x i ) T h(x j ) i,j=1 α i α j y i y j K(x i, x j ). i,j=1 Three popular choice in the SVM literature: K(x, x ) = e γ x x 2 2 (Gaussian kernel) K(x, x ) = (1 + x, x ) d (d-th degree polynomial) K(x, x ) = tanh(κ 1 x, x + κ 2 ) (Neural networks).

Constructing pd kernels 11/12 Properties of pd kernels: 1 If k 1,..., k n are pd, then n λ ik i is pd for any λ 1,..., λ n 0. 2 If k 1, k 2 are pd, then k 1 k 2 is pd (Schur product theorem). 3 If (k i ) i 1 are kernels, then lim k i is a kernel (if the limit exists). Exercise: Use the above properties to show that e γ x x 2 2 (1 + x, x ) d are positive denite kernels. and

Constructing pd kernels 11/12 Properties of pd kernels: 1 If k 1,..., k n are pd, then n λ ik i is pd for any λ 1,..., λ n 0. 2 If k 1, k 2 are pd, then k 1 k 2 is pd (Schur product theorem). 3 If (k i ) i 1 are kernels, then lim k i is a kernel (if the limit exists). Exercise: Use the above properties to show that e γ x x 2 2 and (1 + x, x ) d are positive denite kernels. Denition: A function h : R p R is said to be positive denite if is a positive denite kernel. K(x, x ) := h(x x )

Constructing pd kernels 11/12 Properties of pd kernels: 1 If k 1,..., k n are pd, then n λ ik i is pd for any λ 1,..., λ n 0. 2 If k 1, k 2 are pd, then k 1 k 2 is pd (Schur product theorem). 3 If (k i ) i 1 are kernels, then lim k i is a kernel (if the limit exists). Exercise: Use the above properties to show that e γ x x 2 2 and (1 + x, x ) d are positive denite kernels. Denition: A function h : R p R is said to be positive denite if is a positive denite kernel. K(x, x ) := h(x x ) P.d. functions thus provide a way of constructing pd kernels.

Constructing pd kernels Properties of pd kernels: 1 If k 1,..., k n are pd, then n λ ik i is pd for any λ 1,..., λ n 0. 2 If k 1, k 2 are pd, then k 1 k 2 is pd (Schur product theorem). 3 If (k i ) i 1 are kernels, then lim k i is a kernel (if the limit exists). Exercise: Use the above properties to show that e γ x x 2 2 and (1 + x, x ) d are positive denite kernels. Denition: A function h : R p R is said to be positive denite if K(x, x ) := h(x x ) is a positive denite kernel. P.d. functions thus provide a way of constructing pd kernels. Theorem: (Bochner) A continuous function h : R p C is positive denite if and only if h(x) = e i x,ω dµ(ω), R p for some nite nonnegative Borel measure on R p. 11/12

Example: decision function 12/12 ESL, Figure 12.3 (solid black line = decision boundary, dotted line = margin).