1/12 MATH 829: Introduction to Data Mining and Analysis Support vector machines and kernels Dominique Guillot Departments of Mathematical Sciences University of Delaware March 14, 2016
Separating sets: mapping the features 2/12 We saw in the previous lecture how support vector machines provide a robust way of nding a separating hyperplane:
Separating sets: mapping the features 2/12 We saw in the previous lecture how support vector machines provide a robust way of nding a separating hyperplane: What if the data is not separable? Can map into a high-dimensional space.
3/12 A brief intro to duality in optimization Consider the problem: min x D R n f 0 (x) subject to f i (x) 0, i = 1,..., m h i (x) = 0, i = 1,..., p. Denote by p the optimal value of the problem.
A brief intro to duality in optimization 3/12 Consider the problem: min x D R n f 0 (x) subject to f i (x) 0, i = 1,..., m h i (x) = 0, i = 1,..., p. Denote by p the optimal value of the problem. Lagrangian: L : D R m R p R m p L(x, λ, ν) := f 0 (x) + λ i f i (x) + ν i h i (x).
A brief intro to duality in optimization 3/12 Consider the problem: min x D R n f 0 (x) subject to f i (x) 0, i = 1,..., m h i (x) = 0, i = 1,..., p. Denote by p the optimal value of the problem. Lagrangian: L : D R m R p R m p L(x, λ, ν) := f 0 (x) + λ i f i (x) + ν i h i (x). Lagrange dual function: g : R m R p R g(λ, ν) := inf L(x, λ, ν). x D
A brief intro to duality in optimization 3/12 Consider the problem: min x D R n f 0 (x) subject to f i (x) 0, i = 1,..., m h i (x) = 0, i = 1,..., p. Denote by p the optimal value of the problem. Lagrangian: L : D R m R p R m p L(x, λ, ν) := f 0 (x) + λ i f i (x) + ν i h i (x). Lagrange dual function: g : R m R p R Claim: for every λ 0, g(λ, ν) := inf L(x, λ, ν). x D g(λ, ν) p.
A brief intro to duality in optimization 4/12 Dual problem: max g(λ, ν) λ R m, ν Rp subject to λ 0.
A brief intro to duality in optimization 4/12 Dual problem: max g(λ, ν) λ R m, ν Rp subject to λ 0. Denote by d the optimal value of the dual problem. Clearly d p (weak duality).
A brief intro to duality in optimization 4/12 Dual problem: max g(λ, ν) λ R m, ν Rp subject to λ 0. Denote by d the optimal value of the dual problem. Clearly d p (weak duality). Strong duality: d = p. Does not hold in general. Usually holds for convex problems. (See e.g. Slater's constraint qualication).
5/12 The kernel trick Recall that SVM solves: 1 min β 0,β,ξ 2 β 2 + C ξ i subject to y i (x T i β + β 0 ) 1 ξ i ξ i 0.
5/12 The kernel trick Recall that SVM solves: 1 min β 0,β,ξ 2 β 2 + C ξ i subject to y i (x T i β + β 0 ) 1 ξ i ξ i 0. The associated Lagrangian is L P = 1 2 β 2 + C ξ i α i [y i (x T i β + β 0 ) (1 ξ i )] which we minimize w.r.t. β, β 0, ξ. µ i ξ i,
The kernel trick 5/12 Recall that SVM solves: 1 min β 0,β,ξ 2 β 2 + C ξ i subject to y i (x T i β + β 0 ) 1 ξ i ξ i 0. The associated Lagrangian is L P = 1 2 β 2 + C ξ i α i [y i (x T i β + β 0 ) (1 ξ i )] which we minimize w.r.t. β, β 0, ξ. Setting the respective derivatives to 0, we obtain: β = α i y i x i, 0 = µ i ξ i, α i y i, α i = C µ i (i = 1,..., n).
The kernel trick (cont.) 6/12 Substituting into L P, we obtain the Lagrange (dual) objective function: L D = α i 1 α i α j y i y j x T i x j. 2 i,j=1
The kernel trick (cont.) 6/12 Substituting into L P, we obtain the Lagrange (dual) objective function: L D = α i 1 α i α j y i y j x T i x j. 2 i,j=1 The function L D provides a lower bound on the original objective function at any feasible point (weak duality).
The kernel trick (cont.) 6/12 Substituting into L P, we obtain the Lagrange (dual) objective function: L D = α i 1 α i α j y i y j x T i x j. 2 i,j=1 The function L D provides a lower bound on the original objective function at any feasible point (weak duality). The solution of the original SVM problem can be obtained by maximizing L D under the previous constraints (strong duality).
The kernel trick (cont.) 6/12 Substituting into L P, we obtain the Lagrange (dual) objective function: L D = α i 1 α i α j y i y j x T i x j. 2 i,j=1 The function L D provides a lower bound on the original objective function at any feasible point (weak duality). The solution of the original SVM problem can be obtained by maximizing L D under the previous constraints (strong duality). Now suppose h : R p R m, transforming our features to h(x i ) = (h 1 (x i ),..., h m (x i )) R m.
The kernel trick (cont.) 6/12 Substituting into L P, we obtain the Lagrange (dual) objective function: L D = α i 1 α i α j y i y j x T i x j. 2 i,j=1 The function L D provides a lower bound on the original objective function at any feasible point (weak duality). The solution of the original SVM problem can be obtained by maximizing L D under the previous constraints (strong duality). Now suppose h : R p R m, transforming our features to h(x i ) = (h 1 (x i ),..., h m (x i )) R m. The Lagrange dual function becomes: L D = α i 1 α i α j y i y j h(x i ) T h(x j ). 2 i,j=1
The kernel trick (cont.) Substituting into L P, we obtain the Lagrange (dual) objective function: L D = α i 1 α i α j y i y j x T i x j. 2 i,j=1 The function L D provides a lower bound on the original objective function at any feasible point (weak duality). The solution of the original SVM problem can be obtained by maximizing L D under the previous constraints (strong duality). Now suppose h : R p R m, transforming our features to h(x i ) = (h 1 (x i ),..., h m (x i )) R m. The Lagrange dual function becomes: L D = α i 1 α i α j y i y j h(x i ) T h(x j ). 2 i,j=1 Important observation: L D only depends on h(x i ), h(x j ). 6/12
Positive denite kernels 7/12 Important observation: L D only depends on h(x i ), h(x j ).
Positive denite kernels 7/12 Important observation: L D only depends on h(x i ), h(x j ). In fact, we don't even need to specify h, we only need: K(x, x ) = h(x), h(x ).
Positive denite kernels 7/12 Important observation: L D only depends on h(x i ), h(x j ). In fact, we don't even need to specify h, we only need: K(x, x ) = h(x), h(x ). Question: Given K : R p R p R, when can we guarantee that for some function h? K(x, x ) = h(x), h(x )
Positive denite kernels 7/12 Important observation: L D only depends on h(x i ), h(x j ). In fact, we don't even need to specify h, we only need: K(x, x ) = h(x), h(x ). Question: Given K : R p R p R, when can we guarantee that K(x, x ) = h(x), h(x ) for some function h? Observation: Suppose K has the desired form. Then, for x 1,..., x N R p, and v i := h(x i ), (K(x i, x j )) = ( h(x i ), h(x j ) ) = ( v i, v j ) = V T V, where V = (v1 T,..., vn). T
Positive denite kernels Important observation: L D only depends on h(x i ), h(x j ). In fact, we don't even need to specify h, we only need: K(x, x ) = h(x), h(x ). Question: Given K : R p R p R, when can we guarantee that K(x, x ) = h(x), h(x ) for some function h? Observation: Suppose K has the desired form. Then, for x 1,..., x N R p, and v i := h(x i ), (K(x i, x j )) = ( h(x i ), h(x j ) ) = ( v i, v j ) = V T V, where V = (v1 T,..., vn). T Conclusion: the matrix (K(x i, x j )) is positive semidenite. 7/12
Positive denite kernels (cont.) 8/12 Necessary condition to have K(x, x ) = h(x), h(x ) : (K(x i, x j )) N i,j=1 is psd for any x 1,..., x N, and any N 1.
Positive denite kernels (cont.) 8/12 Necessary condition to have K(x, x ) = h(x), h(x ) : (K(x i, x j )) N i,j=1 is psd for any x 1,..., x N, and any N 1. Note also that K(x, x ) = K(x, x) if K(x, x ) = h(x), h(x ).
Positive denite kernels (cont.) 8/12 Necessary condition to have K(x, x ) = h(x), h(x ) : (K(x i, x j )) N i,j=1 is psd for any x 1,..., x N, and any N 1. Note also that K(x, x ) = K(x, x) if K(x, x ) = h(x), h(x ). Denition: Let X be a set. A symmetric kernel K : X X R is said to be a positive (semi)denite kernel if (K(x i, x j )) N i,j=1 is positive (semi)denite for all x 1,..., x N X and all N 1.
Positive denite kernels (cont.) 8/12 Necessary condition to have K(x, x ) = h(x), h(x ) : (K(x i, x j )) N i,j=1 is psd for any x 1,..., x N, and any N 1. Note also that K(x, x ) = K(x, x) if K(x, x ) = h(x), h(x ). Denition: Let X be a set. A symmetric kernel K : X X R is said to be a positive (semi)denite kernel if (K(x i, x j )) N i,j=1 is positive (semi)denite for all x 1,..., x N X and all N 1. A reproducing kernel Hilbert space (RKHS) over a set X is a Hilbert space H of functions on X such that for each x X, there is a function k x H such that f, k x H = f(x) f H.
Positive denite kernels (cont.) Necessary condition to have K(x, x ) = h(x), h(x ) : (K(x i, x j )) N i,j=1 is psd for any x 1,..., x N, and any N 1. Note also that K(x, x ) = K(x, x) if K(x, x ) = h(x), h(x ). Denition: Let X be a set. A symmetric kernel K : X X R is said to be a positive (semi)denite kernel if (K(x i, x j )) N i,j=1 is positive (semi)denite for all x 1,..., x N X and all N 1. A reproducing kernel Hilbert space (RKHS) over a set X is a Hilbert space H of functions on X such that for each x X, there is a function k x H such that f, k x H = f(x) f H. Write k(, x) := k x ( ) (k = the reproducing kernel of H). 8/12
Positive denite kernels (cont.) 9/12 One can show that H is a RKHS over X i the evaluation functionals Λ x : H C f Λ x (f) = f(x) are continuous on H (use Riesz's representation theorem).
Positive denite kernels (cont.) 9/12 One can show that H is a RKHS over X i the evaluation functionals Λ x : H C f Λ x (f) = f(x) are continuous on H (use Riesz's representation theorem). Theorem: Let k : X X R be a positive denite kernel. Then there exists a RKHS H k over X such that 1 k(, x) H k for all x X. 2 span(k(, x) : x X ) is dense in H k. 3 k is a reproducing kernel on H k.
Positive denite kernels (cont.) 9/12 One can show that H is a RKHS over X i the evaluation functionals Λ x : H C f Λ x (f) = f(x) are continuous on H (use Riesz's representation theorem). Theorem: Let k : X X R be a positive denite kernel. Then there exists a RKHS H k over X such that 1 k(, x) H k for all x X. 2 span(k(, x) : x X ) is dense in H k. 3 k is a reproducing kernel on H k. Now, dene h : X H k by Then h(x) := k(, x). h(x), h(x ) Hk = k(, x), k(, x ) Hk = k(x, x ).
Positive denite kernels (cont.) One can show that H is a RKHS over X i the evaluation functionals Λ x : H C f Λ x (f) = f(x) are continuous on H (use Riesz's representation theorem). Theorem: Let k : X X R be a positive denite kernel. Then there exists a RKHS H k over X such that 1 k(, x) H k for all x X. 2 span(k(, x) : x X ) is dense in H k. 3 k is a reproducing kernel on H k. Now, dene h : X H k by Then h(x) := k(, x). h(x), h(x ) Hk = k(, x), k(, x ) Hk = k(x, x ). Moral: Positive denite kernels arise as h(x), h(x ) H. 9/12
Back to SVM 10/12 We can replace h by any positive denite kernel in the SVM problem: L D = = α i 1 2 α i 1 2 α i α j y i y j h(x i ) T h(x j ) i,j=1 α i α j y i y j K(x i, x j ). i,j=1
Back to SVM 10/12 We can replace h by any positive denite kernel in the SVM problem: L D = = α i 1 2 α i 1 2 α i α j y i y j h(x i ) T h(x j ) i,j=1 α i α j y i y j K(x i, x j ). i,j=1 Three popular choice in the SVM literature: K(x, x ) = e γ x x 2 2 (Gaussian kernel) K(x, x ) = (1 + x, x ) d (d-th degree polynomial) K(x, x ) = tanh(κ 1 x, x + κ 2 ) (Neural networks).
Constructing pd kernels 11/12 Properties of pd kernels: 1 If k 1,..., k n are pd, then n λ ik i is pd for any λ 1,..., λ n 0. 2 If k 1, k 2 are pd, then k 1 k 2 is pd (Schur product theorem). 3 If (k i ) i 1 are kernels, then lim k i is a kernel (if the limit exists). Exercise: Use the above properties to show that e γ x x 2 2 (1 + x, x ) d are positive denite kernels. and
Constructing pd kernels 11/12 Properties of pd kernels: 1 If k 1,..., k n are pd, then n λ ik i is pd for any λ 1,..., λ n 0. 2 If k 1, k 2 are pd, then k 1 k 2 is pd (Schur product theorem). 3 If (k i ) i 1 are kernels, then lim k i is a kernel (if the limit exists). Exercise: Use the above properties to show that e γ x x 2 2 and (1 + x, x ) d are positive denite kernels. Denition: A function h : R p R is said to be positive denite if is a positive denite kernel. K(x, x ) := h(x x )
Constructing pd kernels 11/12 Properties of pd kernels: 1 If k 1,..., k n are pd, then n λ ik i is pd for any λ 1,..., λ n 0. 2 If k 1, k 2 are pd, then k 1 k 2 is pd (Schur product theorem). 3 If (k i ) i 1 are kernels, then lim k i is a kernel (if the limit exists). Exercise: Use the above properties to show that e γ x x 2 2 and (1 + x, x ) d are positive denite kernels. Denition: A function h : R p R is said to be positive denite if is a positive denite kernel. K(x, x ) := h(x x ) P.d. functions thus provide a way of constructing pd kernels.
Constructing pd kernels Properties of pd kernels: 1 If k 1,..., k n are pd, then n λ ik i is pd for any λ 1,..., λ n 0. 2 If k 1, k 2 are pd, then k 1 k 2 is pd (Schur product theorem). 3 If (k i ) i 1 are kernels, then lim k i is a kernel (if the limit exists). Exercise: Use the above properties to show that e γ x x 2 2 and (1 + x, x ) d are positive denite kernels. Denition: A function h : R p R is said to be positive denite if K(x, x ) := h(x x ) is a positive denite kernel. P.d. functions thus provide a way of constructing pd kernels. Theorem: (Bochner) A continuous function h : R p C is positive denite if and only if h(x) = e i x,ω dµ(ω), R p for some nite nonnegative Borel measure on R p. 11/12
Example: decision function 12/12 ESL, Figure 12.3 (solid black line = decision boundary, dotted line = margin).