Learning From Data Lecture 25 The Kernel Trick

Learning From Data Lecture 25 The Kernel Trick Learning with only inner products The Kernel M. Magdon-Ismail CSCI 400/600

recap: Large Margin is Better Controling Overfitting Non-Separable Data 0.08 random hyperplane Eout 0.06 b,w,ξ 2 wt w + C N n= ξ n SVM 0.04 0 0.25 0.5 0.75 γ(random hyperplane)/γ(svm) R 2 Theorem. d vc (γ) + γ 2 subject to: y n (w t x n +b) ξ n ξ n 0 for n =,...,N E cv # support vectors N Φ 2 + SVM Φ 3 + SVM Φ 3 + pseudoinverse algorithm Complex hypothesis that does not overfit because it is simple, controlled by only a few support vectors. c AM L Creator: Malik Magdon-Ismail Kernel Trick: 2 /8 Mechanics of the nonlinear transform

Recall: Mechanics of the Nonlinear Transform X-space is R d Z-space is R d Φ x = x. x d z = Φ(x) = Φ (x). Φ d(x) = z. z d. Original data x n X 2. Transform the data z n = Φ(x n ) Z x,x 2,...,x N y,y 2,...,y N no weights w = z,z 2,...,z N y,y 2,...,y N w 0 w. w d d vc = d+ d vc = d+ Φ g(x) = sign( w t Φ(x)) 4. Classify in X-space g(x) = g(φ(x)) = sign( w t Φ(x)) 3. Separate data in Z-space g(z) = sign( w t z) Have to transform the data to the Z-space. c AM L Creator: Malik Magdon-Ismail Kernel Trick: 3 /8 Topic for this lecture

This Lecture How to use nonlinear transforms without physically transforming data to Z-space. c AM L Creator: Malik Magdon-Ismail Kernel Trick: 4 /8 Primal versus dual

Primal Versus Dual Primal Dual b,w 2 wt w α 2 n,m= α n α m y n y m (x t nx m ) n= α n subject to: y n (w t x n +b) for n =,...,N subject to: α n y n = 0 n= α n 0 for n =,...,N w = αny n x n n= b = y s w t x s (α s > 0) support vectors ւ g(x) = sign(w t x+b) g(x) = sign(w t x+b ) ( N ) = sign αn y nx t n (x x s)+y s n= d + optimization variables w, b N optimization variables α c AM L Creator: Malik Magdon-Ismail Kernel Trick: 5 /8 Vector-matrix form

Primal Versus Dual - Matrix Vector Form Primal Dual b,w 2 wt w α 2 αt Gα t α (G nm = y n y m x t nx m ) subject to: y n (w t x n +b) for n =,...,N subject to: y t α = 0 α 0 w = αn y nx n n= b = y s w t x s (α s > 0) support vectors ւ g(x) = sign(w t x+b) g(x) = sign(w t x+b ) ( N ) = sign αny n x t n(x x s )+y s n= d + optimization variables w, b N optimization variables α c AM L Creator: Malik Magdon-Ismail Kernel Trick: 6 /8 The Lagrangian

Deriving the Dual: The Lagrangian L = 2 wt w+ α n ( y n (w t x n +b)) n= lagrange multipliers the constraints w.r.t. b, w unconstrained maximize w.r.t. α 0 Intuition y n (w t x n +b) > 0 = α n gives L Choose (b,w) to min L, so y n (w t x n +b) 0 y n (w t x n +b) < 0 = α n = 0 (max L w.r.t. α n ) non support vectors Formally: use KKT conditions to transform the primal. Conclusion At the optimum, α n (y n (w t x n +b) ) = 0, so L = 2 wt w is d and the constraints are satisfied y n (w t x n +b) 0 c AM L Creator: Malik Magdon-Ismail Kernel Trick: 7 /8

Unconstrained Minimization w.r.t. (b, w) L = 2 wt w α n (y n (w t x n +b) ) n= Set L b = 0: Set L w = 0: L b = N n= L N w = w n= α n y n = α n y n = 0 n= α n y n x n = w = α n y n x n n= Substitute into L to maximize w.r.t. α 0 L = N 2 wt w w t α n y n x n b α n y n + = 2 wt w+ = 2 n= n= α n α n α m y n y m x t n x m + m,n= n= n= n= α n α n α subject to: y t α = 0 2 αt Gα t α (G nm = y n y m x t nx m ) α 0 w = N n= α ny n x n α s > 0 = y s (w t x s +b) = 0 = b = y s w t x s c AM L Creator: Malik Magdon-Ismail Kernel Trick: 8 /8 Example

Example Our Toy Data Set X = 0 0 2 2 2 0 3 0 y = + + X s = 0 0 2 2 2 0 3 0 signed data matrix G = X s X t s = 0 0 0 0 0 8 4 6 0 4 4 6 0 6 6 9 Quadratic Programming Dual SVM u 2 ut Qu+p t z subject to: Au c u = α Q = G p = N A = yt y t c = I N 0 0 0 N QP(Q,p,A,c) α 2 αt Gα t α subject to: y t α = 0 α 0 α = w = 2 2 0 4 [ αny n x n = n= b = y w t x = γ = w = 2 ] α = 2 α = 2 x x 2 = 0 α = α = 0 non-support vectors = α n = 0 only support vectors can have α n > 0 c AM L Creator: Malik Magdon-Ismail Kernel Trick: 9 /8 Dual linear-svm QP algorithm

Dual QP Algorithm for Hard Margin linear-svm : Input: X,y. 2: Let p = N be the N-vector of ones and c = 0 N+2 the N-vector of zeros. Construct matrices Q and A, where X s = y x t. } y N x t N {{ } signed data matrix, Q = X s X t s, A = y t y t I N N α 2 αt Gα t α subject to: y t α = 0 α 0 3: α QP(Q,c,A,a). Some packages allow equality and bound constraints to directly solve this type of QP 4: Return w = αny n x n α n>0 b = y s w t x s (αs > 0) 5: The final hypothesis is g(x) = sign(w t x+b ). c AM L Creator: Malik Magdon-Ismail Kernel Trick: 0 /8 Primal versus dual (non-separable)

Primal Versus Dual (Non-Separable) Primal Dual b,w,ξ 2 wt w+c N n= ξ n α 2 αt Gα t α subject to: y n (w t x n +b) ξ n ξ n 0 for n =,...,N subject to: y t α = 0 C α 0 w = αny n x n n= b = y s w t x s (C > α s > 0) g(x) = sign(w t x+b) g(x) = sign(w t x+b ) ( N ) = sign αn y nx t n (x x s)+y s n= N +d+ optimization variables b,w,ξ N optimization variables α c AM L Creator: Malik Magdon-Ismail Kernel Trick: /8 Inner product algorithm

Dual SVM is an Inner Product Algorithm X-Space α 2 αt Gα t α subject to: y t α = 0 C α 0 G nm = y n y m (x t nx m ) g(x) = sign αny n (x t nx)+b αn >0 C > α s > 0 b = y s α n>0 α ny n (x t nx s ) Can compute z t z without needing z = Φ(x) to visit Z-space? c AM L Creator: Malik Magdon-Ismail Kernel Trick: 2 /8 Z-space inner product algorithm

Dual SVM is an Inner Product Algorithm Z-Space α 2 αt Gα t α subject to: y t α = 0 C α 0 G nm = y n y m (z t nz m ) g(x) = sign αny n (z t nz)+b αn >0 C > α s > 0 b = y s α n>0 α ny n (z t nz s ) Can we compute z t z without needing z = Φ(x) to visit Z-space? c AM L Creator: Malik Magdon-Ismail Kernel Trick: 3 /8 Can we compute z t z efficiently

The Kernel K(, ) for a Transform Φ( ) The Kernel tells you how to compute the inner product in Z-space K(x,x ) = Φ(x) t Φ(x ) = z t z Example: 2nd-order polynomial transform Φ(x) = x x 2 x 2 2x x 2 x 2 2 K(x,x ) = Φ(x) t Φ(x ) = x x 2 x 2 2x x 2 x 2 2 x x 2 x 2 x 2 2 2x x 2 = x x +x 2 x 2 +x 2 x 2 +2x x 2 x x 2+x 2 2x 2 2 O(d 2 ) = ( ) 2 2 +xt x 4 computed quickly in X-space, in O(d) c AM L Creator: Malik Magdon-Ismail Kernel Trick: 5 /8 Gaussian kernel

The Gaussian Kernel is Infinite-Dimensional K(x,x ) = e γ x x 2 Example: Gaussian Kernel in -dimension e x2 2 0 0! e x2 2! x e Φ(x) = x2 2 2 2! x2 e x2 2 3 3! x3 e x2 2 4. 4! x4 (infinite dimensional Φ) e x2 2 0 0! e x2 2! x K(x,x ) = Φ(x) t Φ(x e ) = x2 2 2 x2 2! e x2 2 3 x3 3! e x2 2 4 = e x2 e x 2 (2xx ) i i=0 i!. 4! x4 e x 2 2 0 0! e x 2 2! x e x 2 2 2 2! x 2 e x 2 2 3 3! x 3 e x 2 2 4 4! x 4. = e (x x ) 2 c AM L Creator: Malik Magdon-Ismail Kernel Trick: 6 /8 Bypass Z-space

The Kernel Allows Us to Bypass Z-space : Input: X, y, regularization parameter C 2: Compute G: G nm = y n y m K(x n,x m ). x n X K(, ) 3: Solve (QP): : α 2 αt Gα t α subject to: y t α = 0 C α 0 α index s : C > α s > 0 4: b = y s α n >0 α ny n K(x n,x s ) g(x) = sign αn y nk(x n,x)+b α n >0 5: The final hypothesis is g(x) = sign α n>0 αn y nk(x n,x)+b b = y s α n >0 α ny n K(x n,x s ) c AM L Creator: Malik Magdon-Ismail Kernel Trick: 7 /8 The Kernel-SVM philosophy

The Kernel-Support Vector Machine Overfitting Computation SVM Regression Inner products with Kernel K(, ) high d complicated separator small # support vectors low effective complexity high d expensive or infeasible computation kernel computationally feasible to go to high d Can go to high (infinite) d Can go to high (infinite) d c AM L Creator: Malik Magdon-Ismail Kernel Trick: 8 /8