Learning From Data Lecture 25 The Kernel Trick

Similar documents
Learning From Data Lecture 26 Kernel Machines

Soft-Margin Support Vector Machine

SVMs: nonlinearity through kernels

Learning From Data Lecture 10 Nonlinear Transforms

Machine Learning. Support Vector Machines. Manfred Huber

Support Vector Machines

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Kernel Machines. Pradeep Ravikumar Co-instructor: Manuela Veloso. Machine Learning

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machine (SVM) and Kernel Methods

SVMs, Duality and the Kernel Trick

Support Vector Machines, Kernel SVM

Perceptron Revisited: Linear Separators. Support Vector Machines

ICS-E4030 Kernel Methods in Machine Learning

Foundation of Intelligent Systems, Part I. SVM s & Kernel Methods

Introduction to Support Vector Machines

Lecture 10: A brief introduction to Support Vector Machine

Support Vector Machine (continued)

Jeff Howbert Introduction to Machine Learning Winter

Statistical Machine Learning from Data

Support Vector Machines

Outline. Basic concepts: SVM and kernels SVM primal/dual problems. Chih-Jen Lin (National Taiwan Univ.) 1 / 22

Indirect Rule Learning: Support Vector Machines. Donglin Zeng, Department of Biostatistics, University of North Carolina

Lecture Notes on Support Vector Machine

Support Vector Machines. Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar

Machine Learning and Data Mining. Support Vector Machines. Kalev Kask

Linear vs Non-linear classifier. CS789: Machine Learning and Neural Network. Introduction

CS798: Selected topics in Machine Learning

Lecture Support Vector Machine (SVM) Classifiers

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Support Vector Machine

Support Vector Machines: Maximum Margin Classifiers

Support Vector Machines

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines

10/05/2016. Computational Methods for Data Analysis. Massimo Poesio SUPPORT VECTOR MACHINES. Support Vector Machines Linear classifiers

Support Vector Machines

Machine Learning Techniques

Learning From Data Lecture 2 The Perceptron

Lecture 10: Support Vector Machine and Large Margin Classifier

Machine Learning Techniques

Learning From Data Lecture 12 Regularization

CS6375: Machine Learning Gautam Kunapuli. Support Vector Machines

Max Margin-Classifier

Kernels and the Kernel Trick. Machine Learning Fall 2017

Statistical Pattern Recognition

Machine Learning Support Vector Machines. Prof. Matteo Matteucci

Support Vector Machines for Classification and Regression. 1 Linearly Separable Data: Hard Margin SVMs

Linear Support Vector Machine. Classification. Linear SVM. Huiping Cao. Huiping Cao, Slide 1/26

Lecture 3 January 28

Linear & nonlinear classifiers

Learning From Data Lecture 13 Validation and Model Selection

Machine Learning. Lecture 6: Support Vector Machine. Feng Li.

Support Vector Machines. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

CS-E4830 Kernel Methods in Machine Learning

Learning From Data Lecture 15 Reflecting on Our Path - Epilogue to Part I

CIS 520: Machine Learning Oct 09, Kernel Methods

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Introduction to Logistic Regression and Support Vector Machine

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

Introduction to SVM and RVM

Review: Support vector machines. Machine learning techniques and image analysis

Pattern Recognition 2018 Support Vector Machines

Machine Learning Techniques

Polyhedral Computation. Linear Classifiers & the SVM

Kernel methods, kernel SVM and ridge regression

SVM optimization and Kernel methods

Kernel Methods and Support Vector Machines

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM

Announcements - Homework

Support Vector Machines Explained

CSC 411 Lecture 17: Support Vector Machine

(Kernels +) Support Vector Machines

This is an author-deposited version published in : Eprints ID : 17710

Learning From Data Lecture 8 Linear Classification and Regression

Linear, threshold units. Linear Discriminant Functions and Support Vector Machines. Biometrics CSE 190 Lecture 11. X i : inputs W i : weights

Support vector machines Lecture 4

Support Vector Machines and Speaker Verification

Machine Learning : Support Vector Machines

Support Vector Machines and Kernel Methods

Support Vector Machines for Classification and Regression

Machine Learning Techniques

Linear & nonlinear classifiers

Lecture 18: Optimization Programming

Chapter 9. Support Vector Machine. Yongdai Kim Seoul National University

Support'Vector'Machines. Machine(Learning(Spring(2018 March(5(2018 Kasthuri Kannan

SUPPORT VECTOR MACHINE

Support Vector Machines II. CAP 5610: Machine Learning Instructor: Guo-Jun QI

The Lagrangian L : R d R m R r R is an (easier to optimize) lower bound on the original problem:

Support Vector Machines

CS145: INTRODUCTION TO DATA MINING

Support Vector Machines and Kernel Methods

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

10701 Recitation 5 Duality and SVM. Ahmed Hefny

Support Vector Machine & Its Applications

Support Vector Machine

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26

Introduction to Support Vector Machines

Dan Roth 461C, 3401 Walnut

Transcription:

Learning From Data Lecture 25 The Kernel Trick Learning with only inner products The Kernel M. Magdon-Ismail CSCI 400/600

recap: Large Margin is Better Controling Overfitting Non-Separable Data 0.08 random hyperplane Eout 0.06 b,w,ξ 2 wt w + C N n= ξ n SVM 0.04 0 0.25 0.5 0.75 γ(random hyperplane)/γ(svm) R 2 Theorem. d vc (γ) + γ 2 subject to: y n (w t x n +b) ξ n ξ n 0 for n =,...,N E cv # support vectors N Φ 2 + SVM Φ 3 + SVM Φ 3 + pseudoinverse algorithm Complex hypothesis that does not overfit because it is simple, controlled by only a few support vectors. c AM L Creator: Malik Magdon-Ismail Kernel Trick: 2 /8 Mechanics of the nonlinear transform

Recall: Mechanics of the Nonlinear Transform X-space is R d Z-space is R d Φ x = x. x d z = Φ(x) = Φ (x). Φ d(x) = z. z d. Original data x n X 2. Transform the data z n = Φ(x n ) Z x,x 2,...,x N y,y 2,...,y N no weights w = z,z 2,...,z N y,y 2,...,y N w 0 w. w d d vc = d+ d vc = d+ Φ g(x) = sign( w t Φ(x)) 4. Classify in X-space g(x) = g(φ(x)) = sign( w t Φ(x)) 3. Separate data in Z-space g(z) = sign( w t z) Have to transform the data to the Z-space. c AM L Creator: Malik Magdon-Ismail Kernel Trick: 3 /8 Topic for this lecture

This Lecture How to use nonlinear transforms without physically transforming data to Z-space. c AM L Creator: Malik Magdon-Ismail Kernel Trick: 4 /8 Primal versus dual

Primal Versus Dual Primal Dual b,w 2 wt w α 2 n,m= α n α m y n y m (x t nx m ) n= α n subject to: y n (w t x n +b) for n =,...,N subject to: α n y n = 0 n= α n 0 for n =,...,N w = αny n x n n= b = y s w t x s (α s > 0) support vectors ւ g(x) = sign(w t x+b) g(x) = sign(w t x+b ) ( N ) = sign αn y nx t n (x x s)+y s n= d + optimization variables w, b N optimization variables α c AM L Creator: Malik Magdon-Ismail Kernel Trick: 5 /8 Vector-matrix form

Primal Versus Dual - Matrix Vector Form Primal Dual b,w 2 wt w α 2 αt Gα t α (G nm = y n y m x t nx m ) subject to: y n (w t x n +b) for n =,...,N subject to: y t α = 0 α 0 w = αn y nx n n= b = y s w t x s (α s > 0) support vectors ւ g(x) = sign(w t x+b) g(x) = sign(w t x+b ) ( N ) = sign αny n x t n(x x s )+y s n= d + optimization variables w, b N optimization variables α c AM L Creator: Malik Magdon-Ismail Kernel Trick: 6 /8 The Lagrangian

Deriving the Dual: The Lagrangian L = 2 wt w+ α n ( y n (w t x n +b)) n= lagrange multipliers the constraints w.r.t. b, w unconstrained maximize w.r.t. α 0 Intuition y n (w t x n +b) > 0 = α n gives L Choose (b,w) to min L, so y n (w t x n +b) 0 y n (w t x n +b) < 0 = α n = 0 (max L w.r.t. α n ) non support vectors Formally: use KKT conditions to transform the primal. Conclusion At the optimum, α n (y n (w t x n +b) ) = 0, so L = 2 wt w is d and the constraints are satisfied y n (w t x n +b) 0 c AM L Creator: Malik Magdon-Ismail Kernel Trick: 7 /8

Unconstrained Minimization w.r.t. (b, w) L = 2 wt w α n (y n (w t x n +b) ) n= Set L b = 0: Set L w = 0: L b = N n= L N w = w n= α n y n = α n y n = 0 n= α n y n x n = w = α n y n x n n= Substitute into L to maximize w.r.t. α 0 L = N 2 wt w w t α n y n x n b α n y n + = 2 wt w+ = 2 n= n= α n α n α m y n y m x t n x m + m,n= n= n= n= α n α n α subject to: y t α = 0 2 αt Gα t α (G nm = y n y m x t nx m ) α 0 w = N n= α ny n x n α s > 0 = y s (w t x s +b) = 0 = b = y s w t x s c AM L Creator: Malik Magdon-Ismail Kernel Trick: 8 /8 Example

Example Our Toy Data Set X = 0 0 2 2 2 0 3 0 y = + + X s = 0 0 2 2 2 0 3 0 signed data matrix G = X s X t s = 0 0 0 0 0 8 4 6 0 4 4 6 0 6 6 9 Quadratic Programming Dual SVM u 2 ut Qu+p t z subject to: Au c u = α Q = G p = N A = yt y t c = I N 0 0 0 N QP(Q,p,A,c) α 2 αt Gα t α subject to: y t α = 0 α 0 α = w = 2 2 0 4 [ αny n x n = n= b = y w t x = γ = w = 2 ] α = 2 α = 2 x x 2 = 0 α = α = 0 non-support vectors = α n = 0 only support vectors can have α n > 0 c AM L Creator: Malik Magdon-Ismail Kernel Trick: 9 /8 Dual linear-svm QP algorithm

Dual QP Algorithm for Hard Margin linear-svm : Input: X,y. 2: Let p = N be the N-vector of ones and c = 0 N+2 the N-vector of zeros. Construct matrices Q and A, where X s = y x t. } y N x t N {{ } signed data matrix, Q = X s X t s, A = y t y t I N N α 2 αt Gα t α subject to: y t α = 0 α 0 3: α QP(Q,c,A,a). Some packages allow equality and bound constraints to directly solve this type of QP 4: Return w = αny n x n α n>0 b = y s w t x s (αs > 0) 5: The final hypothesis is g(x) = sign(w t x+b ). c AM L Creator: Malik Magdon-Ismail Kernel Trick: 0 /8 Primal versus dual (non-separable)

Primal Versus Dual (Non-Separable) Primal Dual b,w,ξ 2 wt w+c N n= ξ n α 2 αt Gα t α subject to: y n (w t x n +b) ξ n ξ n 0 for n =,...,N subject to: y t α = 0 C α 0 w = αny n x n n= b = y s w t x s (C > α s > 0) g(x) = sign(w t x+b) g(x) = sign(w t x+b ) ( N ) = sign αn y nx t n (x x s)+y s n= N +d+ optimization variables b,w,ξ N optimization variables α c AM L Creator: Malik Magdon-Ismail Kernel Trick: /8 Inner product algorithm

Dual SVM is an Inner Product Algorithm X-Space α 2 αt Gα t α subject to: y t α = 0 C α 0 G nm = y n y m (x t nx m ) g(x) = sign αny n (x t nx)+b αn >0 C > α s > 0 b = y s α n>0 α ny n (x t nx s ) Can compute z t z without needing z = Φ(x) to visit Z-space? c AM L Creator: Malik Magdon-Ismail Kernel Trick: 2 /8 Z-space inner product algorithm

Dual SVM is an Inner Product Algorithm Z-Space α 2 αt Gα t α subject to: y t α = 0 C α 0 G nm = y n y m (z t nz m ) g(x) = sign αny n (z t nz)+b αn >0 C > α s > 0 b = y s α n>0 α ny n (z t nz s ) Can we compute z t z without needing z = Φ(x) to visit Z-space? c AM L Creator: Malik Magdon-Ismail Kernel Trick: 3 /8 Can we compute z t z efficiently

Dual SVM is an Inner Product Algorithm Z-Space α 2 αt Gα t α subject to: y t α = 0 C α 0 G nm = y n y m (z t nz m ) g(x) = sign αny n (z t nz)+b αn >0 C > α s > 0 b = y s α n>0 α ny n (z t nz s ) Can we compute z t z without needing z = Φ(x) to visit Z-space? c AM L Creator: Malik Magdon-Ismail Kernel Trick: 4 /8 The Kernel

The Kernel K(, ) for a Transform Φ( ) The Kernel tells you how to compute the inner product in Z-space K(x,x ) = Φ(x) t Φ(x ) = z t z Example: 2nd-order polynomial transform Φ(x) = x x 2 x 2 2x x 2 x 2 2 K(x,x ) = Φ(x) t Φ(x ) = x x 2 x 2 2x x 2 x 2 2 x x 2 x 2 x 2 2 2x x 2 = x x +x 2 x 2 +x 2 x 2 +2x x 2 x x 2+x 2 2x 2 2 O(d 2 ) = ( ) 2 2 +xt x 4 computed quickly in X-space, in O(d) c AM L Creator: Malik Magdon-Ismail Kernel Trick: 5 /8 Gaussian kernel

The Gaussian Kernel is Infinite-Dimensional K(x,x ) = e γ x x 2 Example: Gaussian Kernel in -dimension e x2 2 0 0! e x2 2! x e Φ(x) = x2 2 2 2! x2 e x2 2 3 3! x3 e x2 2 4. 4! x4 (infinite dimensional Φ) e x2 2 0 0! e x2 2! x K(x,x ) = Φ(x) t Φ(x e ) = x2 2 2 x2 2! e x2 2 3 x3 3! e x2 2 4 = e x2 e x 2 (2xx ) i i=0 i!. 4! x4 e x 2 2 0 0! e x 2 2! x e x 2 2 2 2! x 2 e x 2 2 3 3! x 3 e x 2 2 4 4! x 4. = e (x x ) 2 c AM L Creator: Malik Magdon-Ismail Kernel Trick: 6 /8 Bypass Z-space

The Kernel Allows Us to Bypass Z-space : Input: X, y, regularization parameter C 2: Compute G: G nm = y n y m K(x n,x m ). x n X K(, ) 3: Solve (QP): : α 2 αt Gα t α subject to: y t α = 0 C α 0 α index s : C > α s > 0 4: b = y s α n >0 α ny n K(x n,x s ) g(x) = sign αn y nk(x n,x)+b α n >0 5: The final hypothesis is g(x) = sign α n>0 αn y nk(x n,x)+b b = y s α n >0 α ny n K(x n,x s ) c AM L Creator: Malik Magdon-Ismail Kernel Trick: 7 /8 The Kernel-SVM philosophy

The Kernel-Support Vector Machine Overfitting Computation SVM Regression Inner products with Kernel K(, ) high d complicated separator small # support vectors low effective complexity high d expensive or infeasible computation kernel computationally feasible to go to high d Can go to high (infinite) d Can go to high (infinite) d c AM L Creator: Malik Magdon-Ismail Kernel Trick: 8 /8