Support Vector Machines. Machine Learning Series Jerry Jeychandra Blohm Lab

Support Vector Machines Machine Learning Series Jerry Jeychandra Bloh Lab

Outline Main goal: To understand how support vector achines (SVMs) perfor optial classification for labelled data sets, also a quick peek at the ipleentational level. 1. What is a support vector achine? 2. Forulating the SVM proble 3. Optiization using Sequential Minial Optiization 4. Use of Kernels for non-linear classification 5. Ipleentation of SVM in MATLAB environent 6. Stochastic Gradient Descent (extra)

Support Vector Machines (SVM) x 2 x 1

Support Vector Machines (SVM) Not very good decision boundary! x 2 x 1

Support Vector Machines (SVM) Great decision boundary! ax 1 + bx 2 + c = 0 x 2 In achine learning w T x + b = 0 ax 1 + bx 2 + c < 0 ax 1 + bx 2 + c > 0 x 1

Support Vector Machines (SVM) Great decision boundary! ax 1 + bx 2 + c = 0 x 2 In achine learning w T x + b = 0 w T x + b < 0 w T x + b > 0 x 1

Support Vector Machines (SVM) w T x + b = 1 w T x + b = 0 x 2 w T x + b = 1 x 1

Support Vector Machines (SVM) w T x + b = 1 -d +d w T x + b = 0 How do we axiize d? x 2 w T x + b = 1 x 1

Support Vector Machines (SVM) w T x + b = 1 -d +d w T x + b = 0 How do we axiize d? x 2 d = wt x + b w = 1 w w T x + b = 1 w Miniize this! x 1

Support Vector Machines (SVM) w T x + b = 1 -d +d w T x + b = 0 How do we axiize d? x 2 d = wt x + b w = 1 w w T x + b = 1 w Miniize this! 1 w 2 2 x 1

SVM Initial Constraints If our data point is labelled as y = 1 w T x (i) + b 1 If y = -1 w T x (i) + b 1 This can be succinctly suarized as: y w T x (i) + b 1 We re going to allow for soe argin violation: y (i) w T x (i) + b 1 ξ i y (i) w T x (i) + b + 1 ξ i 0 Hard Margin SVM Soft Margin SVM

Support Vector Machines (SVM) w T x + b = 1 w T x + b = 0 x 2 ξ/ w w T x + b = 1 x 1

Optiization Objective Miniize: 1 2 w 2 + C σ i ξ i Penalizing Ter Subject to: y i w T x i + b + 1 ξ i 0 ξ i 0

Optiization Objective Miniize 1 w 2 + C σ 2 i ξ i s.t y (i) w T x (i) + b + 1 ξ i 0 and ξ i 0 Need to use the Lagrangian Using Lagrangian with inequalities as a constraint requires us to fulfill Karush-Khan-Tucker conditions

Karush-Kahn-Tucker Conditions There ust exist a w* that solves the prial proble whereas α * and b * are solutions to the dual proble. All three paraeters ust satisfy the following conditions: 1. 2. L w, α, b w i = 0 L w, α, b b i = 0 3. α i g i w = 0 4. g i w 0 5. α i 0 6. ξ i 0 7. r i ξ i = 0 Constraint 3: If α * i is a non-zero nuber, g i ust be 0! Recall: g i w = y (i) w T x (i) + b + 1 ξ i 0 Exaple x ust lie directly on the functional argin in order for g i w to equal 0. Here α > 0. This is our support vectors!

Optiization Objective Miniize 1 2 w 2 + C σ i ξ i subject to y (i) w T x (i) + b + 1 ξ i 0 and ξ i 0 Need to use the Lagrangian L P w, b, α, ξ, r = 1 2 in w,b L P = 1 2 L x, α = f x + λ i h i x w 2 + C w 2 + C ξ i + α i i i ξ i α i y i w T x i + b 1 + ξ i r i ξ i i i α i y i [ w T x i + b + ξ i ] r i ξ i

Optiization Objective Miniize 1 2 w 2 + C σ i ξ i subject to y (i) w T x (i) + b + 1 ξ 0 and ξ i 0 Need to use the Lagrangian L P w, b, α, ξ, r = 1 2 in w,b L P = 1 2 L x, α = f x + λ i h i x w 2 + C w 2 + C ξ i + α i i i ξ i α i y i w T x i + b 1 + ξ i r i ξ i i i α i y i [ w T x i + b + ξ i ] Miniization function Margin Constraint Error Constraint r i ξ i

The proble with the Prial L P = 1 2 w 2 + C ξ i + α i α i [y i w T x i + b + ξ i ] r i ξ i We could just optiize the prial equation and be finished with SVM optiization BUT!!! You would iss out on one of the ost iportant aspects of the SVM The Kernel Trick We need the Lagrangian Dual!

http://stats.stackexchange.co/questions/19181/why-bother-with-the-dual-proble-when-fitting-sv

Finding the Dual L P = 1 2 w 2 + C ξ i + α i α i [y i w T x i + b + ξ i ] r i ξ i First, copute inial w, b and ξ. Can solve for w! New constraint w L w, b, α = w α i y i x i = 0 w = α i y i x i b L w, b, α = α i y (i) = 0 ξ L w, b, α = C α i r i = 0 Substitute in 0 α i C New constraint α i = C r i

Lagrangian Dual L P = 1 2 w 2 + C ξ i + α i α i y i [ w T x i + b + ξ i ] r i ξ i w = α i y i x i L D = α i 1 2 α i α j y (i) y (j) x (i) x (j) j=1 s.t i : σ α i y (i) = 0, ξ 0 and 0 α i C(fro KKT and prev. derivation)

Lagrangian Dual ax α L D = α i 1 2 j=1 α i α j y (i) y (j) x (i) x (j) S.T: i : σ α i y (i) = 0, ξ 0, 0 α i C and KKT conditions Copute inner product of x (i) and x (j) No longer rely on w or b Can replace with K(x,z) for kernels Can solve for α explicitly All values not on functional argin have α = 0, ore efficient!

Sequential Minial Optiization ax L D = α i 1 α 2 α i α j y (i) y (j) x (i) x (j) j=1 Ideally, hold all α but α j constant and optiize but: α i y (i) = 0 α 1 = y 1 α i y (i) i=2 Solution: Change two α at a tie! Platt 1998

Sequential Minial Optiization Algorith Initialize all α to 0 Repeat { 1. Select α j that violates KKT and additional α k to update 2. Reoptiize L D (α) with respect to two α s while holding all other α constant and copute w,b } In order to keep σ α i y (i) = 0 true: α j y (i) + α k y (j) = ζ Even after update, linear cobination ust reain constant! Platt 1998

Sequential Minial Optiization 0 α C fro constraint Line given due to linear constraint Clip values if they exceed C, H or L C α 2 α j y (i) + α k y (j) = ζ H C α 2 α j y (i) + α k y (j) = ζ H L α 1 C L α 1 C If y (j) y k then α j α k = ζ If y (j) = y (k) then α j + α k = ζ

Sequential Minial Optiization Can analytically solve for new α, w and b for each update step (closed for coputation) α new k = α old k + y k old old (E j Ek ) where η is second derivative of Dual with respect η to α 2. α new j = α old j + y k y j (α old k α new,clipped k ) b 1 = E 1 + y j α j new α j old K x j, x j + y k α k new,clipped α k old K(x j, x k ) b 2 = E 2 + y j α j new α j old K x j, x k + y k α k new,clipped α k old K(x k, x k ) b = b 1 + b 2 2 w = α i y (i) x (i) Platt 1998

Kernels Recall: ax α L D = σ α i 1 σ 2 σ j=1 α i α j y (i) y (j) x (i) x (j) We can generalize this using a kernel function K(x i, x j )! K x i, x j = φ(x i ) T φ(x j ) Where φ is a feature apping function. Turns out that we don t need to explicitly copute φ x due to Kernel Trick!

Kernel Trick Suppose K x, z = x T z 2 K x, z = x i z i x j z j j=1 K x, z = x i z j x i z j j=1 K x, z = ( x i x j )(z i z j ) i,j=1 Representing each feature: O(n 2 ) Kernel trick: O(n)

Types of Kernels Linear Kernel K x, z = x T z Gaussian Kernal (Radial Basis Function) K x, z = exp Polynoial Kernel x z 2 2σ 2 K x, z = x T z + c d ax α L D = α i 1 2 α i α j y (i) y (j) K x i, x j j=1

Exaple 1: Linear Kernel

Exaple 1: Linear Kernel In MATLAB: I used svtrain: odel = svtrain(x, y, C, @linearkernel, 1e-3, 500); 500 ax iterations C = 1000 KKT tol = 1e-3

Exaple 2: Gaussian Kernel (RBF)

Exaple 2: Gaussian Kernel (RBF) In MATLAB Gaussian Kernel: gsi = exp(-1*((x1-x2)'* (x1-x2))/(2*siga^2)); odel= svtrain(x, y, C, @(x1, x2) gaussiankernel(x1, x2, siga), 0.1, 300); 300 ax iterations C = 1 KKT tol = 0.1

Exaple 2: Gaussian Kernel (RBF) In MATLAB Gaussian Kernel: gsi = exp(-1*((x1-x2)'* (x1-x2))/(2*siga^2)); odel= svtrain(x, y, C, @(x1, x2) gaussiankernel(x1, x2, siga), 0.001, 300); 300 ax iterations C = 1 KKT tol = 0.001

Resources http://research.icrosoft.co/pubs/69644/tr-98-14.pdf ftp://www.ai.it.edu/pub/users/tlp/projects/sv/sv-so/so.pdf http://www.cs.toronto.edu/~urtasun/courses/csc2515/09sv-2515.pdf http://www.pstat.ucsb.edu/student%20seinar%20doc/sv2.pdf http://www.ics.uci.edu/~draanan/teaching/ics273a_winter08/lectures/lecture11.pdf http://stats.stackexchange.co/questions/19181/why-bother-with-the-dual-problewhen-fitting-sv http://cs229.stanford.edu/notes/cs229-notes3.pdf http://www.svs.org/tutorials/berwick2003.pdf http://www.ed.nyu.edu/chibi/sites/default/files/chibi/final.pdf https://share.coursera.org/wiki/index.php/ml:main https://www.youtube.co/watch?v=vqovichkm7i

Hinge-Loss Prial SVM Recall: Original Proble 1 2 wt w + C σ ξ i S.T: y i w T x i + b 1 + ξ 0 Re-write in ters of hinge-loss function L P = λ 2 wt w + 1 n h 1 y i w T x i + b Regularization Hinge-Loss Cost Function

Hinge Loss Function h 1 y i w T x i + b Penalize when signs of x (i) and y (i) don t atch! -3-2 -1 0 1 2 3 Y (i) *f(x)

Hinge-Loss Prial SVM Hinge-Loss Prial L P = λ 2 wt w + 1 n h 1 y i w T x i + b Copute gradient (sub-gradient) n w = λw 1 n y i x i L y i w T x i + b < 1 Indicator function L(y (i) f(x)) = 0 if classified correctly L(y (i) f(x)) = hinge slope if classified incorrectly

Stochastic Gradient Descent We ipleent rando sapling here to pick out data points: λw E i U y i x i l y i w T x i + b < 1 Use this to copute update rule: w t w t 1 + 1 t y i x i l y i Here i is sapled fro a unifor distribution. x i T w t 1 + b < 1 λw t 1 Shalev-Shwartz et al. 2007