Support Vector Machines. Machine Learning Series Jerry Jeychandra Blohm Lab

Similar documents
Support Vector Machines. Maximizing the Margin

Intelligent Systems: Reasoning and Recognition. Perceptrons and Support Vector Machines

Support Vector Machines MIT Course Notes Cynthia Rudin

Support Vector Machines. Goals for the lecture

Soft-margin SVM can address linearly separable problems with outliers

Geometrical intuition behind the dual problem

Kernel Methods and Support Vector Machines

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

1 Bounding the Margin

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Machine Learning. Support Vector Machines. Manfred Huber

Lecture 9: Multi Kernel SVM

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Support Vector Machine (continued)

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machine (SVM) and Kernel Methods

CS Lecture 13. More Maximum Likelihood

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machines and Kernel Methods

Introduction to Kernel methods

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Support Vector Machine Classification of Uncertain and Imbalanced data using Robust Optimization

Statistical Machine Learning from Data

Machine Learning. Lecture 6: Support Vector Machine. Feng Li.

Combining Classifiers

E0 370 Statistical Learning Theory Lecture 6 (Aug 30, 2011) Margin Analysis

Bayes Decision Rule and Naïve Bayes Classifier

Support Vector Machine

Sequential Minimal Optimization (SMO)

Cheng Soon Ong & Christian Walder. Canberra February June 2018

UNIVERSITY OF TRENTO ON THE USE OF SVM FOR ELECTROMAGNETIC SUBSURFACE SENSING. A. Boni, M. Conci, A. Massa, and S. Piffer.

Stochastic Subgradient Methods

Boosting with log-loss

Linear & nonlinear classifiers

Pattern Recognition 2018 Support Vector Machines

Machine Learning Lecture 6 Note

Support Vector Machines

Pattern Recognition and Machine Learning. Learning and Evaluation for Pattern Recognition

Outline. Basic concepts: SVM and kernels SVM primal/dual problems. Chih-Jen Lin (National Taiwan Univ.) 1 / 22

Support Vector Machines

CS6375: Machine Learning Gautam Kunapuli. Support Vector Machines

PAC-Bayes Analysis Of Maximum Entropy Learning

Linear & nonlinear classifiers

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Model Fitting. CURM Background Material, Fall 2014 Dr. Doreen De Leon

Support vector machines

SUPPORT VECTOR MACHINE

Foundations of Machine Learning Boosting. Mehryar Mohri Courant Institute and Google Research

Support Vector Machines, Kernel SVM

SMO Algorithms for Support Vector Machines without Bias Term

A Distributed Solver for Kernelized SVM

Estimating Parameters for a Gaussian pdf

Support Vector Machines. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Understanding Machine Learning Solution Manual

Probabilistic Machine Learning

Machine Learning A Geometric Approach

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines

Convex Programming for Scheduling Unrelated Parallel Machines

Intelligent Systems: Reasoning and Recognition. Artificial Neural Networks

Introduction to Discrete Optimization

Stat542 (F11) Statistical Learning. First consider the scenario where the two classes of points are separable.

Robustness and Regularization of Support Vector Machines

Support Vector Machines.

Computational and Statistical Learning Theory

Constrained Optimization and Support Vector Machines

Kernel Methods and Support Vector Machines

Max Margin-Classifier

10701 Recitation 5 Duality and SVM. Ahmed Hefny

Kernel Machines. Pradeep Ravikumar Co-instructor: Manuela Veloso. Machine Learning

Lecture 10: A brief introduction to Support Vector Machine

Machine Learning Support Vector Machines. Prof. Matteo Matteucci

CS798: Selected topics in Machine Learning

Linear vs Non-linear classifier. CS789: Machine Learning and Neural Network. Introduction

Machine Learning and Data Mining. Support Vector Machines. Kalev Kask

Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron

Review: Support vector machines. Machine learning techniques and image analysis

Introduction to Support Vector Machines

Support vector machines Lecture 4

Machine Learning: Fisher s Linear Discriminant. Lecture 05

LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

A Theoretical Analysis of a Warm Start Technique

1. Kernel ridge regression In contrast to ordinary least squares which has a cost function. m (θ T x (i) y (i) ) 2, J(θ) = 1 2.

Support Vector Machines

Introduction to Support Vector Machines

Support Vector Machines

1 Proof of learning bounds

SVMs, Duality and the Kernel Trick

LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES. Supervised Learning

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

The Simplex Method is Strongly Polynomial for the Markov Decision Problem with a Fixed Discount Rate

A short introduction to supervised learning, with applications to cancer pathway analysis Dr. Christina Leslie

Pattern Recognition and Machine Learning. Artificial Neural networks

Support Vector Machine

LMS Algorithm Summary

Approximation in Stochastic Scheduling: The Power of LP-Based Priority Policies

A MESHSIZE BOOSTING ALGORITHM IN KERNEL DENSITY ESTIMATION

Support Vector Machines for Classification and Regression. 1 Linearly Separable Data: Hard Margin SVMs

Bayesian Learning. Chapter 6: Bayesian Learning. Bayes Theorem. Roles for Bayesian Methods. CS 536: Machine Learning Littman (Wu, TA)

ICS-E4030 Kernel Methods in Machine Learning

Transcription:

Support Vector Machines Machine Learning Series Jerry Jeychandra Bloh Lab

Outline Main goal: To understand how support vector achines (SVMs) perfor optial classification for labelled data sets, also a quick peek at the ipleentational level. 1. What is a support vector achine? 2. Forulating the SVM proble 3. Optiization using Sequential Minial Optiization 4. Use of Kernels for non-linear classification 5. Ipleentation of SVM in MATLAB environent 6. Stochastic Gradient Descent (extra)

Outline Main goal: To understand how support vector achines (SVMs) perfor optial classification for labelled data sets, also a quick peek at the ipleentational level. 1. What is a support vector achine? 2. Forulating the SVM proble 3. Optiization using Sequential Minial Optiization 4. Use of Kernels for non-linear classification 5. Ipleentation of SVM in MATLAB environent 6. Stochastic Gradient Descent (extra)

Support Vector Machines (SVM) x 2 x 1

Support Vector Machines (SVM) Not very good decision boundary! x 2 x 1

Support Vector Machines (SVM) Great decision boundary! ax 1 + bx 2 + c = 0 x 2 In achine learning w T x + b = 0 ax 1 + bx 2 + c < 0 ax 1 + bx 2 + c > 0 x 1

Support Vector Machines (SVM) Great decision boundary! ax 1 + bx 2 + c = 0 x 2 In achine learning w T x + b = 0 w T x + b < 0 w T x + b > 0 x 1

Support Vector Machines (SVM) w T x + b = 1 w T x + b = 0 x 2 w T x + b = 1 x 1

Support Vector Machines (SVM) w T x + b = 1 w T x + b = 0 x 2 w T x + b = 1 x 1

Support Vector Machines (SVM) w T x + b = 1 -d +d w T x + b = 0 How do we axiize d? x 2 w T x + b = 1 x 1

Support Vector Machines (SVM) w T x + b = 1 -d +d w T x + b = 0 How do we axiize d? x 2 d = wt x + b w = 1 w w T x + b = 1 w Miniize this! x 1

Support Vector Machines (SVM) w T x + b = 1 -d +d w T x + b = 0 How do we axiize d? x 2 d = wt x + b w = 1 w w T x + b = 1 w Miniize this! 1 w 2 2 x 1

SVM Initial Constraints If our data point is labelled as y = 1 w T x (i) + b 1 If y = -1 w T x (i) + b 1 This can be succinctly suarized as: y w T x (i) + b 1 We re going to allow for soe argin violation: y (i) w T x (i) + b 1 ξ i y (i) w T x (i) + b + 1 ξ i 0 Hard Margin SVM Soft Margin SVM

Support Vector Machines (SVM) w T x + b = 1 w T x + b = 0 x 2 ξ/ w w T x + b = 1 x 1

Optiization Objective Miniize: 1 2 w 2 + C σ i ξ i Penalizing Ter Subject to: y i w T x i + b + 1 ξ i 0 ξ i 0

Outline Main goal: To understand how support vector achines (SVMs) perfor optial classification for labelled data sets, also a quick peek at the ipleentational level. 1. What is a support vector achine? 2. Forulating the SVM proble 3. Optiization using Sequential Minial Optiization 4. Use of Kernels for non-linear classification 5. Ipleentation of SVM in MATLAB environent

Optiization Objective Miniize 1 w 2 + C σ 2 i ξ i s.t y (i) w T x (i) + b + 1 ξ i 0 and ξ i 0 Need to use the Lagrangian Using Lagrangian with inequalities as a constraint requires us to fulfill Karush-Khan-Tucker conditions

Karush-Kahn-Tucker Conditions There ust exist a w* that solves the prial proble whereas α * and b * are solutions to the dual proble. All three paraeters ust satisfy the following conditions: 1. 2. L w, α, b w i = 0 L w, α, b b i = 0 3. α i g i w = 0 4. g i w 0 5. α i 0 6. ξ i 0 7. r i ξ i = 0 Constraint 3: If α * i is a non-zero nuber, g i ust be 0! Recall: g i w = y (i) w T x (i) + b + 1 ξ i 0 Exaple x ust lie directly on the functional argin in order for g i w to equal 0. Here α > 0. This is our support vectors!

Optiization Objective Miniize 1 2 w 2 + C σ i ξ i subject to y (i) w T x (i) + b + 1 ξ i 0 and ξ i 0 Need to use the Lagrangian L P w, b, α, ξ, r = 1 2 in w,b L P = 1 2 L x, α = f x + λ i h i x w 2 + C w 2 + C ξ i + α i i i ξ i α i y i w T x i + b 1 + ξ i r i ξ i i i α i y i [ w T x i + b + ξ i ] r i ξ i

Optiization Objective Miniize 1 2 w 2 + C σ i ξ i subject to y (i) w T x (i) + b + 1 ξ 0 and ξ i 0 Need to use the Lagrangian L P w, b, α, ξ, r = 1 2 in w,b L P = 1 2 L x, α = f x + λ i h i x w 2 + C w 2 + C ξ i + α i i i ξ i α i y i w T x i + b 1 + ξ i r i ξ i i i α i y i [ w T x i + b + ξ i ] Miniization function Margin Constraint Error Constraint r i ξ i

The proble with the Prial L P = 1 2 w 2 + C ξ i + α i α i [y i w T x i + b + ξ i ] r i ξ i We could just optiize the prial equation and be finished with SVM optiization BUT!!! You would iss out on one of the ost iportant aspects of the SVM The Kernel Trick We need the Lagrangian Dual!

http://stats.stackexchange.co/questions/19181/why-bother-with-the-dual-proble-when-fitting-sv

Finding the Dual L P = 1 2 w 2 + C ξ i + α i α i [y i w T x i + b + ξ i ] r i ξ i First, copute inial w, b and ξ. Can solve for w! New constraint w L w, b, α = w α i y i x i = 0 w = α i y i x i b L w, b, α = α i y (i) = 0 ξ L w, b, α = C α i r i = 0 Substitute in 0 α i C New constraint α i = C r i

Lagrangian Dual L P = 1 2 w 2 + C ξ i + α i α i y i [ w T x i + b + ξ i ] r i ξ i w = α i y i x i L D = α i 1 2 α i α j y (i) y (j) x (i) x (j) j=1 s.t i : σ α i y (i) = 0, ξ 0 and 0 α i C(fro KKT and prev. derivation)

Lagrangian Dual ax α L D = α i 1 2 j=1 α i α j y (i) y (j) x (i) x (j) S.T: i : σ α i y (i) = 0, ξ 0, 0 α i C and KKT conditions Copute inner product of x (i) and x (j) No longer rely on w or b Can replace with K(x,z) for kernels Can solve for α explicitly All values not on functional argin have α = 0, ore efficient!

Outline Main goal: To understand how support vector achines (SVMs) perfor optial classification for labelled data sets, also a quick peek at the ipleentational level. 1. What is a support vector achine? 2. Forulating the SVM proble 3. Optiization using Sequential Minial Optiization 4. Use of Kernels for non-linear classification 5. Ipleentation of SVM in MATLAB environent 6. Stochastic SVM Gradient Descent (extra)

Sequential Minial Optiization ax L D = α i 1 α 2 α i α j y (i) y (j) x (i) x (j) j=1 Ideally, hold all α but α j constant and optiize but: α i y (i) = 0 α 1 = y 1 α i y (i) i=2 Solution: Change two α at a tie! Platt 1998

Sequential Minial Optiization Algorith Initialize all α to 0 Repeat { 1. Select α j that violates KKT and additional α k to update 2. Reoptiize L D (α) with respect to two α s while holding all other α constant and copute w,b } In order to keep σ α i y (i) = 0 true: α j y (i) + α k y (j) = ζ Even after update, linear cobination ust reain constant! Platt 1998

Sequential Minial Optiization 0 α C fro constraint Line given due to linear constraint Clip values if they exceed C, H or L C α 2 α j y (i) + α k y (j) = ζ H C α 2 α j y (i) + α k y (j) = ζ H L α 1 C L α 1 C If y (j) y k then α j α k = ζ If y (j) = y (k) then α j + α k = ζ

Sequential Minial Optiization Can analytically solve for new α, w and b for each update step (closed for coputation) α new k = α old k + y k old old (E j Ek ) where η is second derivative of Dual with respect η to α 2. α new j = α old j + y k y j (α old k α new,clipped k ) b 1 = E 1 + y j α j new α j old K x j, x j + y k α k new,clipped α k old K(x j, x k ) b 2 = E 2 + y j α j new α j old K x j, x k + y k α k new,clipped α k old K(x k, x k ) b = b 1 + b 2 2 w = α i y (i) x (i) Platt 1998

Outline Main goal: To understand how support vector achines (SVMs) perfor optial classification for labelled data sets, also a quick peek at the ipleentational level. 1. What is a support vector achine? 2. Forulating the SVM proble 3. Optiization using Sequential Minial Optiization 4. Use of Kernels for non-linear classification 5. Ipleentation of SVM in MATLAB environent 6. Stochastic SVM Gradient Descent (extra)

Kernels Recall: ax α L D = σ α i 1 σ 2 σ j=1 α i α j y (i) y (j) x (i) x (j) We can generalize this using a kernel function K(x i, x j )! K x i, x j = φ(x i ) T φ(x j ) Where φ is a feature apping function. Turns out that we don t need to explicitly copute φ x due to Kernel Trick!

Kernel Trick Suppose K x, z = x T z 2 K x, z = x i z i x j z j j=1 K x, z = x i z j x i z j j=1 K x, z = ( x i x j )(z i z j ) i,j=1 Representing each feature: O(n 2 ) Kernel trick: O(n)

Types of Kernels Linear Kernel K x, z = x T z Gaussian Kernal (Radial Basis Function) K x, z = exp Polynoial Kernel x z 2 2σ 2 K x, z = x T z + c d ax α L D = α i 1 2 α i α j y (i) y (j) K x i, x j j=1

Outline Main goal: To understand how support vector achines (SVMs) perfor optial classification for labelled data sets, also a quick peek at the ipleentational level. 1. What is a support vector achine? 2. Forulating the SVM proble 3. Optiization using Sequential Minial Optiization 4. Use of Kernels for non-linear classification 5. Ipleentation of SVM in MATLAB environent 6. Stochastic SVM Gradient Descent (extra)

Exaple 1: Linear Kernel

Exaple 1: Linear Kernel In MATLAB: I used svtrain: odel = svtrain(x, y, C, @linearkernel, 1e-3, 500); 500 ax iterations C = 1000 KKT tol = 1e-3

Exaple 2: Gaussian Kernel (RBF)

Exaple 2: Gaussian Kernel (RBF) In MATLAB Gaussian Kernel: gsi = exp(-1*((x1-x2)'* (x1-x2))/(2*siga^2)); odel= svtrain(x, y, C, @(x1, x2) gaussiankernel(x1, x2, siga), 0.1, 300); 300 ax iterations C = 1 KKT tol = 0.1

Exaple 2: Gaussian Kernel (RBF) In MATLAB Gaussian Kernel: gsi = exp(-1*((x1-x2)'* (x1-x2))/(2*siga^2)); odel= svtrain(x, y, C, @(x1, x2) gaussiankernel(x1, x2, siga), 0.001, 300); 300 ax iterations C = 1 KKT tol = 0.001

Exaple 2: Gaussian Kernel (RBF) In MATLAB Gaussian Kernel: gsi = exp(-1*((x1-x2)'* (x1-x2))/(2*siga^2)); odel= svtrain(x, y, C, @(x1, x2) gaussiankernel(x1, x2, siga), 0.001, 300); 300 ax iterations C = 10 KKT tol = 0.001

Resources http://research.icrosoft.co/pubs/69644/tr-98-14.pdf ftp://www.ai.it.edu/pub/users/tlp/projects/sv/sv-so/so.pdf http://www.cs.toronto.edu/~urtasun/courses/csc2515/09sv-2515.pdf http://www.pstat.ucsb.edu/student%20seinar%20doc/sv2.pdf http://www.ics.uci.edu/~draanan/teaching/ics273a_winter08/lectures/lecture11.pdf http://stats.stackexchange.co/questions/19181/why-bother-with-the-dual-problewhen-fitting-sv http://cs229.stanford.edu/notes/cs229-notes3.pdf http://www.svs.org/tutorials/berwick2003.pdf http://www.ed.nyu.edu/chibi/sites/default/files/chibi/final.pdf https://share.coursera.org/wiki/index.php/ml:main https://www.youtube.co/watch?v=vqovichkm7i

Outline Main goal: To understand how support vector achines (SVMs) perfor optial classification for labelled data sets, also a quick peek at the ipleentational level. 1. What is a support vector achine? 2. Forulating the SVM proble 3. Optiization using Sequential Minial Optiization 4. Use of Kernels for non-linear classification 5. Ipleentation of SVM in MATLAB environent 6. Stochastic SVM Gradient Descent (extra)

Hinge-Loss Prial SVM Recall: Original Proble 1 2 wt w + C σ ξ i S.T: y i w T x i + b 1 + ξ 0 Re-write in ters of hinge-loss function L P = λ 2 wt w + 1 n h 1 y i w T x i + b Regularization Hinge-Loss Cost Function

Hinge Loss Function h 1 y i w T x i + b Penalize when signs of x (i) and y (i) don t atch! -3-2 -1 0 1 2 3 Y (i) *f(x)

Hinge-Loss Prial SVM Hinge-Loss Prial L P = λ 2 wt w + 1 n h 1 y i w T x i + b Copute gradient (sub-gradient) n w = λw 1 n y i x i L y i w T x i + b < 1 Indicator function L(y (i) f(x)) = 0 if classified correctly L(y (i) f(x)) = hinge slope if classified incorrectly

Stochastic Gradient Descent We ipleent rando sapling here to pick out data points: λw E i U y i x i l y i w T x i + b < 1 Use this to copute update rule: w t w t 1 + 1 t y i x i l y i Here i is sapled fro a unifor distribution. x i T w t 1 + b < 1 λw t 1 Shalev-Shwartz et al. 2007