Machine Learning. Lecture 6: Support Vector Machine. Feng Li.

Similar documents
Lecture Notes on Support Vector Machine

Machine Learning. Support Vector Machines. Manfred Huber

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machine (continued)

Support Vector Machines

ICS-E4030 Kernel Methods in Machine Learning

Support Vector Machines for Classification and Regression

Jeff Howbert Introduction to Machine Learning Winter

Support Vector Machines: Maximum Margin Classifiers

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Support Vector Machines for Classification and Regression. 1 Linearly Separable Data: Hard Margin SVMs

Support Vector Machines

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

CS6375: Machine Learning Gautam Kunapuli. Support Vector Machines

The Lagrangian L : R d R m R r R is an (easier to optimize) lower bound on the original problem:

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines

CS-E4830 Kernel Methods in Machine Learning

Each new feature uses a pair of the original features. Problem: Mapping usually leads to the number of features blow up!

Support Vector Machines and Kernel Methods

Support Vector Machines

Lecture 10: A brief introduction to Support Vector Machine

Linear vs Non-linear classifier. CS789: Machine Learning and Neural Network. Introduction

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machine

Support vector machines

Lecture Support Vector Machine (SVM) Classifiers

Statistical Machine Learning from Data

Introduction to Support Vector Machines

Kernel Machines. Pradeep Ravikumar Co-instructor: Manuela Veloso. Machine Learning

Review: Support vector machines. Machine learning techniques and image analysis

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

CS798: Selected topics in Machine Learning

Support Vector Machines

Linear & nonlinear classifiers

Convex Optimization and Support Vector Machine

Linear & nonlinear classifiers

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Support Vector Machine

Support Vector Machines for Regression

Convex Optimization. Dani Yogatama. School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA. February 12, 2014

Data Mining. Linear & nonlinear classifiers. Hamid Beigy. Sharif University of Technology. Fall 1396

Support Vector Machines

Support Vector Machines

Support Vector Machines. Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar

Lecture 2: Linear SVM in the Dual

Pattern Recognition 2018 Support Vector Machines

EE613 Machine Learning for Engineers. Kernel methods Support Vector Machines. jean-marc odobez 2015

CSC 411 Lecture 17: Support Vector Machine

Statistical Pattern Recognition

Support Vector Machines

Introduction to Machine Learning Spring 2018 Note Duality. 1.1 Primal and Dual Problem

Perceptron Revisited: Linear Separators. Support Vector Machines

L5 Support Vector Classification

ML (cont.): SUPPORT VECTOR MACHINES

Outline. Basic concepts: SVM and kernels SVM primal/dual problems. Chih-Jen Lin (National Taiwan Univ.) 1 / 22

Chapter 9. Support Vector Machine. Yongdai Kim Seoul National University

Incorporating detractors into SVM classification

Support Vector Machines, Kernel SVM

Lecture 18: Optimization Programming

Machine Learning Support Vector Machines. Prof. Matteo Matteucci

CS269: Machine Learning Theory Lecture 16: SVMs and Kernels November 17, 2010

Support Vector Machines.

Support Vector Machines Explained

COMP 652: Machine Learning. Lecture 12. COMP Lecture 12 1 / 37

Linear Support Vector Machine. Classification. Linear SVM. Huiping Cao. Huiping Cao, Slide 1/26

Lecture 10: Support Vector Machine and Large Margin Classifier

Duality. Lagrange dual problem weak and strong duality optimality conditions perturbation and sensitivity analysis generalized inequalities

Modelli Lineari (Generalizzati) e SVM

5. Duality. Lagrangian

Lecture: Duality of LP, SOCP and SDP

Support Vector Machines

SVM and Kernel machine

SMO Algorithms for Support Vector Machines without Bias Term

Machine Learning And Applications: Supervised Learning-SVM

SVMs, Duality and the Kernel Trick

Constrained Optimization and Support Vector Machines

Machine Learning A Geometric Approach

Applied inductive learning - Lecture 7

Extreme Abridgment of Boyd and Vandenberghe s Convex Optimization

Support Vector Machines

SVM optimization and Kernel methods

(Kernels +) Support Vector Machines

Indirect Rule Learning: Support Vector Machines. Donglin Zeng, Department of Biostatistics, University of North Carolina

Support Vector Machines and Speaker Verification

Kernel Methods and Support Vector Machines

Convex Optimization M2

10/05/2016. Computational Methods for Data Analysis. Massimo Poesio SUPPORT VECTOR MACHINES. Support Vector Machines Linear classifiers

Announcements - Homework

Lecture 16: Modern Classification (I) - Separating Hyperplanes

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Lagrange Duality. Daniel P. Palomar. Hong Kong University of Science and Technology (HKUST)

SMO vs PDCO for SVM: Sequential Minimal Optimization vs Primal-Dual interior method for Convex Objectives for Support Vector Machines

Machine Learning and Data Mining. Support Vector Machines. Kalev Kask

A GENERAL FORMULATION FOR SUPPORT VECTOR MACHINES. Wei Chu, S. Sathiya Keerthi, Chong Jin Ong

Soft-margin SVM can address linearly separable problems with outliers

Learning with kernels and SVM

10701 Recitation 5 Duality and SVM. Ahmed Hefny

An introduction to Support Vector Machines

Transcription:

Machine Learning Lecture 6: Support Vector Machine Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 2018

Warm Up 2 / 80

Warm Up (Contd.) 3 / 80

Warm Up (Contd.) 4 / 80

Warm Up (Contd.) 5 / 80

Warm Up (Contd.) 6 / 80

Warm Up (Contd.) (Contd.) 7 / 80

Warm Up (Contd.) 8 / 80

Warm Up (Contd.) 9 / 80

Warm Up (Contd.) 10 / 80

Hyperplane Separates a n-dimensional space into two half-spaces Defined by an outward pointing normal vector ω R n Assumption: The hyperplane passes through origin. If not, have a bias term b; we will then need both ω and b to define it b > 0 means moving it parallely along ω (b < 0 means in opposite direction) 11 / 80

Support Vector Machine A hyperplane based linear classifier defined by ω and b Prediction rule: y = sign(ω T x + b) Given: Training data {(x (i), y (i) )} i=1,,m Goal: Learn ω and b that achieve the maximum margin For now, assume that entire training data are correctly classified by (ω, b) Zero loss on the training examples (non-zero loss later) 12 / 80

Margin (Contd.) Hyperplane: w T x + b = 0, where w is the normal vector The margin γ (i) is the distance between x (i) and the hyperplane ( ) ( ) T w T x (i) γ (i) w w + b = 0 γ (i) = x (i) + b w w w 13 / 80

Margin (Contd.) Hyperplane: w T x + b = 0, where w is the normal vector The margin γ (i) is the distance between x (i) and the hyperplane ( ) ( ) T w T x (i) γ (i) w w + b = 0 γ (i) = x (i) + b w w w Now, the margin is signed If y (i) = 1, γ (i) 0; otherwise, γ (i) < 0 14 / 80

Margin (Contd.) Geometric margin ( ( ) ) T w γ (i) = y (i) x (i) + b w w Scaling (w, b) does not change γ (i) With respect to the whole training set, the margin is written as γ = min γ (i) i 15 / 80

Maximizing The Margin The hyperplane actually serves as a decision boundary to differentiating positive labels from negative labels We make more confident decision if larger margin is given, i.e., the data sample is further away from the hyperplane There exist a infinite number of hyperplanes, but which one is the best? max ω,b min {γ (i) } i 16 / 80

Maximizing The Margin (Contd.) There exist a infinite number of hyperplanes, but which one is the best? max ω,b min i {γ (i) } It is equivalent to max γ,ω,b γ s.t. γ (i) γ, i Since ( ( ) ) T w γ (i) = y (i) x (i) + b w w the constraint becomes y (i) (ω T x (i) + b) γ ω, i 17 / 80

Maximizing The Margin (Contd.) Formally, max γ,ω,b s.t. γ y (i) (ω T x (i) + b) γ ω, i 18 / 80

Maximizing The Margin (Contd.) Scaling (ω, b) such that min i {y (i) (ω T x (i) + b)} = 1, the representation of the margin becomes 1/ ω { ( ( ) )} T ω min y (i) x (i) + b = 1 i w ω ω The problem becomes max ω,b 1/ ω s.t. y (i) (ω T x (i) + b) 1, i 19 / 80

Support Vector Machine (Primal Form) Maximizing 1/ ω is equivalent to minimizing ω 2 = ω T ω min ω,b s.t. ω T ω y (i) (ω T x (i) + b) 1, i This is a quadratic programming (QP) problem! Interior point method (https://en.wikipedia.org/wiki/interior-point_method) Active set method (https://en.wikipedia.org/wiki/active_set_method) Gradient projection method (http://www.ifp.illinois.edu/~angelia/l13_constrained_gradient.pdf)... Existing generic QP solvers is of low efficiency, especially in face of a large training set 20 / 80

Convex Optimization Review Optimization Problem Lagrangian Duality KKT Conditions Convex Optimization 21 / 80

Optimization Problems Considering the following optimization problem min ω s.t. f(ω) g i (ω) 0, i = 1,, k h j (ω) = 0, j = 1,, l with variable w R n, domain D = k i=1 domg i l j=1 domh j, optimal value p Objective function f(ω) k inequality constraints g i(ω) 0, i = 1,, k l equality constraints h j(ω) = 0, j = 1,, l 22 / 80

Lagrangian Lagrangian: L : R n R k R l R, with doml = D R k R l k l L(ω, α, β) = f(ω) + α i g i (ω) + β j h j (ω) i=1 j=1 Weighted sum of objective and constraint functions α i is Lagrange multiplier associated with g i(ω) 0 β j is Lagrange multiplier associated with h j(ω) = 0 23 / 80

Lagrange Dual Function The Lagrange dual function G : R k R l R G(α, β) = inf L(ω, α, β) ω D = inf w D f(ω) + G is concave, can be for some α, β k α i g i (ω) + i=1 l β j h j (ω) j=1 24 / 80

The Lower Bounds Property If α 0, then G(α, β) p, where p is the optimal value of the primal problem Proof: If ω is feasible and α 0, then f( ω) L( ω, α, β) inf L(ω, α, β) = G(α, β) ω D minimizing over all feasible ω gives p G(α, β) 25 / 80

Lagrange Dual Problem Lagrange dual problem max α,β G(α, β) s.t. α 0, i = 1,, k Find the best low bound on p, obtained from Lagrange dual function A convex optimization problem (optimal value denoted by d ) α, β are dual feasible if α 0, (α, β) dom G and G > Often simplified by making implicit constraint (α, β) dom G explicit 26 / 80

Weak Duality Weak duality: d p Always holds Can be sued to find nontrivial lower bounds for difficult problems Optimal duality gap: p d 27 / 80

Complementary Slackness Let ω be a primal optimal point and (α, β ) be a dual optimal point If strong duality holds, then αi g i (ω ) = 0 for i = 1, 2,, k 28 / 80

We have Complementary Slackness (Proof) f(ω ) = G(α, β ) = inf ω f(ω) + f(ω ) + f(ω ) k αi g i (ω) + i=1 k αi g i (ω ) + i=1 l βj h j (ω) j=1 l βj h j (ω ) j=1 The last two inequalities hold with equality, such that we have k αi g i (ω ) = 0 Since each term, i.e., α i g i(ω ), is nonpositive, we thus conclude i=1 α i g i (ω ) = 0, i = 1, 2,, k 29 / 80

Complementary Slackness (Contd ) The inequality in the third line holds with equality f(ω ) = G(α, β ) = inf ω f(ω) + = f(ω ) + = f(ω ) k αi g i (ω) + i=1 k αi g i (ω ) + i=1 ω actually minimizes L(ω, α, β ) over ω L(ω, α, β ) = f(ω) + l βj h j (ω) j=1 l βj h j (ω ) j=1 k αi g i (ω) + i=1 l βj h j (ω) j=1 30 / 80

Karush-Kuhn-Tucker (KKT) Conditions Let w and (α, β ) by any primal and dual optimal points wither zero duality gap (i.e., the strong duality holds), the following conditions should be satisfied Stationarity: Gradient of Lagrangian with respect to ω vanishes Primal feasibility f(w ) + k α i g i(w ) + i=1 l β j h j(w ) = 0 j=1 g i(w ) 0, i = 1,, k h j(w ) = 0, j = 1,, l Dual feasibility Complementary slackness α i 0, i = 1,, k α ig i(w ) = 0, i = 1,, k 31 / 80

Convex Optimization Problem Problem Formulation min ω s.t. f(w) g i (w) 0, i = 1,, k Aw b = 0 where A is a l n matrix, b R l 32 / 80

Weak Duality V.s. Strong Duality Weak duality: d p Always holds Can be sued to find nontrivial lower bounds for difficult problems Strong duality: d = p Does not hold in general (Usually) holds for convex problems Conditions that guarantee strong duality in convex problems are called constraint qualifications 33 / 80

Slater s Constraint Qualification Strong duality holds for a convex prblem min ω s.t. f(w) g i (w) 0, i = 1,, k Aw b = 0 if it is strictly feasible, i.e., ω relintd : g i (ω) < 0, i = 1,, m, Aw = b 34 / 80

KKT Conditions for Convex Optimization For convex optimization problem, the KKT conditions are also sufficient for the points to be primal and dual optimal Suppose ω, α, and β are any points satisfying the following KKT conditions g i( ω) 0, i = 1,, k h j( ω) = 0, j = 1,, l α i 0, i = 1,, k α ig i( ω) = 0, i = 1,, k k l f( ω) + α i g i( ω) + β j h j( ω) = 0 i=1 j=1 then they are primal and dual optimal with strong duality holding 35 / 80

Optimal Margin Classifier Primal (convex) problem formulation The Lagrangian min ω,b 1 2 ω 2 L(w, b, α) = 1 2 w 2 The Lagrange dual function s.t. y (i) (ω T x (i) + b) 1, i m α i (y (i) (w T x (i) + b) 1) i=1 G(α) = inf L(w, b, α) w,b 36 / 80

Optimal Margin Classifier Dual problem formulation The Lagrangian L(ω, b, α) = 1 2 ω 2 The Lagrange dual function max inf L(ω, b, α) α ω,b s.t. α i 0, i m α i (y (i) (ω T x (i) + b) 1) i=1 G(α) = inf L(ω, b, α) ω,b 37 / 80

Optimal Margin Classifier (Contd.) Dual problem formulation max α s.t. G(α) = inf L(ω, b, α) ω,b α i 0 i 38 / 80

Optimal Margin Classifier (Contd.) According to KKT conditions, minimizing L(w, b, α) over ω and b m m w L(w, b, α) = w α i y (i) x (i) = 0 w = α i y (i) x (i) i=1 m b L(w, b, α) = α i y (i) = 0 i=1 The Lagrange dual function becomes G(α) = m α i 1 2 i=1 with m i=1 α iy (i) = 0 and α i 0 i=1 m y (i) y (j) α i α j (x (i) ) T x (j) i,j=1 39 / 80

Optimal Margin Classifier (Contd.) Dual problem formulation max α s.t. m G(α) = α i 1 m y (i) y (j) α i α j (x (i) ) T x (j) 2 i=1 α i 0 i m α i y (i) = 0 i=1 i,j=1 It is a convex optimization problem, so the strong duality (p = d ) holds and the KKT conditions are respected Quadratic Programming problem in α Several off-the-shelf solvers exist to solve such QPs Some examples: quadprog (MATLAB), CVXOPT, CPLEX, IPOPT, etc. 40 / 80

SVM: The Solution Once we have the α, m w = αi y (i) x (i) i=1 Given w, how to calculate the optimal value of b? 41 / 80

SVM: The Solution Since α i (y(i) (ω T x (i) + b) 1) = 0, for i, we have for {i : α i > 0} Then, for i such that α i y (i) (ω T x (i) + b ) = 1 > 0, we have b = y (i) ω T x (i) For robustness, we calculated the optimal value for b by taking the average b = i:α i >0(y(i) ω T x (i) ) m i=1 1(α i > 0) 42 / 80

SVM: The Solution (Contd.) Most α i s in the solution are zero (sparse solution) According to KKT conditions, for the optimal α i s, ) α i (1 y (i) (w T x (i) + b) = 0 α i is non-zero only if x (i) lies on the one of the two margin boundaries. i.e., for which y (i) (ω T x (i) + b) = 1 43 / 80

SVM: The Solution (Contd.) These data samples are called support vector (i.e., support vectors support the margin boundaries) 44 / 80

SVM: The Solution (Contd.) Redefine ω ω = αsy (s) x (s) s S where S denotes the indices of the support vectors 45 / 80

Kernel Methods Motivation: Linear models (e.g., linear regression, linear SVM etc.) cannot reflect the nonlinear pattern in the data Kernels: Make linear model work in nonlinear settings By mapping data to higher dimensions where it exhibits linear patterns Apply the linear model in the new input space Mapping is equivalent to changing the feature representation 46 / 80

Feature Mapping Consider the following binary classification problem Each sample is represented by a single feature x No linear separator exists for this data 47 / 80

Feature Mapping (Contd.) Now map each example as x {x, x 2 } Each example now has two features ( derived from the old representation) Data now becomes linearly separable in the new representation 48 / 80

Feature Mapping (Contd.) Another example Each sample is defined by x = {x 1, x 2} No linear separator exists for this data 49 / 80

Feature Mapping (Contd.) Now map each example as x = {x 1, x 2 } z = {x 2 1, 2x 1 x 2, x 2 2} Each example now has three features ( derived from the old representation) Data now becomes linearly separable in the new representation 50 / 80

Feature Mapping (Contd.) Consider the follow feature mapping φ for an example x = {x 1,, x n } φ : x {x 2 1, x 2 2,, x 2 n, x 1 x 2, x 1 x 2,, x 1 x n,, x n 1 x n } It is an example of a quadratic mapping Each new feature uses a pair of the original features 51 / 80

Feature Mapping (Contd.) Problem: Mapping usually leads to the number of features blow up! Computing the mapping itself can be inefficient, especially when the new space is very high dimensional Storing and using these mappings in later computations can be expensive (e.g., we may have to compute inner products in a very high dimensional space) Using the mapped representation could be inefficient too Thankfully, kernels help us avoid both these issues! The mapping does not have to be explicitly computed Computations with the mapped features remain efficient 52 / 80

Kernels as High Dimensional Feature Mapping Let s assume we are given a function K (kernel) that takes as inputs x and z K(x, z) = (x T z) 2 = (x 1 z 1 + x 2 z 2 ) 2 = x 2 1z 2 1 + x 2 2z 2 2 + 2x 1 x 2 z 1 z 2 = (x 2 1, 2x 1 x 2, x 2 2) T (z 2 1, 2z 1 z 2, z 2 2) The above function K implicitly defines a mapping φ to a higher dim. space φ(x) = {x 2 1, 2x 1 x 2, x 2 2} Simply defining the kernel in a certain way gives a higher dim. mapping φ The mapping does not have to be explicitly computed Computations with the mapped features remain efficient 53 / 80

Kernels: Formal Definition Each kernel K has an associated feature mapping φ φ takes input x X (input space) and maps it to F (feature space) Kernel K(x, z) = φ(x) T φ(z) takes two inputs and gives their similarity in F space φ : X F K : X X R F needs to be a vector space with a dot product defined upon it Also called a Hilbert Space Can just any function be used as a kernel function? No. It must satisfy Mercer s Condition 54 / 80

Mercer s Condition For K to be a kernel function There must exist a Hilbert Space F for which K defines a dot product The above is true if K is a positive definite function f(x)k(x, z)f(z)dxdz > 0 ( f L 2) for all functions f that are square integrable, i.e., f 2 (x)dx < 55 / 80

Mercer s Condition (Contd.) Let K 1 and K 2 be two kernel functions then the followings are as well: Direct sum: K(x, z) = K 1(x, z) + K 2(x, z) Scalar product: K(x, z) = αk 1(x, z) Direct product: K(x, z) = K 1(x, z)k 2(x, z) Kernels can also be constructed by composing these rules 56 / 80

The Kernel Matrix For K to be a kernel function The kernel function K also defines the Kernel Matrix over the data (also denoted by K) Given m samples {x (1), x (2),, x (m) }, the (i, j)-th entry of K is defined as K i,j = K(x (i), x (j) ) = φ(x (i) ) T φ(x (j) ) K i,j : Similarity between the i-th and j-th example in the feature space F K: m m matrix of pairwise similarities between samples in F space K is a symmetric matrix K is a positive semi-definite matrix 57 / 80

Some Examples of Kernels Linear (trivial) Kernal: K(x, z) = x T z Quadratic Kernel K(x, z) = (x T z) 2 or (1 + x T z) 2 Polynomial Kernel (of degree d) K(x, z) = (x T z) d or (1 + x T z) d Gaussian Kernel Sigmoid Kernel ) x z 2 K(x, z) = exp ( 2σ 2 K(x, z) = tanh(αx T + c) 58 / 80

Using Kernels Kernels can turn a linear model into a nonlinear one Kernel K(x, z) represents a dot product in some high dimensional feature space F K(x, z) = (x T z) 2 or (1 + x T z) 2 Any learning algorithm in which examples only appear as dot products (x (i)t x (j) ) can be kernelized (i.e., non-linearlized) By replacing the x (i)t x (j) terms by φ(x (i) ) T φ(x (j) ) = K(x (i), x (j) ) Most learning algorithms are like that SVM, linear regression, etc. Many of the unsupervised learning algorithms too can be kernelized (e.g., K-means clustering, Principal Component Analysis, etc.) 59 / 80

Kernelized SVM Training SVM dual Lagrangian max α s.t. m α i 1 m y (i) y (j) α i α j < x (i), x (j) > 2 i=1 i,j=1 m α i y (i) = 0 i=1 α i 0, i 60 / 80

Kernelized SVM Training (Contd.) Replacing < x (i), x (j) > by φ(x (i) ) T φ(x (j) ) = K(x (i), x (j) ) = K ij max α s.t. m α i 1 2 i=1 m y (i) y (j) α i α j K i,j i,j=1 m α i y (i) = 0 i=1 α i 0, i SVM now learns a linear separator in the kernel defined feature space F This corresponds to a non-linear separator in the original space X 61 / 80

Kernelized SVM Prediction Prediction for a test example x (assume b = 0) ( ) y = sign(ω T x) = sign α s y (s) x (s)t x Replacing each example with its feature mapped representation (x φ(x)) ( ) ( ) y = sign α s y (s) φ(x (s) ) T φ(x) = sign α s y (s) K(x (s), x) s S Kernelized SVM needs the support vectors at the test time (except when you can write φ(x) as an explicit, reasonably-sized vector) In the unkernelized version ω = s S αsy(s) x (s) can be computed and stored as a n 1 vector, so the support vectors need not be stored s S s S 62 / 80

Kernelized SVM Prediction (Contd.) What if b 0? y = sign ( ω T φ(x) + b ) ( ) = sign α s y (s) φ(x (s) ) T φ(x) + b s S ( ) = sign α s y (s) K(x (s), x) + b s S 63 / 80

Soft-Margin SVM We allow some training examples to be misclassified, and some training examples to fall within the margin region 64 / 80

Soft-Margin SVM (Contd.) Recall that, for the separable case (training loss = 0), the constraints were y (i) (ω T x (i) + b) 1 for i For the non-separable case, we relax the above constraints as: y (i) (ω T x (i) + b) 1 ξ i for i ξ i is called slack variable 65 / 80

Non-separable case: We will allow misclassified training examples But we want their number to be minimized, by minimizing the sum of the slack variables i ξi Reformulating the SVM problem by introducing slack variables ξ i min ω,b,ξ 1 2 ω 2 + C m i=1 ξ i s.t. y (i) (ω T x (i) + b) 1 ξ i, i = 1,, m ξ i 0, i = 1,, m The parameter C controls the relative weighting between the following two goals Small C ω 2 /2 dominates prefer large margins but allow potential large number of misclassified training examples Large C C m i=1 ξi dominates prefer small number of misclassified examples at the expense of having a small margin 66 / 80

Lagrangian L(ω, b, ξ, α, r) = 1 m m m 2 ωt ω + C ξ i α i [y (i) (ω T x (i) + b) 1 + ξ i ] r i ξ i i=1 KKT conditions (the optimal values of ω, b, ξ, α, and r should satisfy the following conditions) ωl(ω, b, ξ, α, r) = 0 ω = m i=1 αiy(i) x (i) b L(ω, b, ξ, α, r) = 0 m i=1 αiy(i) = 0 ξi L(ω, b, ξ, α, r) = 0 α i + r i = C, for i α i, r i, ξ i 0, for i y (i) (ω T x (i) + b) + ξ i 1 0, for i i=1 α i(y (i) (ω T x (i) + b) + ξ i 1) = 0, for i r iξ i = 0, for i i=1 67 / 80

Dual problem (derived according to the KKT conditions) max α m α i 1 2 i=1 m y (i) y (j) α i α j < x (i), x (j) > i,j=1 s.t. 0 α i C, i = 1,, m m α i y (i) = 0 i=1 Use existing QP solvers to address the above optimization problem 68 / 80

Optimal values for α i (i = 1,, m) How to calculate the optimal values of w and b? Use KKT conditions! 69 / 80

By resolving the above optimization problem, we get the optimal value of α i (i = 1,, m) How to calculate the optimal values of w and b? According to the KKT conditions, we have How about b? ω = m α iy (i) x (i) i=1 70 / 80

Since α i + r i = C, for i, we have r i = C α i, i Since r i ξ i = 0, for i, we have (C α i )ξ i = 0, i For i such that α i C, we have ξ i = 0, and thus α i (y (i) (w T x (i) + b) 1) = 0 for those i s 71 / 80

For i such that 0 < α i < C, we have for those i s Hence, for {i : 0 < α i < C} We finally calculate b as b = y (i) (w T x (i) + b) = 1 w T x (i) + b = y (i) i:0<α i<c (y(i) w T x (i) ) m i=1 1(0 < α i < C) 72 / 80

Some useful corollaries according to the KKT conditions When α i = 0, y (i) (w T x (i) + b) 1 When α i = C, y (i) (w T x (i) + b) 1 When 0 < α i < C, y (i) (w T x (i) + b) = 1 73 / 80

Coordinate Ascent Algorithm Consider the following unconstrained optimization problem max α f(α 1, α 2,, α m ) Repeat the following step until convergence For each i, α i = arg max αi f(α 1,, α i 1, α i, α i+1,, α m) For some α i, fix the other variables and re-optimize f(α) with respect to α i 74 / 80

Sequential Minimal Optimization (SMO) Algorithm Coordinate ascent algorithm cannot be applied since m i=0 α iy (i) = 0 The basic idea of SMO Repeat the following steps until convergence Select some pair of α i and α j to update next (using a heuristic that tries to pick the two that will allow us to make the biggest progress towards the global maximum) Re-optimize G(α) with respect to α i and α j, while holding all the other α k s (k i, j) fixed 75 / 80

Take α 1 and α 2 for example m α 1 y (1) + α 2 y (2) = α i y (i) = ζ α 1 = (ζ α 2 y (2) )y (1) i=3 Given α 1, α 2 will be bounded by L and H such that L α 2 H If y (1) y (2) = 1, H = min(c, C + α 2 α 1) and L = max(0, α 2 α 1) If y (1) y (2) = 1, H = min(c, α 1 + α 2) and L = max(0, α 1 + α 2 C) 76 / 80

Our objective function becomes f(α) = f(α 1, α 2 ) = f((ζ α 2 y (2) )y (1), α 2, α 3,, α m ) Resolving the following optimization problem max f((ζ α 2 y (2) )y (1), α 2, α 3,, α m ) α 2 s.t. L α 2 H where {α 3,, α m } are treated as constants 77 / 80

Solving α 2 f((ζ α 2 y (2) )y (1), α 2 ) = 0 gives the updating rule for α 2 (without considering the corresponding constraints) α + 2 = α 2 + y(2) (E 1 E 2 ) K 11 2K 12 + K 22 where K i,j =< x i, x j > f i = m i=1 y(i) α jk i,j + b E i = f i y (i) V i = m j=3 y(j) α ik i,j = f i 2 j=1 y(j) α jk i,j b Considering L α 2 H, we have H if α 2 + > H α 2 = α 2 + if L α 2 + H L if α 2 + < L 78 / 80

How about b? When α 1 (0, C), b + = (α 1 α + 1 )y(1) K 11 + (α 2 α + 2 )y(2) K 12 E 1 + b = b 1 When α 2 (0, C), b + = (α 1 α + 1 )y(1) K 12 + (α 2 α + 2 )y(2) K 22 E 2 + b = b 2 When α 1, α 2 (0, C), b + = b 1 = b 2 When α 1, α 2 {0, C}, b + = (b 1 + b 2)/2 More details can be found at J. Platt, Sequential Minimal Optimization: A Fast Algorithm for Training Support Vector Machines, Microsoft Research Technical Report, 1998. https://www.microsoft.com/en-us/research/wp-content/uploads/2016/ 02/tr-98-14.pdf 79 / 80

Thanks! Q & A 80 / 80