Machine Learning. Lecture 6: Support Vector Machine. Feng Li.

Machine Learning Lecture 6: Support Vector Machine Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 2018

Warm Up 2 / 80

Warm Up (Contd.) 3 / 80

Warm Up (Contd.) 4 / 80

Warm Up (Contd.) 5 / 80

Warm Up (Contd.) 6 / 80

Warm Up (Contd.) (Contd.) 7 / 80

Warm Up (Contd.) 8 / 80

Warm Up (Contd.) 9 / 80

Warm Up (Contd.) 10 / 80

Hyperplane Separates a n-dimensional space into two half-spaces Defined by an outward pointing normal vector ω R n Assumption: The hyperplane passes through origin. If not, have a bias term b; we will then need both ω and b to define it b > 0 means moving it parallely along ω (b < 0 means in opposite direction) 11 / 80

Support Vector Machine A hyperplane based linear classifier defined by ω and b Prediction rule: y = sign(ω T x + b) Given: Training data {(x (i), y (i) )} i=1,,m Goal: Learn ω and b that achieve the maximum margin For now, assume that entire training data are correctly classified by (ω, b) Zero loss on the training examples (non-zero loss later) 12 / 80

Margin (Contd.) Hyperplane: w T x + b = 0, where w is the normal vector The margin γ (i) is the distance between x (i) and the hyperplane ( ) ( ) T w T x (i) γ (i) w w + b = 0 γ (i) = x (i) + b w w w 13 / 80

Margin (Contd.) Hyperplane: w T x + b = 0, where w is the normal vector The margin γ (i) is the distance between x (i) and the hyperplane ( ) ( ) T w T x (i) γ (i) w w + b = 0 γ (i) = x (i) + b w w w Now, the margin is signed If y (i) = 1, γ (i) 0; otherwise, γ (i) < 0 14 / 80

Margin (Contd.) Geometric margin ( ( ) ) T w γ (i) = y (i) x (i) + b w w Scaling (w, b) does not change γ (i) With respect to the whole training set, the margin is written as γ = min γ (i) i 15 / 80

Maximizing The Margin The hyperplane actually serves as a decision boundary to differentiating positive labels from negative labels We make more confident decision if larger margin is given, i.e., the data sample is further away from the hyperplane There exist a infinite number of hyperplanes, but which one is the best? max ω,b min {γ (i) } i 16 / 80

Maximizing The Margin (Contd.) There exist a infinite number of hyperplanes, but which one is the best? max ω,b min i {γ (i) } It is equivalent to max γ,ω,b γ s.t. γ (i) γ, i Since ( ( ) ) T w γ (i) = y (i) x (i) + b w w the constraint becomes y (i) (ω T x (i) + b) γ ω, i 17 / 80

Maximizing The Margin (Contd.) Formally, max γ,ω,b s.t. γ y (i) (ω T x (i) + b) γ ω, i 18 / 80

Maximizing The Margin (Contd.) Scaling (ω, b) such that min i {y (i) (ω T x (i) + b)} = 1, the representation of the margin becomes 1/ ω { ( ( ) )} T ω min y (i) x (i) + b = 1 i w ω ω The problem becomes max ω,b 1/ ω s.t. y (i) (ω T x (i) + b) 1, i 19 / 80

Support Vector Machine (Primal Form) Maximizing 1/ ω is equivalent to minimizing ω 2 = ω T ω min ω,b s.t. ω T ω y (i) (ω T x (i) + b) 1, i This is a quadratic programming (QP) problem! Interior point method (https://en.wikipedia.org/wiki/interior-point_method) Active set method (https://en.wikipedia.org/wiki/active_set_method) Gradient projection method (http://www.ifp.illinois.edu/~angelia/l13_constrained_gradient.pdf)... Existing generic QP solvers is of low efficiency, especially in face of a large training set 20 / 80

Convex Optimization Review Optimization Problem Lagrangian Duality KKT Conditions Convex Optimization 21 / 80

Optimization Problems Considering the following optimization problem min ω s.t. f(ω) g i (ω) 0, i = 1,, k h j (ω) = 0, j = 1,, l with variable w R n, domain D = k i=1 domg i l j=1 domh j, optimal value p Objective function f(ω) k inequality constraints g i(ω) 0, i = 1,, k l equality constraints h j(ω) = 0, j = 1,, l 22 / 80

Lagrangian Lagrangian: L : R n R k R l R, with doml = D R k R l k l L(ω, α, β) = f(ω) + α i g i (ω) + β j h j (ω) i=1 j=1 Weighted sum of objective and constraint functions α i is Lagrange multiplier associated with g i(ω) 0 β j is Lagrange multiplier associated with h j(ω) = 0 23 / 80

Lagrange Dual Function The Lagrange dual function G : R k R l R G(α, β) = inf L(ω, α, β) ω D = inf w D f(ω) + G is concave, can be for some α, β k α i g i (ω) + i=1 l β j h j (ω) j=1 24 / 80

The Lower Bounds Property If α 0, then G(α, β) p, where p is the optimal value of the primal problem Proof: If ω is feasible and α 0, then f( ω) L( ω, α, β) inf L(ω, α, β) = G(α, β) ω D minimizing over all feasible ω gives p G(α, β) 25 / 80

Lagrange Dual Problem Lagrange dual problem max α,β G(α, β) s.t. α 0, i = 1,, k Find the best low bound on p, obtained from Lagrange dual function A convex optimization problem (optimal value denoted by d ) α, β are dual feasible if α 0, (α, β) dom G and G > Often simplified by making implicit constraint (α, β) dom G explicit 26 / 80

Weak Duality Weak duality: d p Always holds Can be sued to find nontrivial lower bounds for difficult problems Optimal duality gap: p d 27 / 80

Complementary Slackness Let ω be a primal optimal point and (α, β ) be a dual optimal point If strong duality holds, then αi g i (ω ) = 0 for i = 1, 2,, k 28 / 80

We have Complementary Slackness (Proof) f(ω ) = G(α, β ) = inf ω f(ω) + f(ω ) + f(ω ) k αi g i (ω) + i=1 k αi g i (ω ) + i=1 l βj h j (ω) j=1 l βj h j (ω ) j=1 The last two inequalities hold with equality, such that we have k αi g i (ω ) = 0 Since each term, i.e., α i g i(ω ), is nonpositive, we thus conclude i=1 α i g i (ω ) = 0, i = 1, 2,, k 29 / 80

Complementary Slackness (Contd ) The inequality in the third line holds with equality f(ω ) = G(α, β ) = inf ω f(ω) + = f(ω ) + = f(ω ) k αi g i (ω) + i=1 k αi g i (ω ) + i=1 ω actually minimizes L(ω, α, β ) over ω L(ω, α, β ) = f(ω) + l βj h j (ω) j=1 l βj h j (ω ) j=1 k αi g i (ω) + i=1 l βj h j (ω) j=1 30 / 80

Karush-Kuhn-Tucker (KKT) Conditions Let w and (α, β ) by any primal and dual optimal points wither zero duality gap (i.e., the strong duality holds), the following conditions should be satisfied Stationarity: Gradient of Lagrangian with respect to ω vanishes Primal feasibility f(w ) + k α i g i(w ) + i=1 l β j h j(w ) = 0 j=1 g i(w ) 0, i = 1,, k h j(w ) = 0, j = 1,, l Dual feasibility Complementary slackness α i 0, i = 1,, k α ig i(w ) = 0, i = 1,, k 31 / 80

Convex Optimization Problem Problem Formulation min ω s.t. f(w) g i (w) 0, i = 1,, k Aw b = 0 where A is a l n matrix, b R l 32 / 80

Weak Duality V.s. Strong Duality Weak duality: d p Always holds Can be sued to find nontrivial lower bounds for difficult problems Strong duality: d = p Does not hold in general (Usually) holds for convex problems Conditions that guarantee strong duality in convex problems are called constraint qualifications 33 / 80

Slater s Constraint Qualification Strong duality holds for a convex prblem min ω s.t. f(w) g i (w) 0, i = 1,, k Aw b = 0 if it is strictly feasible, i.e., ω relintd : g i (ω) < 0, i = 1,, m, Aw = b 34 / 80

KKT Conditions for Convex Optimization For convex optimization problem, the KKT conditions are also sufficient for the points to be primal and dual optimal Suppose ω, α, and β are any points satisfying the following KKT conditions g i( ω) 0, i = 1,, k h j( ω) = 0, j = 1,, l α i 0, i = 1,, k α ig i( ω) = 0, i = 1,, k k l f( ω) + α i g i( ω) + β j h j( ω) = 0 i=1 j=1 then they are primal and dual optimal with strong duality holding 35 / 80

Optimal Margin Classifier Primal (convex) problem formulation The Lagrangian min ω,b 1 2 ω 2 L(w, b, α) = 1 2 w 2 The Lagrange dual function s.t. y (i) (ω T x (i) + b) 1, i m α i (y (i) (w T x (i) + b) 1) i=1 G(α) = inf L(w, b, α) w,b 36 / 80

Optimal Margin Classifier Dual problem formulation The Lagrangian L(ω, b, α) = 1 2 ω 2 The Lagrange dual function max inf L(ω, b, α) α ω,b s.t. α i 0, i m α i (y (i) (ω T x (i) + b) 1) i=1 G(α) = inf L(ω, b, α) ω,b 37 / 80

Optimal Margin Classifier (Contd.) Dual problem formulation max α s.t. G(α) = inf L(ω, b, α) ω,b α i 0 i 38 / 80

Optimal Margin Classifier (Contd.) According to KKT conditions, minimizing L(w, b, α) over ω and b m m w L(w, b, α) = w α i y (i) x (i) = 0 w = α i y (i) x (i) i=1 m b L(w, b, α) = α i y (i) = 0 i=1 The Lagrange dual function becomes G(α) = m α i 1 2 i=1 with m i=1 α iy (i) = 0 and α i 0 i=1 m y (i) y (j) α i α j (x (i) ) T x (j) i,j=1 39 / 80

Optimal Margin Classifier (Contd.) Dual problem formulation max α s.t. m G(α) = α i 1 m y (i) y (j) α i α j (x (i) ) T x (j) 2 i=1 α i 0 i m α i y (i) = 0 i=1 i,j=1 It is a convex optimization problem, so the strong duality (p = d ) holds and the KKT conditions are respected Quadratic Programming problem in α Several off-the-shelf solvers exist to solve such QPs Some examples: quadprog (MATLAB), CVXOPT, CPLEX, IPOPT, etc. 40 / 80

SVM: The Solution Once we have the α, m w = αi y (i) x (i) i=1 Given w, how to calculate the optimal value of b? 41 / 80

SVM: The Solution Since α i (y(i) (ω T x (i) + b) 1) = 0, for i, we have for {i : α i > 0} Then, for i such that α i y (i) (ω T x (i) + b ) = 1 > 0, we have b = y (i) ω T x (i) For robustness, we calculated the optimal value for b by taking the average b = i:α i >0(y(i) ω T x (i) ) m i=1 1(α i > 0) 42 / 80

SVM: The Solution (Contd.) Most α i s in the solution are zero (sparse solution) According to KKT conditions, for the optimal α i s, ) α i (1 y (i) (w T x (i) + b) = 0 α i is non-zero only if x (i) lies on the one of the two margin boundaries. i.e., for which y (i) (ω T x (i) + b) = 1 43 / 80

SVM: The Solution (Contd.) These data samples are called support vector (i.e., support vectors support the margin boundaries) 44 / 80

SVM: The Solution (Contd.) Redefine ω ω = αsy (s) x (s) s S where S denotes the indices of the support vectors 45 / 80

Kernel Methods Motivation: Linear models (e.g., linear regression, linear SVM etc.) cannot reflect the nonlinear pattern in the data Kernels: Make linear model work in nonlinear settings By mapping data to higher dimensions where it exhibits linear patterns Apply the linear model in the new input space Mapping is equivalent to changing the feature representation 46 / 80

Feature Mapping Consider the following binary classification problem Each sample is represented by a single feature x No linear separator exists for this data 47 / 80

Feature Mapping (Contd.) Now map each example as x {x, x 2 } Each example now has two features ( derived from the old representation) Data now becomes linearly separable in the new representation 48 / 80

Feature Mapping (Contd.) Another example Each sample is defined by x = {x 1, x 2} No linear separator exists for this data 49 / 80

Feature Mapping (Contd.) Now map each example as x = {x 1, x 2 } z = {x 2 1, 2x 1 x 2, x 2 2} Each example now has three features ( derived from the old representation) Data now becomes linearly separable in the new representation 50 / 80

Feature Mapping (Contd.) Consider the follow feature mapping φ for an example x = {x 1,, x n } φ : x {x 2 1, x 2 2,, x 2 n, x 1 x 2, x 1 x 2,, x 1 x n,, x n 1 x n } It is an example of a quadratic mapping Each new feature uses a pair of the original features 51 / 80

Feature Mapping (Contd.) Problem: Mapping usually leads to the number of features blow up! Computing the mapping itself can be inefficient, especially when the new space is very high dimensional Storing and using these mappings in later computations can be expensive (e.g., we may have to compute inner products in a very high dimensional space) Using the mapped representation could be inefficient too Thankfully, kernels help us avoid both these issues! The mapping does not have to be explicitly computed Computations with the mapped features remain efficient 52 / 80

Kernels as High Dimensional Feature Mapping Let s assume we are given a function K (kernel) that takes as inputs x and z K(x, z) = (x T z) 2 = (x 1 z 1 + x 2 z 2 ) 2 = x 2 1z 2 1 + x 2 2z 2 2 + 2x 1 x 2 z 1 z 2 = (x 2 1, 2x 1 x 2, x 2 2) T (z 2 1, 2z 1 z 2, z 2 2) The above function K implicitly defines a mapping φ to a higher dim. space φ(x) = {x 2 1, 2x 1 x 2, x 2 2} Simply defining the kernel in a certain way gives a higher dim. mapping φ The mapping does not have to be explicitly computed Computations with the mapped features remain efficient 53 / 80

Kernels: Formal Definition Each kernel K has an associated feature mapping φ φ takes input x X (input space) and maps it to F (feature space) Kernel K(x, z) = φ(x) T φ(z) takes two inputs and gives their similarity in F space φ : X F K : X X R F needs to be a vector space with a dot product defined upon it Also called a Hilbert Space Can just any function be used as a kernel function? No. It must satisfy Mercer s Condition 54 / 80

Mercer s Condition For K to be a kernel function There must exist a Hilbert Space F for which K defines a dot product The above is true if K is a positive definite function f(x)k(x, z)f(z)dxdz > 0 ( f L 2) for all functions f that are square integrable, i.e., f 2 (x)dx < 55 / 80

Mercer s Condition (Contd.) Let K 1 and K 2 be two kernel functions then the followings are as well: Direct sum: K(x, z) = K 1(x, z) + K 2(x, z) Scalar product: K(x, z) = αk 1(x, z) Direct product: K(x, z) = K 1(x, z)k 2(x, z) Kernels can also be constructed by composing these rules 56 / 80

The Kernel Matrix For K to be a kernel function The kernel function K also defines the Kernel Matrix over the data (also denoted by K) Given m samples {x (1), x (2),, x (m) }, the (i, j)-th entry of K is defined as K i,j = K(x (i), x (j) ) = φ(x (i) ) T φ(x (j) ) K i,j : Similarity between the i-th and j-th example in the feature space F K: m m matrix of pairwise similarities between samples in F space K is a symmetric matrix K is a positive semi-definite matrix 57 / 80

Some Examples of Kernels Linear (trivial) Kernal: K(x, z) = x T z Quadratic Kernel K(x, z) = (x T z) 2 or (1 + x T z) 2 Polynomial Kernel (of degree d) K(x, z) = (x T z) d or (1 + x T z) d Gaussian Kernel Sigmoid Kernel ) x z 2 K(x, z) = exp ( 2σ 2 K(x, z) = tanh(αx T + c) 58 / 80

Using Kernels Kernels can turn a linear model into a nonlinear one Kernel K(x, z) represents a dot product in some high dimensional feature space F K(x, z) = (x T z) 2 or (1 + x T z) 2 Any learning algorithm in which examples only appear as dot products (x (i)t x (j) ) can be kernelized (i.e., non-linearlized) By replacing the x (i)t x (j) terms by φ(x (i) ) T φ(x (j) ) = K(x (i), x (j) ) Most learning algorithms are like that SVM, linear regression, etc. Many of the unsupervised learning algorithms too can be kernelized (e.g., K-means clustering, Principal Component Analysis, etc.) 59 / 80

Kernelized SVM Training SVM dual Lagrangian max α s.t. m α i 1 m y (i) y (j) α i α j < x (i), x (j) > 2 i=1 i,j=1 m α i y (i) = 0 i=1 α i 0, i 60 / 80

Kernelized SVM Training (Contd.) Replacing < x (i), x (j) > by φ(x (i) ) T φ(x (j) ) = K(x (i), x (j) ) = K ij max α s.t. m α i 1 2 i=1 m y (i) y (j) α i α j K i,j i,j=1 m α i y (i) = 0 i=1 α i 0, i SVM now learns a linear separator in the kernel defined feature space F This corresponds to a non-linear separator in the original space X 61 / 80

Kernelized SVM Prediction Prediction for a test example x (assume b = 0) ( ) y = sign(ω T x) = sign α s y (s) x (s)t x Replacing each example with its feature mapped representation (x φ(x)) ( ) ( ) y = sign α s y (s) φ(x (s) ) T φ(x) = sign α s y (s) K(x (s), x) s S Kernelized SVM needs the support vectors at the test time (except when you can write φ(x) as an explicit, reasonably-sized vector) In the unkernelized version ω = s S αsy(s) x (s) can be computed and stored as a n 1 vector, so the support vectors need not be stored s S s S 62 / 80

Kernelized SVM Prediction (Contd.) What if b 0? y = sign ( ω T φ(x) + b ) ( ) = sign α s y (s) φ(x (s) ) T φ(x) + b s S ( ) = sign α s y (s) K(x (s), x) + b s S 63 / 80

Soft-Margin SVM We allow some training examples to be misclassified, and some training examples to fall within the margin region 64 / 80

Soft-Margin SVM (Contd.) Recall that, for the separable case (training loss = 0), the constraints were y (i) (ω T x (i) + b) 1 for i For the non-separable case, we relax the above constraints as: y (i) (ω T x (i) + b) 1 ξ i for i ξ i is called slack variable 65 / 80

Non-separable case: We will allow misclassified training examples But we want their number to be minimized, by minimizing the sum of the slack variables i ξi Reformulating the SVM problem by introducing slack variables ξ i min ω,b,ξ 1 2 ω 2 + C m i=1 ξ i s.t. y (i) (ω T x (i) + b) 1 ξ i, i = 1,, m ξ i 0, i = 1,, m The parameter C controls the relative weighting between the following two goals Small C ω 2 /2 dominates prefer large margins but allow potential large number of misclassified training examples Large C C m i=1 ξi dominates prefer small number of misclassified examples at the expense of having a small margin 66 / 80

Lagrangian L(ω, b, ξ, α, r) = 1 m m m 2 ωt ω + C ξ i α i [y (i) (ω T x (i) + b) 1 + ξ i ] r i ξ i i=1 KKT conditions (the optimal values of ω, b, ξ, α, and r should satisfy the following conditions) ωl(ω, b, ξ, α, r) = 0 ω = m i=1 αiy(i) x (i) b L(ω, b, ξ, α, r) = 0 m i=1 αiy(i) = 0 ξi L(ω, b, ξ, α, r) = 0 α i + r i = C, for i α i, r i, ξ i 0, for i y (i) (ω T x (i) + b) + ξ i 1 0, for i i=1 α i(y (i) (ω T x (i) + b) + ξ i 1) = 0, for i r iξ i = 0, for i i=1 67 / 80

Dual problem (derived according to the KKT conditions) max α m α i 1 2 i=1 m y (i) y (j) α i α j < x (i), x (j) > i,j=1 s.t. 0 α i C, i = 1,, m m α i y (i) = 0 i=1 Use existing QP solvers to address the above optimization problem 68 / 80

Optimal values for α i (i = 1,, m) How to calculate the optimal values of w and b? Use KKT conditions! 69 / 80

By resolving the above optimization problem, we get the optimal value of α i (i = 1,, m) How to calculate the optimal values of w and b? According to the KKT conditions, we have How about b? ω = m α iy (i) x (i) i=1 70 / 80

Since α i + r i = C, for i, we have r i = C α i, i Since r i ξ i = 0, for i, we have (C α i )ξ i = 0, i For i such that α i C, we have ξ i = 0, and thus α i (y (i) (w T x (i) + b) 1) = 0 for those i s 71 / 80

For i such that 0 < α i < C, we have for those i s Hence, for {i : 0 < α i < C} We finally calculate b as b = y (i) (w T x (i) + b) = 1 w T x (i) + b = y (i) i:0<α i<c (y(i) w T x (i) ) m i=1 1(0 < α i < C) 72 / 80

Some useful corollaries according to the KKT conditions When α i = 0, y (i) (w T x (i) + b) 1 When α i = C, y (i) (w T x (i) + b) 1 When 0 < α i < C, y (i) (w T x (i) + b) = 1 73 / 80

Coordinate Ascent Algorithm Consider the following unconstrained optimization problem max α f(α 1, α 2,, α m ) Repeat the following step until convergence For each i, α i = arg max αi f(α 1,, α i 1, α i, α i+1,, α m) For some α i, fix the other variables and re-optimize f(α) with respect to α i 74 / 80

Sequential Minimal Optimization (SMO) Algorithm Coordinate ascent algorithm cannot be applied since m i=0 α iy (i) = 0 The basic idea of SMO Repeat the following steps until convergence Select some pair of α i and α j to update next (using a heuristic that tries to pick the two that will allow us to make the biggest progress towards the global maximum) Re-optimize G(α) with respect to α i and α j, while holding all the other α k s (k i, j) fixed 75 / 80

Take α 1 and α 2 for example m α 1 y (1) + α 2 y (2) = α i y (i) = ζ α 1 = (ζ α 2 y (2) )y (1) i=3 Given α 1, α 2 will be bounded by L and H such that L α 2 H If y (1) y (2) = 1, H = min(c, C + α 2 α 1) and L = max(0, α 2 α 1) If y (1) y (2) = 1, H = min(c, α 1 + α 2) and L = max(0, α 1 + α 2 C) 76 / 80

Our objective function becomes f(α) = f(α 1, α 2 ) = f((ζ α 2 y (2) )y (1), α 2, α 3,, α m ) Resolving the following optimization problem max f((ζ α 2 y (2) )y (1), α 2, α 3,, α m ) α 2 s.t. L α 2 H where {α 3,, α m } are treated as constants 77 / 80

Solving α 2 f((ζ α 2 y (2) )y (1), α 2 ) = 0 gives the updating rule for α 2 (without considering the corresponding constraints) α + 2 = α 2 + y(2) (E 1 E 2 ) K 11 2K 12 + K 22 where K i,j =< x i, x j > f i = m i=1 y(i) α jk i,j + b E i = f i y (i) V i = m j=3 y(j) α ik i,j = f i 2 j=1 y(j) α jk i,j b Considering L α 2 H, we have H if α 2 + > H α 2 = α 2 + if L α 2 + H L if α 2 + < L 78 / 80

How about b? When α 1 (0, C), b + = (α 1 α + 1 )y(1) K 11 + (α 2 α + 2 )y(2) K 12 E 1 + b = b 1 When α 2 (0, C), b + = (α 1 α + 1 )y(1) K 12 + (α 2 α + 2 )y(2) K 22 E 2 + b = b 2 When α 1, α 2 (0, C), b + = b 1 = b 2 When α 1, α 2 {0, C}, b + = (b 1 + b 2)/2 More details can be found at J. Platt, Sequential Minimal Optimization: A Fast Algorithm for Training Support Vector Machines, Microsoft Research Technical Report, 1998. https://www.microsoft.com/en-us/research/wp-content/uploads/2016/ 02/tr-98-14.pdf 79 / 80

Thanks! Q & A 80 / 80