Linear Support Vector Machine. Classification. Linear SVM. Huiping Cao. Huiping Cao, Slide 1/26

Huiping Cao, Slide 1/26 Classification Linear SVM Huiping Cao

linear hyperplane (decision boundary) that will separate the data Huiping Cao, Slide 2/26 Support Vector Machines rt Vector Find a linear Machines hyperplane (decision boundary) that will separate the data.

Support Vector Machines port Vector Machines One possible solution. B 1 ne Possible Solution CS479/579: Data Mining Slide-3 Huiping Cao, Slide 3/26

Support Vector Machines port Vector Machines Another possible solution. B 2 nother possible solution CS479/579: Data Mining Slide-4 Huiping Cao, Slide 4/26

Support Vector Machines port Vector Machines Other possible solutions. B 2 her possible solutions CS479/579: Data Mining Slide-5 Huiping Cao, Slide 5/26

Support Vector Machines Which one is better? B1 or B2? port Vector Machines How do you define better? The bigger the margin, the better. B 1 B 2 hich one is better? B1 or B2? ow do you define better? Huiping Cao, Slide 6/26

Support Vector Machines pport Find Vector hyperplane Machines that maximizes the margin; so B1 is better than B2 B 1 B 2 b 21 b 22 margin b 11 b 12 Find hyperplane maximizes the margin => B1 is better than B2 Huiping Cao, Slide 7/26 CS479/579: Data Mining Slide-7

Huiping Cao, Slide 8/26 Linear Support Vector Machine Support Vector Machines Support Vector Machines B 1 + b = 0 b = 1 w x + b = +1 b 11 1 1 if w x + b 1 if w x + b 1 B 1 : w x + b = 0 CS479/579: Data Mining Slide-8 b 11 : w x + b = 1 b 12 : w x + b = 1 Margin = 2! w b 12

Huiping Cao, Slide 9/26 Support Vector Machines! w! x + b = +1 w x B 1 + b = 0! w! x + b =!1 X2 d X1! w b 11 The margin of the decision boundaries: w (x 1 x 2 ) = 2 w (x 1 x 2 ) = w (x 1 x 2 ) cos(θ), where is the norm of a vector. w (x 1 x 2 ) cos(θ) = w d, where d is the length of vector (x 1 x 2 ) in the direction of vector w Thus, w d = 2 Margin: d = 2 w b 12

Huiping Cao, Slide 10/26 Learn a linear SVM Model The training phase of SVM is to estimate the parameters w and b. The parameters must follow two conditions: { 1, if w x i + b 1 y i = 1, if w x i + b 1

Huiping Cao, Slide 11/26 Formulate the problem rationale We want to maximize margin: d = 2 w which is equivalent to minimize L(w) = w 2 2. Subjected to the following constraints: { 1 if w x i + b 1 y i = 1 if w x i + b 1 which is equivalent to y i (w x i + b) 1, i = 1,, N

Huiping Cao, Slide 12/26 Formulate the problem Formalized to the following constrained optimization problem. Objective function: subject to Minimize w 2 2 y i (w x i + b) 1, i = 1,, N Convex optimization problem: Objective function is quadratic Constraints are linear Can be solved using the standard Lagrange multiplier method.

Huiping Cao, Slide 13/26 Lagrange formulation Lagrangian for the optimization problem (take into account the constraints by rewriting the objective function). L P (w, b, λ) = w 2 2 λ i (y i (w x i + b) 1) minimize w.r.t. w and b and maximize w.r.t. each λ i 0 where λ i : Lagrange multiplier To minimize the Lagrangian, take the derivative of L P (w, b, λ) w.r.t. w and b and set them to 0: L N P w = 0 = w = λ i y i x i L P b N = 0 = λ i y i = 0

Huiping Cao, Slide 14/26 Substituting, L P (w, b, λ) to L D (λ) Solving L P (w, b, λ) is still difficult because it solves a large number of parameters w, b, and λ i. Idea: Transform Lagrangian into a function of the Lagrange multipliers only by substituting w and b in L P (w, b, λ), we get (the dual problem) L D (λ) = λ i 1 2 λ i λ j y i y j x i x j j=1 Details: see the next slide. It is very nice to have L D (λ) because it is a simple quadratic form in the vector λ.

Huiping Cao, Slide 15/26 Substituting details, get the dual problem L D (λ) w 2 2 = 1 2 w w = 1 N 2 ( λ i y i x i ) ( N λ j y j x j ) = 1 λ i λ j y i y j x i 2 x j (1) j=1 j=1 λ i (y i (w x i + b) 1) = λ i y i w x i λ i y i b + λ i (2) i = λ i y i ( λ j y j x j )x i + λ i (3) j=1 = λ i λ i λ j y i y j x i x j (4) j=1 Add both sides, get L D (λ) = λ i 1 2 λ i λ j y i y j x i x j j=1

Huiping Cao, Slide 16/26 Finalized L D (λ) L D (λ) = Maximize w.r.t. each λ subject to λ i 1 2 j=1 λ i λ j y i y j x i x j λ i 0 for i = 1, 2,, N and N λ iy i = 0 Solve L D (λ) using quadratic programming (QP). We get all the λ i (Out of the scope).

Huiping Cao, Slide 17/26 Solve L D (λ) QP We are maximizing L D (λ) max λ ( λ i 1 2 λ i λ j y i y j x i x j) j=1 subject to constraints (1) λ i 0 for i = 1, 2,, N and (2) N λ iy i = 0. Translate the objective to minimization because QP packages generally come with minimization. min λ ( 1 2 λ i λ j y i y j x i x j λ i ) j=1

Huiping Cao, Slide 18/26 Solve L D (λ) QP min λ ( 1 2 λ subject to y λ = 0 }{{} linear constraint y 1 y 1 x 1 x 1 y 1 y 2 x 1 x 2 y 1 y N x 1 x N y 2 y 1 x 2 x 1 y 2 y 2 x 2 x 2 y 2 y N x 2 x N y N y 1 x N x 1 y N y 2 x N x 2 y N y N x N x N } {{ } quadratic coefficients }{{} 0 λ }{{} lower bounds upper bounds Let Q represent the matrix with the quadratic coefficients min λ ( 1 2 λ Qλ + ( 1) λ) subject to y λ = 0; λ 0 λ + ( 1) λ }{{} linear

Huiping Cao, Slide 19/26 Lagrange multiplier QP solves λ = λ 1, λ 2,, λ N, where most of them are zeros. Karush-Kuhn-Tucker (KKT) conditions λ i 0 The constraint (zero form with extreme value) λ i (y i (w x i + b) 1) = 0 Either λ i is zero Or y i (w x i + b) 1) = 0 Support vector x i : y i (w x i + b 1) = 0 and λ i > 0 Training instances that do not reside along these hyperplanes have λ i = 0

Huiping Cao, Slide 20/26 Get w and b w and b depend on support vectors x i and its class label y i. w value b value w = λ i y i x i b = y i w x i Idea: Given a support vector (x i, y i ), we have y i (w x i + b) 1 = 0. Multiply y i on both sides, yi 2(w x i + b) y i = 0. yi 2 = 1 because y i = 1 or y i = 1. Then, (w x i + b) y i = 0.

Huiping Cao, Slide 21/26 Get w and b Example x i1 x i2 y i λ i 0.4 0.5 1 100 0.5 0.6-1 100 0.9 0.4-1 0 0.7 0.9-1 0 0.17 0.05 1 0 0.4 0.35 1 0 0.9 0.8-1 0 0.2 0 1 0 Solve λ using quadratic programming packages w = (w 1, w 2 ) w 1 = 2 λ iy i x i1 = 100 1 0.4 + 100 ( 1) 0.5 = 10 w 2 = 2 λ iy i x i2 = 100 1 0.5 + 100 ( 1) 0.6 = 10 b = 1 w x 1 = 1 (( 10) 0.4 + ( 10) (0.5)) = 10

Huiping Cao, Slide 22/26 SVM: given a test data point z y z = sign(w z + b) = sign(( N λ iy i x i )z + b) if y z = 1, the test instance is classified as positive class if y z = 1, the test instance is classified as negative class

Huiping Cao, Slide 23/26 SVM discussions Support Vector Machines What is the problem if not linearly separable? What if the problem is not linearly separable? CS479/579: Data Mining Slide-11

Huiping Cao, Slide 24/26 SVM discussions Nonlinear separable Nonlinear Support Vector Machines What if decision boundary is not linear? CS479/579: Data Mining Slide-13 Do not work in X space. Transform the data into higher dimensional Z space such that the data are linearly separable. L D (λ) = λ i 1 λ i λ j y i y j x i 2 x j j=1 L D (λ) = λ i 1 λ i λ j y i y j z i 2 z j j=1 Apply linear SVM, support vectors live in Z space

Huiping Cao, Slide 25/26 Quadratic programming packages Octave https://www.gnu.org/software/octave/doc/interpreter/quadratic- Programming.html Solve the quadratic program min x (0.5x H x + x q) subject to A x = b lb <= x <= ub A lb <= A in x <= A ub

Huiping Cao, Slide 26/26 Quadratic programming packages (MATLAB) Optimization toolbox in MATLAB: min x ( 1 2 x Hx + f x) such that A x b, Aeq x = beq, lb x ub.