Support Vector Machines. Machine Learning Series Jerry Jeychandra Blohm Lab

Size: px

Start display at page:

Download "Support Vector Machines. Machine Learning Series Jerry Jeychandra Blohm Lab"

Morris Hart
6 years ago
Views:

1 Support Vector Machines Machine Learning Series Jerry Jeychandra Bloh Lab

2 Outline Main goal: To understand how support vector achines (SVMs) perfor optial classification for labelled data sets, also a quick peek at the ipleentational level. 1. What is a support vector achine? 2. Forulating the SVM proble 3. Optiization using Sequential Minial Optiization 4. Use of Kernels for non-linear classification 5. Ipleentation of SVM in MATLAB environent 6. Stochastic Gradient Descent (extra)

3 Outline Main goal: To understand how support vector achines (SVMs) perfor optial classification for labelled data sets, also a quick peek at the ipleentational level. 1. What is a support vector achine? 2. Forulating the SVM proble 3. Optiization using Sequential Minial Optiization 4. Use of Kernels for non-linear classification 5. Ipleentation of SVM in MATLAB environent 6. Stochastic Gradient Descent (extra)

4 Support Vector Machines (SVM) x 2 x 1

5 Support Vector Machines (SVM) Not very good decision boundary! x 2 x 1

6 Support Vector Machines (SVM) Great decision boundary! ax 1 + bx 2 + c = 0 x 2 In achine learning w T x + b = 0 ax 1 + bx 2 + c < 0 ax 1 + bx 2 + c > 0 x 1

7 Support Vector Machines (SVM) Great decision boundary! ax 1 + bx 2 + c = 0 x 2 In achine learning w T x + b = 0 w T x + b < 0 w T x + b > 0 x 1

8 Support Vector Machines (SVM) w T x + b = 1 w T x + b = 0 x 2 w T x + b = 1 x 1

9 Support Vector Machines (SVM) w T x + b = 1 w T x + b = 0 x 2 w T x + b = 1 x 1

10 Support Vector Machines (SVM) w T x + b = 1 -d +d w T x + b = 0 How do we axiize d? x 2 w T x + b = 1 x 1

11 Support Vector Machines (SVM) w T x + b = 1 -d +d w T x + b = 0 How do we axiize d? x 2 d = wt x + b w = 1 w w T x + b = 1 w Miniize this! x 1

12 Support Vector Machines (SVM) w T x + b = 1 -d +d w T x + b = 0 How do we axiize d? x 2 d = wt x + b w = 1 w w T x + b = 1 w Miniize this! 1 w 2 2 x 1

13 SVM Initial Constraints If our data point is labelled as y = 1 w T x (i) + b 1 If y = -1 w T x (i) + b 1 This can be succinctly suarized as: y w T x (i) + b 1 We re going to allow for soe argin violation: y (i) w T x (i) + b 1 ξ i y (i) w T x (i) + b + 1 ξ i 0 Hard Margin SVM Soft Margin SVM

14 Support Vector Machines (SVM) w T x + b = 1 w T x + b = 0 x 2 ξ/ w w T x + b = 1 x 1

15 Optiization Objective Miniize: 1 2 w 2 + C σ i ξ i Penalizing Ter Subject to: y i w T x i + b + 1 ξ i 0 ξ i 0

16 Outline Main goal: To understand how support vector achines (SVMs) perfor optial classification for labelled data sets, also a quick peek at the ipleentational level. 1. What is a support vector achine? 2. Forulating the SVM proble 3. Optiization using Sequential Minial Optiization 4. Use of Kernels for non-linear classification 5. Ipleentation of SVM in MATLAB environent

17 Optiization Objective Miniize 1 w 2 + C σ 2 i ξ i s.t y (i) w T x (i) + b + 1 ξ i 0 and ξ i 0 Need to use the Lagrangian Using Lagrangian with inequalities as a constraint requires us to fulfill Karush-Khan-Tucker conditions

18 Karush-Kahn-Tucker Conditions There ust exist a w* that solves the prial proble whereas α * and b * are solutions to the dual proble. All three paraeters ust satisfy the following conditions: L w, α, b w i = 0 L w, α, b b i = 0 3. α i g i w = 0 4. g i w 0 5. α i 0 6. ξ i 0 7. r i ξ i = 0 Constraint 3: If α * i is a non-zero nuber, g i ust be 0! Recall: g i w = y (i) w T x (i) + b + 1 ξ i 0 Exaple x ust lie directly on the functional argin in order for g i w to equal 0. Here α > 0. This is our support vectors!

19 Optiization Objective Miniize 1 2 w 2 + C σ i ξ i subject to y (i) w T x (i) + b + 1 ξ i 0 and ξ i 0 Need to use the Lagrangian L P w, b, α, ξ, r = 1 2 in w,b L P = 1 2 L x, α = f x + λ i h i x w 2 + C w 2 + C ξ i + α i i i ξ i α i y i w T x i + b 1 + ξ i r i ξ i i i α i y i [ w T x i + b + ξ i ] r i ξ i

20 Optiization Objective Miniize 1 2 w 2 + C σ i ξ i subject to y (i) w T x (i) + b + 1 ξ 0 and ξ i 0 Need to use the Lagrangian L P w, b, α, ξ, r = 1 2 in w,b L P = 1 2 L x, α = f x + λ i h i x w 2 + C w 2 + C ξ i + α i i i ξ i α i y i w T x i + b 1 + ξ i r i ξ i i i α i y i [ w T x i + b + ξ i ] Miniization function Margin Constraint Error Constraint r i ξ i

21 The proble with the Prial L P = 1 2 w 2 + C ξ i + α i α i [y i w T x i + b + ξ i ] r i ξ i We could just optiize the prial equation and be finished with SVM optiization BUT!!! You would iss out on one of the ost iportant aspects of the SVM The Kernel Trick We need the Lagrangian Dual!

23 Finding the Dual L P = 1 2 w 2 + C ξ i + α i α i [y i w T x i + b + ξ i ] r i ξ i First, copute inial w, b and ξ. Can solve for w! New constraint w L w, b, α = w α i y i x i = 0 w = α i y i x i b L w, b, α = α i y (i) = 0 ξ L w, b, α = C α i r i = 0 Substitute in 0 α i C New constraint α i = C r i

24 Lagrangian Dual L P = 1 2 w 2 + C ξ i + α i α i y i [ w T x i + b + ξ i ] r i ξ i w = α i y i x i L D = α i 1 2 α i α j y (i) y (j) x (i) x (j) j=1 s.t i : σ α i y (i) = 0, ξ 0 and 0 α i C(fro KKT and prev. derivation)

25 Lagrangian Dual ax α L D = α i 1 2 j=1 α i α j y (i) y (j) x (i) x (j) S.T: i : σ α i y (i) = 0, ξ 0, 0 α i C and KKT conditions Copute inner product of x (i) and x (j) No longer rely on w or b Can replace with K(x,z) for kernels Can solve for α explicitly All values not on functional argin have α = 0, ore efficient!

26 Outline Main goal: To understand how support vector achines (SVMs) perfor optial classification for labelled data sets, also a quick peek at the ipleentational level. 1. What is a support vector achine? 2. Forulating the SVM proble 3. Optiization using Sequential Minial Optiization 4. Use of Kernels for non-linear classification 5. Ipleentation of SVM in MATLAB environent 6. Stochastic SVM Gradient Descent (extra)

27 Sequential Minial Optiization ax L D = α i 1 α 2 α i α j y (i) y (j) x (i) x (j) j=1 Ideally, hold all α but α j constant and optiize but: α i y (i) = 0 α 1 = y 1 α i y (i) i=2 Solution: Change two α at a tie! Platt 1998

28 Sequential Minial Optiization Algorith Initialize all α to 0 Repeat { 1. Select α j that violates KKT and additional α k to update 2. Reoptiize L D (α) with respect to two α s while holding all other α constant and copute w,b } In order to keep σ α i y (i) = 0 true: α j y (i) + α k y (j) = ζ Even after update, linear cobination ust reain constant! Platt 1998

29 Sequential Minial Optiization 0 α C fro constraint Line given due to linear constraint Clip values if they exceed C, H or L C α 2 α j y (i) + α k y (j) = ζ H C α 2 α j y (i) + α k y (j) = ζ H L α 1 C L α 1 C If y (j) y k then α j α k = ζ If y (j) = y (k) then α j + α k = ζ

30 Sequential Minial Optiization Can analytically solve for new α, w and b for each update step (closed for coputation) α new k = α old k + y k old old (E j Ek ) where η is second derivative of Dual with respect η to α 2. α new j = α old j + y k y j (α old k α new,clipped k ) b 1 = E 1 + y j α j new α j old K x j, x j + y k α k new,clipped α k old K(x j, x k ) b 2 = E 2 + y j α j new α j old K x j, x k + y k α k new,clipped α k old K(x k, x k ) b = b 1 + b 2 2 w = α i y (i) x (i) Platt 1998

31 Outline Main goal: To understand how support vector achines (SVMs) perfor optial classification for labelled data sets, also a quick peek at the ipleentational level. 1. What is a support vector achine? 2. Forulating the SVM proble 3. Optiization using Sequential Minial Optiization 4. Use of Kernels for non-linear classification 5. Ipleentation of SVM in MATLAB environent 6. Stochastic SVM Gradient Descent (extra)

32 Kernels Recall: ax α L D = σ α i 1 σ 2 σ j=1 α i α j y (i) y (j) x (i) x (j) We can generalize this using a kernel function K(x i, x j )! K x i, x j = φ(x i ) T φ(x j ) Where φ is a feature apping function. Turns out that we don t need to explicitly copute φ x due to Kernel Trick!

33 Kernel Trick Suppose K x, z = x T z 2 K x, z = x i z i x j z j j=1 K x, z = x i z j x i z j j=1 K x, z = ( x i x j )(z i z j ) i,j=1 Representing each feature: O(n 2 ) Kernel trick: O(n)

34 Types of Kernels Linear Kernel K x, z = x T z Gaussian Kernal (Radial Basis Function) K x, z = exp Polynoial Kernel x z 2 2σ 2 K x, z = x T z + c d ax α L D = α i 1 2 α i α j y (i) y (j) K x i, x j j=1

35 Outline Main goal: To understand how support vector achines (SVMs) perfor optial classification for labelled data sets, also a quick peek at the ipleentational level. 1. What is a support vector achine? 2. Forulating the SVM proble 3. Optiization using Sequential Minial Optiization 4. Use of Kernels for non-linear classification 5. Ipleentation of SVM in MATLAB environent 6. Stochastic SVM Gradient Descent (extra)

36 Exaple 1: Linear Kernel

37 Exaple 1: Linear Kernel In MATLAB: I used svtrain: odel = svtrain(x, y, 1e-3, 500); 500 ax iterations C = 1000 KKT tol = 1e-3

38 Exaple 2: Gaussian Kernel (RBF)

39 Exaple 2: Gaussian Kernel (RBF) In MATLAB Gaussian Kernel: gsi = exp(-1*((x1-x2)'* (x1-x2))/(2*siga^2)); odel= svtrain(x, y, x2) gaussiankernel(x1, x2, siga), 0.1, 300); 300 ax iterations C = 1 KKT tol = 0.1

40 Exaple 2: Gaussian Kernel (RBF) In MATLAB Gaussian Kernel: gsi = exp(-1*((x1-x2)'* (x1-x2))/(2*siga^2)); odel= svtrain(x, y, x2) gaussiankernel(x1, x2, siga), 0.001, 300); 300 ax iterations C = 1 KKT tol = 0.001

41 Exaple 2: Gaussian Kernel (RBF) In MATLAB Gaussian Kernel: gsi = exp(-1*((x1-x2)'* (x1-x2))/(2*siga^2)); odel= svtrain(x, y, x2) gaussiankernel(x1, x2, siga), 0.001, 300); 300 ax iterations C = 10 KKT tol = 0.001

42 Resources ftp://

43 Outline Main goal: To understand how support vector achines (SVMs) perfor optial classification for labelled data sets, also a quick peek at the ipleentational level. 1. What is a support vector achine? 2. Forulating the SVM proble 3. Optiization using Sequential Minial Optiization 4. Use of Kernels for non-linear classification 5. Ipleentation of SVM in MATLAB environent 6. Stochastic SVM Gradient Descent (extra)

44 Hinge-Loss Prial SVM Recall: Original Proble 1 2 wt w + C σ ξ i S.T: y i w T x i + b 1 + ξ 0 Re-write in ters of hinge-loss function L P = λ 2 wt w + 1 n h 1 y i w T x i + b Regularization Hinge-Loss Cost Function

45 Hinge Loss Function h 1 y i w T x i + b Penalize when signs of x (i) and y (i) don t atch! Y (i) *f(x)

46 Hinge-Loss Prial SVM Hinge-Loss Prial L P = λ 2 wt w + 1 n h 1 y i w T x i + b Copute gradient (sub-gradient) n w = λw 1 n y i x i L y i w T x i + b < 1 Indicator function L(y (i) f(x)) = 0 if classified correctly L(y (i) f(x)) = hinge slope if classified incorrectly

47 Stochastic Gradient Descent We ipleent rando sapling here to pick out data points: λw E i U y i x i l y i w T x i + b < 1 Use this to copute update rule: w t w t t y i x i l y i Here i is sapled fro a unifor distribution. x i T w t 1 + b < 1 λw t 1 Shalev-Shwartz et al. 2007

Support Vector Machines. Maximizing the Margin

Support Vector Machines. Maximizing the Margin Support Vector Machines Support vector achines (SVMs) learn a hypothesis: h(x) = b + Σ i= y i α i k(x, x i ) (x, y ),..., (x, y ) are the training exs., y i {, } b is the bias weight. α,..., α are the