Support Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017

Size: px

Start display at page:

Download "Support Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017"

Bertram Gordon
6 years ago
Views:

1 Support Vector Machines: Training with Stochastic Gradient Descent Machine Learning Fall

2 Support vector machines Training by maximizing margin The SVM objective Solving the SVM optimization problem Support vectors, duals and kernels 2

3 SVM objective function Regularization term: Maximize the margin Imposes a preference over the hypothesis space and pushes for better generalization Can be replaced with other regularization terms which impose other preferences Empirical Loss: Hinge loss Penalizes weight vectors that make mistakes Can be replaced with other loss functions which impose other preferences A hyper-parameter that controls the tradeoff between a large margin and a small hinge-loss 3

4 Support vector machines Training by maximizing margin The SVM objective Solving the SVM optimization problem Support vectors, duals and kernels 4

5 Outline: Training SVM by optimization 1. Review of convex functions and gradient descent 2. Stochastic gradient descent 3. Gradient descent vs stochastic gradient descent 4. Sub-derivatives of the hinge loss 5. Stochastic sub-gradient descent for SVM 6. Comparison to perceptron 5

6 Outline: Training SVM by optimization 1. Review of convex functions and gradient descent 2. Stochastic gradient descent 3. Gradient descent vs stochastic gradient descent 4. Sub-derivatives of the hinge loss 5. Stochastic sub-gradient descent for SVM 6. Comparison to perceptron 6

7 Solving the SVM optimization problem This function is convex in w 7

8 Recall: Convex functions A function f is convex if for every u, v in the domain, and for every λ [0,1] we have f λu + 1 λ v λf u + 1 λ f(v) From geometric perspective Every tangent plane lies below the function f(u) f(v) u v 8

9 Recall: Convex functions A function f is convex if for every u, v in the domain, and for every λ [0,1] we have f λu + 1 λ v λf u + 1 λ f(v) From geometric perspective Every tangent plane lies below the function f(u) f(v) u v 9

10 Convex functions Linear functions max is convex Some ways to show that a function is convex: 1. Using the definition of convexity 2. Showing that the second derivative is positive (for one dimensional functions) 3. Showing that the second derivative is positive semi-definite (for vector functions) 10

11 Not all functions are convex These are concave These are neither f λu + 1 λ v λf u + 1 λ f(v) 11

12 Convex functions are convenient A function f is convex if for every u, v in the domain, and for every λ [0,1] we have f λu + 1 λ v λf u + 1 λ f(v) f(u) f(v) In general: Necessary condition for x to be a minimum for the function f is r f (x)= 0 For convex functions, this is both necessary and sufficient u v 12

13 Solving the SVM optimization problem This function is convex in w This is a quadratic optimization problem because the objective is quadratic Older methods: Used techniques from Quadratic Programming Very slow No constraints, can use gradient descent Still very slow! 13

14 We are trying to minimize Gradient descent J(w) General strategy for minimizing a function J(w) Start with an initial guess for w, say w 0 Iterate till convergence: Compute the gradient of the gradient of J at w t Update w t to get w t+1 by taking a step in the opposite direction of the gradient w 0 Intuition: The gradient is the direction of steepest increase in the function. To get to the minimum, go in the opposite direction w 14

15 We are trying to minimize Gradient descent J(w) General strategy for minimizing a function J(w) Start with an initial guess for w, say w 0 Iterate till convergence: Compute the gradient of the gradient of J at w t Update w t to get w t+1 by taking a step in the opposite direction of the gradient w 1 w 0 Intuition: The gradient is the direction of steepest increase in the function. To get to the minimum, go in the opposite direction w 15

16 We are trying to minimize Gradient descent J(w) General strategy for minimizing a function J(w) Start with an initial guess for w, say w 0 Iterate till convergence: Compute the gradient of the gradient of J at w t Update w t to get w t+1 by taking a step in the opposite direction of the gradient w 2 w 1 w 0 Intuition: The gradient is the direction of steepest increase in the function. To get to the minimum, go in the opposite direction w 16

17 We are trying to minimize Gradient descent J(w) General strategy for minimizing a function J(w) Start with an initial guess for w, say w 0 Iterate till convergence: Compute the gradient of the gradient of J at w t Update w t to get w t+1 by taking a step in the opposite direction of the gradient w 3 w 2 w 1 w 0 Intuition: The gradient is the direction of steepest increase in the function. To get to the minimum, go in the opposite direction w 17

18 Gradient descent for SVM 1. Initialize w 0 2. For t = 0, 1, 2,. We are trying to minimize 1. Compute gradient of J(w) at w t. Call it r J(w t ) 2. Update w as follows: r: Called the learning rate. 18

19 Outline: Training SVM by optimization ü Review of convex functions and gradient descent 2. Stochastic gradient descent 3. Gradient descent vs stochastic gradient descent 4. Sub-derivatives of the hinge loss 5. Stochastic sub-gradient descent for SVM 6. Comparison to perceptron 19

20 Gradient descent for SVM 1. Initialize w 0 2. For t = 0, 1, 2,. 1. Compute gradient of J(w) at w t. Call it r J(w t ) 2. Gradient Update of the w as SVM follows: objective requires summing over the entire training set Slow, does not really scale r: Called the learning rate We are trying to minimize 20

21 Stochastic gradient descent for SVM Given a training set S = {(x i, y i )}, x 2 < n, y 2 {-1,1} 1. Initialize w 0 = 0 2 < n 2. For epoch = 1 T: 1. Pick a random example (x i, y i ) from the training set S 2. Treat (x i, y i ) as a full dataset and take the derivative of the SVM objective at the current w t-1 to be rj t (w t-1 ) 3. Update: w t Ã w t-1 t rj t (w t-1 ) 3. Return final w 21

22 Stochastic gradient descent for SVM Given a training set S = {(x i, y i )}, x 2 < n, y 2 {-1,1} 1. Initialize w 0 = 0 2 < n 2. For epoch = 1 T: 1. Pick a random example (x i, y i ) from the training set S 2. Treat (x i, y i ) as a full dataset and take the derivative of the SVM objective at the current w t-1 to be rj t (w t-1 ) 3. Update: w t Ã w t-1 t rj t (w t-1 ) 3. Return final w 22

23 Stochastic gradient descent for SVM Given a training set S = {(x i, y i )}, x 2 < n, y 2 {-1,1} 1. Initialize w 0 = 0 2 < n 2. For epoch = 1 T: 1. Pick a random example (x i, y i ) from the training set S 2. Treat (x i, y i ) as a full dataset and take the derivative of the SVM objective at the current w t-1 to be rj t (w t-1 ) 3. Update: w t Ã w t-1 t rj t (w t-1 ) 3. Return final w 23

24 Stochastic gradient descent for SVM Given a training set S = {(x i, y i )}, x 2 < n, y 2 {-1,1} 1. Initialize w 0 = 0 2 < n 2. For epoch = 1 T: 1. Pick a random example (x i, y i ) from the training set S 2. Treat (x i, y i ) as a full dataset and take the derivative of the SVM objective at the current w t-1 to be rj t (w t-1 ) 3. Update: w t Ã w t-1 t rj t (w t-1 ) 3. Return final w 24

25 Stochastic gradient descent for SVM Given a training set S = {(x i, y i )}, x 2 < n, y 2 {-1,1} 1. Initialize w 0 = 0 2 < n 2. For epoch = 1 T: 1. Pick a random example (x i, y i ) from the training set S 2. Treat (x i, y i ) as a full dataset and take the derivative of the SVM objective at the current w t-1 to be rj t (w t-1 ) 3. Update: w t Ã w t-1 t rj t (w t-1 ) 3. Return final w 25

26 Stochastic gradient descent for SVM Given a training set S = {(x i, y i )}, x 2 < n, y 2 {-1,1} 1. Initialize w 0 = 0 2 < n 2. For epoch = 1 T: 1. Pick a random example (x i, y i ) from the training set S 2. Treat (x i, y i ) as a full dataset and take the derivative of the SVM objective at the current w t-1 to be rj t (w t-1 ) 3. Update: w t Ã w t-1 t rj t (w t-1 ) 3. Return final w What is the gradient of the hinge loss with respect to w? (The hinge loss is not a differentiable function!) This algorithm is guaranteed to converge to the minimum of J if t is small enough. Why? The objective J(w) is a convex function 26

27 Outline: Training SVM by optimization ü Review of convex functions and gradient descent ü Stochastic gradient descent 3. Gradient descent vs stochastic gradient descent 4. Sub-derivatives of the hinge loss 5. Stochastic sub-gradient descent for SVM 6. Comparison to perceptron 27

28 Gradient Descent vs SGD Gradient descent 28

29 Gradient Descent vs SGD Stochastic Gradient descent 29

30 Gradient Descent vs SGD Stochastic Gradient descent 30

31 Gradient Descent vs SGD Stochastic Gradient descent 31

32 Gradient Descent vs SGD Stochastic Gradient descent 32

33 Gradient Descent vs SGD Stochastic Gradient descent 33

34 Gradient Descent vs SGD Stochastic Gradient descent 34

35 Gradient Descent vs SGD Stochastic Gradient descent 35

36 Gradient Descent vs SGD Stochastic Gradient descent 36

37 Gradient Descent vs SGD Stochastic Gradient descent 37

38 Gradient Descent vs SGD Stochastic Gradient descent 38

39 Gradient Descent vs SGD Stochastic Gradient descent 39

40 Gradient Descent vs SGD Stochastic Gradient descent 40

41 Gradient Descent vs SGD Stochastic Gradient descent 41

42 Gradient Descent vs SGD Stochastic Gradient descent 42

43 Gradient Descent vs SGD Stochastic Gradient descent 43

44 Gradient Descent vs SGD Stochastic Gradient descent 44

45 Gradient Descent vs SGD Stochastic Gradient descent 45

46 Gradient Descent vs SGD Many more updates than gradient descent, but each individual update is less computationally expensive Stochastic Gradient descent 46

47 Outline: Training SVM by optimization ü Review of convex functions and gradient descent ü Stochastic gradient descent ü Gradient descent vs stochastic gradient descent 4. Sub-derivatives of the hinge loss 5. Stochastic sub-gradient descent for SVM 6. Comparison to perceptron 47

48 Stochastic gradient descent for SVM Given a training set S = {(x i, y i )}, x 2 < n, y 2 {-1,1} 1. Initialize w 0 = 0 2 < n 2. For epoch = 1 T: 1. Pick a random example (x i, y i ) from the training set S 2. Treat (x i, y i ) as a full dataset and take the derivative of the SVM objective at the current w t-1 to be rj t (w t-1 ) 3. Update: w t Ã w t-1 t rj t (w t-1 ) 3. Return final w What is the derivative of the hinge loss with respect to w? (The hinge loss is not a differentiable function!) 48

49 Hinge loss is not differentiable! What is the derivative of the hinge loss with respect to w? 49

50 Detour: Sub-gradients Generalization of gradients to non-differentiable functions (Remember that every tangent lies below the function for convex functions) Informally, a sub-tangent at a point is any line lies below the function at the point. A sub-gradient is the slope of that line 50

51 Sub-gradients Formally, g is a subgradient to f at x if f is differentiable at x 1 Tangent at this point g 1 is a gradient at x 1 g 2 and g 3 is are both subgradients at x 2 [Example from Boyd] 51

52 Sub-gradients Formally, g is a subgradient to f at x if f is differentiable at x 1 Tangent at this point g 1 is a gradient at x 1 g 2 and g 3 is are both subgradients at x 2 [Example from Boyd] 52

53 Sub-gradients Formally, g is a subgradient to f at x if f is differentiable at x 1 Tangent at this point g 1 is a gradient at x 1 g 2 and g 3 is are both subgradients at x 2 [Example from Boyd] 53

54 Sub-gradient of the SVM objective General strategy: First solve the max and compute the gradient for each case 54

55 Sub-gradient of the SVM objective General strategy: First solve the max and compute the gradient for each case 55

56 Outline: Training SVM by optimization ü Review of convex functions and gradient descent ü Stochastic gradient descent ü Gradient descent vs stochastic gradient descent ü Sub-derivatives of the hinge loss 5. Stochastic sub-gradient descent for SVM 6. Comparison to perceptron 56

57 Stochastic sub-gradient descent for SVM Given a training set S = {(x i, y i )}, x 2 < n, y 2 {-1,1} 1. Initialize w = 0 2 < n 2. For epoch = 1 T: 1. For each training example (x i, y i )2 S: If y i w T x i 1, else 3. Return w w Ã (1- t ) w + t C y i x i w Ã (1- t ) w 57

58 Stochastic sub-gradient descent for SVM Given a training set S = {(x i, y i )}, x 2 < n, y 2 {-1,1} 1. Initialize w = 0 2 < n 2. For epoch = 1 T: 1. For each training example (x i, y i )2 S: If y i w T x i 1, else 3. Return w w Ã (1- t ) w + t C y i x i w Ã (1- t ) w 58

59 Stochastic sub-gradient descent for SVM Given a training set S = {(x i, y i )}, x 2 < n, y 2 {-1,1} 1. Initialize w = 0 2 < n 2. For epoch = 1 T: 1. For each training example (x i, y i )2 S: If y i w T x i 1, Update w Ã w (1- Ã w t ) w + t rj t C t y i x i else 3. Return w w Ã (1- t ) w 59

60 Stochastic sub-gradient descent for SVM Given a training set S = {(x i, y i )}, x 2 < n, y 2 {-1,1} 1. Initialize w = 0 2 < n 2. For epoch = 1 T: 1. For each training example (x i, y i )2 S: If y i w T x i 1, else 3. Return w w Ã (1- t ) w + t C y i x i w Ã (1- t ) w 60

61 Stochastic sub-gradient descent for SVM Given a training set S = {(x i, y i )}, x 2 < n, y 2 {-1,1} 1. Initialize w = 0 2 < n 2. For epoch = 1 T: 1. For each training example (x i, y i )2 S: If y i w T x i 1, else 3. Return w w Ã (1- t ) w + t C y i x i w Ã (1- t ) w t : learning rate, many tweaks possible Important to shuffle examples at the start of each epoch 61

62 Stochastic sub-gradient descent for SVM Given a training set S = {(x i, y i )}, x 2 < n, y 2 {-1,1} 1. Initialize w = 0 2 < n 2. For epoch = 1 T: 1. Shuffle the training set 2. For each training example (x i, y i )2 S: If y i w T x i 1, t : learning rate, many tweaks possible w Ã (1- t ) w + t C y i x i else w Ã (1- t ) w 3. Return w 62

63 Convergence and learning rates With enough iterations, it will converge in expectation Provided the step sizes are square summable, but not summable Step sizes t are positive Sum of squares of step sizes over t = 1 to 1 is not infinite Sum of step sizes over t = 1 to 1 is infinity Some examples: γ 2 = : ; or γ 2 =

64 Convergence and learning rates Number of iterations to get to accuracy within ² For strongly convex functions, N examples, d dimensional: Gradient descent: O(Nd ln(1/²)) Stochastic gradient descent: O(d/²) More subtleties involved, but SGD is generally preferable when the data size is huge Recently, many variants that are based on this general strategy Examples: Adagrad, momentum, Nesterov s accelerated gradient, Adam, RMSProp, etc 64

65 Convergence and learning rates Number of iterations to get to accuracy within ² For strongly convex functions, N examples, d dimensional: Gradient descent: O(Nd ln(1/²)) Stochastic gradient descent: O(d/²) More subtleties involved, but SGD is generally preferable when the data size is huge Recently, many variants that are based on this general strategy, targeting multilayer neural networks Examples: Adagrad, momentum, Nesterov s accelerated gradient, Adam, RMSProp, etc 65

66 Stochastic sub-gradient descent for SVM Given a training set S = {(x i, y i )}, x 2 < n, y 2 {-1,1} 1. Initialize w = 0 2 < n 2. For epoch = 1 T: 1. Shuffle the training set 2. For each training example (x i, y i )2 S: If y i w T x i 1, else 3. Return w w Ã (1- t ) w + t C y i x i w Ã (1- t ) w 66

67 Outline: Training SVM by optimization ü Review of convex functions and gradient descent ü Stochastic gradient descent ü Gradient descent vs stochastic gradient descent ü Sub-derivatives of the hinge loss ü Stochastic sub-gradient descent for SVM 6. Comparison to perceptron 67

68 Stochastic sub-gradient descent for SVM Given a training set S = {(x i, y i )}, x 2 < n, y 2 {-1,1} 1. Initialize w = 0 2 < n 2. For epoch = 1 T: 1. Shuffle the training set 2. For each training example (x i, y i )2 S: If y i w T x i 1, w Ã (1- t ) w + t C y i x i else w Ã (1- t ) w 3. Return w Compare with the Perceptron update: If y w T x 0, update w Ã w + r y x 68

69 Perceptron vs. SVM Perceptron: Stochastic sub-gradient descent for a different loss No regularization though SVM optimizes the hinge loss With regularization 69

70 SVM summary from optimization perspective Minimize regularized hinge loss Solve using stochastic gradient descent Very fast, run time does not depend on number of examples Compare with Perceptron algorithm: Perceptron does not maximize margin width Perceptron variants can force a margin Convergence criterion is an issue; can be too aggressive in the beginning and get to a reasonably good solution fast; but convergence is slow for very accurate weight vector Other successful optimization algorithms exist Eg: Dual coordinate descent, implemented in liblinear Questions? 70

Kernelized Perceptron Support Vector Machines

Kernelized Perceptron Support Vector Machines Emily Fox University of Washington February 13, 2017 What is the perceptron optimizing? 1 The perceptron algorithm [Rosenblatt 58, 62] Classification setting: