Support Vector Machines. Machine Learning Fall 2017

Size: px

Start display at page:

Download "Support Vector Machines. Machine Learning Fall 2017"

Kory Blair
5 years ago
Views:

1 Support Vector Machines Machine Learning Fall

2 Where are we? Learning algorithms Decision Trees Perceptron AdaBoost 2

3 Where are we? Learning algorithms Decision Trees Perceptron AdaBoost Produce linear classifiers 3

4 Where are we? Learning algorithms Decision Trees Perceptron AdaBoost Produce linear classifiers General learning principles Overfitting Mistakebound learning PAC learning, sample complexity Hypothesis choice & VC dimensions Training and generalization errors 4

5 Where are we? Learning algorithms Decision Trees Perceptron AdaBoost Produce linear classifiers General learning principles Overfitting Mistakebound learning PAC learning, sample complexity Hypothesis choice & VC dimensions Training and generalization errors Coming up (next few lectures): Learning theory! Training linear classifiers by minimizing loss The Risk Minimization Principle 5

6 Big picture Linear models 6

7 Big picture Linear models How good is a learning algorithm? 7

8 Big picture Linear models Perceptron, Winnow Online learning How good is a learning algorithm? 8

9 Big picture Linear models Perceptron, Winnow Online learning PAC, Agnostic learning How good is a learning algorithm? 9

10 Big picture Linear models Perceptron, Winnow Support Vector Machines Online learning PAC, Agnostic learning How good is a learning algorithm? 10

11 Big picture Linear models Perceptron, Winnow Support Vector Machines. Online learning PAC, Agnostic learning. How good is a learning algorithm? 11

12 This lecture: Support vector machines Training by maximizing margin The SVM objective Solving the SVM optimization problem Support vectors, duals and kernels 12

13 This lecture: Support vector machines Training by maximizing margin The SVM objective Solving the SVM optimization problem Support vectors, duals and kernels 13

14 VC dimensions and linear classifiers What we know so far 1. If we have m examples, then with probability 1 ±, the true error of a hypothesis h with training error err S (h) is bounded by Generalization error Training error A function of VC dimension. Low VC dimension gives tighter bound 14

15 VC dimensions and linear classifiers What we know so far 1. If we have m examples, then with probability 1 ±, the true error of a hypothesis h with training error err S (h) is bounded by Generalization error Training error A function of VC dimension. Low VC dimension gives tighter bound 2. VC dimension of a linear classifier in d dimensions = d 1 15

16 VC dimensions and linear classifiers What we know so far 1. If we have m examples, then with probability 1 ±, the true error of a hypothesis h with training error err S (h) is bounded by Generalization error Training error A function of VC dimension. Low VC dimension gives tighter bound 2. VC dimension of a linear classifier in d dimensions = d 1 But are all linear classifiers the same? 16

17 Recall: Margin The margin of a hyperplane for a dataset is the distance between the hyperplane and the data point nearest to it. 17

18 Recall: Margin The margin of a hyperplane for a dataset is the distance between the hyperplane and the data point nearest to it. Margin with respect to this hyperplane 18

19 Which line is a better choice? Why? h 2 h 1 19

20 Which line is a better choice? Why? h 2 h 1 A new example, not from the training set might be misclassified if the margin is smaller 20

21 Data dependent VC dimension Intuitively, larger margins are better Suppose we only consider linear separators with margins 1 and 2 H 1 = linear separators that have a margin 1 H 2 = linear separators that have a margin 2 And 1 > 2 The entire set of functions H 1 is better 21

22 Data dependent VC dimension Theorem (Vapnik): Let H be the set of linear classifiers that separate the training set by a margin at least Then R is the radius of the smallest sphere containing the data 22

23 Data dependent VC dimension Theorem (Vapnik): Let H be the set of linear classifiers that separate the training set by a margin at least Then R is the radius of the smallest sphere containing the data Larger margin ) Lower VC dimension Lower VC dimension ) Better generalization bound 23

24 Learning strategy Find the linear separator that maximizes the margin 24

25 This lecture: Support vector machines Training by maximizing margin The SVM objective Solving the SVM optimization problem Support vectors, duals and kernels 25

26 Recall: The geometry of a linear classifier b w 1 x 1 w 2 x 2 =0 Prediction = sgn(b w 1 x 1 w 2 x 2 ) 26

27 Recall: The geometry of a linear classifier b w 1 x 1 w 2 x 2 =0 Prediction = sgn(b w 1 x 1 w 2 x 2 ) We only care about the sign, not the magnitude 27

28 Recall: The geometry of a linear classifier Prediction = sgn(b w 1 x 1 w 2 x 2 ) b w 1 x 1 w 2 x 2 =0 2b 2w 1 x 1 2w 2 x 2 =0 1000b 1000w 1 x w 2 x 2 =0 We only care about the sign, not the magnitude 28

29 Maximizing margin Margin = distance of the closest point from the hyperplane We want max w Some people call this the geometric margin The numerator alone is called the functional margin 29

30 Maximizing margin Margin = distance of the closest point from the hyperplane We want max w Some people call this the geometric margin The numerator alone is called the functional margin 30

31 Recall: The geometry of a linear classifier b w 1 x 1 w 2 x 2 =0 Prediction = sgn(b w 1 x 1 w 2 x 2 ) We only care about the sign, not the magnitude 31

32 Recall: The geometry of a linear classifier b w 1 x 1 w 2 x 2 =0 Prediction = sgn(b w 1 x 1 w 2 x 2 ) We have the freedom to scale up/down w and b so that the numerator is 1. We only care about the sign, not the magnitude 32

33 Maximizing margin Margin = distance of the closest point from the hyperplane We want max w We only care about the sign of w and b in the end and not the magnitude Set the absolute score (functional margin) of the closest point to be 1 and allow w to adjust itself max w γ is equivalent to max w ' w in this setting 33

34 Maxmargin classifiers Learning problem: 34

35 Maxmargin classifiers Learning problem: Mimimizing gives us max w ' w 35

36 Maxmargin classifiers Learning problem: Mimimizing gives us max w ' w This condition is true for every example, specifically, for the example closest to the separator 36

37 Maxmargin classifiers Learning problem: Mimimizing gives us max w ' w This condition is true for every example, specifically, for the example closest to the separator This is called the hard Support Vector Machine We will look at how to solve this optimization problem later 37

38 What if the data is not separable? Hard SVM Maximize margin Every example has an functional margin of at least 1 This is a constrained optimization problem If the data is not separable, there is no w that will classify the data Infeasible problem, no solution! 38

39 What if the data is not separable? Hard SVM Maximize margin Every example has an functional margin of at least 1 This is a constrained optimization problem If the data is not separable, there is no w that will classify the data Infeasible problem, no solution! 39

40 Dealing with nonseparable data Key idea: Allow some examples to break into the margin 40

41 Dealing with nonseparable data Key idea: Allow some examples to break into the margin 41

42 Dealing with nonseparable data Key idea: Allow some examples to break into the margin This separator has a large enough margin that it should generalize well. 42

43 Dealing with nonseparable data Key idea: Allow some examples to break into the margin This separator has a large enough margin that it should generalize well. So, when computing margin, ignore the examples that make the margin smaller or the data inseparable. 43

44 Soft SVM Hard SVM: Maximize margin Every example has an functional margin of at least 1 Introduce one slack variable» i per example And require y i w T x i 1» i and» i 0 Intuition: The slack variable allows examples to break into the margin If the slack value is zero, then the example is either on or outside the margin 44

45 Soft SVM Hard SVM: Maximize margin Every example has an functional margin of at least 1 Introduce one slack variable» i per example And require y i w T x i 1» i and» i 0 Intuition: The slack variable allows examples to break into the margin If the slack value is zero, then the example is either on or outside the margin 45

46 Soft SVM Hard SVM: Maximize margin Every example has an functional margin of at least 1 Introduce one slack variable» i per example And require y i w T x i 1» i and» i 0 Intuition: The slack variable allows examples to break into the margin If the slack value is zero, then the example is either on or outside the margin 46

47 Soft SVM Hard SVM: Maximize margin Every example has an functional margin of at least 1 Introduce one slack variable» i per example And require y i w T x i 1» i and» i 0 New optimization problem for learning 47

48 Soft SVM Hard SVM: Maximize margin Every example has an functional margin of at least 1 Introduce one slack variable» i per example And require y i w T x i 1» i and» i 0 New optimization problem for learning 48

49 Soft SVM 49

50 Soft SVM Maximize margin 50

51 Soft SVM Maximize margin Minimize total slack (i.e allow as few examples as possible to violate the margin) 51

52 Soft SVM Maximize margin Tradeoff between the two terms Minimize total slack (i.e allow as few examples as possible to violate the margin) 52

53 Soft SVM Maximize margin Tradeoff between the two terms Minimize total slack (i.e allow as few examples as possible to violate the margin) Eliminate the slack variables to rewrite this This form is more interpretable 53

54 Soft SVM Maximize margin Tradeoff between the two terms Minimize total slack (i.e allow as few examples as possible to violate the margin) Eliminate the slack variables to rewrite this This form is more interpretable 54

55 Maximizing margin and minimizing loss Maximize margin Penalty for the prediction Three cases Example is correctly classified and is outside the margin: penalty = 0 Example is incorrectly classified: penalty = 1 y i w T x i Example is correctly classified but within the margin: penalty = 1 y i w T x i This is the hinge loss function 55

56 Maximizing margin and minimizing loss Maximize margin Penalty for the prediction We can consider three cases Example is correctly classified and is outside the margin: penalty = 0 Example is incorrectly classified: penalty = 1 y i w T x i Example is correctly classified but within the margin: penalty = 1 y i w T x i This is the hinge loss function 56

57 Maximizing margin and minimizing loss Maximize margin Penalty for the prediction We can consider three cases Example is correctly classified and is outside the margin: penalty = 0 Example is incorrectly classified: penalty = 1 y i w T x i Example is correctly classified but within the margin: penalty = 1 y i w T x i This is the hinge loss function 57

58 Maximizing margin and minimizing loss Maximize margin Penalty for the prediction We can consider three cases Example is correctly classified and is outside the margin: penalty = 0 Example is incorrectly classified: penalty = 1 y i w T x i Example is correctly classified but within the margin: penalty = 1 y i w T x i This is the hinge loss function 58

59 Maximizing margin and minimizing loss Maximize margin Penalty for the prediction We can consider three cases Example is correctly classified and is outside the margin: penalty = 0 Example is incorrectly classified: penalty = 1 y i w T x i Example is correctly classified but within the margin: penalty = 1 y i w T x i This is the hinge loss function 59

60 Maximizing margin and minimizing loss Maximize margin Penalty for the prediction We can consider three cases Example is correctly classified and is outside the margin: penalty = 0 Example is incorrectly classified: penalty = 1 y i w T x i Example is correctly classified but within the margin: penalty = 1 y i w T x i This gives us the hinge loss function 60

61 The Hinge Loss Loss yw T x 61

62 The Hinge Loss Loss Hinge loss 01 loss yw T x 62

63 The Hinge Loss Loss 01 loss: If the sign of y and w T x are different, then penalty = 1 Hinge loss 01 loss 01 loss: If the sign of y and w T x is the same, then no penalty yw T x 63

64 The Hinge Loss Loss Hinge: Penalize predictions even if they are correct, but too close to the margin Hinge: Incorrect predictions get a linearly increasing penalty with w T x Hinge: No penalty if w T x is far away from 1 (1 for negative examples) yw T x 64

65 Maximizing margin and minimizing loss Maximize margin Penalty for the prediction Three cases Example is correctly classified and is outside the margin: penalty = 0 Example is incorrectly classified: penalty = 1 y i w T x i Example is correctly classified but within the margin: penalty = 1 y i w T x i 65

66 General learning principle Risk minimization Define the notion of loss over the training data as a function of a hypothesis Learning = find the hypothesis that has lowest loss on the training data 66

67 General learning principle Regularized risk minimization Define a regularization function that penalizes overcomplex hypothesis. Capacity control gives better generalization Define the notion of loss over the training data as a function of a hypothesis Learning = find the hypothesis that has lowest [Regularizer loss on the training data] 67

68 SVM objective function Regularization term: Maximize the margin Imposes a preference over the hypothesis space and pushes for better generalization Can be replaced with other regularization terms which impose other preferences Empirical Loss: Hinge loss Penalizes weight vectors that make mistakes Can be replaced with other loss functions which impose other preferences 68

69 SVM objective function Regularization term: Maximize the margin Imposes a preference over the hypothesis space and pushes for better generalization Can be replaced with other regularization terms which impose other preferences Empirical Loss: Hinge loss Penalizes weight vectors that make mistakes Can be replaced with other loss functions which impose other preferences A hyperparameter that controls the tradeoff between a large margin and a small hingeloss 69

Support vector machines Lecture 4

Support vector machines Lecture 4 David Sontag New York University Slides adapted from Luke Zettlemoyer, Vibhav Gogate, and Carlos Guestrin Q: What does the Perceptron mistake bound tell us? Theorem: The