Support Vector Machines Machine Learning Fall 2017 1
Where are we? Learning algorithms Decision Trees Perceptron AdaBoost 2
Where are we? Learning algorithms Decision Trees Perceptron AdaBoost Produce linear classifiers 3
Where are we? Learning algorithms Decision Trees Perceptron AdaBoost Produce linear classifiers General learning principles Overfitting Mistakebound learning PAC learning, sample complexity Hypothesis choice & VC dimensions Training and generalization errors 4
Where are we? Learning algorithms Decision Trees Perceptron AdaBoost Produce linear classifiers General learning principles Overfitting Mistakebound learning PAC learning, sample complexity Hypothesis choice & VC dimensions Training and generalization errors Coming up (next few lectures): Learning theory! Training linear classifiers by minimizing loss The Risk Minimization Principle 5
Big picture Linear models 6
Big picture Linear models How good is a learning algorithm? 7
Big picture Linear models Perceptron, Winnow Online learning How good is a learning algorithm? 8
Big picture Linear models Perceptron, Winnow Online learning PAC, Agnostic learning How good is a learning algorithm? 9
Big picture Linear models Perceptron, Winnow Support Vector Machines Online learning PAC, Agnostic learning How good is a learning algorithm? 10
Big picture Linear models Perceptron, Winnow Support Vector Machines. Online learning PAC, Agnostic learning. How good is a learning algorithm? 11
This lecture: Support vector machines Training by maximizing margin The SVM objective Solving the SVM optimization problem Support vectors, duals and kernels 12
This lecture: Support vector machines Training by maximizing margin The SVM objective Solving the SVM optimization problem Support vectors, duals and kernels 13
VC dimensions and linear classifiers What we know so far 1. If we have m examples, then with probability 1 ±, the true error of a hypothesis h with training error err S (h) is bounded by Generalization error Training error A function of VC dimension. Low VC dimension gives tighter bound 14
VC dimensions and linear classifiers What we know so far 1. If we have m examples, then with probability 1 ±, the true error of a hypothesis h with training error err S (h) is bounded by Generalization error Training error A function of VC dimension. Low VC dimension gives tighter bound 2. VC dimension of a linear classifier in d dimensions = d 1 15
VC dimensions and linear classifiers What we know so far 1. If we have m examples, then with probability 1 ±, the true error of a hypothesis h with training error err S (h) is bounded by Generalization error Training error A function of VC dimension. Low VC dimension gives tighter bound 2. VC dimension of a linear classifier in d dimensions = d 1 But are all linear classifiers the same? 16
Recall: Margin The margin of a hyperplane for a dataset is the distance between the hyperplane and the data point nearest to it. 17
Recall: Margin The margin of a hyperplane for a dataset is the distance between the hyperplane and the data point nearest to it. Margin with respect to this hyperplane 18
Which line is a better choice? Why? h 2 h 1 19
Which line is a better choice? Why? h 2 h 1 A new example, not from the training set might be misclassified if the margin is smaller 20
Data dependent VC dimension Intuitively, larger margins are better Suppose we only consider linear separators with margins 1 and 2 H 1 = linear separators that have a margin 1 H 2 = linear separators that have a margin 2 And 1 > 2 The entire set of functions H 1 is better 21
Data dependent VC dimension Theorem (Vapnik): Let H be the set of linear classifiers that separate the training set by a margin at least Then R is the radius of the smallest sphere containing the data 22
Data dependent VC dimension Theorem (Vapnik): Let H be the set of linear classifiers that separate the training set by a margin at least Then R is the radius of the smallest sphere containing the data Larger margin ) Lower VC dimension Lower VC dimension ) Better generalization bound 23
Learning strategy Find the linear separator that maximizes the margin 24
This lecture: Support vector machines Training by maximizing margin The SVM objective Solving the SVM optimization problem Support vectors, duals and kernels 25
Recall: The geometry of a linear classifier b w 1 x 1 w 2 x 2 =0 Prediction = sgn(b w 1 x 1 w 2 x 2 ) 26
Recall: The geometry of a linear classifier b w 1 x 1 w 2 x 2 =0 Prediction = sgn(b w 1 x 1 w 2 x 2 ) We only care about the sign, not the magnitude 27
Recall: The geometry of a linear classifier Prediction = sgn(b w 1 x 1 w 2 x 2 ) b w 1 x 1 w 2 x 2 =0 2b 2w 1 x 1 2w 2 x 2 =0 1000b 1000w 1 x 1 1000w 2 x 2 =0 We only care about the sign, not the magnitude 28
Maximizing margin Margin = distance of the closest point from the hyperplane We want max w Some people call this the geometric margin The numerator alone is called the functional margin 29
Maximizing margin Margin = distance of the closest point from the hyperplane We want max w Some people call this the geometric margin The numerator alone is called the functional margin 30
Recall: The geometry of a linear classifier b w 1 x 1 w 2 x 2 =0 Prediction = sgn(b w 1 x 1 w 2 x 2 ) We only care about the sign, not the magnitude 31
Recall: The geometry of a linear classifier b w 1 x 1 w 2 x 2 =0 Prediction = sgn(b w 1 x 1 w 2 x 2 ) We have the freedom to scale up/down w and b so that the numerator is 1. We only care about the sign, not the magnitude 32
Maximizing margin Margin = distance of the closest point from the hyperplane We want max w We only care about the sign of w and b in the end and not the magnitude Set the absolute score (functional margin) of the closest point to be 1 and allow w to adjust itself max w γ is equivalent to max w ' w in this setting 33
Maxmargin classifiers Learning problem: 34
Maxmargin classifiers Learning problem: Mimimizing gives us max w ' w 35
Maxmargin classifiers Learning problem: Mimimizing gives us max w ' w This condition is true for every example, specifically, for the example closest to the separator 36
Maxmargin classifiers Learning problem: Mimimizing gives us max w ' w This condition is true for every example, specifically, for the example closest to the separator This is called the hard Support Vector Machine We will look at how to solve this optimization problem later 37
What if the data is not separable? Hard SVM Maximize margin Every example has an functional margin of at least 1 This is a constrained optimization problem If the data is not separable, there is no w that will classify the data Infeasible problem, no solution! 38
What if the data is not separable? Hard SVM Maximize margin Every example has an functional margin of at least 1 This is a constrained optimization problem If the data is not separable, there is no w that will classify the data Infeasible problem, no solution! 39
Dealing with nonseparable data Key idea: Allow some examples to break into the margin 40
Dealing with nonseparable data Key idea: Allow some examples to break into the margin 41
Dealing with nonseparable data Key idea: Allow some examples to break into the margin This separator has a large enough margin that it should generalize well. 42
Dealing with nonseparable data Key idea: Allow some examples to break into the margin This separator has a large enough margin that it should generalize well. So, when computing margin, ignore the examples that make the margin smaller or the data inseparable. 43
Soft SVM Hard SVM: Maximize margin Every example has an functional margin of at least 1 Introduce one slack variable» i per example And require y i w T x i 1» i and» i 0 Intuition: The slack variable allows examples to break into the margin If the slack value is zero, then the example is either on or outside the margin 44
Soft SVM Hard SVM: Maximize margin Every example has an functional margin of at least 1 Introduce one slack variable» i per example And require y i w T x i 1» i and» i 0 Intuition: The slack variable allows examples to break into the margin If the slack value is zero, then the example is either on or outside the margin 45
Soft SVM Hard SVM: Maximize margin Every example has an functional margin of at least 1 Introduce one slack variable» i per example And require y i w T x i 1» i and» i 0 Intuition: The slack variable allows examples to break into the margin If the slack value is zero, then the example is either on or outside the margin 46
Soft SVM Hard SVM: Maximize margin Every example has an functional margin of at least 1 Introduce one slack variable» i per example And require y i w T x i 1» i and» i 0 New optimization problem for learning 47
Soft SVM Hard SVM: Maximize margin Every example has an functional margin of at least 1 Introduce one slack variable» i per example And require y i w T x i 1» i and» i 0 New optimization problem for learning 48
Soft SVM 49
Soft SVM Maximize margin 50
Soft SVM Maximize margin Minimize total slack (i.e allow as few examples as possible to violate the margin) 51
Soft SVM Maximize margin Tradeoff between the two terms Minimize total slack (i.e allow as few examples as possible to violate the margin) 52
Soft SVM Maximize margin Tradeoff between the two terms Minimize total slack (i.e allow as few examples as possible to violate the margin) Eliminate the slack variables to rewrite this This form is more interpretable 53
Soft SVM Maximize margin Tradeoff between the two terms Minimize total slack (i.e allow as few examples as possible to violate the margin) Eliminate the slack variables to rewrite this This form is more interpretable 54
Maximizing margin and minimizing loss Maximize margin Penalty for the prediction Three cases Example is correctly classified and is outside the margin: penalty = 0 Example is incorrectly classified: penalty = 1 y i w T x i Example is correctly classified but within the margin: penalty = 1 y i w T x i This is the hinge loss function 55
Maximizing margin and minimizing loss Maximize margin Penalty for the prediction We can consider three cases Example is correctly classified and is outside the margin: penalty = 0 Example is incorrectly classified: penalty = 1 y i w T x i Example is correctly classified but within the margin: penalty = 1 y i w T x i This is the hinge loss function 56
Maximizing margin and minimizing loss Maximize margin Penalty for the prediction We can consider three cases Example is correctly classified and is outside the margin: penalty = 0 Example is incorrectly classified: penalty = 1 y i w T x i Example is correctly classified but within the margin: penalty = 1 y i w T x i This is the hinge loss function 57
Maximizing margin and minimizing loss Maximize margin Penalty for the prediction We can consider three cases Example is correctly classified and is outside the margin: penalty = 0 Example is incorrectly classified: penalty = 1 y i w T x i Example is correctly classified but within the margin: penalty = 1 y i w T x i This is the hinge loss function 58
Maximizing margin and minimizing loss Maximize margin Penalty for the prediction We can consider three cases Example is correctly classified and is outside the margin: penalty = 0 Example is incorrectly classified: penalty = 1 y i w T x i Example is correctly classified but within the margin: penalty = 1 y i w T x i This is the hinge loss function 59
Maximizing margin and minimizing loss Maximize margin Penalty for the prediction We can consider three cases Example is correctly classified and is outside the margin: penalty = 0 Example is incorrectly classified: penalty = 1 y i w T x i Example is correctly classified but within the margin: penalty = 1 y i w T x i This gives us the hinge loss function 60
The Hinge Loss Loss yw T x 61
The Hinge Loss Loss Hinge loss 01 loss yw T x 62
The Hinge Loss Loss 01 loss: If the sign of y and w T x are different, then penalty = 1 Hinge loss 01 loss 01 loss: If the sign of y and w T x is the same, then no penalty yw T x 63
The Hinge Loss Loss Hinge: Penalize predictions even if they are correct, but too close to the margin Hinge: Incorrect predictions get a linearly increasing penalty with w T x Hinge: No penalty if w T x is far away from 1 (1 for negative examples) yw T x 64
Maximizing margin and minimizing loss Maximize margin Penalty for the prediction Three cases Example is correctly classified and is outside the margin: penalty = 0 Example is incorrectly classified: penalty = 1 y i w T x i Example is correctly classified but within the margin: penalty = 1 y i w T x i 65
General learning principle Risk minimization Define the notion of loss over the training data as a function of a hypothesis Learning = find the hypothesis that has lowest loss on the training data 66
General learning principle Regularized risk minimization Define a regularization function that penalizes overcomplex hypothesis. Capacity control gives better generalization Define the notion of loss over the training data as a function of a hypothesis Learning = find the hypothesis that has lowest [Regularizer loss on the training data] 67
SVM objective function Regularization term: Maximize the margin Imposes a preference over the hypothesis space and pushes for better generalization Can be replaced with other regularization terms which impose other preferences Empirical Loss: Hinge loss Penalizes weight vectors that make mistakes Can be replaced with other loss functions which impose other preferences 68
SVM objective function Regularization term: Maximize the margin Imposes a preference over the hypothesis space and pushes for better generalization Can be replaced with other regularization terms which impose other preferences Empirical Loss: Hinge loss Penalizes weight vectors that make mistakes Can be replaced with other loss functions which impose other preferences A hyperparameter that controls the tradeoff between a large margin and a small hingeloss 69