Support Vector Machines. Machine Learning Fall PDF Free Download

Support Vector Machines Machine Learning Fall 2017 1

Where are we? Learning algorithms Decision Trees Perceptron AdaBoost 2

Where are we? Learning algorithms Decision Trees Perceptron AdaBoost Produce linear classifiers 3

Where are we? Learning algorithms Decision Trees Perceptron AdaBoost Produce linear classifiers General learning principles Overfitting Mistakebound learning PAC learning, sample complexity Hypothesis choice & VC dimensions Training and generalization errors Coming up (next few lectures): Learning theory! Training linear classifiers by minimizing loss The Risk Minimization Principle 5

Big picture Linear models 6

Big picture Linear models How good is a learning algorithm? 7

Big picture Linear models Perceptron, Winnow Online learning How good is a learning algorithm? 8

Big picture Linear models Perceptron, Winnow Online learning PAC, Agnostic learning How good is a learning algorithm? 9

Big picture Linear models Perceptron, Winnow Support Vector Machines Online learning PAC, Agnostic learning How good is a learning algorithm? 10

Big picture Linear models Perceptron, Winnow Support Vector Machines. Online learning PAC, Agnostic learning. How good is a learning algorithm? 11

This lecture: Support vector machines Training by maximizing margin The SVM objective Solving the SVM optimization problem Support vectors, duals and kernels 12

This lecture: Support vector machines Training by maximizing margin The SVM objective Solving the SVM optimization problem Support vectors, duals and kernels 13

VC dimensions and linear classifiers What we know so far 1. If we have m examples, then with probability 1 ±, the true error of a hypothesis h with training error err S (h) is bounded by Generalization error Training error A function of VC dimension. Low VC dimension gives tighter bound 14

Recall: Margin The margin of a hyperplane for a dataset is the distance between the hyperplane and the data point nearest to it. 17

Recall: Margin The margin of a hyperplane for a dataset is the distance between the hyperplane and the data point nearest to it. Margin with respect to this hyperplane 18

Which line is a better choice? Why? h 2 h 1 19

Which line is a better choice? Why? h 2 h 1 A new example, not from the training set might be misclassified if the margin is smaller 20

Data dependent VC dimension Intuitively, larger margins are better Suppose we only consider linear separators with margins 1 and 2 H 1 = linear separators that have a margin 1 H 2 = linear separators that have a margin 2 And 1 > 2 The entire set of functions H 1 is better 21

Data dependent VC dimension Theorem (Vapnik): Let H be the set of linear classifiers that separate the training set by a margin at least Then R is the radius of the smallest sphere containing the data 22

Learning strategy Find the linear separator that maximizes the margin 24

This lecture: Support vector machines Training by maximizing margin The SVM objective Solving the SVM optimization problem Support vectors, duals and kernels 25

Recall: The geometry of a linear classifier b w 1 x 1 w 2 x 2 =0 Prediction = sgn(b w 1 x 1 w 2 x 2 ) 26

Recall: The geometry of a linear classifier b w 1 x 1 w 2 x 2 =0 Prediction = sgn(b w 1 x 1 w 2 x 2 ) We only care about the sign, not the magnitude 27

Recall: The geometry of a linear classifier Prediction = sgn(b w 1 x 1 w 2 x 2 ) b w 1 x 1 w 2 x 2 =0 2b 2w 1 x 1 2w 2 x 2 =0 1000b 1000w 1 x 1 1000w 2 x 2 =0 We only care about the sign, not the magnitude 28

Maximizing margin Margin = distance of the closest point from the hyperplane We want max w Some people call this the geometric margin The numerator alone is called the functional margin 29

Maximizing margin Margin = distance of the closest point from the hyperplane We want max w Some people call this the geometric margin The numerator alone is called the functional margin 30

Recall: The geometry of a linear classifier b w 1 x 1 w 2 x 2 =0 Prediction = sgn(b w 1 x 1 w 2 x 2 ) We only care about the sign, not the magnitude 31

Recall: The geometry of a linear classifier b w 1 x 1 w 2 x 2 =0 Prediction = sgn(b w 1 x 1 w 2 x 2 ) We have the freedom to scale up/down w and b so that the numerator is 1. We only care about the sign, not the magnitude 32

Maximizing margin Margin = distance of the closest point from the hyperplane We want max w We only care about the sign of w and b in the end and not the magnitude Set the absolute score (functional margin) of the closest point to be 1 and allow w to adjust itself max w γ is equivalent to max w ' w in this setting 33

Maxmargin classifiers Learning problem: 34

Maxmargin classifiers Learning problem: Mimimizing gives us max w ' w 35

Maxmargin classifiers Learning problem: Mimimizing gives us max w ' w This condition is true for every example, specifically, for the example closest to the separator 36

Maxmargin classifiers Learning problem: Mimimizing gives us max w ' w This condition is true for every example, specifically, for the example closest to the separator This is called the hard Support Vector Machine We will look at how to solve this optimization problem later 37

What if the data is not separable? Hard SVM Maximize margin Every example has an functional margin of at least 1 This is a constrained optimization problem If the data is not separable, there is no w that will classify the data Infeasible problem, no solution! 38

Dealing with nonseparable data Key idea: Allow some examples to break into the margin 40

Dealing with nonseparable data Key idea: Allow some examples to break into the margin 41

Dealing with nonseparable data Key idea: Allow some examples to break into the margin This separator has a large enough margin that it should generalize well. 42

Dealing with nonseparable data Key idea: Allow some examples to break into the margin This separator has a large enough margin that it should generalize well. So, when computing margin, ignore the examples that make the margin smaller or the data inseparable. 43

Soft SVM Hard SVM: Maximize margin Every example has an functional margin of at least 1 Introduce one slack variable» i per example And require y i w T x i 1» i and» i 0 Intuition: The slack variable allows examples to break into the margin If the slack value is zero, then the example is either on or outside the margin 44

Soft SVM Hard SVM: Maximize margin Every example has an functional margin of at least 1 Introduce one slack variable» i per example And require y i w T x i 1» i and» i 0 New optimization problem for learning 47

Soft SVM 49

Soft SVM Maximize margin 50

Soft SVM Maximize margin Minimize total slack (i.e allow as few examples as possible to violate the margin) 51

Soft SVM Maximize margin Tradeoff between the two terms Minimize total slack (i.e allow as few examples as possible to violate the margin) 52

Soft SVM Maximize margin Tradeoff between the two terms Minimize total slack (i.e allow as few examples as possible to violate the margin) Eliminate the slack variables to rewrite this This form is more interpretable 53

Maximizing margin and minimizing loss Maximize margin Penalty for the prediction Three cases Example is correctly classified and is outside the margin: penalty = 0 Example is incorrectly classified: penalty = 1 y i w T x i Example is correctly classified but within the margin: penalty = 1 y i w T x i This is the hinge loss function 55

Maximizing margin and minimizing loss Maximize margin Penalty for the prediction We can consider three cases Example is correctly classified and is outside the margin: penalty = 0 Example is incorrectly classified: penalty = 1 y i w T x i Example is correctly classified but within the margin: penalty = 1 y i w T x i This is the hinge loss function 56

The Hinge Loss Loss yw T x 61

The Hinge Loss Loss Hinge loss 01 loss yw T x 62

The Hinge Loss Loss 01 loss: If the sign of y and w T x are different, then penalty = 1 Hinge loss 01 loss 01 loss: If the sign of y and w T x is the same, then no penalty yw T x 63

The Hinge Loss Loss Hinge: Penalize predictions even if they are correct, but too close to the margin Hinge: Incorrect predictions get a linearly increasing penalty with w T x Hinge: No penalty if w T x is far away from 1 (1 for negative examples) yw T x 64

General learning principle Risk minimization Define the notion of loss over the training data as a function of a hypothesis Learning = find the hypothesis that has lowest loss on the training data 66

General learning principle Regularized risk minimization Define a regularization function that penalizes overcomplex hypothesis. Capacity control gives better generalization Define the notion of loss over the training data as a function of a hypothesis Learning = find the hypothesis that has lowest [Regularizer loss on the training data] 67

SVM objective function Regularization term: Maximize the margin Imposes a preference over the hypothesis space and pushes for better generalization Can be replaced with other regularization terms which impose other preferences Empirical Loss: Hinge loss Penalizes weight vectors that make mistakes Can be replaced with other loss functions which impose other preferences 68

Support Vector Machines. Machine Learning Fall 2017