Support Vector Machines. Machine Learning Fall 2017

Similar documents
Support vector machines Lecture 4

PAC-learning, VC Dimension and Margin-based Bounds

Linear classifiers Lecture 3

LECTURE NOTE #8 PROF. ALAN YUILLE. Can we find a linear classifier that separates the position and negative examples?

The Perceptron Algorithm, Margins

Announcements - Homework

Support Vector Machine (continued)

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines

A short introduction to supervised learning, with applications to cancer pathway analysis Dr. Christina Leslie

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

Introduction to Support Vector Machines

Statistical and Computational Learning Theory

COMP 875 Announcements

Computational Learning Theory

Support Vector Machine. Natural Language Processing Lab lizhonghua

CSC 411 Lecture 17: Support Vector Machine

Support Vector Machines, Kernel SVM

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Machine Learning. VC Dimension and Model Complexity. Eric Xing , Fall 2015

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Perceptron Revisited: Linear Separators. Support Vector Machines

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Machine Learning

Support Vector Machines

Statistical Pattern Recognition

Lecture Support Vector Machine (SVM) Classifiers

Pattern Recognition 2018 Support Vector Machines

Machine Learning

Support Vector Machines

Support Vector Machine (SVM) and Kernel Methods

Minimax risk bounds for linear threshold functions

COMS 4771 Introduction to Machine Learning. Nakul Verma

Part of the slides are adapted from Ziko Kolter

Support Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017

Support Vector Machines.

Max Margin-Classifier

Introduction to Logistic Regression and Support Vector Machine

Support Vector Machines

Machine Learning And Applications: Supervised Learning-SVM

Support Vector Machines for Classification and Regression. 1 Linearly Separable Data: Hard Margin SVMs

L5 Support Vector Classification

Dan Roth 461C, 3401 Walnut

CS6375: Machine Learning Gautam Kunapuli. Support Vector Machines

Lecture 3 January 28

COMP 652: Machine Learning. Lecture 12. COMP Lecture 12 1 / 37

10/05/2016. Computational Methods for Data Analysis. Massimo Poesio SUPPORT VECTOR MACHINES. Support Vector Machines Linear classifiers

Support Vector Machines

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

Support Vector Machine (SVM) and Kernel Methods

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Kernelized Perceptron Support Vector Machines

Machine Learning and Data Mining. Support Vector Machines. Kalev Kask

Neural Networks. Prof. Dr. Rudolf Kruse. Computational Intelligence Group Faculty for Computer Science

Multiclass Classification-1

SUPPORT VECTOR MACHINE

Modelli Lineari (Generalizzati) e SVM

LECTURE 7 Support vector machines

PAC-learning, VC Dimension and Margin-based Bounds

Support Vector Machines. Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar

CSE546: SVMs, Dual Formula5on, and Kernels Winter 2012

Deep Learning for Computer Vision

Computational Learning Theory (VC Dimension)

Support Vector Machines

Support Vector Machines. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

FINAL: CS 6375 (Machine Learning) Fall 2014

Lecture 3: Multiclass Classification

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Kernels. Machine Learning CSE446 Carlos Guestrin University of Washington. October 28, Carlos Guestrin

Linear, Binary SVM Classifiers

Support Vector Machines for Classification and Regression

Kernel Methods and Support Vector Machines

Linear & nonlinear classifiers

About this class. Maximizing the Margin. Maximum margin classifiers. Picture of large and small margin hyperplanes

Chapter 9. Support Vector Machine. Yongdai Kim Seoul National University

Support Vector Machines

Support Vector Machine. Industrial AI Lab.

Introduction to SVM and RVM

Brief Introduction to Machine Learning

MACHINE LEARNING. Support Vector Machines. Alessandro Moschitti

Linear & nonlinear classifiers

Machine Learning Basics Lecture 4: SVM I. Princeton University COS 495 Instructor: Yingyu Liang

Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x))

Midterm Exam Solutions, Spring 2007

Support Vector Machines

Machine Learning 4771

18.9 SUPPORT VECTOR MACHINES

Machine Learning, Fall 2011: Homework 5

Support Vector Machines: Maximum Margin Classifiers

FIND A FUNCTION TO CLASSIFY HIGH VALUE CUSTOMERS

(Kernels +) Support Vector Machines

Machine Learning. Lecture 9: Learning Theory. Feng Li.

Linear, threshold units. Linear Discriminant Functions and Support Vector Machines. Biometrics CSE 190 Lecture 11. X i : inputs W i : weights

Learning with multiple models. Boosting.

Warm up: risk prediction with logistic regression

MIRA, SVM, k-nn. Lirong Xia

MLCC 2017 Regularization Networks I: Linear Models

From Binary to Multiclass Classification. CS 6961: Structured Prediction Spring 2018

Lecture Slides for INTRODUCTION TO. Machine Learning. By: Postedited by: R.

The Perceptron algorithm

Transcription:

Support Vector Machines Machine Learning Fall 2017 1

Where are we? Learning algorithms Decision Trees Perceptron AdaBoost 2

Where are we? Learning algorithms Decision Trees Perceptron AdaBoost Produce linear classifiers 3

Where are we? Learning algorithms Decision Trees Perceptron AdaBoost Produce linear classifiers General learning principles Overfitting Mistakebound learning PAC learning, sample complexity Hypothesis choice & VC dimensions Training and generalization errors 4

Where are we? Learning algorithms Decision Trees Perceptron AdaBoost Produce linear classifiers General learning principles Overfitting Mistakebound learning PAC learning, sample complexity Hypothesis choice & VC dimensions Training and generalization errors Coming up (next few lectures): Learning theory! Training linear classifiers by minimizing loss The Risk Minimization Principle 5

Big picture Linear models 6

Big picture Linear models How good is a learning algorithm? 7

Big picture Linear models Perceptron, Winnow Online learning How good is a learning algorithm? 8

Big picture Linear models Perceptron, Winnow Online learning PAC, Agnostic learning How good is a learning algorithm? 9

Big picture Linear models Perceptron, Winnow Support Vector Machines Online learning PAC, Agnostic learning How good is a learning algorithm? 10

Big picture Linear models Perceptron, Winnow Support Vector Machines. Online learning PAC, Agnostic learning. How good is a learning algorithm? 11

This lecture: Support vector machines Training by maximizing margin The SVM objective Solving the SVM optimization problem Support vectors, duals and kernels 12

This lecture: Support vector machines Training by maximizing margin The SVM objective Solving the SVM optimization problem Support vectors, duals and kernels 13

VC dimensions and linear classifiers What we know so far 1. If we have m examples, then with probability 1 ±, the true error of a hypothesis h with training error err S (h) is bounded by Generalization error Training error A function of VC dimension. Low VC dimension gives tighter bound 14

VC dimensions and linear classifiers What we know so far 1. If we have m examples, then with probability 1 ±, the true error of a hypothesis h with training error err S (h) is bounded by Generalization error Training error A function of VC dimension. Low VC dimension gives tighter bound 2. VC dimension of a linear classifier in d dimensions = d 1 15

VC dimensions and linear classifiers What we know so far 1. If we have m examples, then with probability 1 ±, the true error of a hypothesis h with training error err S (h) is bounded by Generalization error Training error A function of VC dimension. Low VC dimension gives tighter bound 2. VC dimension of a linear classifier in d dimensions = d 1 But are all linear classifiers the same? 16

Recall: Margin The margin of a hyperplane for a dataset is the distance between the hyperplane and the data point nearest to it. 17

Recall: Margin The margin of a hyperplane for a dataset is the distance between the hyperplane and the data point nearest to it. Margin with respect to this hyperplane 18

Which line is a better choice? Why? h 2 h 1 19

Which line is a better choice? Why? h 2 h 1 A new example, not from the training set might be misclassified if the margin is smaller 20

Data dependent VC dimension Intuitively, larger margins are better Suppose we only consider linear separators with margins 1 and 2 H 1 = linear separators that have a margin 1 H 2 = linear separators that have a margin 2 And 1 > 2 The entire set of functions H 1 is better 21

Data dependent VC dimension Theorem (Vapnik): Let H be the set of linear classifiers that separate the training set by a margin at least Then R is the radius of the smallest sphere containing the data 22

Data dependent VC dimension Theorem (Vapnik): Let H be the set of linear classifiers that separate the training set by a margin at least Then R is the radius of the smallest sphere containing the data Larger margin ) Lower VC dimension Lower VC dimension ) Better generalization bound 23

Learning strategy Find the linear separator that maximizes the margin 24

This lecture: Support vector machines Training by maximizing margin The SVM objective Solving the SVM optimization problem Support vectors, duals and kernels 25

Recall: The geometry of a linear classifier b w 1 x 1 w 2 x 2 =0 Prediction = sgn(b w 1 x 1 w 2 x 2 ) 26

Recall: The geometry of a linear classifier b w 1 x 1 w 2 x 2 =0 Prediction = sgn(b w 1 x 1 w 2 x 2 ) We only care about the sign, not the magnitude 27

Recall: The geometry of a linear classifier Prediction = sgn(b w 1 x 1 w 2 x 2 ) b w 1 x 1 w 2 x 2 =0 2b 2w 1 x 1 2w 2 x 2 =0 1000b 1000w 1 x 1 1000w 2 x 2 =0 We only care about the sign, not the magnitude 28

Maximizing margin Margin = distance of the closest point from the hyperplane We want max w Some people call this the geometric margin The numerator alone is called the functional margin 29

Maximizing margin Margin = distance of the closest point from the hyperplane We want max w Some people call this the geometric margin The numerator alone is called the functional margin 30

Recall: The geometry of a linear classifier b w 1 x 1 w 2 x 2 =0 Prediction = sgn(b w 1 x 1 w 2 x 2 ) We only care about the sign, not the magnitude 31

Recall: The geometry of a linear classifier b w 1 x 1 w 2 x 2 =0 Prediction = sgn(b w 1 x 1 w 2 x 2 ) We have the freedom to scale up/down w and b so that the numerator is 1. We only care about the sign, not the magnitude 32

Maximizing margin Margin = distance of the closest point from the hyperplane We want max w We only care about the sign of w and b in the end and not the magnitude Set the absolute score (functional margin) of the closest point to be 1 and allow w to adjust itself max w γ is equivalent to max w ' w in this setting 33

Maxmargin classifiers Learning problem: 34

Maxmargin classifiers Learning problem: Mimimizing gives us max w ' w 35

Maxmargin classifiers Learning problem: Mimimizing gives us max w ' w This condition is true for every example, specifically, for the example closest to the separator 36

Maxmargin classifiers Learning problem: Mimimizing gives us max w ' w This condition is true for every example, specifically, for the example closest to the separator This is called the hard Support Vector Machine We will look at how to solve this optimization problem later 37

What if the data is not separable? Hard SVM Maximize margin Every example has an functional margin of at least 1 This is a constrained optimization problem If the data is not separable, there is no w that will classify the data Infeasible problem, no solution! 38

What if the data is not separable? Hard SVM Maximize margin Every example has an functional margin of at least 1 This is a constrained optimization problem If the data is not separable, there is no w that will classify the data Infeasible problem, no solution! 39

Dealing with nonseparable data Key idea: Allow some examples to break into the margin 40

Dealing with nonseparable data Key idea: Allow some examples to break into the margin 41

Dealing with nonseparable data Key idea: Allow some examples to break into the margin This separator has a large enough margin that it should generalize well. 42

Dealing with nonseparable data Key idea: Allow some examples to break into the margin This separator has a large enough margin that it should generalize well. So, when computing margin, ignore the examples that make the margin smaller or the data inseparable. 43

Soft SVM Hard SVM: Maximize margin Every example has an functional margin of at least 1 Introduce one slack variable» i per example And require y i w T x i 1» i and» i 0 Intuition: The slack variable allows examples to break into the margin If the slack value is zero, then the example is either on or outside the margin 44

Soft SVM Hard SVM: Maximize margin Every example has an functional margin of at least 1 Introduce one slack variable» i per example And require y i w T x i 1» i and» i 0 Intuition: The slack variable allows examples to break into the margin If the slack value is zero, then the example is either on or outside the margin 45

Soft SVM Hard SVM: Maximize margin Every example has an functional margin of at least 1 Introduce one slack variable» i per example And require y i w T x i 1» i and» i 0 Intuition: The slack variable allows examples to break into the margin If the slack value is zero, then the example is either on or outside the margin 46

Soft SVM Hard SVM: Maximize margin Every example has an functional margin of at least 1 Introduce one slack variable» i per example And require y i w T x i 1» i and» i 0 New optimization problem for learning 47

Soft SVM Hard SVM: Maximize margin Every example has an functional margin of at least 1 Introduce one slack variable» i per example And require y i w T x i 1» i and» i 0 New optimization problem for learning 48

Soft SVM 49

Soft SVM Maximize margin 50

Soft SVM Maximize margin Minimize total slack (i.e allow as few examples as possible to violate the margin) 51

Soft SVM Maximize margin Tradeoff between the two terms Minimize total slack (i.e allow as few examples as possible to violate the margin) 52

Soft SVM Maximize margin Tradeoff between the two terms Minimize total slack (i.e allow as few examples as possible to violate the margin) Eliminate the slack variables to rewrite this This form is more interpretable 53

Soft SVM Maximize margin Tradeoff between the two terms Minimize total slack (i.e allow as few examples as possible to violate the margin) Eliminate the slack variables to rewrite this This form is more interpretable 54

Maximizing margin and minimizing loss Maximize margin Penalty for the prediction Three cases Example is correctly classified and is outside the margin: penalty = 0 Example is incorrectly classified: penalty = 1 y i w T x i Example is correctly classified but within the margin: penalty = 1 y i w T x i This is the hinge loss function 55

Maximizing margin and minimizing loss Maximize margin Penalty for the prediction We can consider three cases Example is correctly classified and is outside the margin: penalty = 0 Example is incorrectly classified: penalty = 1 y i w T x i Example is correctly classified but within the margin: penalty = 1 y i w T x i This is the hinge loss function 56

Maximizing margin and minimizing loss Maximize margin Penalty for the prediction We can consider three cases Example is correctly classified and is outside the margin: penalty = 0 Example is incorrectly classified: penalty = 1 y i w T x i Example is correctly classified but within the margin: penalty = 1 y i w T x i This is the hinge loss function 57

Maximizing margin and minimizing loss Maximize margin Penalty for the prediction We can consider three cases Example is correctly classified and is outside the margin: penalty = 0 Example is incorrectly classified: penalty = 1 y i w T x i Example is correctly classified but within the margin: penalty = 1 y i w T x i This is the hinge loss function 58

Maximizing margin and minimizing loss Maximize margin Penalty for the prediction We can consider three cases Example is correctly classified and is outside the margin: penalty = 0 Example is incorrectly classified: penalty = 1 y i w T x i Example is correctly classified but within the margin: penalty = 1 y i w T x i This is the hinge loss function 59

Maximizing margin and minimizing loss Maximize margin Penalty for the prediction We can consider three cases Example is correctly classified and is outside the margin: penalty = 0 Example is incorrectly classified: penalty = 1 y i w T x i Example is correctly classified but within the margin: penalty = 1 y i w T x i This gives us the hinge loss function 60

The Hinge Loss Loss yw T x 61

The Hinge Loss Loss Hinge loss 01 loss yw T x 62

The Hinge Loss Loss 01 loss: If the sign of y and w T x are different, then penalty = 1 Hinge loss 01 loss 01 loss: If the sign of y and w T x is the same, then no penalty yw T x 63

The Hinge Loss Loss Hinge: Penalize predictions even if they are correct, but too close to the margin Hinge: Incorrect predictions get a linearly increasing penalty with w T x Hinge: No penalty if w T x is far away from 1 (1 for negative examples) yw T x 64

Maximizing margin and minimizing loss Maximize margin Penalty for the prediction Three cases Example is correctly classified and is outside the margin: penalty = 0 Example is incorrectly classified: penalty = 1 y i w T x i Example is correctly classified but within the margin: penalty = 1 y i w T x i 65

General learning principle Risk minimization Define the notion of loss over the training data as a function of a hypothesis Learning = find the hypothesis that has lowest loss on the training data 66

General learning principle Regularized risk minimization Define a regularization function that penalizes overcomplex hypothesis. Capacity control gives better generalization Define the notion of loss over the training data as a function of a hypothesis Learning = find the hypothesis that has lowest [Regularizer loss on the training data] 67

SVM objective function Regularization term: Maximize the margin Imposes a preference over the hypothesis space and pushes for better generalization Can be replaced with other regularization terms which impose other preferences Empirical Loss: Hinge loss Penalizes weight vectors that make mistakes Can be replaced with other loss functions which impose other preferences 68

SVM objective function Regularization term: Maximize the margin Imposes a preference over the hypothesis space and pushes for better generalization Can be replaced with other regularization terms which impose other preferences Empirical Loss: Hinge loss Penalizes weight vectors that make mistakes Can be replaced with other loss functions which impose other preferences A hyperparameter that controls the tradeoff between a large margin and a small hingeloss 69