Support vector machines Lecture 4

Similar documents
Linear classifiers Lecture 3

Support Vector Machines

Announcements - Homework

CSE546: SVMs, Dual Formula5on, and Kernels Winter 2012

Kernelized Perceptron Support Vector Machines

Support Vector Machines. Machine Learning Fall 2017

PAC-learning, VC Dimension and Margin-based Bounds

SVMs, Duality and the Kernel Trick

Kernels. Machine Learning CSE446 Carlos Guestrin University of Washington. October 28, Carlos Guestrin

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Support Vector Machines. Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar

A short introduction to supervised learning, with applications to cancer pathway analysis Dr. Christina Leslie

Multi-class SVMs. Lecture 17: Aykut Erdem April 2016 Hacettepe University

COMP 875 Announcements

The Perceptron Algorithm, Margins

Lecture Support Vector Machine (SVM) Classifiers

Support Vector Machine (continued)

Warm up: risk prediction with logistic regression

PAC-learning, VC Dimension and Margin-based Bounds

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x))

Pattern Recognition 2018 Support Vector Machines

Binary Classification / Perceptron

Jeff Howbert Introduction to Machine Learning Winter

About this class. Maximizing the Margin. Maximum margin classifiers. Picture of large and small margin hyperplanes

Support Vector Machine (SVM) and Kernel Methods

Natural Language Processing. Classification. Features. Some Definitions. Classification. Feature Vectors. Classification I. Dan Klein UC Berkeley

Machine Learning Lecture 6 Note

Machine Learning and Data Mining. Support Vector Machines. Kalev Kask

Support Vector Machine (SVM) and Kernel Methods

Kernel Methods and Support Vector Machines

Machine Learning And Applications: Supervised Learning-SVM

Lecture 18: Kernels Risk and Loss Support Vector Regression. Aykut Erdem December 2016 Hacettepe University

Chapter 9. Support Vector Machine. Yongdai Kim Seoul National University

Support Vector Machines

Logistic Regression Logistic

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines

Machine Learning for NLP

Statistical and Computational Learning Theory

Support Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017

CSC 411 Lecture 17: Support Vector Machine

Indirect Rule Learning: Support Vector Machines. Donglin Zeng, Department of Biostatistics, University of North Carolina

Max Margin-Classifier

Statistical Pattern Recognition

CS6375: Machine Learning Gautam Kunapuli. Support Vector Machines

Learning by constraints and SVMs (2)

Introduction to Logistic Regression and Support Vector Machine

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

CS145: INTRODUCTION TO DATA MINING

SUPPORT VECTOR MACHINE

Multiclass Classification-1

Support Vector Machines. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Support Vector Machines for Classification and Regression. 1 Linearly Separable Data: Hard Margin SVMs

Learning Theory. Piyush Rai. CS5350/6350: Machine Learning. September 27, (CS5350/6350) Learning Theory September 27, / 14

Dan Roth 461C, 3401 Walnut

Kernel Machines. Pradeep Ravikumar Co-instructor: Manuela Veloso. Machine Learning

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Perceptron Revisited: Linear Separators. Support Vector Machines

MIRA, SVM, k-nn. Lirong Xia

Neural Networks. Prof. Dr. Rudolf Kruse. Computational Intelligence Group Faculty for Computer Science

Day 5: Generative models, structured classification

Online Passive- Aggressive Algorithms

(Kernels +) Support Vector Machines

Support Vector Machines: Maximum Margin Classifiers

Linear & nonlinear classifiers

CS 188: Artificial Intelligence. Outline

Support vector machines

From Binary to Multiclass Classification. CS 6961: Structured Prediction Spring 2018

EE 511 Online Learning Perceptron

Efficient Bandit Algorithms for Online Multiclass Prediction

Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron

Linear & nonlinear classifiers

Support Vector Machines

Logistic Regression. Machine Learning Fall 2018

CS 188: Artificial Intelligence Spring Announcements

Support Vector Machines and Kernel Methods

18.9 SUPPORT VECTOR MACHINES

Warm up. Regrade requests submitted directly in Gradescope, do not instructors.

CS 188: Artificial Intelligence Spring Announcements

Learning Theory. Machine Learning CSE546 Carlos Guestrin University of Washington. November 25, Carlos Guestrin

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

Support Vector Machine (SVM) and Kernel Methods

10/05/2016. Computational Methods for Data Analysis. Massimo Poesio SUPPORT VECTOR MACHINES. Support Vector Machines Linear classifiers

Machine Learning for NLP

Support Vector Machines for Classification and Regression

Introduction to Support Vector Machines

ML (cont.): SUPPORT VECTOR MACHINES

Classifier Complexity and Support Vector Classifiers

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Advanced Introduction to Machine Learning CMU-10715

Machine Learning. VC Dimension and Model Complexity. Eric Xing , Fall 2015

FINAL EXAM: FALL 2013 CS 6375 INSTRUCTOR: VIBHAV GOGATE

Topics we covered. Machine Learning. Statistics. Optimization. Systems! Basics of probability Tail bounds Density Estimation Exponential Families

Lecture 4: Linear predictors and the Perceptron

IFT Lecture 7 Elements of statistical learning theory

Support Vector Machines

ECE 5984: Introduction to Machine Learning

CPSC 340: Machine Learning and Data Mining

Neural Networks: Backpropagation

The Perceptron Algorithm

Transcription:

Support vector machines Lecture 4 David Sontag New York University Slides adapted from Luke Zettlemoyer, Vibhav Gogate, and Carlos Guestrin

Q: What does the Perceptron mistake bound tell us? Theorem: The maximum number of mistakes made by the perceptron algorithm is bounded above by Batch learning: setting we consider for most of the class. Assume training data drawn from same distribution as future test data Use training data to find the hypothesis The mistake bound gives us an upper bound on the perceptron running time At least one mistake made per pass through the data Running time is at most computed on training data Does not tell us anything about generalization this is addressed by the concept of VC-dimension (in a couple lectures)

Q: What does the Perceptron mistake bound tell us? Theorem: The maximum number of mistakes made by the perceptron algorithm is bounded above by Demonstration in Matlab that Perceptron takes many more iterations to converge when there is a smaller margin (relative to R)

Online versus batch learning In the online setting we measure regret, i.e. the total cumulative loss No assumptions at all about the order of the data points! R and gamma refer to all data points (seen and future) Perceptron mistake bound tell us that the algorithm has bounded regret [Shai Shalev-Shwartz, Online Learning and Online Convex Optimization, 11]

Recall from last lecture Support vector machines (SVMs) w.x + b = +1 w.x + b = 0 w.x + b = -1 Example of a convex optimization problem A quadratic program Polynomial-time algorithms to solve! Hyperplane defined by support vectors margin 2γ Non-support Vectors: everything else moving them will not change w Support Vectors: data points on the canonical lines Could use them as a lower-dimension basis to write down line, although we haven t seen how yet

What if the data is not linearly separable? What about overfitting? Add More Features!!! φ(x) = x (1)... x (n) x (1) x (2) x (1) x (3)... e x(1)...

What if the data is not linearly separable? + C #(mistakes) First Idea: Jointly minimize w.w and number of training mistakes How to tradeoff two criteria? Pick C using held-out data Tradeoff #(mistakes) and w.w 0/1 loss Not QP anymore Also doesn t distinguish near misses and really bad mistakes NP hard to find optimal solution!!!

Allowing for slack: Soft margin SVM w.x + b = +1 w.x + b = 0 w.x + b = -1 + C Σ j ξ j - ξ j ξ j 0 ξ slack variables ξ ξ ξ Slack penalty C > 0: C= have to separate the data! C=0 ignores the data entirely! For each data point: If margin 1, don t care If margin < 1, pay linear penalty

Allowing for slack: Soft margin SVM w.x + b = +1 w.x + b = 0 w.x + b = -1 ξ + C Σ j ξ j - ξ j ξ j 0 slack variables ξ ξ What is the (optimal) value of ξ j as a function of w and b? ξ If then ξ j = 0 If then ξ j = Sometimes written as

Equivalent hinge loss formulation + C Σ j ξ j - ξ j ξ j 0 Substituting into the objective, we get: The hinge loss is defined as This is called regularization; used to prevent overfitting! This part is empirical risk minimization, using the hinge loss

Hinge loss vs. 0/1 loss 0-1 Loss: 1 Hinge loss: 0 1 Hinge loss upper bounds 0/1 loss!

How do we do multi-class classification?

One versus all classification w + w - w o Learn 3 classifiers: - vs {o,+}, weights w - + vs {o,-}, weights w + o vs {+,-}, weights w o Predict label using: Any problems? Could we learn this dataset?

Multi-class SVM Simultaneously learn 3 sets of weights: How do we guarantee the correct labels? Need new constraints! w + w - w o The score of the correct class must be better than the score of wrong classes:

Multi-class SVM As for the SVM, we introduce slack variables and maximize margin: To predict, we use: Now can we learn it?

What you need to know Perceptron mistake bound Maximizing margin Derivation of SVM formulation Relationship between SVMs and empirical risk minimization 0/1 loss versus hinge loss Tackling multiple class One against All Multiclass SVMs

What s Next! Learn one of the most interesting and exciting recent advancements in machine learning The kernel trick High dimensional feature spaces at no extra cost! But first, a detour Constrained optimization!