Support Vector Machines

Similar documents
SVMs, Duality and the Kernel Trick

Support vector machines Lecture 4

CSE546: SVMs, Dual Formula5on, and Kernels Winter 2012

Announcements - Homework

Kernelized Perceptron Support Vector Machines

Kernel Machines. Pradeep Ravikumar Co-instructor: Manuela Veloso. Machine Learning

Linear classifiers Lecture 3

Kernels. Machine Learning CSE446 Carlos Guestrin University of Washington. October 28, Carlos Guestrin

PAC-learning, VC Dimension and Margin-based Bounds

PAC-learning, VC Dimension and Margin-based Bounds

Support Vector Machines. Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar

Support Vector Machine (continued)

Support Vector Machine (SVM) and Kernel Methods

Logistic Regression Logistic

Jeff Howbert Introduction to Machine Learning Winter

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Support Vector Machine (SVM) and Kernel Methods

(Kernels +) Support Vector Machines

Kernel Methods and Support Vector Machines

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines

Machine Learning Support Vector Machines. Prof. Matteo Matteucci

COMP 875 Announcements

Topics we covered. Machine Learning. Statistics. Optimization. Systems! Basics of probability Tail bounds Density Estimation Exponential Families

Max Margin-Classifier

Support Vector Machine (SVM) and Kernel Methods

Bias-Variance Tradeoff

Support Vector Machines and Kernel Methods

Pattern Recognition 2018 Support Vector Machines

CS145: INTRODUCTION TO DATA MINING

Lecture 10: Support Vector Machine and Large Margin Classifier

Linear & nonlinear classifiers

CS6375: Machine Learning Gautam Kunapuli. Support Vector Machines

About this class. Maximizing the Margin. Maximum margin classifiers. Picture of large and small margin hyperplanes

Machine Learning and Data Mining. Support Vector Machines. Kalev Kask

Indirect Rule Learning: Support Vector Machines. Donglin Zeng, Department of Biostatistics, University of North Carolina

Review: Support vector machines. Machine learning techniques and image analysis

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Support Vector Machines

Support Vector Machines

CSC 411 Lecture 17: Support Vector Machine

Lecture 10: A brief introduction to Support Vector Machine

Machine Learning. Support Vector Machines. Manfred Huber

Kernel Methods. Barnabás Póczos

COMP 652: Machine Learning. Lecture 12. COMP Lecture 12 1 / 37

Support Vector Machines

Machine Learning Basics Lecture 4: SVM I. Princeton University COS 495 Instructor: Yingyu Liang

Perceptron Revisited: Linear Separators. Support Vector Machines

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Support Vector Machines for Classification and Regression. 1 Linearly Separable Data: Hard Margin SVMs

Support Vector Machines

Support Vector Machines II. CAP 5610: Machine Learning Instructor: Guo-Jun QI

Stat542 (F11) Statistical Learning. First consider the scenario where the two classes of points are separable.

A short introduction to supervised learning, with applications to cancer pathway analysis Dr. Christina Leslie

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Support Vector Machines: Maximum Margin Classifiers

Introduction to Support Vector Machines

Kernels and the Kernel Trick. Machine Learning Fall 2017

Chapter 9. Support Vector Machine. Yongdai Kim Seoul National University

Support Vector Machines Explained

10/05/2016. Computational Methods for Data Analysis. Massimo Poesio SUPPORT VECTOR MACHINES. Support Vector Machines Linear classifiers

Statistical Methods for NLP

Applied Machine Learning Annalisa Marsico

Machine Learning And Applications: Supervised Learning-SVM

Linear & nonlinear classifiers

Support Vector Machine. Industrial AI Lab.

Warm up: risk prediction with logistic regression

Statistical Pattern Recognition

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Support Vector Machines. Machine Learning Fall 2017

Lecture Notes on Support Vector Machine

Support Vector Machine

COMS 4771 Introduction to Machine Learning. James McInerney Adapted from slides by Nakul Verma

Machine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University. September 20, 2012

SUPPORT VECTOR MACHINE

Support Vector Machines. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Support Vector Machines and Kernel Methods

ML (cont.): SUPPORT VECTOR MACHINES

Introduction to Support Vector Machines

Support Vector Machines for Classification and Regression

Linear, threshold units. Linear Discriminant Functions and Support Vector Machines. Biometrics CSE 190 Lecture 11. X i : inputs W i : weights

COMS 4771 Regression. Nakul Verma

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, 2013

Kernel Methods. Machine Learning A W VO

Online Learning With Kernel

Machine Learning

Support Vector Machines

Statistical Machine Learning from Data

Learning From Data Lecture 25 The Kernel Trick

LMS Algorithm Summary

Machine Learning. Lecture 6: Support Vector Machine. Feng Li.

Logistic Regression. Machine Learning Fall 2018

Convex Optimization. Dani Yogatama. School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA. February 12, 2014

CMU-Q Lecture 24:

LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES. Supervised Learning

Support Vector Machines, Kernel SVM

Neural Networks. Prof. Dr. Rudolf Kruse. Computational Intelligence Group Faculty for Computer Science

Linear vs Non-linear classifier. CS789: Machine Learning and Neural Network. Introduction

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM

Support Vector Machine. Industrial AI Lab. Prof. Seungchul Lee

Transcription:

Two SVM tutorials linked in class website (please, read both): High-level presentation with applications (Hearst 1998) Detailed tutorial (Burges 1998) Support Vector Machines Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University February 22 nd, 2005 2006 Carlos Guestrin 1

2006 Carlos Guestrin 2 Announcements Third homework is out Due March 1 st Final assigned by registrar: May 12, 1-4p.m Location TBD

Linear classifiers Which line is better? Data: Example i: w.x = j w (j) x (j) 2006 Carlos Guestrin 3

Pick the one with the largest margin! w.x + b = 0 w.x = j w (j) x (j) 2006 Carlos Guestrin 4

2006 Carlos Guestrin 5 Maximize the margin w.x + b = 0

2006 Carlos Guestrin 6 But there are a many planes w.x + b = 0

2006 Carlos Guestrin 7 Review: Normal to a plane w.x + b = 0

Normalized margin Canonical hyperplanes w.x + b = +1 w.x + b = 0 w.x + b = -1 x + x - margin 2γ 2006 Carlos Guestrin 8

Normalized margin Canonical hyperplanes w.x + b = +1 w.x + b = 0 w.x + b = -1 x + x - margin 2γ 2006 Carlos Guestrin 9

Margin maximization using canonical hyperplanes w.x + b = +1 w.x + b = 0 w.x + b = -1 margin 2γ 2006 Carlos Guestrin 10

2006 Carlos Guestrin 11 Support vector machines (SVMs) w.x + b = +1 w.x + b = 0 w.x + b = -1 Solve efficiently by quadratic programming (QP) Well-studied solution algorithms margin 2γ Hyperplane defined by support vectors

2006 Carlos Guestrin 12 What if the data is not linearly separable? Use features of features of features of features.

What if the data is still not linearly separable? Minimize w.w and number of training mistakes Tradeoff two criteria? Tradeoff #(mistakes) and w.w 0/1 loss Slack penalty C Not QP anymore Also doesn t distinguish near misses and really bad mistakes 2006 Carlos Guestrin 13

Slack variables Hinge loss If margin 1, don t care If margin < 1, pay linear penalty 2006 Carlos Guestrin 14

2006 Carlos Guestrin 15 Side note: What s the difference between SVMs and logistic regression? SVM: Logistic regression: Log loss:

What about multiple classes? 2006 Carlos Guestrin 16

2006 Carlos Guestrin 17 One against All Learn 3 classifiers:

2006 Carlos Guestrin 18 Learn 1 classifier: Multiclass SVM Simultaneously learn 3 sets of weights

Learn 1 classifier: Multiclass SVM 2006 Carlos Guestrin 19

2006 Carlos Guestrin 20 What you need to know Maximizing margin Derivation of SVM formulation Slack variables and hinge loss Relationship between SVMs and logistic regression 0/1 loss Hinge loss Log loss Tackling multiple class One against All Multiclass SVMs

SVMs, Duality and the Kernel Trick Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University February 22 nd, 2005 2006 Carlos Guestrin 21

SVMs reminder 2006 Carlos Guestrin 22

2006 Carlos Guestrin 23 You will now Learn one of the most interesting and exciting recent advancements in machine learning The kernel trick High dimensional feature spaces at no extra cost! But first, a detour Constrained optimization!

Constrained optimization 2006 Carlos Guestrin 24

Lagrange multipliers Dual variables 2006 Carlos Guestrin 25

2006 Carlos Guestrin 26 Dual SVM derivation (1) the linearly separable case

2006 Carlos Guestrin 27 Dual SVM derivation (2) the linearly separable case

2006 Carlos Guestrin 28 Dual SVM interpretation w.x + b = 0

2006 Carlos Guestrin 29 Dual SVM formulation the linearly separable case

2006 Carlos Guestrin 30 Dual SVM derivation the non-separable case

2006 Carlos Guestrin 31 Dual SVM formulation the non-separable case

2006 Carlos Guestrin 32 Why did we learn about the dual SVM? There are some quadratic programming algorithms that can solve the dual faster than the primal But, more importantly, the kernel trick!!! Another little detour

Reminder from last time: What if the data is not linearly separable? Use features of features of features of features. Feature space can get really large really quickly! 2006 Carlos Guestrin 33

Higher order polynomials number of monomial terms number of input dimensions d=4 d=3 d=2 m input features d degree of polynomial grows fast! d = 6, m = 100 about 1.6 billion terms 2006 Carlos Guestrin 34

2006 Carlos Guestrin 35 Dual formulation only depends on dot-products, not on w!

Dot-product of polynomials 2006 Carlos Guestrin 36

2006 Carlos Guestrin 37 Finally: the kernel trick! Never represent features explicitly Compute dot products in closed form Constant-time high-dimensional dotproducts for many classes of features Very interesting theory Reproducing Kernel Hilbert Spaces Not covered in detail in 10701/15781, more in 10702

2006 Carlos Guestrin 38 Polynomial kernels All monomials of degree d in O(d) operations: How about all monomials of degree up to d? Solution 0: Better solution:

2006 Carlos Guestrin 39 Common kernels Polynomials of degree d Polynomials of degree up to d Gaussian kernels Sigmoid

2006 Carlos Guestrin 40 Overfitting? Huge feature space with kernels, what about overfitting??? Maximizing margin leads to sparse set of support vectors Some interesting theory says that SVMs search for simple hypothesis with large margin Often robust to overfitting

What about at classification time For a new input x, if we need to represent Φ(x), we are in trouble! Recall classifier: sign(w.φ(x)+b) Using kernels we are cool! 2006 Carlos Guestrin 41

2006 Carlos Guestrin 42 SVMs with kernels Choose a set of features and kernel function Solve dual problem to obtain support vectors α i At classification time, compute: Classify as

2006 Carlos Guestrin 43 What s the difference between SVMs and Logistic Regression? Loss function SVMs Hinge loss Logistic Regression Log-loss High dimensional features with kernels Yes! No

2006 Carlos Guestrin 44 Kernels in logistic regression Define weights in terms of support vectors: Derive simple gradient descent rule on α i

2006 Carlos Guestrin 45 What s the difference between SVMs and Logistic Regression? (Revisited) Loss function SVMs Hinge loss Logistic Regression Log-loss High dimensional features with kernels Yes! Yes! Solution sparse Often yes! Almost always no!

2006 Carlos Guestrin 46 What you need to know Dual SVM formulation How it s derived The kernel trick Derive polynomial kernel Common kernels Kernelized logistic regression Differences between SVMs and logistic regression

2006 Carlos Guestrin 47 Acknowledgment SVM applet: http://www.site.uottawa.ca/~gcaron/applets.htm

2006 Carlos Guestrin 48 Acknowledgment SVM applet: http://www.site.uottawa.ca/~gcaron/applets.htm