Support Vector Machines CS434

Similar documents
Support Vector Machines CS434

Lecture 10 Support Vector Machines. Oct

Which Separator? Spring 1

Support Vector Machines. Vibhav Gogate The University of Texas at dallas

Linear Classification, SVMs and Nearest Neighbors

Kernel Methods and SVMs Extension

Support Vector Machines

Support Vector Machines

Support Vector Machines

Chapter 6 Support vector machine. Séparateurs à vaste marge

CS 3710: Visual Recognition Classification and Detection. Adriana Kovashka Department of Computer Science January 13, 2015

Lecture 10 Support Vector Machines II

Pattern Classification

18-660: Numerical Methods for Engineering Design and Optimization

Lagrange Multipliers Kernel Trick

CSE 252C: Computer Vision III

Lecture 3: Dual problems and Kernels

Lecture 11 SVM cont

Image classification. Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing i them?

Kernels in Support Vector Machines. Based on lectures of Martin Law, University of Michigan

Recap: the SVM problem

Support Vector Machines

Support Vector Machines

15-381: Artificial Intelligence. Regression and cross validation

Advanced Introduction to Machine Learning

Ensemble Methods: Boosting

10-701/ Machine Learning, Fall 2005 Homework 3

Lecture Notes on Linear Regression

SVMs: Duality and Kernel Trick. SVMs as quadratic programs

Intro to Visual Recognition

Kristin P. Bennett. Rensselaer Polytechnic Institute

princeton univ. F 17 cos 521: Advanced Algorithm Design Lecture 7: LP Duality Lecturer: Matt Weinberg

ADVANCED MACHINE LEARNING ADVANCED MACHINE LEARNING

Lecture 6: Support Vector Machines

UVA CS / Introduc8on to Machine Learning and Data Mining. Lecture 10: Classifica8on with Support Vector Machine (cont.

Maximal Margin Classifier

SVMs: Duality and Kernel Trick. SVMs as quadratic programs

Linear discriminants. Nuno Vasconcelos ECE Department, UCSD

FMA901F: Machine Learning Lecture 5: Support Vector Machines. Cristian Sminchisescu

Kernel Methods and SVMs

COS 521: Advanced Algorithms Game Theory and Linear Programming

CSC 411 / CSC D11 / CSC C11

Machine Learning. What is a good Decision Boundary? Support Vector Machines

We present the algorithm first, then derive it later. Assume access to a dataset {(x i, y i )} n i=1, where x i R d and y i { 1, 1}.

Feature Selection: Part 1

Solutions to exam in SF1811 Optimization, Jan 14, 2015

Lectures - Week 4 Matrix norms, Conditioning, Vector Spaces, Linear Independence, Spanning sets and Basis, Null space and Range of a Matrix

Discriminative classifier: Logistic Regression. CS534-Machine Learning

Discriminative classifier: Logistic Regression. CS534-Machine Learning

INF 5860 Machine learning for image classification. Lecture 3 : Image classification and regression part II Anne Solberg January 31, 2018

p 1 c 2 + p 2 c 2 + p 3 c p m c 2

CS4495/6495 Introduction to Computer Vision. 3C-L3 Calibrating cameras

Linear Feature Engineering 11

U.C. Berkeley CS294: Beyond Worst-Case Analysis Luca Trevisan September 5, 2017

Multigradient for Neural Networks for Equalizers 1

Classification as a Regression Problem

Support Vector Machines. Jie Tang Knowledge Engineering Group Department of Computer Science and Technology Tsinghua University 2012

Natural Language Processing and Information Retrieval

Assortment Optimization under MNL

17 Support Vector Machines

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Multilayer Perceptron (MLP)

A Tutorial on Data Reduction. Linear Discriminant Analysis (LDA) Shireen Elhabian and Aly A. Farag. University of Louisville, CVIP Lab September 2009

Lecture 20: November 7

Linear Approximation with Regularization and Moving Least Squares

Evaluation of classifiers MLPs

Fisher Linear Discriminant Analysis

Machine Learning. Support Vector Machines. Eric Xing. Lecture 4, August 12, Reading: Eric CMU,

MACHINE APPLIED MACHINE LEARNING LEARNING. Gaussian Mixture Regression

COS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture #16 Scribe: Yannan Wang April 3, 2014

CSE 546 Midterm Exam, Fall 2014(with Solution)

UVA CS / Introduc8on to Machine Learning and Data Mining

Machine Learning. Support Vector Machines. Eric Xing , Fall Lecture 9, October 6, 2015

CSci 6974 and ECSE 6966 Math. Tech. for Vision, Graphics and Robotics Lecture 21, April 17, 2006 Estimating A Plane Homography

Lecture 11. minimize. c j x j. j=1. 1 x j 0 j. +, b R m + and c R n +

PHYS 705: Classical Mechanics. Calculus of Variations II

Nonlinear Classifiers II

MMA and GCMMA two methods for nonlinear optimization

Lecture 6: Introduction to Linear Regression

Instance-Based Learning (a.k.a. memory-based learning) Part I: Nearest Neighbor Classification

Machine Learning. Support Vector Machines. Eric Xing , Fall Lecture 9, October 8, 2015

Optimization. Nuno Vasconcelos ECE Department, UCSD

C4B Machine Learning Answers II. = σ(z) (1 σ(z)) 1 1 e z. e z = σ(1 σ) (1 + e z )

Generalized Linear Methods

CIS526: Machine Learning Lecture 3 (Sept 16, 2003) Linear Regression. Preparation help: Xiaoying Huang. x 1 θ 1 output... θ M x M

Modeling curves. Graphs: y = ax+b, y = sin(x) Implicit ax + by + c = 0, x 2 +y 2 =r 2 Parametric:

Supplement: Proofs and Technical Details for The Solution Path of the Generalized Lasso

Duality in linear programming

Predictive Analytics : QM901.1x Prof U Dinesh Kumar, IIMB. All Rights Reserved, Indian Institute of Management Bangalore

LECTURE 9 CANONICAL CORRELATION ANALYSIS

Neural networks. Nuno Vasconcelos ECE Department, UCSD

Inner Product. Euclidean Space. Orthonormal Basis. Orthogonal

The exam is closed book, closed notes except your one-page cheat sheet.

The Gaussian classifier. Nuno Vasconcelos ECE Department, UCSD

Machine Learning & Data Mining CS/CNS/EE 155. Lecture 4: Regularization, Sparsity & Lasso

1 The Mistake Bound Model

Problem Set 9 Solutions

The Second Anti-Mathima on Game Theory

Errors for Linear Systems

VQ widely used in coding speech, image, and video

Transcription:

Support Vector Machnes CS434

Lnear Separators Many lnear separators exst that perfectly classfy all tranng examples Whch of the lnear separators s the best?

Intuton of Margn Consder ponts A, B, and C We are qute confdent n our predcton for A because t s far from the decson boundary. In contrast, e are not so confdent n our predcton for C because a slght change n the decson boundary may flp the decson. A x b > 0 B C x b = 0 x b < 0 Gven a tranng set, e ould lke to make all of our predctons correct and confdent! Ths can be captured by the concept of margn

The best choce because all ponts n the tranng set are reasonably far aay from t. Ho do e capture ths concept mathematcally?

Margn Gven a lnear decson boundary defned by 0 The functonal margn for a pont (, s defned as: For a fxed and, the larger the functonal value, the more confdent e have about the predcton Ho about e set our goal to be: fnd a set of and so that all tranng data ponts ll have large (maxmum) functonal margn? Ths has the rght ntuton but has a crtcal techncal fla

Issue th Functonal Margn Let s say ths purple pont, has f functonal margn of 10,.e.,, Decson boundary defned by 0 10 What f e change the value of and b proportonally by multplyng a large constant 1000: What s the ne decson boundary? 1000 1000 0 The functonal margn of pont, becomes: 10 1000 10000 We can arbtrarly change the functonal margn thout changng the boundary at all.

What e really ant A B C Geometrc margn: the dstances beteen an example and the decson boundary

x 1 Basc facts about lnes d x 0 Multply both sdes by T x b d To see ths: Let x 0 be the projecton of x on ths lne. We have: d T x x x 0 T x the red vector = d x x 0 d 0 T x b d the red vector T b d ote: d can be postve or negatve dependng on hch sde t s on. We predct postve f 0and negatve otherse

Geometrc margn, We defne the geometrc margn of a pont,, rt 0to be: It measures the geometrc dstance beteen the pont and the decson boundary It can be ether postve or negatve Postve f the pont s correctly classfed egatve f the pont s msclassfed

Geometrc margn of a decson boundary For tranng set, : 1,, and boundary 0, compute the geometrc margn of all ponts: y ( x b), 1,..., ote: 0f pont s correctly classfed Ho can e tell f 0 s good? the smallest s large The geometrc margn of a decson boundary for tranng set s defned to be: mn 1 ( ) B C A Goal: e am to fnd a decson boundary hose geometrc margn rt tranng set s maxmzed! γ A

What e have done so far We have establshed that e ant to fnd a lnear decson boundary hose geometrc margn s the largest We have a ne learnng objectve Gven a lnearly separable (ll be relaxed later) tranng set, : 1,,, e ould lke to fnd a lnear classfer, th the maxmum geometrc margn. Ho can e acheve ths? Mathematcally formulate ths as an optmzaton problem and then solve t In ths class, e just need to kno some bascs about ths process

Maxmum Margn Classfer Ths can be represented as a constraned optmzaton problem. max, b subject to : Ths optmzaton problem s n a nasty form, so e need to do some rertng Let be the functonal margn (e have = can rerte ths as max, b ' subject to : y () y () ( x ( x b) b), ', 1,, ), e 1,,

Maxmum Margn Classfer ote that e can arbtrarly rescale and b to make the functonal margn ' large or small So e can rescale them such that =1 max, b ' subject to : y ( x ' b) ', 1,, 1 max (or equvalently mn, b, b subject to : y ( x b) 1, 2 ) 1,, Maxmzng the geometrc margn s equvalent to mnmzng the magntude of subject to mantanng a functonal margn of at least 1

Solvng the Optmzaton Problem 1 2 mn, b 2 subject to : y ( x b) 1, 1,, Ths results n a quadratc optmzaton problem th lnear nequalty constrants. Ths s a ell-knon class of mathematcal programmng problems for hch several (non-trval) algorthms exst. In practce, e can just regard the QP solver as a black-box thout botherng ho t orks You ll be spared of the excrucatng detals and jump to

The soluton We can not gve you a close form soluton that you can drectly plug n the numbers and compute for an arbtrary data sets But, the soluton can alays be rtten n the follong form y x, s.t. y 0 1 Ths s the form of, the value for b can be calculated accordngly usng some addtonal steps A fe mportant thngs to note: The eght vector s a lnear combnaton of all the tranng examples Importantly, many of the s are zeros These ponts that have non-zero s are the support vectors These are the ponts that have the smallest geometrc margn 1

A Geometrcal Interpretaton Class 2 8 =0.6 10 =0 5 =0 2 7 =0 2 =0 4 =0 9 =0 Class 1 3 =0 6 =1.4 1 =0.8

A fe mportant notes regardng the geometrc nterpretaton gves the decson boundary Postve support vectors le on the lne of egatve support vectors le on the lne of We can thnk of the above to lnes as defnng a fat decson boundary The support vectors exactly touches the to sdes of the fat boundary Learnng nvolves adjustng the locaton and orentaton of the decson boundary so that t can be as fat as possble thout eatng up the tranng examples

Summary So Far We defned margn (functonal, geometrc) We demonstrated that e prefer to have lnear classfers th large geometrc margn. We formulated the problem of fndng the maxmum margn lnear classfer as a quadratc optmzaton problem Ths problem can be solved usng effcent QP algorthms that are avalable. The solutons for and b are very ncely formed Do e have our perfect classfer yet?

on-separable Data and ose What f the data s not lnearly separable? The soluton does not exst.e., the set of lnear constrants are not satsfable But e should stll be able to fnd a good decson boundary What f e have nose n data? The maxmum margn classfer s not robust to nose!

Soft Margn Allo functonal margns to be less than 1 Postve Class Orgnally functonal margns need to satsfy: y ( x b) 1 o e allo t to be less than 1: y ( x b) 1- ξ & ξ 0 The objectve changes to: egatve Class mn, b 2 c 1 ξ

Soft-Margn Maxmzaton subject to : mn, b y 2 ( x b) 1, 1,, Slack varables subject to : mn, b y ( x 0, 2 cξ 1 b) 1 ξ, 1,, 1,, Ths allos some functonal margns < 1 (could even be < 0) The s can be veed as the errors of our fat decson boundary Addng s to the objectve functon to mnmze errors We have a tradeoff beteen makng the decson boundary fat and mnmzng the error Parameter c controls the tradeoff: Large c: s ncur large penalty, so the optmal soluton ll try to avod them Small c: small cost for s, e can sacrfce some tranng examples to have a large classfer margn

Solutons to soft-margn SVM 1 y x, s.t. y 1 0 o soft margn 1 y x, s.t. 1 y 0 and 0 α c Wth soft margn c effectvely puts a box constrant on, the eghts of the support vectors It lmts the nfluence of ndvdual support vectors (maybe outlers) In practce, c s a parameter to be set, smlar to k n k-nearest neghbor It can be set usng cross-valdaton

Ho to make predctons? For classfyng th a ne nput z Compute s j1 j j z b ( y x ) z b y t j j1 classfy z as f postve, and - otherse t t s t j t j ( x t j z) b ote: need not be computed/stored explctly, e can store the α s, and classfy z by usng the above calculaton, hch nvolves takng the dot product beteen tranng examples and the ne example z In fact, n SVM both learnng and classfcaton can be acheved usng dot products beteen par of nput ponts: --- ths allos us to do the Kernel trck, that s to replace the dot product th somethng called a kernel functon.

on-lnear SVMs Datasets that are lnearly separable th some nose ork out great: 0 x But hat are e gong to do f the dataset s just too hard? 0 x

Mappng the nput to a hgher dmensonal space can solve the lnearly nseparable cases 0 x x x 2 0 x (x, x 2 )

on-lnear SVMs: Feature Spaces General dea: For any data set, the orgnal nput space can alays be mapped to some hgher-dmensonal feature spaces such that the data s lnearly separable: x Ф(x)

Example: Quadratc Feature Space Assume m nput dmensons x = (x 1, x 2,, x m ) umber of quadratc terms: 1mmm(m-1)/2 m 2 The number of dmensons ncrease rapdly! You may be onderng about the s At least they on t hurt anythng! You ll fnd out hy they are there soon!

Dot product n quadratc feature space Ф(a) Ф(b) o let s just look at another nterestng functon of (a b): 1 2 2 1 2 1 2 ) ( 1 ) 2( ) ( 1) ( 1 1 1 1 2 2 1 1 1 1 2 1 2 2 m m m j j j m m m m j j j m m a b b b a a b a a b b b a a a b a b b a b a b a They are the same! And the later only takes O(m) to compute!

Kernel Functons If every data pont s mapped nto hgh-dmensonal space va some transformaton x (x), the dot product that e need to compute for classfyng a pont x becomes: <(x ) (x)> for all support vectors x A kernel functon s a functon that s equvalent to an dot product n some feature space. k(a,b) = <(a) (b)> We have seen the example: k(a,b) = (ab1) 2 Ths s equvalent to mappng to the quadratc space! A kernel functon can often be veed measurng smlarty

More kernel functons Lnear kernel: k(a,b) = (ab) Polynomal kernel: k(a,b) = (ab1) d Radal-Bass-Functon kernel: In ths case, the correspondng mappng (x) s nfntedmensonal! Lucky that e don t have to compute the mappng explctly! s j1 t j j ( z) b y ( ( x ) ( z)) b t j t s j1 t j y t j K( x t j z) b ote: We ll not get nto the detals but the learnng of can be acheved by usng kernel functons as ell!

onlnear SVM summary Usng kernel functons n place of dot product, e, n effect, map the nput space to a hgh dmensonal feature space and learn a lnear decson boundary n the feature space The decson boundary ll be nonlnear n the orgnal nput space Many possble choces of kernel functons Ho to choose? Most frequently used method: cross-valdaton

Strengths Strength vs eakness The soluton s globally optmal It scales ell th hgh dmensonal data It can handle non-tradtonal data lke strngs, trees, nstead of the tradtonal fxed length feature vectors Why? Because as long as e can defne a kernel (smlarty) functon for such nput, e can apply svm Weakness eed to specfy a good kernel Tranng tme can be long f you use the rong softare package 2007 2006 1998