Support Vector Machines CS434

Similar documents
Support Vector Machines CS434

Lecture 10 Support Vector Machines. Oct

Support Vector Machines. Vibhav Gogate The University of Texas at dallas

Which Separator? Spring 1

Lecture 10 Support Vector Machines II

Lecture 3: Dual problems and Kernels

Kernel Methods and SVMs Extension

Support Vector Machines

Linear Classification, SVMs and Nearest Neighbors

Chapter 6 Support vector machine. Séparateurs à vaste marge

Lagrange Multipliers Kernel Trick

Support Vector Machines

Support Vector Machines

CS 3710: Visual Recognition Classification and Detection. Adriana Kovashka Department of Computer Science January 13, 2015

Support Vector Machines. Jie Tang Knowledge Engineering Group Department of Computer Science and Technology Tsinghua University 2012

Kernels in Support Vector Machines. Based on lectures of Martin Law, University of Michigan

10-701/ Machine Learning, Fall 2005 Homework 3

Support Vector Machines

princeton univ. F 17 cos 521: Advanced Algorithm Design Lecture 7: LP Duality Lecturer: Matt Weinberg

Support Vector Machines

CSE 252C: Computer Vision III

Image classification. Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing i them?

Advanced Introduction to Machine Learning

U.C. Berkeley CS294: Beyond Worst-Case Analysis Luca Trevisan September 5, 2017

Maximal Margin Classifier

FMA901F: Machine Learning Lecture 5: Support Vector Machines. Cristian Sminchisescu

Lectures - Week 4 Matrix norms, Conditioning, Vector Spaces, Linear Independence, Spanning sets and Basis, Null space and Range of a Matrix

Solutions to exam in SF1811 Optimization, Jan 14, 2015

Ensemble Methods: Boosting

MMA and GCMMA two methods for nonlinear optimization

Natural Language Processing and Information Retrieval

Lecture 6: Support Vector Machines

Kristin P. Bennett. Rensselaer Polytechnic Institute

Fisher Linear Discriminant Analysis

Lecture Notes on Linear Regression

PHYS 705: Classical Mechanics. Calculus of Variations II

CSC 411 / CSC D11 / CSC C11

Assortment Optimization under MNL

COS 521: Advanced Algorithms Game Theory and Linear Programming

Kernel Methods and SVMs

ADVANCED MACHINE LEARNING ADVANCED MACHINE LEARNING

Inner Product. Euclidean Space. Orthonormal Basis. Orthogonal

18-660: Numerical Methods for Engineering Design and Optimization

Linear Approximation with Regularization and Moving Least Squares

17 Support Vector Machines

Feature Selection: Part 1

Intro to Visual Recognition

Lecture 20: November 7

Lecture 11 SVM cont

We present the algorithm first, then derive it later. Assume access to a dataset {(x i, y i )} n i=1, where x i R d and y i { 1, 1}.

Generalized Linear Methods

CSci 6974 and ECSE 6966 Math. Tech. for Vision, Graphics and Robotics Lecture 21, April 17, 2006 Estimating A Plane Homography

STAT 309: MATHEMATICAL COMPUTATIONS I FALL 2018 LECTURE 16

Pattern Classification

Supplement: Proofs and Technical Details for The Solution Path of the Generalized Lasso

6.854J / J Advanced Algorithms Fall 2008

Errors for Linear Systems

C4B Machine Learning Answers II. = σ(z) (1 σ(z)) 1 1 e z. e z = σ(1 σ) (1 + e z )

UVA CS / Introduc8on to Machine Learning and Data Mining. Lecture 10: Classifica8on with Support Vector Machine (cont.

p 1 c 2 + p 2 c 2 + p 3 c p m c 2

Recap: the SVM problem

Modeling curves. Graphs: y = ax+b, y = sin(x) Implicit ax + by + c = 0, x 2 +y 2 =r 2 Parametric:

14 Lagrange Multipliers

MACHINE APPLIED MACHINE LEARNING LEARNING. Gaussian Mixture Regression

Multilayer Perceptron (MLP)

Singular Value Decomposition: Theory and Applications

INF 5860 Machine learning for image classification. Lecture 3 : Image classification and regression part II Anne Solberg January 31, 2018

College of Computer & Information Science Fall 2009 Northeastern University 20 October 2009

4DVAR, according to the name, is a four-dimensional variational method.

Problem Set 9 Solutions

Vector Norms. Chapter 7 Iterative Techniques in Matrix Algebra. Cauchy-Bunyakovsky-Schwarz Inequality for Sums. Distances. Convergence.

Winter 2008 CS567 Stochastic Linear/Integer Programming Guest Lecturer: Xu, Huan

The exam is closed book, closed notes except your one-page cheat sheet.

Chapter 5. Solution of System of Linear Equations. Module No. 6. Solution of Inconsistent and Ill Conditioned Systems

A NEW ALGORITHM FOR FINDING THE MINIMUM DISTANCE BETWEEN TWO CONVEX HULLS. Dougsoo Kaown, B.Sc., M.Sc. Dissertation Prepared for the Degree of

2E Pattern Recognition Solutions to Introduction to Pattern Recognition, Chapter 2: Bayesian pattern classification

Difference Equations

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Classification as a Regression Problem

Lecture 12: Discrete Laplacian

P R. Lecture 4. Theory and Applications of Pattern Recognition. Dept. of Electrical and Computer Engineering /

Calculation of time complexity (3%)

APPENDIX A Some Linear Algebra

8.4 COMPLEX VECTOR SPACES AND INNER PRODUCTS

The Minimum Universal Cost Flow in an Infeasible Flow Network

The Second Anti-Mathima on Game Theory

Nonlinear Classifiers II

1 Convex Optimization

Lecture 20: Lift and Project, SDP Duality. Today we will study the Lift and Project method. Then we will prove the SDP duality theorem.

CHALMERS, GÖTEBORGS UNIVERSITET. SOLUTIONS to RE-EXAM for ARTIFICIAL NEURAL NETWORKS. COURSE CODES: FFR 135, FIM 720 GU, PhD

Neural networks. Nuno Vasconcelos ECE Department, UCSD

Module 9. Lecture 6. Duality in Assignment Problems

MLE and Bayesian Estimation. Jie Tang Department of Computer Science & Technology Tsinghua University 2012

Lecture 11. minimize. c j x j. j=1. 1 x j 0 j. +, b R m + and c R n +

LECTURE 9 CANONICAL CORRELATION ANALYSIS

CHAPTER 5 NUMERICAL EVALUATION OF DYNAMIC RESPONSE

Chapter Newton s Method

Report on Image warping

Welfare Properties of General Equilibrium. What can be said about optimality properties of resource allocation implied by general equilibrium?

Department of Computer Science Artificial Intelligence Research Laboratory. Iowa State University MACHINE LEARNING

Transcription:

Support Vector Machnes CS434

Lnear Separators Many lnear separators exst that perfectly classfy all tranng examples Whch of the lnear separators s the best? + + + + + + + + +

Intuton of Margn Consder ponts A, B, and C We are qute confdent n our predcton for A because t s far from the decson boundary. In contrast, we are not so confdent n our predcton for C because a slght change n the decson boundary may flp the decson. A + w x + b > 0 B + + + + + + + + C w x + b = 0 w x + b < 0 Gven a tranng set, we would lke to make all of our predctons correct and confdent! Ths can be captured by the concept of margn

+ + + + + + + + + The best choce because all ponts n the tranng set are reasonably far away from t. How do we capture ths concept mathematcally?

Margn Gven a lnear decson boundary defned by 0 The functonal margn for a pont (, s defned as: For a fxed and, the larger the functonal value, the more confdence we have about the predcton A possble goal: fnd a set of and so that all tranng data ponts wll have large (maxmum) functonal margn? Ths has the rght ntuton but has a crtcal techncal flaw

Issue wth Functonal Margn + Let s say ths purple pont, has f functonal margn of 10,.e., + + + + + + + +, Decson boundary defned by 0 10 What f we change the value of and b proportonally by multplyng a large constant 1000: What s the new decson boundary? 1000 1000 0 The functonal margn of pont, becomes: 10 1000 10000 We can arbtrarly change the functonal margn wthout changng the boundary at all.

What we should measure: Geometrc margn We defne the geometrc margn of a pont,, wrt 0to be:, It measures the geometrc dstance between the pont and the decson boundary It can be ether postve or negatve Postve f the pont s correctly classfed egatve f the pont s msclassfed

Geometrc margn of a decson boundary For tranng set, : 1,, and boundary 0, compute the geometrc margn of all ponts: y ( w x b), 1,..., w ote: 0f pont s correctly classfed How can we tell f 0 s good? the smallest s large The geometrc margn of a decson boundary for tranng set s defned to be: mn 1 ( ) + B + + + + C + + A + + Goal: we am to fnd a decson boundary whose geometrc margn wrt tranng set s maxmzed! γ A

What we have done so far We have establshed that we want to fnd a lnear decson boundary whose geometrc margn s the largest We have a new learnng objectve Gven a lnearly separable (wll be relaxed later) tranng set, : 1,,, we would lke to fnd a lnear classfer, wth the maxmum geometrc margn. How can we acheve ths? Mathematcally formulate ths as an optmzaton problem and then solve t

Maxmum Margn Classfer Ths can be represented as a constraned optmzaton problem. max w, b subject to : Ths optmzaton problem s n a nasty form, so we need to do some rewrtng Let be the functonal margn (we have = can rewrte ths as max w, b ' w subject to : y () y () ( w x w ( w x b) b), ', 1,, ), we 1,,

Maxmum Margn Classfer ote that we can arbtrarly rescale w and b to make the functonal margn ' large or small So we can rescale them such that =1 max w, b ' w subject to : y ( w x ' b) ', 1,, 1 max (or equvalently mn w, b w w, b subject to : y ( w x b) 1, w 2 ) 1,, Maxmzng the geometrc margn s equvalent to mnmzng the magntude of w subject to mantanng a functonal margn of at least 1

Solvng the Optmzaton Problem 1 2 mn w w, b 2 subject to : y ( w x b) 1, 1,, Quadratc optmzaton wth lnear nequalty constrants. A well-known class of mathematcal programmng problems, Several (non-trval) algorthms exst. In practce, one can regard the QP solver as a black-box Useful to formulate an equvalent dual optmzaton problem and solve that one nstead

The soluton We can not gve you a close form soluton that you can drectly plug n the numbers and compute for an arbtrary data sets But, the soluton can always be wrtten n the followng form w y x, s.t. y 0 1 Ths s the form of w, the value for b can be calculated accordngly usng some addtonal steps A few mportant thngs to note: The weght vector s a lnear combnaton of all the tranng examples Importantly, many of the s are zeros These ponts that have non-zero s are the support vectors These are the ponts that have the smallest geometrc margn 1

Asde: Constraned Optmzaton To solve the followng optmzaton problem Consder the followng functon known as the Lagrangan Under certan condtons t can be shown that for a soluton x to the above problem we have Prmal form Dual form

Back to the Orgnal Problem subject to : 1- y (w x b) 0, The Lagrangan s 1,L, L(w,b,α) 1 2 w w We want to solve 1 {1 y (w x b)}, subject to 0 Settng the gradent of w.r.t. w and b to zero, we have y x w 0 w y x 1 1 y 0 1

The Dual Problem If we substtute w y x to, we have 1 L(α) 1 w w {y (w x b) 1} 2 1 1 j j j j yy x x j 2 yy 1 j1 1 1 1 j jyy x x j 1 2 1 j1 x x j b 1 y 1 ote that y 0 1 Ths s a functon of only

The Dual Problem The new objectve functon s n terms of only It s known as the dual problem: f we know all,we know w The orgnal problem s known as the prmal problem The objectve functon of the dual problem needs to be maxmzed! The dual problem s therefore: 1 j max L(α) jy y x x j 2 subject to 1 1 j1 0, 1,...,n, 1 y 0 Propertes of when we ntroduce the Lagrange multplers The result when we dfferentate the orgnal Lagrangan w.r.t. b

The Dual Problem max L(α) subject to 1 1 1 j1 0, 1,...,n, 2 j y y x x j j 1 y 0 Ths s also quadratc programmng (QP) problem A global maxmum of can always be found w can be recovered by w y x 1 b can also be recovered as well (wat for a bt)

Characterstcs of the Soluton Many of the are zero w s a lnear combnaton of only a small number of data ponts In fact, optmzaton theory requres that the soluton to satsfy the followng KKT condtons: 0, 1,...,n, y ( j j1 j1 {y ( x wth non-zero are called support vectors (SV) The decson boundary s determned only by the SV Let t j (j=1,..., s) be the ndces of the s support vectors. We can wrte y x j x b) 1 j y x j x b) -1} 0 s w t j j1 y t j xt j Functonal margn 1 s nonzero only when functonal margn = 1

Solve for b ote that we know that for support vectors the functonal margn = 1 We can use ths nformaton to solve for b We can use any support vector to acheve ths s y (t j j1 y t j x t j x b) 1 A numercally more stable soluton s to use all support vectors (detals n the book)

Classfyng new examples For classfyng wth a new nput z w T x b s Compute t j and classfy z j1 y t j x t j x b as postve f the sum s postve, and negatve otherwse ote: w need not be formed explctly, rather we can classfy z by takng a weghted sum of the nner products wth the support vectors (useful when we generalze from nner product to kernel functons later)

The Quadratc Programmng Problem Many approaches have been proposed Loqo, cplex, etc. (see http://www.numercal.rl.ac.uk/qp/qp.html) Most are nteror-pont methods Start wth an ntal soluton that can volate the constrants Improve ths soluton by optmzng the objectve functon and/or reducng the amount of constrant volaton For SVM, sequental mnmal optmzaton (SMO) seems to be the most popular A QP wth two varables s trval to solve Each teraton of SMO pcks a par of (, j ) and solve the QP wth these two varables; repeat untl convergence In practce, we can just regard the QP solver as a black-box wthout botherng how t works

A Geometrcal Interpretaton Class 2 8 =0.6 10 =0 =0 5 7 =0 2 =0 =0 4 9 =0 Class 1 3 =0 6 =1.4 1 =0.8

A Geometrcal Interpretaton Class 2 8 =0.6 10 =0 5 =0 2 w 7 =0 2 =0 4 =0 9 =0 Class 1 3 =0 6 =1.4 1 =0.8

A few mportant notes regardng the geometrc nterpretaton gves the decson boundary Postve support vectors le on the lne of egatve support vectors le on the lne of We can thnk of the above two lnes as defnng a fat decson boundary The support vectors exactly touches the two sdes of the fat boundary Learnng nvolves adjustng the locaton and orentaton of the decson boundary so that t can be as fat as possble wthout eatng up the tranng examples

Summary So Far We defned margn (functonal, geometrc) We demonstrated that we prefer to have lnear classfers wth large geometrc margn. We formulated the problem of fndng the maxmum margn lnear classfer as a quadratc optmzaton problem Ths problem can be solved usng effcent QP algorthms that are avalable. The solutons for w and b are very ncely formed Do we have our perfect classfer yet?

on-separable Data and ose What f the data s not lnearly separable? The soluton does not exst.e., the set of lnear constrants are not satsfable But we should stll be able to fnd a good decson boundary What f we have nose n data? The maxmum margn classfer s not robust to nose!

Soft Margn Allow functonal margns to be less than 1 Postve Class Orgnally functonal margns need to satsfy: y (w x +b) 1 ow we allow t to be less than 1: y (w x +b) 1- ξ & ξ 0 The objectve changes to: egatve Class mn w w, b 2 c 1 ξ

Soft-Margn Maxmzaton subject to : mn w, b y w 2 ( w x b) 1, 1,, Slack varables subject to : mn w, b y w ( w x 0, 2 cξ 1 b) 1 ξ, 1,, 1,, Ths allows some functonal margns < 1 (could even be < 0) The s can be vewed as the errors of our fat decson boundary Addng s to the objectve functon to mnmze errors We have a tradeoff between makng the decson boundary fat and mnmzng the error Parameter c controls the tradeoff: Large c: s ncur large penalty, so the optmal soluton wll try to avod them Small c: small cost for s, we can sacrfce some tranng examples to have a large classfer margn

Solutons to soft-margn SVM w 1 y x, s.t. y 1 0 o soft margn w 1 y x, s.t. 1 y 0 and 0 α c Wth soft margn c effectvely puts a box constrant on, the weghts of the support vectors It lmts the nfluence of ndvdual support vectors (maybe outlers) In practce, c s a parameter to be set, smlar to k n k-nearest neghbor It can be set usng cross-valdaton

How to make predctons? For classfyng wth a new nput z Compute s j1 j j w z b ( y x ) z b y t j j1 classfy z as + f postve, and - otherwse t t s t j t j ( x t j z) b ote: w need not be computed/stored explctly, we can store the α s, and classfy z by usng the above calculaton, whch nvolves takng the dot product between tranng examples and the new example z In fact, n SVM both learnng and classfcaton can be acheved usng dot products between par of nput ponts: --- ths allows us to do the Kernel trck, that s to replace the dot product wth somethng called a kernel functon.

on-lnear SVMs Soft margn works out great for datasets that are lnearly separable wth some nose 0 x But what are we gong to do f the dataset s just too hard (not lnear separable)? 0 x

Mappng the nput to a hgher dmensonal space can solve the lnearly nseparable cases 0 x x x 2 0 x (x, x 2 )

on-lnear SVMs: Feature Spaces General dea: For any data set, the orgnal nput space can always be mapped to some hgher-dmensonal feature spaces such that the data s lnearly separable: x Ф(x)

Example: Quadratc Feature Space Assume m nput dmensons x = (x 1, x 2,, x m ) umber of quadratc terms: 1+m+m+m(m-1)/2 The number of dmensons ncrease rapdly! You may be wonderng about the s At least they won t hurt anythng! You wll fnd out why they are there soon!

Dot Product n Quadratc Feature Space Ф(a) Ф(b) ow let s just look at another nterestng functon of (a b): 1 2 2 1 2 1 2 ) ( 1 ) 2( ) ( 1 1 1 1 2 2 1 1 1 1 2 1 2 m m m j j j m m m m j j j m m a b b b a a b a a b b b a a a b a b b a b a They are the same! And the later only takes O(m) to compute! 2 ( 1) ab

Kernel Functons The lnear classfer reles on nner product between vectors If every data pont s mapped nto a hgh-dmensonal space va some transformaton x (x), the nner product becomes: Defnton: A functon, s called a kernel functon f,, for some We have seen the example:, 1 Ths s equvalent to mappng to the quadratc space!

on-lnear SVM wth Kernels Dual formulaton of the learnng: Dual formulaton of predcton:

More about kernel functons In practce, we specfy the kernel functon wthout explctly statng the transformaton Gven a kernel functon, fndng ts correspondng transformaton can be very cumbersome A kernel functon can be vewed as computng some smlarty measure between objects If you have a good smlarty measure, can we use t as a kernel n svm?

Kernel functon or not Consder a fnte set of m ponts, we defne the kernel marx as,,,,,,,,, Kernel matrces are square and symmetrc Mercer theorem: A functon s a kernel functon ff for any fnte sample,,,, ts correspondng kernel matrx s postve sem-defnte (non-negatve egenvalues)

More kernel functons Lnear kernel: k(a,b) = (ab) Polynomal kernel: k(a,b) = (ab+1) d Radal-Bass-Functon kernel: In ths case, the correspondng mappng (x) s nfnte-dmensonal! Closure property of kernel: f and are kernel functons, then the followng are all kernel functons:,,,,,,,,

Key Choces n Applyng SVM Select the kernel functon In practce, we often try dfferent kernel functons and use cross-valdaton to choose Lnear kernel, polynomal kernels (wth low degrees) and RBF kernels are popular choces One can also construct a kernel usng lnear combnatons of dfferent kernels and learn the combnaton parameters (kernel learnng) Selectng the kernel parameter and c Very strong mpact on performance Often the optmal range s reasonably large Grd search wth cross-valdaton

SVM summary Strength vs weakness Strengths The soluton s globally optmal and exact Kernel allows for very flexble hypotheses It can handle non-tradtonal data lke strngs, trees, nstead of the tradtonal fxed length feature vectors Why? Because as long as we can defne a kernel (smlarty) functon for such nputs, we can apply svm Weakness eed to specfy a good kernel and tune s parameters Tranng tme can be long 2007 2006 1998