Which Separator? Spring 1

Similar documents
Support Vector Machines

CS 3710: Visual Recognition Classification and Detection. Adriana Kovashka Department of Computer Science January 13, 2015

Support Vector Machines

Support Vector Machines. Vibhav Gogate The University of Texas at dallas

Kernel Methods and SVMs Extension

Support Vector Machines

Maximal Margin Classifier

Support Vector Machines CS434

Lecture 3: Dual problems and Kernels

Support Vector Machines CS434

Lecture 10 Support Vector Machines. Oct

Lecture 10 Support Vector Machines II

Chapter 6 Support vector machine. Séparateurs à vaste marge

Intro to Visual Recognition

Linear Classification, SVMs and Nearest Neighbors

INF 5860 Machine learning for image classification. Lecture 3 : Image classification and regression part II Anne Solberg January 31, 2018

Image classification. Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing i them?

Lagrange Multipliers Kernel Trick

UVA CS / Introduc8on to Machine Learning and Data Mining. Lecture 10: Classifica8on with Support Vector Machine (cont.

10-701/ Machine Learning, Fall 2005 Homework 3

ADVANCED MACHINE LEARNING ADVANCED MACHINE LEARNING

U.C. Berkeley CS294: Beyond Worst-Case Analysis Luca Trevisan September 5, 2017

Advanced Introduction to Machine Learning

Kernels in Support Vector Machines. Based on lectures of Martin Law, University of Michigan

Kristin P. Bennett. Rensselaer Polytechnic Institute

Lecture Notes on Linear Regression

SVMs: Duality and Kernel Trick. SVMs as quadratic programs

Multilayer Perceptron (MLP)

CSE 252C: Computer Vision III

CSC 411 / CSC D11 / CSC C11

UVA CS / Introduc8on to Machine Learning and Data Mining

Support Vector Machines

Nonlinear Classifiers II

Pattern Classification

MACHINE APPLIED MACHINE LEARNING LEARNING. Gaussian Mixture Regression

17 Support Vector Machines

Lecture 6: Support Vector Machines

Support Vector Machines

C4B Machine Learning Answers II. = σ(z) (1 σ(z)) 1 1 e z. e z = σ(1 σ) (1 + e z )

Recap: the SVM problem

Solutions to exam in SF1811 Optimization, Jan 14, 2015

Support Vector Machines. Jie Tang Knowledge Engineering Group Department of Computer Science and Technology Tsinghua University 2012

Natural Language Processing and Information Retrieval

FMA901F: Machine Learning Lecture 5: Support Vector Machines. Cristian Sminchisescu

princeton univ. F 17 cos 521: Advanced Algorithm Design Lecture 7: LP Duality Lecturer: Matt Weinberg

18-660: Numerical Methods for Engineering Design and Optimization

Ensemble Methods: Boosting

The Gaussian classifier. Nuno Vasconcelos ECE Department, UCSD

Generalized Linear Methods

CIS526: Machine Learning Lecture 3 (Sept 16, 2003) Linear Regression. Preparation help: Xiaoying Huang. x 1 θ 1 output... θ M x M

Linear discriminants. Nuno Vasconcelos ECE Department, UCSD

14 Lagrange Multipliers

15 Lagrange Multipliers

PHYS 705: Classical Mechanics. Calculus of Variations II

COS 521: Advanced Algorithms Game Theory and Linear Programming

Solutions HW #2. minimize. Ax = b. Give the dual problem, and make the implicit equality constraints explicit. Solution.

SVMs: Duality and Kernel Trick. SVMs as quadratic programs

OPTIMISATION. Introduction Single Variable Unconstrained Optimisation Multivariable Unconstrained Optimisation Linear Programming

1 Convex Optimization

IV. Performance Optimization

8.4 COMPLEX VECTOR SPACES AND INNER PRODUCTS

Calculation of time complexity (3%)

ECE559VV Project Report

Logistic Regression. CAP 5610: Machine Learning Instructor: Guo-Jun QI

15-381: Artificial Intelligence. Regression and cross validation

Classification as a Regression Problem

Neural networks. Nuno Vasconcelos ECE Department, UCSD

Instance-Based Learning (a.k.a. memory-based learning) Part I: Nearest Neighbor Classification

Economics 101. Lecture 4 - Equilibrium and Efficiency

Feature Selection: Part 1

p 1 c 2 + p 2 c 2 + p 3 c p m c 2

Lecture 12: Classification

Report on Image warping

Machine Learning & Data Mining CS/CNS/EE 155. Lecture 4: Regularization, Sparsity & Lasso

Kernel Methods and SVMs

Online Classification: Perceptron and Winnow

MLE and Bayesian Estimation. Jie Tang Department of Computer Science & Technology Tsinghua University 2012

2E Pattern Recognition Solutions to Introduction to Pattern Recognition, Chapter 2: Bayesian pattern classification

VQ widely used in coding speech, image, and video

A Tutorial on Data Reduction. Linear Discriminant Analysis (LDA) Shireen Elhabian and Aly A. Farag. University of Louisville, CVIP Lab September 2009

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Maximum Likelihood Estimation (MLE)

College of Computer & Information Science Fall 2009 Northeastern University 20 October 2009

Canonical transformations

Problem Set 9 Solutions

CS : Algorithms and Uncertainty Lecture 14 Date: October 17, 2016

Computing Correlated Equilibria in Multi-Player Games

6.854J / J Advanced Algorithms Fall 2008

FORECASTING EXCHANGE RATE USING SUPPORT VECTOR MACHINES

3.1 ML and Empirical Distribution

Maximum Likelihood Estimation

By : Moataz Al-Haj. Vision Topics Seminar (University of Haifa) Supervised by Dr. Hagit Hel-Or

Chapter 10 The Support-Vector-Machine (SVM) A statistical approach of learning theory for designing an optimal classifier

Evaluation for sets of classes

Support Vector Machines. More Generally Kernel Methods

Learning with Tensor Representation

Yong Joon Ryang. 1. Introduction Consider the multicommodity transportation problem with convex quadratic cost function. 1 2 (x x0 ) T Q(x x 0 )

Generative classification models

Fisher Linear Discriminant Analysis

Outline. Multivariate Parametric Methods. Multivariate Data. Basic Multivariate Statistics. Steven J Zeil

Transcription:

Whch Separator? 6.034 - Sprng 1

Whch Separator? Mamze the margn to closest ponts 6.034 - Sprng

Whch Separator? Mamze the margn to closest ponts 6.034 - Sprng 3

Margn of a pont " # y (w $ + b) proportonal to perpendcular dstance of pont to hyperplane k! k w! 1 6.034 - Sprng 4

Margn of a pont " # y (w $ + b) proportonal to perpendcular dstance of pont to hyperplane geometrc margn s " w k! k w! 1 6.034 - Sprng 5

Margn # " y ( w! + b) Scalng w changes value of margn but not actual dstances to separator (geometrc margn) Pck the margn to closest postve and negatve ponts to be 1 + 1( w! " 1( w! 1 + b) = + b) = 1 1 1 6.034 - Sprng 6

Margn Pck the margn to closest postve and negatve ponts to be 1 + 1( w! " 1( w! Combnng these 1 + b) = + b) = 1 1 w 1 " (! ) = Dvdng by length of w gves perpendcular dstance between lnes ( geometrc margn) w 1 " (! ) = w w 6.034 - Sprng 7

Pckng w to Mamze Margn Pck w to mamze geometrc margn w or, equvalently, mnmze w = w " w or, equvalently, mnmze 1 w = 1 w " w = 1! j w j 6.034 - Sprng 8

Pckng w to Mamze Margn Pck w to mamze geometrc margn w or, equvalently, mnmze 1 w =! whle classfyng ponts correctly 1 w " w = 1 j w j y ( w " + b)! 1 1 1 or, equvalently, y ( w # + b) " 1! 0 6.034 - Sprng 9

Constraned Optmzaton mn w 1 w subject to y ( w $ + b) # 1 " 0,! 6.034 - Sprng 10

Constraned op5mza5on No Constrant -1 1 *=0 *=0 *=1 How do we solve wth constrants? à Lagrange Multplers!!!

Lagrange mul5plers Dual varables Add Lagrange multpler Introduce Lagrangan (objectve): Rewrte Constrant We wll solve: Why does ths work at all??? mn s fghtng ma! <b à (-b)<0 à ma α -α(-b) = mn won t let that happen!! >b, α>0à (-b)>0 à ma α -α(-b) = 0, α*=0 Add new constrant mn s cool wth 0, and L(, α)= (orgnal objectve) =b à α can be anythng, and L(, α)= (orgnal objectve) Snce mn s on the outsde, can force ma to behave and constrants wll be satsfed!!!

Constraned Optmzaton mn w 1 w subject to y ( w $ + b) # 1 " 0,! Convert to unconstraned optmzaton by ncorporatng the constrants as an addtonal term ( 1 mn& w ' w % ) +,, $ [ ] y ( w * + b) ) 1 #, " 0! 6.034 - Sprng 11

Constraned Optmzaton mn w 1 w subject to y ( w $ + b) # 1 " 0,! Convert to unconstraned optmzaton by ncorporatng the constrants as an addtonal term ( 1 mn& w ' w % ) +,, $ [ ] y ( w * + b) ) 1 #, " 0! To mnmze epresson: mnmze frst (orgnal) term, and mamze second (constrant) term snce α > 0, encourages constrants to be satsfed but we want least dstorton of orgnal term 6.034 - Sprng 1

Constraned Optmzaton mn w 1 w subject to y ( w $ + b) # 1 " 0,! Convert to unconstraned optmzaton by ncorporatng the constrants as an addtonal term ( 1 mn& w ' w % ) +,, $ [ ] y ( w * + b) ) 1 #, " 0! Lagrange multplers To mnmze epresson: mnmze frst (orgnal) term, and mamze second (constrant) term snce α > 0, encourages constrants to be satsfed but we want least dstorton of orgnal term Method of Lagrange multplers 6.034 - Sprng 13

Mamzng the Margn 1 L( w, b) = w "! $ " [ ( # + ) 1] y w b 6.034 - Sprng 14

Mamzng the Margn 1 L( w, b) = w "! $ " Mnmzed when: [ ( # + ) 1] y w b! * w = " y!" y = 0 6.034 - Sprng 15

Mamzng the Margn 1 L( w, b) = w "! $ " Mnmzed when: [ ( # + ) 1] y w b! * w = " y Substtutng w* nto L yelds dual Lagrangan:!" y = 0 L(") = m #" $ 1 =1 m # =1 m # k=1 " " k y y k k Only dot products of the feature vectors appear 6.034 - Sprng 16

Dual Lagrangan ma " L(") subject to #" y = 0 and " $ 0, % 6.034 - Sprng 17

Dual Lagrangan ma " L(") subject to #" y = 0 and " $ 0, % In general, snce α >= 0, ether α = 0: constrant s satsfed wth no dstorton at optmum w or α > 0: constrant s satsfed wth equalty (n ths case s known as a support vector) " = 0 " = 0 " = 0 " = 0 6.034 - Sprng 18

Dual Lagrangan ma " L(") subject to #" y = 0 and " $ 0, % In general, snce α >= 0, ether α = 0: constrant s satsfed wth no dstorton at optmum w or α > 0: constrant s satsfed wth equalty ( s known as a support vector) " = 0 w * = " y! b = 1 y " w * " = 0 " = 0 " = 0 6.034 - Sprng 19

Dual Lagrangan ma " L(") subject to #" y = 0 and " $ 0, % In general, snce α >= 0, ether α = 0: constrant s satsfed wth no dstorton at optmum w or α > 0: constrant s satsfed wth equalty ( s known as a support vector) " = 0 w * = " y! b = 1 y " w * Has a unque mamum vector Can be found usng quadratc programmng or gradent ascent " = 0 " = 0 " = 0 6.034 - Sprng 0

SVM Classfer Gven unknown vector u, predct class (1 or -1) as follows: k & h( u) = sgn$ () y % = 1 ' u + # b! " The sum s over k support vectors 6.034 - Sprng 1

Bankruptcy Eample 6.10-6.69 31.87 α y for support vectors are non-zero, all others are zero. -31.8 6.034 - Sprng

Key Ponts Learnng depends only on dot products of sample pars. Recognton depends only on dot products of unknown wth samples. 6.034 - Sprng 3

Key Ponts Learnng depends only on dot products of sample pars. Recognton depends only on dot products of unknown wth samples. Eclusve relance on dot products enables approach to non-lnearly-separable problems. 6.034 - Sprng 4

Key Ponts Learnng depends only on dot products of sample pars. Recognton depends only on dot products of unknown wth samples. Eclusve relance on dot products enables approach to non-lnearly-separable problems. The classfer depends only on the support vectors, not on all the tranng ponts. 6.034 - Sprng 5

Key Ponts Learnng depends only on dot products of sample pars. Recognton depends only on dot products of unknown wth samples. Eclusve relance on dot products enables approach to non-lnearly-separable problems. The classfer depends only on the support vectors, not on all the tranng ponts. Ma margn lowers hypothess varance. 6.034 - Sprng 6

Key Ponts Learnng depends only on dot products of sample pars. Recognton depends only on dot products of unknown wth samples. Eclusve relance on dot products enables approach to non-lnearly-separable problems. The classfer depends only on the support vectors, not on all the tranng ponts. Ma margn lowers hypothess varance. The optmal classfer s defned unquely there are no local mama n the search space Polynomal n number of data ponts and dmensonalty 6.034 - Sprng 7

Not Lnearly Separable? Requre 0 " # " C C specfed by user; controls tradeoff between sze of margn and classfcaton errors C = 1 for separable case 6.034 - Sprng 8

C Change C=10 C=1 6.034 - Sprng 9

C Change C=100 C=1 6.034 - Sprng 30

Eample: Lnearly Separable Image by Patrck Wnston 6.034 - Sprng 31

Another eample: Not lnearly separable Image by Patrck Wnston 6.034 - Sprng 3

Isn t a lnear classfer very lmtng? - - - - + + + + + + R - + - - - R + - - - - - - - - + + + + + - - 1 + + + + + - - R 1 not lnearly separable lnearly separable usng squared value of features. Important: Lnear separator n transformed feature space maps nto non-lnear separator n orgnal feature space 6.034 - Sprng 33

Not separable? Try a hgher dmensonal space! Not separable wth D lne Separable wth 3D plane 6.034 - Sprng 34

What you need To get nto the new feature space, you use!( ) The transformaton can be to a hgher-dmensonal feature space and may be non-lnear n the feature values. 6.034 - Sprng 35

What you need To get nto the new feature space, you use!( ) The transformaton can be to a hgher-dmensonal feature space and may be non-lnear n the feature values. Recall that SVM s only use dot products of the data, so k To optmze classfer, you need!( ) "!( ) To run classfer, you need! ( ) "!( u) So, all you need s a way to compute dot products n transformed space as a functon of vectors n orgnal space! 6.034 - Sprng 36

6.034 - Sprng 37 The Kernel Trck If dot products can be effcently computed by Then, all you need s a functon on low-dm nputs You don t need ever to construct hgh-dmensonal ), ( ) ( ) ( k k K =! "! ), ( k K ) (!

Standard Choces For Kernels No change (lnear kernel) "( ) k k )! "( ) = K(, =! k 6.034 - Sprng 38

Standard Choces For Kernels No change (lnear kernel) "( ) k k )! "( ) = K(, =! k Polynomal kernel (n th order) k k n K (, ) = (1 +! ) 6.034 - Sprng 39

0.1 Polynomal Kernel Eample (one feature) 0. 0.3 0.4 0.5 0.6 Not separable 6.034 - Sprng 40

0.1 Polynomal Kernel Eample (one feature) 0. 0.3 0.4 0.5 0.6 Not separable!( ) = (,, 1) 0.4 Separable 0.35 ^ 0.3 0.5 0. Neg Pos!( ) "!( z) = z = (1 + + z z) + 1 0.15 0.1 0.05 0 0 0. 0.4 0.6 0.8 1 sqrt() 6.034 - Sprng 41

Polynomal Kernel Polynomal kernel for n= and features =[ 1 ] K(, z) = (1 +! z) s equvalent to the followng feature mappng:!( ) = [ 1 1 1 1] We can verfy that: "() # "(z) = 1 z 1 + z + 1 z 1 z + 1 z 1 + z +1 = (1 + 1 z 1 + z ) = (1 + # z) = K(,z) 6.034 - Sprng 4

Polynomal Kernel Images by Patrck Wnston 6.034 - Sprng 43

6.034 - Sprng 44 Standard Choces For Kernels No change (lnear kernel) Polynomal kernel (n th order) Radal bass kernel (σ s standard devaton) k k k K! = = "! " ), ( ) ( ) ( n k k K ) (1 ), (! + = ) ( ) ( ), (!! k k k e e K k " # " " " " = =

Radal-bass kernel Classfer based on sum of Gaussan bumps wth standard devaton σ, centered on support vectors. [ h ( )] h ( u) = sgn! u u h "(u) = k $ # y K(,u) + b =1 1 K(, u) = e "! " u 6.034 - Sprng 45

Radal-bass kernel! = 0.1 0.1 0. 0.3 0.4 0.5 0.6 6.034 - Sprng 46

y 1 " 1 =1.76 y " = #1.76 y 3 " 3 =1.76 y 4 " 4 = #1.76 Radal-bass kernel b = 0.55! = 0. 1.5 1.5 1 0.5 0-0.5-1 -1.5 0 0.1 0. 0.3 0.4 0.5 0.6 0.7 0.8 0.1 0. 0.3 0.4 0.5 0.6 support vectors 6.034 - Sprng 47

y 1 " 1 =1.76 y " = #1.76 y 3 " 3 =1.76 y 4 " 4 = #1.76 h "(u) =.5 4 Radal-bass kernel $ # y K(,u) + b =1 b = 0.55! = 0. 1 K(, u) = e " " u! 1.5 1 0.5 0-0.5-1 -1.5 0 0.1 0. 0.3 0.4 0.5 0.6 0.7 0.8 0.1 0. 0.3 0.4 0.5 0.6 support vectors 6.034 - Sprng 48

Radal-bass kernel (large σ) Images by Patrck Wnston 6.034 - Sprng 49

Another radal-bass eample (small σ) Image by Patrck Wnston 6.034 - Sprng 50

Cross-Valdaton Error Does mappng to a very hgh-dmensonal space lead to over-fttng? Generally, no, thanks to the fact that only the support vectors determne the decson surface. 6.034 - Sprng 51

Cross-Valdaton Error Does mappng to a very hgh-dmensonal space lead to over-fttng? Generally, no, thanks to the fact that only the support vectors determne the decson surface. The epected leave-one-out cross-valdaton error depends on number of support vectors, not dmensonalty of feature space. Epected CV error! Epected # support vectors # tranng samples If most data ponts are support vectors, a sgn of possble overfttng, ndependent of the dmensonalty of feature space. 6.034 - Sprng 5

Summary A sngle global mamum Quadratc programmng or gradent descent 6.034 - Sprng 53

Summary A sngle global mamum Quadratc programmng or gradent descent Fewer parameters C and kernel parameters (n for polynomal, σ for radal bass kernel) 6.034 - Sprng 54

Summary A sngle global mamum Quadratc programmng or gradent descent Fewer parameters C and kernel parameters (n for polynomal, σ for radal bass kernel) Kernel Quadratc mnmzaton depends only on dot products of sample vectors Recognton depends only on dot products of unknown vector wth sample vectors Relance on only dot products enables effcent feature mappng to hgher-dmensonal spaces where lnear separaton s more effectve. 6.034 - Sprng 55

Real Data Wsconsn Breast Cancer Data 9 features C=1 37 support vectors are used from 51 tranng data ponts 1 predcton errors on tranng set (98% accuracy) 96% accuracy on 171 held out ponts Essentally same performance as nearest neghbors and decson trees Don t epect such good performance on every data set. 6.034 - Sprng 56

Success Stores Gene mcroarray data outperformed all other classfers specally desgned kernel Tet categorzaton lnear kernel n >10,000 D nput space best predcton performance 35 tmes faster to tran than net best classfer (decson trees) Many others: http://www.clopnet.com/sabelle/projects/svm/ applst.html 6.034 - Sprng 57