UVA$CS$6316$$ $Fall$2015$Graduate:$$ Machine$Learning$$ $ $Lecture$8:$Supervised$ClassificaDon$ with$support$vector$machine$

Size: px

Start display at page:

Download "UVA$CS$6316$$ $Fall$2015$Graduate:$$ Machine$Learning$$ $ $Lecture$8:$Supervised$ClassificaDon$ with$support$vector$machine$"

Verity Waters
5 years ago
Views:

1 UVACS6316 Fall2015Graduate: MachineLearning Lecture8:SupervisedClassificaDon withsupportvectormachine 9/30/15 Dr.YanjunQi UniversityofVirginia Departmentof ComputerScience 1 Whereweare?! FivemajorsecHonsofthiscourse " Regression(supervised) " ClassificaHon(supervised) " Unsupervisedmodels " Learningtheory " Graphicalmodels 9/30/15 2

2 Today " SupervisedClassificaHon " SupportVectorMachine(SVM) 9/30/15 3 e.g. SUPERVISED LEARNING Find function to map input space X to output space Y So that the difference between y and f(x) of each example x is small. e.g. x Ibelievethatthisbookis notatallhelpfulsinceit doesnotexplainthoroughly thematerial.itjustprovides thereaderwithtablesand calculahonsthatsomehmes arenoteasilyunderstood y X1 OutputY:{1/Yes,X1/No} e.g.isthisaposihveproductreview? 9/30/15 InputX:e.g.apieceofEnglishtext 4

ADataset forclassificahon Output Class: categorical variable Data/points/instances/examples/samples/records:[rows] Features/a0ributes/dimensions/independent3variables/covariates/

3 ADataset forclassificahon Output Class: categorical variable Data/points/instances/examples/samples/records:[rows] Features/a0ributes/dimensions/independent3variables/covariates/ predictors/regressors:[columns,exceptthelast] Target/outcome/response/label/dependent3variable:special columntobepredicted[lastcolumn] 9/30/15 5 e.g. SUPERVISED Linear Binary Classifier w"x"+"b>0"? x f y f(x,w,b) = sign(w x + b)? w"x"+"b<0"? denotes +1 point denotes -1 point denotes future points 9/30/15 6 CourtesyslidefromProf.AndrewMoore stutorial

ApplicaHon1:ClassifyingGalaxies Courtesy: http://aps.umn.

features, Characteristics of light waves received, etc.

Image Database: 150 GB From [Berry & Linoff] Data Mining Techniques, 1997

4 ApplicaHon1:ClassifyingGalaxies Courtesy: Early Class: Stages of Formation Intermediate3 Attributes: Image features, Characteristics of light waves received, etc. Late Data Size: 72 million stars, 20 million galaxies Object Catalog: 9 GB Image Database: 150 GB From [Berry & Linoff] Data Mining Techniques, /30/15 7 ApplicaHon2:CancerClassificaDon usinggeneexpression p genes quantities in blood cell N patient blood samples X 9/30/15 8

5 ApplicaHon3: TextDocuments,e.g.Google News 9/30/15 9 TextDocumentRepresentaHon Eachdocumentbecomesa`term'vector, eachtermisan(alribute)ofthevector, thevalueofeachdescribesthenumberofhmes thecorrespondingtermoccursinthedocument. season timeout lost wi n game score ball pla y coach team Document Document Document /30/15 10

6 Text Categorization Pre-given categories and labeled document examples (Categories may form hierarchy) Classify new documents A standard supervised learning problem Categorization System 9/30/15 y ŷ = f (x) 11 Dr. Yanjun Qi / UVA CS 6316 / f15 Examples of Text Categorization News article classification Meta-data annotation Automatic sorting Web page classification 9/30/15 12

7 ApplicaHon4: ObjecHverecogniHon/Image Labeling(LabelImagesintopredefinedclasses) 9/30/15 13 ImageRepresentaHonfor ObjecHverecogniHon ImagerepresentaHon!bagof visualwords 9/30/15 14

8 9/30/15 15 ApplicaHon5: AudioClassificaHon RealXlifeapplicaHons: CustomerservicephonerouHng VoicerecogniHonsooware 9/30/15 16

9 MusicInformaHonRetrievalSystems e.g.,automahcmusicclassificahon ManyareasofresearchinmusicinformaHon retrieval(mir)involveusingcomputerstoclassify musicinvariousways GenreorstyleclassificaHon MoodclassificaHon PerformerorcomposeridenHficaHon MusicrecommendaHon PlaylistgeneraHon HitpredicHon AudiotosymbolictranscripHon etc. Suchareasooensharesimilarcentralprocedures Dr.YanjunQi/UVACS6316/ 9/30/15 f15 17 Dr.YanjunQi/UVACS6316/ f15 MusicInformaHonRetrievalSystems e.g.,automahcmusicclassificahon 9/30/15 MusicaldatacollecHon Theinstances(basicenHHes) toclassify Audiorecordings,scores, culturaldata,etc. FeatureextracHon Featuresrepresent characterishcinformahon aboutinstances Mustprovidesufficient informahontosegment instancesamongclasses (categories) Machinelearning Algorithms( classifiers or learners )learntoassociate featurepalernsofinstances withtheirclasses BasicClassificaDonTasks Music MusicalData CollecDon FeatureExtracDon MachineLearning ClassificaDons Metadata Analysis Metadata" Classifier Training 18

Audio,Typesoffeatures LowXlevel Associatedwithsignal processingandbasicauditory percephon e.g.spectralfluxorrms UsuallynotintuiHvelymusical HighXlevel MusicalabstracHons e.g.meterorpitchclass distribuhons Cultural SocioculturalinformaHon outsidethescopeofauditory ormusicalcontent e.

10 Audio,Typesoffeatures LowXlevel Associatedwithsignal processingandbasicauditory percephon e.g.spectralfluxorrms UsuallynotintuiHvelymusical HighXlevel MusicalabstracHons e.g.meterorpitchclass distribuhons Cultural SocioculturalinformaHon outsidethescopeofauditory ormusicalcontent e.g.playlistcoxoccurrenceor purchasecorrelahons FeatureExtracDon LowOLevel Features HighOLevel Features Cultural Features Dr.YanjunQi/UVACS6316/ 9/30/15 19 f15 Whereweare?! ThreemajorsecHonsforclassificaHon We can divide the large variety of classification approaches into roughly three major types 1. Discriminative - directly estimate a decision rule/boundary - e.g., support vector machine, decision tree 2. Generative: - build a generative statistical model - e.g., Bayesian networks 3. Instance based classifiers - Use observation directly (no models) - e.g. K nearest neighbors 9/30/15 20

11 ICML'06Proceedingsofthe23rdinternaHonal conferenceonmachinelearning A study comparing Classifiers 9/30/15 21 A study comparing Classifiers! 11binaryclassificaHonproblems 9/30/15 22

12 A study comparing Classifiers! 11binaryclassificaHonproblems/8metrics 9/30/15 23 Today " SupervisedClassificaHon " SupportVectorMachine(SVM) # HistoryofSVM # LargeMarginLinearClassifier # DefineMargin(M)intermsofmodelparameter # OpHmizaHontolearnmodelparameters(w,b) # Nonlinearlyseparablecase # OpHmizaHonwithdualform # Nonlineardecisionboundary # MulHclassSVM 9/30/15 24

HistoryofSVM Young/ theorehcally sound/ SVMisinspiredfromstaHsHcallearningtheory[3] Impacuul SVMwasfirstintroducedin1992[1] SVMbecomespopularbecauseofitssuccessinhandwrilendigit recognihon(1994) 1.

11in[2]orthediscussionin[3]fordetails SVMisnowregardedasanimportantexampleof kernelmethods, arguablytheholestareainmachinelearning15yearsago [1] B.E. Boser et al.

13 HistoryofSVM Young/ theorehcally sound/ SVMisinspiredfromstaHsHcallearningtheory[3] Impacuul SVMwasfirstintroducedin1992[1] SVMbecomespopularbecauseofitssuccessinhandwrilendigit recognihon(1994) 1.1%testerrorrateforSVM.Thisisthesameastheerrorratesofacarefully constructedneuralnetwork,lenet4. SecHon5.11in[2]orthediscussionin[3]fordetails SVMisnowregardedasanimportantexampleof kernelmethods, arguablytheholestareainmachinelearning15yearsago [1] B.E. Boser et al. A Training Algorithm for Optimal Margin Classifiers. Proceedings of the Fifth Annual Workshop on Computational Learning Theory , Pittsburgh, [2] L. Bottou et al. Comparison of classifier methods: a case study in handwritten digit recognition. Proceedings of the 12th IAPR International Conference on Pattern Recognition, vol. 2, pp , [3] V. Vapnik. The Nature of Statistical Learning Theory. 2 nd edition, Springer, /30/15 25 ApplicaHonsofSVMs ComputerVision TextCategorizaHon Ranking(e.g.,Googlesearches) HandwrilenCharacterRecogniHon Timeseriesanalysis BioinformaHcs. LotsofverysuccessfulapplicaHons!!! 9/30/15 26

HandwrilendigitrecogniHon In90s,SVM achievesthe

14 HandwrilendigitrecogniHon In90s,SVM achievesthe 9/30/15 27 ADataset forbinary classificahon Output as Binary Class: only two possibilities Data/points/instances/examples/samples/records:[rows] Features/a0ributes/dimensions/independent3variables/covariates/ predictors/regressors:[columns,exceptthelast] Target/outcome/response/label/dependent3variable:special columntobepredicted[lastcolumn] 9/30/15 28

15 Today " SupervisedClassificaHon " SupportVectorMachine(SVM) # HistoryofSVM # LargeMarginLinearClassifier # DefineMargin(M)intermsofmodelparameter # OpHmizaHontolearnmodelparameters(w,b) # Nonlinearlyseparablecase # OpHmizaHonwithdualform # Nonlineardecisionboundary # MulHclassSVM 9/30/15 29 LinearClassifiers x" y est f3 denotes+1 denotesx1 Howwouldyou classifythisdata? 9/30/15 30

16 LinearClassifiers x" y est f3 denotes+1 denotesx1 Howwouldyou classifythisdata? 9/30/15 31 LinearClassifiers x" y est f3 denotes+1 denotesx1 Howwouldyou classifythisdata? 9/30/15 32

17 LinearClassifiers x" y est f3 denotes+1 denotesx1 Howwouldyou classifythisdata? 9/30/15 33 LinearClassifiers x" y est f3 denotes+1 denotesx1 Anyofthesewould befine....butwhichisbest? 9/30/15 34

18 ClassifierMargin x" y est f3 denotes+1 denotesx1 Definethemarginof alinearclassifieras thewidththatthe boundarycouldbe increasedbybefore hiyngadatapoint. 9/30/15 35 MaximumMargin x" y est f3 denotes+1 denotesx1 Themaximum marginlinear classifieristhe linearclassifierwith the,maximum margin. Thisisthesimplest kindofsvm(called anlsvm) LinearSVM 9/30/15 36

19 MaximumMargin x" y est f3 denotes+1 denotesx1 SupportVectorsare thosedatapoints thatthemargin pushesupagainst x2 x1 Themaximum marginlinear classifieristhe linearclassifierwith the,maximum margin. Thisisthesimplest kindofsvm(called anlsvm) LinearSVM 9/30/15 37 MaximumMargin x" y est f3 denotes+1 denotesx1 SupportVectorsare thosedatapoints thatthemargin pushesupagainst f(x,w,b)3=3sign(w T x3+3b)3 Themaximum marginlinear classifieristhe linearclassifierwith the,maximum margin. Thisisthesimplest kindofsvm(called anlsvm) LinearSVM 9/30/15 38

20 Max margin classifiers Instead of fitting all points, focus on boundary points Learn a boundary that leads to the largest margin from both sets of points From all the possible boundary lines, this leads to the largest margin on both sides 9/30/15 39 Max margin classifiers Instead of fitting all points, focus on boundary points Learn a boundary that leads to the largest margin from points on both sides D D Why? Intuitive, makes sense Some theoretical support Works well in practice 9/30/15 40

These are the vectors supporting the boundary 9/30/15 41 MaxXmargin&DecisionBoundary

21 Max margin classifiers Instead of fitting all points, focus on boundary points Learn a boundary that leads to the largest margin from points on both sides D D Also known as linear support vector machines (SVMs) These are the vectors supporting the boundary 9/30/15 41 MaxXmargin&DecisionBoundary Thedecisionboundaryshouldbeasfarawayfrom thedataofbothclassesaspossible Class 1 W is a p-dim vector; b is a scalar Class -1 9/30/15 42

22 Specifying a max margin classifier Predict class +1 w T x+b=+1 w T x+b=0 w T x+b=-1 Predict class -1 Class +1 plane boundary Class -1 plane Classify as +1 if w T x+b >= 1 Classify as -1 if w T x+b <= - 1 Undefined if -1 <w T x+b < 1 9/30/15 43 Specifying a max margin classifier Predict class +1 w T x+b=+1 w T x+b=0 w T x+b=-1 Predict class -1 Is the linear separation assumption realistic? We will deal with this shortly, but lets assume it for now Classify as +1 if w T x+b >= 1 Classify as -1 if w T x+b <= - 1 Undefined if -1 <w T x+b < 1 9/30/15 44

23 Today " SupervisedClassificaHon " SupportVectorMachine(SVM) # HistoryofSVM # LargeMarginLinearClassifier # DefineMargin(M)intermsofmodelparameter # OpHmizaHontolearnmodelparameters(w,b) # Nonlinearlyseparablecase # OpHmizaHonwithdualform # Nonlineardecisionboundary # MulHclassSVM 9/30/15 45 Maximizing the margin Predict class +1 M Classify as +1 if w T x+b >= 1 Classify as -1 if w T x+b <= - 1 Undefined if -1 <w T x+b < 1 w T x+b=+1 w T x+b=0 w T x+b=-1 Predict class -1 Lets define the width of the margin by M How can we encode our goal of maximizing M in terms of our parameters (w and b)? Lets start with a few obsevrations 9/30/15 46

24 Maximizing the margin: observation-1 Predict class +1 w T x+b=+1 w T x+b=0 w T x+b=-1 Predict class -1 M Classify as +1 if w T x+b >= 1 Classify as -1 if w T x+b <= - 1 Undefined if -1 <w T x+b < 1 Observation 1: the vector w is orthogonal to the +1 plane Why? Let u and v be two points on the +1 plane, then for the vector defined by u and v we have w T (u-v) = 0 9/30/15 Corollary: the vector w is orthogonal to the -1 plane 47 ObservaHon1!Review:VectorSubtracHon 9/30/15 48

25 ObservaHon1!Review:VectorProduct!Orthogonal 9/30/15 49 ObservaHon1!Review: VectorProduct,Orthogonal,andNorm Fortwovectorsxandy, x T y3 iscalledthe(inner)3vector3product. xandyarecalledorthogonalif x T y3=303 Thesquarerootoftheproductofavectorwith itself, iscalledthe2cnorm( x 2 ),canalsowriteas x 9/30/15 50

<w T x+b < 1 Observation 1: the vector w is orthogonal to the +1 and -1 planes Observation 2: if x + is a point on the +1 plane and x - is the closest

26 Maximizing the margin: observation-1 ObservaDon1:thevectorwisorthogonaltothe+1plane Class 2 Class 1 M 9/30/15 51 w T x+b=+1 w T x+b=0 w T x+b=-1 Maximizing the margin: observation-2 Predict class +1 Predict class -1 M Classify as +1 if w T x+b >= 1 Classify as -1 if w T x+b <= - 1 Undefined if -1 <w T x+b < 1 Observation 1: the vector w is orthogonal to the +1 and -1 planes Observation 2: if x + is a point on the +1 plane and x - is the closest point to x + on the -1 plane then x + = λ w + x - Since w is orthogonal to both planes we need to travel some distance along w to get from x + to x - 9/30/15 52

27 Predict class +1 Putting it together M w T x+b=+1 w T x+b=0 w T x+b=-1 Predict class -1 w T x + + b = +1 w T x - + b = -1 x + = λ w + x - x + - x - = M We can now define M in terms of w and b 9/30/ /30/15 54

28 Predict class +1 Putting it together M w T x+b=+1 w T x+b=0 w T x+b=-1 w T x + + b = +1 w T x - + b = -1 x + = λw + x - x + - x - = M Predict class -1 We can now define M in terms of w and b w T x + + b = +1 => w T (λw + x - ) + b = +1 => w T x - + b + λw T w = +1 => -1 + λw T w = +1 => λ = 2/w T w 9/30/15 55 w T x+b=+1 w T x+b=0 w T x+b=-1 w T x + + b = +1 w T x - + b = -1 x + = λ w + x - x + - x - = M λ = 2/w T w Predict class +1 Putting it together Predict class -1 We can now define M in terms of w and b M M = x + - x - => M = λw = λ w = λ => M = 2 w T w w T w = 2 w T w w T w 9/30/15 56

w T x+b=+1 w T x+b=0 w T x+b=-1 Finding the optimal parameters Predict class +1 Predict class -1 M 2 M = w T w We can now search for the optimal parameters by finding a

Maximizes the margin (or equivalently minimizes w T w) Several optimization methods can be used: Gradient descent, simulated annealing, EM etc.

29 w T x+b=+1 w T x+b=0 w T x+b=-1 Finding the optimal parameters Predict class +1 Predict class -1 M 2 M = w T w We can now search for the optimal parameters by finding a solution that: 1. Correctly classifies all points 2. Maximizes the margin (or equivalently minimizes w T w) Several optimization methods can be used: Gradient descent, simulated annealing, EM etc. 9/30/15 57 TodayRecap " SupervisedClassificaHon " SupportVectorMachine(SVM) # HistoryofSVM # LargeMarginLinearClassifier # DefineMargin(M)intermsofmodelparameter # OpHmizaHontolearnmodelparameters(w,b) # Nonlinearlyseparablecase # OpHmizaHonwithdualform # Nonlineardecisionboundary # MulHclassSVM 9/30/15 58

30 Optimization Step i.e. learning optimal parameter for SVM Predict class +1 M M = 2 w T w w T x+b=+1 w T x+b=0 w T x+b=-1 Predict class -1 Min (w T w)/2 subject to the following constraints: For all x in class + 1 w T x+b >= 1 For all x in class - 1 w T x+b <= -1 } A total of n constraints if we have n input samples 9/30/15 59 YanjunQi/UVACS4501X01X6501X07 Support Vector Machine Task Representation Score Function classification Kernel Func K(xi, xj) Margin + Hinge Loss (optional) K(x, z) := Φ(x) T Φ(z) Search/Optimization Models, Parameters QP with Dual form w = Dual Weights i α i x i y i argmin w,b +C ε i p 2 w i=1 i n i=1 subject to x i Dtrain : y i ( x i w + b) 1 ε i max α α i 1 2 α i y i = 0, α i 0 i! 9/30/15 60 i i i,j α i α j y i y j x T i x j

31 References allowingmetoreusesomeofhisslides ElementsofStaHsHcalLearning,byHasHe, TibshiraniandFriedman sslides 9/30/15 61

UVA CS / Introduc8on to Machine Learning and Data Mining. Lecture 9: Classifica8on with Support Vector Machine (cont.

UVA CS / Introduc8on to Machine Learning and Data Mining. Lecture 9: Classifica8on with Support Vector Machine (cont. UVA CS 4501-001 / 6501 007 Introduc8on to Machine Learning and Data Mining Lecture 9: Classifica8on with Support Vector Machine (cont.) Yanjun Qi / Jane University of Virginia Department of Computer Science