UVA CS / Introduc8on to Machine Learning and Data Mining. Lecture 10: Classifica8on with Support Vector Machine (cont.

Similar documents
UVA CS / Introduc8on to Machine Learning and Data Mining

SVMs: Duality and Kernel Trick. SVMs as quadratic programs

Support Vector Machines

Support Vector Machines

Support Vector Machines. Vibhav Gogate The University of Texas at dallas

Which Separator? Spring 1

SVMs: Duality and Kernel Trick. SVMs as quadratic programs

Kernel Methods and SVMs Extension

Lecture 10 Support Vector Machines II

Linear Classification, SVMs and Nearest Neighbors

Support Vector Machines

CS 3710: Visual Recognition Classification and Detection. Adriana Kovashka Department of Computer Science January 13, 2015

Lecture 3: Dual problems and Kernels

Lecture 10 Support Vector Machines. Oct

Lagrange Multipliers Kernel Trick

Support Vector Machines. Jie Tang Knowledge Engineering Group Department of Computer Science and Technology Tsinghua University 2012

Natural Language Processing and Information Retrieval

Support Vector Machines CS434

Support Vector Machines

Image classification. Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing i them?

Kristin P. Bennett. Rensselaer Polytechnic Institute

Intro to Visual Recognition

Support Vector Machines

Kernels in Support Vector Machines. Based on lectures of Martin Law, University of Michigan

Nonlinear Classifiers II

Maximal Margin Classifier

Chapter 6 Support vector machine. Séparateurs à vaste marge

UVA CS / Introduc8on to Machine Learning and Data Mining. Lecture 9: Classifica8on with Support Vector Machine (cont.

18-660: Numerical Methods for Engineering Design and Optimization

Advanced Introduction to Machine Learning

Support Vector Machines CS434

CSE 252C: Computer Vision III

Kernel Methods and SVMs

10-701/ Machine Learning, Fall 2005 Homework 3

INF 5860 Machine learning for image classification. Lecture 3 : Image classification and regression part II Anne Solberg January 31, 2018

princeton univ. F 17 cos 521: Advanced Algorithm Design Lecture 7: LP Duality Lecturer: Matt Weinberg

For now, let us focus on a specific model of neurons. These are simplified from reality but can achieve remarkable results.

1 Convex Optimization

ADVANCED MACHINE LEARNING ADVANCED MACHINE LEARNING

Recap: the SVM problem

Neural networks. Nuno Vasconcelos ECE Department, UCSD

Evaluation of classifiers MLPs

Lecture 6: Support Vector Machines

Generalized Linear Methods

Ensemble Methods: Boosting

Feature Selection: Part 1

MLE and Bayesian Estimation. Jie Tang Department of Computer Science & Technology Tsinghua University 2012

Lecture Notes on Linear Regression

Pattern Classification

Logistic Regression. CAP 5610: Machine Learning Instructor: Guo-Jun QI

Lecture 20: November 7

CIS526: Machine Learning Lecture 3 (Sept 16, 2003) Linear Regression. Preparation help: Xiaoying Huang. x 1 θ 1 output... θ M x M

Lecture 12: Classification

Boostrapaggregating (Bagging)

Online Classification: Perceptron and Winnow

17 Support Vector Machines

Homework Assignment 3 Due in class, Thursday October 15

The Gaussian classifier. Nuno Vasconcelos ECE Department, UCSD

Linear Feature Engineering 11

C4B Machine Learning Answers II. = σ(z) (1 σ(z)) 1 1 e z. e z = σ(1 σ) (1 + e z )

Week 5: Neural Networks

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Classification as a Regression Problem

The exam is closed book, closed notes except your one-page cheat sheet.

Solutions HW #2. minimize. Ax = b. Give the dual problem, and make the implicit equality constraints explicit. Solution.

Pairwise Multi-classification Support Vector Machines: Quadratic Programming (QP-P A MSVM) formulations

CSE 546 Midterm Exam, Fall 2014(with Solution)

UVA$CS$6316$$ $Fall$2015$Graduate:$$ Machine$Learning$$ $ $Lecture$9:$Support$Vector$Machine$ (Cont.$Revised$Advanced$Version)$

Math 217 Fall 2013 Homework 2 Solutions

FMA901F: Machine Learning Lecture 5: Support Vector Machines. Cristian Sminchisescu

Instance-Based Learning (a.k.a. memory-based learning) Part I: Nearest Neighbor Classification

CSC 411 / CSC D11 / CSC C11

Multilayer Perceptron (MLP)

Supporting Information

p 1 c 2 + p 2 c 2 + p 3 c p m c 2

EEE 241: Linear Systems

Statistical machine learning and its application to neonatal seizure detection

Learning with Tensor Representation

Lectures - Week 4 Matrix norms, Conditioning, Vector Spaces, Linear Independence, Spanning sets and Basis, Null space and Range of a Matrix

Multi-layer neural networks

Department of Computer Science Artificial Intelligence Research Laboratory. Iowa State University MACHINE LEARNING

COS 521: Advanced Algorithms Game Theory and Linear Programming

U.C. Berkeley CS294: Beyond Worst-Case Analysis Luca Trevisan September 5, 2017

Numerical Heat and Mass Transfer

Report on Image warping

A NEW ALGORITHM FOR FINDING THE MINIMUM DISTANCE BETWEEN TWO CONVEX HULLS. Dougsoo Kaown, B.Sc., M.Sc. Dissertation Prepared for the Degree of

Transfer Functions. Convenient representation of a linear, dynamic model. A transfer function (TF) relates one input and one output: ( ) system

Multilayer neural networks

MACHINE APPLIED MACHINE LEARNING LEARNING. Gaussian Mixture Regression

Linear discriminants. Nuno Vasconcelos ECE Department, UCSD

Assortment Optimization under MNL

Multilayer Perceptrons and Backpropagation. Perceptrons. Recap: Perceptrons. Informatics 1 CG: Lecture 6. Mirella Lapata

A Bayes Algorithm for the Multitask Pattern Recognition Problem Direct Approach

The Expectation-Maximization Algorithm

Module 9. Lecture 6. Duality in Assignment Problems

PHYS 705: Classical Mechanics. Calculus of Variations II

Differentiating Gaussian Processes

VQ widely used in coding speech, image, and video

By : Moataz Al-Haj. Vision Topics Seminar (University of Haifa) Supervised by Dr. Hagit Hel-Or

A Study on L2-Loss (Squared Hinge-Loss) Multi-Class SVM

Transcription:

UVA CS 4501-001 / 6501 007 Introduc8on to Machne Learnng and Data Mnng Lecture 10: Classfca8on wth Support Vector Machne (cont. ) Yanjun Q / Jane Unversty of Vrgna Department of Computer Scence 9/6/14 1 Where we are? è Fve major seclons of ths course q Regresson (supervsed) q ClassfcaLon (supervsed) q Unsupervsed models q Learnng theory q Graphcal models 9/6/14 1

Where we are? è Three major seclons for classfcalon We can dvde the large varety of classfcaton approaches nto roughly three major types 1. Dscrmnatve - drectly estmate a decson rule/boundary - e.g., support vector machne, decson tree. Generatve: - buld a generatve statstcal model - e.g., Bayesan networks 3. Instance based classfers - Use observaton drectly (no models) - e.g. K nearest neghbors 9/6/14 3 Today Last Lecture q Support Vector Machne (SVM) ü Hstory of SVM ü Large Margn Lnear Classfer ü Defne Margn (M) n terms of model parameter ü OpLmzaLon to learn model parameters (w, b) ü Non lnearly separable case ü OpLmzaLon wth dual form ü Nonlnear decson boundary ü MulLclass SVM 9/6/14 4

Today revew q Support Vector Machne (SVM) ü Hstory of SVM ü Large Margn Lnear Classfer ü Defne Margn (M) n terms of model parameter ü OpLmzaLon to learn model parameters (w, b) ü Non lnearly separable case ü OpLmzaLon wth dual form ü Nonlnear decson boundary ü MulLclass SVM 9/6/14 5 Hstory of SVM Young / theorelcally sound / SVM s nspred from stalslcal learnng theory [3] Impaceul SVM was frst ntroduced n 199 [1] SVM becomes popular because of ts success n handwr_en dgt recognlon 1.1% test error rate for SVM. Ths s the same as the error rates of a carefully constructed neural network, LeNet 4. See SecLon 5.11 n [] or the dscusson n [3] for detals SVM s now regarded as an mportant example of kernel methods, arguably the ho_est area n machne learnng 10 years ago [1] B.E. Boser et al. A Tranng Algorthm for Optmal Margn Classfers. Proceedngs of the Ffth Annual Workshop on Computatonal Learnng Theory 5 144-15, Pttsburgh, 199. [] L. Bottou et al. Comparson of classfer methods: a case study n handwrtten dgt recognton. Proceedngs of the 1th IAPR Internatonal Conference on Pattern Recognton, vol., pp. 77-8, 1994. [3] V. Vapnk. The Nature of Statstcal Learnng Theory. nd edton, Sprnger, 1999. 9/6/14 6 3

ApplcaLons of SVMs Computer Vson Text CategorzaLon Rankng (e.g., Google searches) Handwr_en Character RecognLon Tme seres analyss BonformaLcs. à Lots of very successful applcalons!!! 9/6/14 7 Handwr_en dgt recognlon 1999, SVM 9/6/14 8 4

Today revew q Support Vector Machne (SVM) ü Hstory of SVM ü Large Margn Lnear Classfer ü Defne Margn (M) n terms of model parameter ü OpLmzaLon to learn model parameters (w, b) ü Non lnearly separable case ü OpLmzaLon wth dual form ü Nonlnear decson boundary 9/6/14 9 A Dataset for bnary classfcalon Output as Bnary Class Label: 1 or -1 Data/ponts/nstances/examples/samples/records: [ rows ] Features/a0rbutes/dmensons/ndependent varables/covarates/ predctors/regressors: [ columns, except the last] Target/outcome/response/label/dependent varable: specal 9/5/14 column to be predcted [ last column ] 10 5

Max margn classfers Instead of fttng all ponts, focus on boundary ponts Learn a boundary that leads to the largest margn from ponts on both sdes x Why? Intutve, makes sense Some theoretcal support Works well n practce 9/5/14 x 11 1 Max- margn & Decson Boundary The decson boundary should be as far away from the data of both classes as possble Class 1 W s a p-dm vector; b s a scalar Class -1 9/6/14 1 6

Today revew q Support Vector Machne (SVM) ü Hstory of SVM ü Large Margn Lnear Classfer ü Defne Margn (M) n terms of model parameter ü OpLmzaLon to learn model parameters (w, b) ü Non lnearly separable case ü OpLmzaLon wth dual form ü Nonlnear decson boundary ü MulLclass SVM 9/6/14 13 Maxmzng the margn: observaton-1 Observa8on 1: the vector w s orthogonal to the +1 plane Class Class 1 M 9/6/14 14 7

w T x+b=+1 w T x+b=0 w T x+b=-1 Maxmzng the margn: observaton- Predct class +1 Predct class -1 M Classfy as +1 f w T x+b 1 Classfy as -1 f w T x+b - 1 Undefned f -1 <w T x+b < 1 Observaton 1: the vector w s orthogonal to the +1 and -1 planes Observaton : f x + s a pont on the +1 plane and x - s the closest pont to x + on the -1 plane then x + = λw + x - Snce w s orthogonal to both planes we need to travel some dstance along w to get from x + to x - 9/6/14 15 Predct class +1 Puttng t together M w T x+b=+1 w T x+b=0 w T x+b=-1 w T x + + b = +1 w T x - + b = -1 x + = λw + x - x + - x - = M Predct class -1 We can now defne M n terms of w and b w T x + + b = +1 w T (λw + x - ) + b = +1 w T x - + b + λw T w = +1-1 + λw T w = +1 λ = /w T w 9/6/14 16 8

w T x+b=+1 w T x+b=0 w T x+b=-1 w T x + + b = +1 w T x - + b = -1 x + = λw + x - x + - x - = M λ = /w T w Predct class +1 Puttng t together Predct class -1 We can now defne M n terms of w and b M M = x + - x - M = λw = λ w = λ M = w T w w T w = w T w w T w 9/6/14 17 Today revew q Support Vector Machne (SVM) ü Hstory of SVM ü Large Margn Lnear Classfer ü Defne Margn (M) n terms of model parameter ü OpLmzaLon to learn model parameters (w, b) ü Non lnearly separable case ü OpLmzaLon wth dual form ü Nonlnear decson boundary ü MulLclass SVM 9/6/14 18 9

Optmzaton Step.e. learnng optmal parameter for SVM Predct class +1 M M = w T w w T x+b=+1 w T x+b=0 w T x+b=-1 Predct class -1 Mn (w T w)/ subject to the followng constrants: For all x n class + 1 w T x+b 1 For all x n class - 1 w T x+b -1 } A total of n constrants f we have n nput samples argmn w,b p w =1 9/6/14 19 subject to x Dtran : y ( x w + b) 1 SVM as a QP problem w T x+b=+1 w T x+b=0 w T x+b=-1 Predct class +1 Predct class -1 M M = w T w R as I matrx, d as zero vector, c as 0 value mn U u T Ru + d T u + c subject to n nequalty constrants: a 11 u 1 + a 1 u +... b 1!!! Mn (w T w)/ subject to the followng nequalty constrants: For all x n class + 1 w T x+b 1 For all x n class - 1 w T x+b -1 } A total of n constrants f we have n nput samples a n1 u 1 + a n u +... b n and k equvalency constrants: a n +1,1 u 1 + a n +1, u +... = b n +1!!! a n +k,1 u 1 + a n +k, u +... = b n +k 9/6/14 0 10

Today revew q Support Vector Machne (SVM) ü Hstory of SVM ü Large Margn Lnear Classfer ü Defne Margn (M) n terms of model parameter ü OpLmzaLon to learn model parameters (w, b) ü Non lnearly separable case ü OpLmzaLon wth dual form ü Nonlnear decson boundary ü MulLclass SVM 9/6/14 1 Non lnearly separable case Instead of mnmzng the number of msclassfed ponts we can mnmze the dstance between these ponts and ther correct plane +1 plane -1 plane The new optmzaton problem s: n w T w mn w + Cε =1 subject to the followng nequalty constrants: For all x n class + 1 ε k ε j w T x+b 1- ε For all x n class - 1 w T x+b -1+ ε }A total of n constrants For all ε I 0 } Another n constrants 9/6/14 11

Where we are Two optmzaton problems: For the separable and non separable cases w T n w T w w mn mn w + Cε w =1 For all x n class + 1 For all x n class + 1 w T x+b 1 w T x+b 1- ε For all x For all x n class - 1 n class - 1 w T w x+b -1 T x+b -1+ ε For all ε I 0 9/6/14 3 Today q Support Vector Machne (SVM) ü Hstory of SVM ü Large Margn Lnear Classfer ü Defne Margn (M) n terms of model parameter ü OpLmzaLon to learn model parameters (w, b) ü Non lnearly separable case ü OpLmzaLon wth dual form ü Nonlnear decson boundary ü MulLclass SVM 9/6/14 4 1

Where we are Two optmzaton problems: For the separable and non separable cases w T n w Mn (w T mn w)/ w + Cε =1 For all x n class + 1 For all x n class + 1 w T x+b 1 w T x+b 1- ε For all x For all x n class - 1 n class - 1 w T w x+b -1 T x+b -1+ ε For all ε I 0 Instead of solvng these QPs drectly we wll solve a dual formulaton of the SVM optmzaton problem The man reason for swtchng to ths type of representaton s that t would allow us to use a neat trck that wll make our lves easer (and the run tme faster) 9/6/14 5 Optmzaton Revew: Constraned OpLmzaLon mn u u s.t. u b Allowed mn Case 1: b Global mn Allowed mn Case : b Global mn 9/6/14 6 13

Optmzaton Revew: Constraned OpLmzaLon wth Lagrange When equal constrants è oplmze f(x), subject to g (x)=0 Method of Lagrange mullplers: convert to a hgher- dmensonal problem Mnmze w.r.t. f ( x) + g ( x) λ ( x1 x n 1 k ; λ λ ) Introducng a Lagrange mullpler for each constrant Construct the Lagrangan for the orgnal oplmzalon problem 7 Optmzaton Revew: Dual Problem Usng dual problem Constraned oplmzalon à unconstraned oplmzalon Need to change maxmzalon to mnmzalon Only vald when the orgnal oplmzalon problem s convex/concave (strong dualty) x*=λ* When convex/concave Dual Problem * λ = arg mn l( λ) x * λ Prmal Problem = arg max f( x) x subject to gx ( ) = c l(λ) = sup( f (x) + λ(g(x) c)) x 14

An alternatve (dual) representaton of the SVM QP We wll start wth the lnearly separable case Instead of encodng the correct classfcaton rule and constrant we wll use LaGrange multples to encode t as part of the our mnmzaton problem Mn (w T w)/ For all x n class +1 w T x+b 1 For all x n class -1 w T x+b -1 Why? Mn (w T w)/ (w T x +b)y 1 9/6/14 9 An alternatve (dual) representaton of the SVM QP We wll start wth the lnearly separable case Instead of encodng the correct classfcaton rule a constrant we wll use Lagrange multples to encode t as part of the our mnmzaton problem Recall that Lagrange multplers can be appled to turn the followng problem: mn x x s.t. x b To Mn x,α x +α(b-x) s.t. α 0 b- x 0 mn x max α x - α(x- b) Mn (w T w)/ (w T x +b)y 1 b Allowed mn Global mn 9/6/14 30 15

Lagrange multpler for SVMs Dual formulaton w T w mn w,b max α α [(w T x + b)y 1] α 0 Usng ths new formulaton we can derve w and b by takng the dervatve w.r.t. w and α leadng to: w = α x y b = y w T x for s.t. α > 0 Set partal dervatves to 0 Orgnal formulaton Mn (w T w)/ (w T x +b)y 1 Fnally, takng the dervatve w.r.t. b we get: α y = 0 9/6/14 31 Dual SVM - nterpretaton w = α x y For α s that are not 0, no nfluence 9/6/14 3 16

A Geometrcal InterpretaLon α 8 =0.6 α 10 =0 α 5 =0 α 4 =0 α 9 =0 α 3 =0 α 6 =1.4 α 7 =0 α =0 α 1 =0.8 9/6/14 33 Dual SVM for lnearly separable case Substtutng w nto our target functon and usng the addtonal constrant we get: Dual formulaton max α α 1 α y = 0 α 0,j α α j y y j x T x j mn w,b w T w α 0 w = α x y b = y w T x for s.t. α > 0 α y = 0 α [(w T x + b)y 1] 9/6/14 34 17

Dual SVM for lnearly separable case Our dual target functon: max α α 1 α y = 0 α 0 To evaluate a new sample x j we need to compute: w T x j + b =,j α α j y y j x T x j α y x T x j + b Is ths too much computatonal work (for example when usng transformaton of the data)? Dot product for all tranng samples Dot product wth tranng samples 9/6/14 35 Dual formulaton for non lnearly separable case Dual target functon: max α α 1 α y = 0 C > α 0,,j α α j y y j x T x j Hyperparameter C should be tuned through k- folds CV The only dfference s that the α I s are now bounded To evaluate a new sample x j we need to compute: w T x j + b = α y x T x j + b Ths s very smlar to the oplmzalon problem n the lnear separable case, except that there s an upper bound C on α now Once agan, a QP solver can be used to fnd α 9/6/14 36 18

Today q Support Vector Machne (SVM) ü Hstory of SVM ü Large Margn Lnear Classfer ü Defne Margn (M) n terms of model parameter ü OpLmzaLon to learn model parameters (w, b) ü Non lnearly separable case ü OpLmzaLon wth dual form ü Nonlnear decson boundary ü MulLclass SVM 9/6/14 37 Classfyng n 1-d Can an SVM correctly classfy ths data? What about ths? X X 9/6/14 38 19

Classfyng n 1-d Can an SVM correctly classfy ths data? And now? (extend wth polynomal bass ) X X X 9/6/14 39 Non- lnear SVMs: D The orgnal nput space (x) can be mapped to some hgher- dmensonal feature space (φ(x) )where the tranng set s separable: x=(x 1,x ) φ(x) =(x 1,x, x 1 x ) x 1 x Φ: x φ(x) x x 1 9/6/14 40 Ths slde s courtesy of www.ro.umontreal.ca/~pft6080/documents/papers/svm_tutoral.ppt 0

Non- lnear SVMs: D The orgnal nput space (x) can be mapped to some hgher- dmensonal feature space (φ(x) )where the tranng set s separable: x=(x 1,x ) φ(x) =(x 1,x, x 1 x ) x 1 x If data s mapped nto suffcently hgh dmenson, then samples wll n general Φ: x φ(x) be lnearly separable; N data ponts are n general separable n a space of N-1 dmensons or more!!! x x 1 9/6/14 41 Ths slde s courtesy of www.ro.umontreal.ca/~pft6080/documents/papers/svm_tutoral.ppt A lttle bt theory: Vapnk-Chervonenks (VC) dmenson If data s mapped nto suffcently hgh dmenson, then samples wll n general be lnearly separable; N data ponts are n general separable n a space of N-1 dmensons or more!!! VC dmenson of the set of orented lnes n R s 3 It can be shown that the VC dmenson of the famly of orented separalng hyperplanes n R N s at least N+1 9/6/14 4 1

Transformaton of Inputs Possble problems - Hgh computaton burden due to hgh-dmensonalty - Many more parameters SVM solves these two ssues smultaneously Kernel trcks for effcent computaton Dual formulaton only assgns parameters to samples, not features Input space φ(.) φ( ) φ( ) φ( ) φ( ) φ( ) φ( ) φ( ) φ( ) φ( ) φ( ) φ( ) φ( ) φ( ) φ( ) φ( ) φ( ) φ( ) φ( ) Feature space 9/6/14 43 Quadratc kernels Whle workng n hgher dmensons s benefcal, t also ncreases our runnng tme because of the dot product computaton However, there s a neat trck we can use max α α α α j y y j Φ(x ) T Φ(x j ),j α y = 0 α 0 consder all quadratc terms for x 1, x x m The term wll become clear n the next slde Φ(x) = 1 x 1! x m x 1! x m m+1 lnear terms m quadratc terms m s the number of features n each vector x 1 x! m(m-1)/ parwse terms x m 1 x m 9/6/14 44

Dot product for quadratc kernels How many operatons do we need for the dot product? 1 1 x 1! z 1! Φ(x) T Φ(z) = x m x 1! x m. z m z 1! z m = x z + x z + x x j z z j +1 j= +1 m m m(m-1)/ =~ m x 1 x! z 1 z! x m 1 x m z m 1 z m 9/6/14 45 The kernel trck How many operatons do we need for the dot product? Φ(x) T Φ(z) = x z + x z + x x j z z j +1 j= +1 m m m(m-1)/ =~ m However, we can obtan dramatc savngs by notng that Φ(x) T Φ(z) = (x T z +1) = (x.z +1) = (x.z) + (x.z)+1 = ( x z ) + x z +1 We only need m operatons! = x z + x z + x x j z z j +1 9/6/14 46 j=+1 So, f we defne the kernel func8on as follows, there s no need to carry out φ(.) explctly K(x, z) = (x T z +1) 3

Where we are Our dual target functon: max α α 1 α α j y y j Φ(x ) T Φ(x j ) α y = 0 α 0,j mn operatons at each teraton To evaluate a new sample x j we need to compute: w T Φ(x j )+ b = mr operatons where r are the number of support vectors (α >0) α y Φ(x ) T Φ(x j )+ b So, f we defne the kernel func8on as follows, there s no need to carry out φ(.) explctly K(x, z) = (x T z +1) 9/6/14 47 Erc Xng @ CMU, 006-008 More examples of kernel funclons Lnear kernel (we've seen t) T K ( x, x') = x x' Polynomal kernel (we just saw an example) T ( x ') p K ( x, x') = 1 + x where p =, 3, To get the feature vectors we concatenate all pth order polynomal terms of the components of x (weghted approprately) Radal bass kernel 1 K( x, x') = exp x x' In ths case the feature space conssts of funclons and results n a non- parametrc classfer. Never represent features explctly Compute dot products n closed form Very ntereslng theory Reproducng Kernel Hlbert Spaces Not covered n detal here 48 4

Today q Support Vector Machne (SVM) ü Hstory of SVM ü Large Margn Lnear Classfer ü Defne Margn (M) n terms of model parameter ü OpLmzaLon to learn model parameters (w, b) ü Non lnearly separable case ü OpLmzaLon wth dual form ü Nonlnear decson boundary ü MulLclass SVM 9/6/14 49 Mult-class classfcaton wth SVMs What f we have data from more than two classes? Most common soluton: One vs. all - create a classfer for each class aganst all other data - for a new pont use all classfers and compare the margn for all selected classes Note that ths s not necessarly vald snce ths s not what we traned the SVM for, but often works well n practce 9/6/14 50 5

Handwrtten dgt recognton 1999, SVM 9/6/14 51 Why do SVMs work? If we are usng huge features spaces (wth kernels) how come we are not overfttng the data? - Number of parameters remans the same (and most are set to 0) - Whle we have a lot of nput values, at the end we only care about the support vectors and these are usually a small group of samples - The mnmzaton (or the maxmzng of the margn) functon acts as a sort of regularzaton term leadng to reduced overfttng 9/6/14 5 6

Software A lst of SVM mplementaton can be found at http://www.kernel-machnes.org/software.html Some mplementaton (such as LIBSVM) can handle mult-class classfcaton SVMLght s among one of the earlest mplementaton of SVM Several Matlab toolboxes for SVM are also avalable 9/6/14 53 References Bg thanks to Prof. Zv Bar- Joseph @ CMU for allowng me to reuse some of hs sldes Prof. Andrew Moore @ CMU s sldes Elements of StaLsLcal Learnng, by HasLe, Tbshran and Fredman 9/18/14 54 7