Andrew W. Moore Professor School of Computer Science Carnegie Mellon University

Size: px

Start display at page:

Download "Andrew W. Moore Professor School of Computer Science Carnegie Mellon University"

Shanon Warren
6 years ago
Views:

1 Spport Vector Machines Note to other teachers and sers of these slides. Andrew wold be delighted if yo fond this sorce material sefl in giving yor own lectres. Feel free to se these slides verbatim, or to modify them to fit yor own needs. PowerPoint originals are available. If yo make se of a significant portion of these slides in yor own lectre, please inclde this message, or the following link to the sorce repository of Andrew s ttorials: Comments and corrections grateflly received. Andrew W. Moore Professor School of Compter Science Carnegie Mellon University awm@cs.cm.ed Nov 23rd, 2001

2 Overviews Proposed by Vapnik and his colleages - Started in 1963, taking shape in late 70 s as part of his statistical learning theory (with Chervonenkis) - Crrent form established in early 90 s (with Cortes) Becomes poplar in last decade - Classification, regression (fnction approx.) optimization - Compared favorably to MLP Basic ideas - Overcoming linear seperability problem by transforming the problem into higher dimensional space sing kernel fnctions - (become eqiv. to 2-layer perceptron when kernel is sigmoid fnction) - Maximize margin of decision bondary Spport Vector Machines: Slide 2

3 Linear Classifiers x f y est denotes +1 denotes -1 f(x,w,b) = sign(w. x - b) How wold yo classify this data? Spport Vector Machines: Slide 3

4 Linear Classifiers x f y est denotes +1 denotes -1 f(x,w,b) = sign(w. x - b) How wold yo classify this data? Spport Vector Machines: Slide 4

5 Linear Classifiers x f y est denotes +1 denotes -1 f(x,w,b) = sign(w. x - b) How wold yo classify this data? Spport Vector Machines: Slide 5

6 Linear Classifiers x f y est denotes +1 denotes -1 f(x,w,b) = sign(w. x - b) How wold yo classify this data? Spport Vector Machines: Slide 6

7 Linear Classifiers x f y est denotes +1 denotes -1 f(x,w,b) = sign(w. x - b) Any of these wold be fine....bt which is best? Spport Vector Machines: Slide 7

8 Classifier Margin x f y est denotes +1 denotes -1 f(x,w,b) = sign(w. x - b) Define the margin of a linear classifier as the width that the bondary cold be increased by before hitting a datapoint. Spport Vector Machines: Slide 8

9 Maximm Margin x f y est denotes +1 denotes -1 Linear SVM f(x,w,b) = sign(w. x - b) The maximm margin linear classifier is the linear classifier with the, m, maximm margin. This is the simplest kind of SVM (Called an LSVM) Spport Vector Machines: Slide 9

10 Maximm Margin x f y est denotes +1 denotes -1 Spport Vectors are those datapoints that the margin pshes p against Linear SVM f(x,w,b) = sign(w. x - b) The maximm margin linear classifier is the linear classifier with the, m, maximm margin. This is the simplest kind of SVM (Called an LSVM) Spport Vector Machines: Slide 10

11 Why Maximm Margin? denotes +1 denotes Intitively this feels safest. 2. If we ve made f(x,w,b) a small = sign(w. error in xthe - b) location of the bondary (it s been jolted in its perpendiclar The maximm direction) this gives s least margin chance linear of casing a misclassification. classifier is the linear classifier with the, m, maximm margin. 3. CV is easy since the model is immne Spport Vectors to removal of any non-spport-vector are those datapoints. datapoints that the margin 4. There s some theory that this is a This is the pshes p good thing. against simplest kind of 5. Empirically it works very very well. SVM (Called an LSVM) Spport Vector Machines: Slide 11

12 Specifying a line and margin Pls-Plane Classifier Bondary Mins-Plane How do we represent this mathematically? in m inpt dimensions? Spport Vector Machines: Slide 12

13 Specifying a line and margin Pls-Plane Classifier Bondary Mins-Plane Pls-plane = { x : w. x + b = +1 } Mins-plane = { x : w. x + b = -1 } Classify as.. +1 if w. x + b >= 1-1 if w. x + b <= -1 Universe if -1 < w. x + b < 1 explodes Spport Vector Machines: Slide 13

14 Compting the margin width M = Margin Width How do we compte M in terms of w and b? Pls-plane = { x : w. x + b = +1 } Mins-plane = { x : w. x + b = -1 } Claim: The vector w is perpendiclar to the Pls Plane. Why? Spport Vector Machines: Slide 14

15 Compting the margin width M = Margin Width Pls-plane = { x : w. x + b = +1 } Mins-plane = { x : w. x + b = -1 } How do we compte M in terms of w and b? Claim: The vector w is perpendiclar to the Pls Plane. Why? And so of corse the vector w is also perpendiclar to the Mins Plane Let and v be two vectors on the Pls Plane. What is w. ( v )? Spport Vector Machines: Slide 15

16 Compting the margin width x + x - M = Margin Width How do we compte M in terms of w and b? Pls-plane = { x : w. x + b = +1 } Mins-plane = { x : w. x + b = -1 } The vector w is perpendiclar to the Pls Plane Let x - be any ypoint on the mins plane Let x + be the closest pls-plane-point to x -. Any location in R m : not necessarily a datapoint Spport Vector Machines: Slide 16

17 Compting the margin width x + x - M = Margin Width How do we compte M in terms of w and b? Pls-plane = { x : w. x + b = +1 } Mins-plane = { x : w. x + b = -1 } The vector w is perpendiclar to the Pls Plane Let x - be any ypoint on the mins plane Let x + be the closest pls-plane-point to x -. Claim: x + = x - + w for some vale of. Why? Spport Vector Machines: Slide 17

18 Compting the margin width x + M = Margin Width The line from x- to x+ is x perpendiclar to the - planes. How do we compte M in terms of w So to and get b? from x - to x + travel some distance in Pls-plane = { x : w. x + b = +1 direction } w. Mins-plane = { x : w. x + b = -1 } The vector w is perpendiclar to the Pls Plane Let x - be any ypoint on the mins plane Let x + be the closest pls-plane-point to x -. Claim: x + = x - + w for some vale of. Why? Spport Vector Machines: Slide 18

19 Compting the margin width x + M = Margin Width x - What we know: w. x + + b = +1 w. x - + b = -1 x + = x - + w x + - x - = M It s now easy to get M in terms of w and b Spport Vector Machines: Slide 19

20 Compting the margin width x + M = Margin Width x - w. (x - + w) + b = 1 What we know: w. x + + b = +1 w. x - + b = -1 x + = x - + w x + - x - = M It s now easy to get M in terms of w and b => w. x - + b + w.w = 1 => -1 + w.w = 1 => λ 2 w.w Spport Vector Machines: Slide 20

21 Compting the margin width x + M = Margin Width = 2 w. w x - M = x + - x - = w = What we know: w. x + + b = +1 w. x - + b = -1 x + = x - + w x + - x - = M 2 λ w.w λ w λ w. w 2 w. w 2 w. w w. w Spport Vector Machines: Slide 21

22 Learning the Maximm Margin Classifier x + M = Margin Width = 2 w. w x - Given a gess of w and b we can Compte whether all data points in the correct half-planes Compte the width of the margin So now we jst need to write a program to search the space of w s s andb s b to find the widest margin that matches all the datapoints. How? Gradient descent? Simlated Annealing? Matrix Inversion? EM? Newton s Method? Spport Vector Machines: Slide 22

23 Learning via Qadratic Programming QP is a well-stdied class of optimization algorithms to maximize a qadratic fnction of some real-valed variables sbject to linear constraints. Spport Vector Machines: Slide 23

24 Find arg max Qadratic Programming c d T R T 2 Qadratic criterion Sbject to And sbject to a a a a a 11 1 a a1 mm b1 1 a a2mm b2 21 n additional linear ineqality : constraints n1 ( n1)1 ( n 2)1 1 1 a... 1 n2 a a 2 ( n1)2 ( n 2)2 2 2 a nm... a... a : m b ( n1) m ( n 2) m n m m b b ( n1) ( n 2) a( ne)1 1 a( ne) a( ne) m m b( ne ) eqali constr e addition nal linea ity raints ar Spport Vector Machines: Slide 24

25 Find arg max Qadratic Programming c d T R T 2 Qadratic criterion Sbject to And sbject to a a a a a 11 1 a a1 mm b1 1 a a2mm b2 21 n additional linear ineqality : constraints n1 ( n1)1 ( n 2)1 1 1 a... 1 n2 a a 2 ( n1)2 ( n 2)2 2 2 a nm... a... a : m b ( n1) m ( n 2) m n m m b b ( n1) ( n 2) a( ne)1 1 a( ne) a( ne) m m b( ne ) eqali constr e addition nal linea ity raints ar Spport Vector Machines: Slide 25

26 Learning the Maximm Margin Classifier M = 2 w.w Given gess of w, b we can Compte whether all data points are in the correct half-planes Compte the margin width Assme R datapoints, each (x k,y k ) where y k =+/- 1 What shold or qadratic optimization criterion be? How many constraints will we have? What shold they be? Spport Vector Machines: Slide 26

27 Sppose we re in 1-dimension What wold SVMs do with this data? x=0 Spport Vector Machines: Slide 27

28 Sppose we re in 1-dimension Not a big srprise x=0 Positive plane Negative plane Spport Vector Machines: Slide 28

29 Harder 1-dimensional dataset That s wiped the smirk off SVM s face. What can be done abot this? x=0 Spport Vector Machines: Slide 29

30 Harder 1-dimensional dataset Remember how permitting nonlinear basis fnctions made linear regression so mch nicer? Let s permit them here too 2 x=0 z ( x, x ) k ( k k Spport Vector Machines: Slide 30

31 Harder 1-dimensional dataset Remember how permitting nonlinear basis fnctions made linear regression so mch nicer? Let s permit them here too 2 x=0 z ( x, x ) k ( k k Spport Vector Machines: Slide 31

32 Common SVM basis fnctions z k = ( polynomial terms of x k of degree 1 to q ) z k = ( radial basis fnctions of x k ) x k c j z k[ j] φ j ( xk ) KernelFn KW z k = ( sigmoid fnctions of x k ) Spport Vector Machines: Slide 32

33 SVM Kernel Fnctions K(a,b)=(a. b +1) d is an example of an SVM Kernel Fnction Beyond polynomials there are other very high dimensional basis fnctions that can be made practical by finding the right Kernel Fnction Radial-Basis-style style Kernel Fnction: K( a, b) exp ( a b) Neral-net-style Kernel Fnction: K( a, b) tanh( a. b ) 2, and are magic parameters that mst be chosen by a model selection method sch as CV or VCSRM* *see last lectre Spport Vector Machines: Slide 33

34 Spport Vector Machines: Slide 34

35 Spport Vector Machines: Slide 35

36 Spport Vector Machines: Slide 36

37 SVM Performance Anecdotally they work very very well indeed. Overcomes linear seperability problem Generalizes well (overfitting not as serios) Example: crrently the best-known classifier on a well-stdied hand-written-character recognition benchmark several reliable people doing practical real-world work claim that SVMs have saved them when their other favorite classifiers did poorly. There is a lot of excitement and religios fervor abot SVMs as of Despite this, some practitioners (inclding yor lectrer) are a little skeptical. Spport Vector Machines: Slide 37

38 Doing mlti-class classification SVMs can only handle two-class otpts (i.e. a categorical otpt variable with arity 2). What can be done? Answer: with otpt arity N, learn N SVM s SVM 1 learns Otpt==1 vs Otpt!= 1 SVM 2 learns Otpt==2 vs Otpt!= 2 : SVM N learns Otpt==N vs Otpt!= N Then to predict the otpt for a new inpt, jst predict with each SVM and find ot which one pts the prediction the frthest into the positive region. Spport Vector Machines: Slide 38

39 References An excellent ttorial on VC-dimension and Spport Vector Machines: C.J.C. Brges. A ttorial on spport vector machines for pattern recognition. Data Mining and Knowledge Discovery, 2(2): , The VC/SRM/SVM Bible: Statistical Learning Theory by Vladimir Vapnik, Wiley- Interscience; 1998 Download SVM-light: Statistical Learning Theory by Vladimir Vapnik, Wiley- Interscience; 1998 Spport Vector Machines: Slide 39

FIND A FUNCTION TO CLASSIFY HIGH VALUE CUSTOMERS

LINEAR CLASSIFIER 1 FIND A FUNCTION TO CLASSIFY HIGH VALUE CUSTOMERS x f y High Value Customers Salary Task: Find Nb Orders 150 70 300 100 200 80 120 100 Low Value Customers Salary Nb Orders 40 80 220