FIND A FUNCTION TO CLASSIFY HIGH VALUE CUSTOMERS

LINEAR CLASSIFIER 1

FIND A FUNCTION TO CLASSIFY HIGH VALUE CUSTOMERS x f y High Value Customers Salary Task: Find Nb Orders 150 70 300 100 200 80 120 100 Low Value Customers Salary Nb Orders 40 80 220 20 100 20 175 10 α 1,α 2,α 3 : α 1 i salary + α 2 i orders α 3 > 0 Low value customer α 1 i salary + α 2 i orders α 3 < 0 High value customer 2

FIND A FUNCTION TO CLASSIFY HIGH VALUE CUSTOMERS High Value Customers Low Value Customers 120 Salary Orders 150 70 300 100 Salary Orders 40 80 220 20 Orders 100 80 60 40 200 80 100 20 20 120 100 175 10 0 0 100 200 300 400 Salary Task: Find α 1,α 2,α 3 : High value customer Low value customer α 1 i salary + α 2 i orders α 3 > 0 α 1 i salary + α 2 i orders α 3 < 0 3

FIND A FUNCTION TO CLASSIFY HIGH VALUE CUSTOMERS High Value Customers Salary Orders 150 70 300 100 200 80 120 100 Low Value Customers Salary Orders 40 80 220 20 100 20 175 10 Orders 120 100 80 60 40 20 0 0 50 100 150 200 250 300 350 Salary Task: Find α 1,α 2,α 3 : α 1 i salary + α 2 i orders α 3 > 0 Low value customer α 1 i salary + α 2 i orders α 3 < 0 High value customer 4

LINEAR CLASSIFIERS x w f y est denotes +1 denotes -1 f(x,w,b) = sign(w. x - b) How would you classify this data? x vector of feature values, w vector of weights. x.w=b is a line (hyperplane)

LINEAR CLASSIFIERS x a f y est denotes +1 denotes -1 f(x,w,b) = sign(w. x - b) How would you classify this data? x.w= b or w 1 x 1 + w 2 x 2 + w 3 x 3 + w 4 x 4 +.= b is an hyperplane.

LINEAR CLASSIFIERS x a f y est denotes +1 denotes -1 f(x,w,b) = sign(w. x - b) How would you classify this data?

LINEAR CLASSIFIERS x a f y est denotes +1 denotes -1 f(x,w,b) = sign(w. x - b) Any of these would be fine....but which is best?

CLASSIFIER MARGIN x a f y est denotes +1 denotes -1 f(x,w,b) = sign(w. x - b) Define the margin of a linear classifier as the width that the boundary could be increased by before hitting a datapoint.

MAXIMUM MARGIN x a f y est denotes +1 denotes -1 Linear SVM f(x,w,b) = sign(w. x - b) The maximum margin linear classifier is the linear classifier with the, um, maximum margin. This is the simplest kind of SVM (Called an LSVM)

MAXIMUM MARGIN x a f y est denotes +1 denotes -1 Support Vectors are those datapoints that the margin pushes up against Linear SVM f(x,w,b) = sign(w. x - b) The maximum margin linear classifier is the linear classifier with the, um, maximum margin. This is the simplest kind of SVM (Called an LSVM)

WHY MAXIMUM MARGIN? denotes +1 denotes -1 Support Vectors are those datapoints that the margin pushes up against 1. Intuitively this feels safest. 2. If we ve made f(x,w,b) a small = sign(w. error xin -the b) location of the boundary (it s been jolted in its perpendicular The maximum direction) this gives us least margin chance of linear causing a misclassification. classifier is the linear classifier with the, um, maximum margin. This is the simplest kind of SVM (Called an LSVM) 3. Leave-one-out-cross-validation (LOOCV) is easy since the model is immune to removal of any nonsupport-vector datapoints. 4. There s some theory (using VC dimension) that is related to (but not the same as) the proposition that this is a good thing. 5. Empirically it works very very well.

ESTIMATE THE MARGIN denotes +1 denotes -1 x wx +b = 0 x i D : y i What is the distance expression for a point x to a line wx+b= 0? ( x i w + b) > 0 d(x) = x w + b w 2 = x w + b d i=1 w i 2

ESTIMATE THE MARGIN denotes +1 denotes -1 wx +b = 0 Margin What is the expression for margin? x i D : y i ( x i w + b) > 0 margin º min d( x) = min xîd xîd xw + b å d i= 1 w 2 i

MAXIMIZE MARGIN denotes +1 denotes -1 wx +b = 0 Margin argmax w,b = argmax w,b = argmax w,b margin(w,b, D) min d(x i ) x i D min x i D b + x i w d i=1 w i 2

MAXIMIZE MARGIN denotes +1 denotes -1 wx +b = 0 Margin argmax min w, b x ÎD i b + x w å i d i= 1 2 i ( b) subject to " x Î D: y x w+ > 0 Min-max problem à game problem w i i i

MAXIMIZE MARGIN denotes +1 denotes -1 wx +b = 0 Margin Strategy: argmax min w, b x ÎD i b + x w å i d i= 1 2 i ( b) subject to " x Î D: y x w+ ³ 0 w i i i " x Î D: b+ x w ³ 1 i i argmin w, b å d i= 1 w 2 i ( b) subject to " x Î D: y x w+ ³ 1 i i i If you want to learn why this holds I highly recommend the following video https://www.youtube.com/watch?v=_pwhiwxhk8o (not necessary to know for the

MAXIMUM MARGIN LINEAR CLASSIFIER {! w *, b * }= argmin! w,b subject to!! w x1 + b y 1 y 2... y N ( ) 1 (!! w x2 + b) 1! w! xn + b ( ) 1 d 2 w k=1 k How to solve it?

LEARNING VIA QUADRATIC PROGRAMMING QP is a well-studied class of optimization algorithms to maximize a quadratic function of some real-valued variables subject to linear constraints.

SOFT MARGIN CLASSIFICATION Sec. 15.2.1 ξ i ξ j If the training data is not linearly separable, slack variables ξ i can be added to allow misclassification of difficult or noisy examples. Allow some errors Let some points be moved to where they belong, at a cost Still, try to minimize training set errors, and to place hyperplane far from each class (large margin) 21

REGULARIZATION SOFT MARGIN CLASSIFICATION MATHEMATICALLY Sec. 15.2.1 Find w and b such that Φ(w) =½ w T w is minimized and for all {(x i,y i )} y i (w T x i + b) 1 Find w and b such that Φ(w) =½ w T w + C Σ ξ i is minimized and for all {(x i,y i )} y i (w T x i + b) 1- ξ i and ξ i 0 for all i 22

SUPPOSE WE RE IN 1-DIMENSION What would SVMs do with this data? x=0

SUPPOSE WE RE IN 1-DIMENSION Not a big surprise x=0 Positive plane Negative plane

HARDER 1-DIMENSIONAL DATASET That s wiped the smirk off SVM s face. What can be done about this? x=0

HARDER 1-DIMENSIONAL DATASET Permitting nonlinear basis functions x=0 z k = ( x k, x 2 k )

NONLINEAR KERNEL (I)

NON-LINEAR KERNEL : < 2! < 3 (x 1,x 2 ) 7! (z 1,z 2,z 3 )=(x 2 1, p 2x 1 x 2,x 2 2) eparate the mapped data, our decision boundaries wil [http://www.cs.berkeley.edu/~jordan/courses/281b- 28

NONLINEAR KERNEL (II) Nice video lecture from CalTech on RBF kernels: https://www.youtube.com/watch?v=o8cfrnoptlc

DEMO http://cs.stanford.edu/people/karpathy/sv mjs/demo/ 31

KERNEL TRICKS Pro Introducing nonlinearity into the model Computational cheap Con Still have potential overfitting problems

SVM PERFORMANCE Anecdotally they work very very well. Example: They are currently the best-known classifier on a well-studied hand-written-character recognition benchmark Anecdotally reliable people doing practical real-world work who claim that SVMs have saved them when their other favorite classifiers did poorly. There is a lot of excitement and religious fervor about SVMs Despite this, some practitioners are a little skeptical.

REFERENCES An excellent tutorial on VC-dimension and Support Vector Machines: C.J.C. Burges. A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery, 2(2):955-974, 1998. The VC/SRM/SVM Bible: Statistical Learning Theory by Vladimir Vapnik, Wiley-Interscience; 1998 Software: SVM-light, http://svmlight.joachims.org/, free download