COMP 562: Introduction to Machine Learning

Size: px

Start display at page:

Download "COMP 562: Introduction to Machine Learning"

Julie Adelia Manning
5 years ago
Views:

1 COMP 562: Introduction to Machine Learning Lecture 20 : Support Vector Machines, Kernels Mahmoud Mostapha 1 Department of Computer Science University of North Carolina at Chapel Hill mahmoudm@cs.unc.edu October 31, Slides adapted from Eric Eaton

2 COMP562 - Lecture 20 Plan for today: Prediction Why might prediction be wrong? Support vector machines Doing really well with linear models Kernels Making the non-linear linear

3 Why Might be Wrong? True non- determinism Flip a biased coin p(heads) = θ! Es@mate θ! If θ > 0.5 predict heads, else tails Lots of ML research on problems like this: Learn a model Do the best you can in expecta@on

4 Why Might be Wrong? observability Something needed to predict y is missing from observa@on x N- bit parity problem x contains N- 1 bits (hard PO) x contains N bits but learner ignores some of them (sow PO) Noise in the observa@on x Measurement error Instrument limita@ons

5 Why Might be Wrong? True non- determinism observability hard, sow bias Algorithmic bias Bounded resources

6 Bias Having the right features (x) is crucial 0 x! x 2!

7 Support Vector Machines Doing Really Well with Linear Decision Surfaces

8 Strengths of SVMs Good in theory in Works well with few training instances Find globally best model Efficient algorithms Amenable to the kernel trick

9 Linear Separators Training instances x 2 R d+1,x 0 =1 y 2 { 1, 1} Model parameters 2 R d+1 Hyperplane x = h, xi =0 Decision func@on h(x) = sign( x)=sign(h, xi) Recall: Inner (dot) product: hu, vi = u v = u v = X i u i v i

14 A Good Separator

15 Noise in the

16 Ruling Out Some Separators

17 Lots of Noise

18 Only One Separator Remains

19 Maximizing the Margin

20 Fat Separators

21 Fat Separators margin

22 Why Maximize Margin Increasing margin reduces capacity i.e., fewer possible models Lesson from Learning Theory: If the following holds: H is sufficiently constrained in size and/or the size of the training data set n is large, then low training error is likely to be evidence of low generaliza@on error 23

23 View of Regression h (x) = 1 1+e T x h (x) =g(z) z = T x If y =1, we want h, T (x) 1 x 0 If y =0, we want h (x) 0, T x 0 J( ) = min J( ) Based on slide by Andrew Ng nx [y i log h (x i )+(1 y i ) log (1 h (x i ))] i=1 cost 1 ( x i ) cost 0 ( x i ) 24

24 Alternate View of Regression Cost of example: y i log h (x i ) (1 y i ) log (1 h (x i )) h (x) = 1 1+e T x z = T x If y =1(want T x 0 ): If y =0(want T x 0 ): Based on slide by Andrew Ng 25

25 Regression to SVMs Regression: nx dx min [y i log h (x i )+(1 y i ) log (1 h (x i ))] + 2 i=1 j=1 2 j Support Vector Machines: nx min C [y i cost 1 ( x i )+(1 y i ) cost 0 ( x i )] i=1 dx j=1 2 j You can think of C as similar to! 1 26

26 Support Vector Machine min C nx [y i cost 1 ( x i )+(1 y i ) cost 0 ( x i )] i=1 dx j=1 2 j If y =1(want x 1): If y =0(want x apple 1): `hinge (h(x)) = max(0, 1 y h(x)) Based on slide by Andrew Ng 27

27 Support Vector Machine min C nx [y i cost 1 ( x i )+(1 y i ) cost 0 ( x i )] i=1 y = 1 / 0 dx j=1 2 j with C = 1 y = +1 / min 2 dx j=1 2 j s.t. x i 1 x i apple 1 if y i =1 if y i = 1 1 min 2 dx j=1 2 j s.t. y i ( x i ) 1 28

28 Maximum Margin Hyperplane 2 margin = k k 2 x =1 x = 1

29 Support Vectors x =1 x = 1

30 Vector Inner Product v 2 v! u 2 θ p! u! kuk 2 = length(u) 2 R q = u u2 2 v 1 u 1 Based on example by Andrew Ng u v = v u = u 1 v 1 + u 2 v 2 = kuk k k 2 kvk k k 2 cos = pkukk k 2 where p = kvkk 2 cos 32

31 Understanding the Hyperplane 1 min 2 dx j=1 2 j s.t. x i 1 x i apple 1 if y i =1 if y i = 1 Assume θ 0 = 0 so that the hyperplane is centered at the origin, and that d = 2 x! θ x = k k 2 kxk 2 cos {z } p θ p! = pk k 2 Based on example by Andrew Ng 33

32 Based on example by Andrew Ng Maximizing the Margin 1 min 2 dx j=1 2 j s.t. x i 1 x i apple 1 if y i =1 if y i = 1 Assume θ 0 = 0 so that the hyperplane is centered at the origin, and that d = 2 Let p i be the projec@on of x i onto the vector θ θ -θ Since p is small, therefore k k 2 must be large to have pk k 2 1 (or - 1) -θ θ Since p is larger, k k 2 can be smaller in order to have pk k 2 1 (or - 1)

33 Size of the Margin For the support vectors, we have pk k 2 = ±1 p is the length of the projec@on of the SVs onto θ -θ p θ Therefore, p = 1 k k 2 margin = 2p = 2 k k 2 margin 35

34 The SVM Dual Problem The primal SVM problem was given as 1 dx min j 2 2 j=1 s.t. y i ( x i ) 1 8i Can solve it more efficiently by taking the Lagrangian dual Duality is a common idea in op@miza@on It transforms a difficult op@miza@on problem into a simpler one Key idea: introduce slack variables α i for each constraint α i indicates how important a par@cular constraint is to the solu@on 36

35 The SVM Dual Problem The Lagrangian is given by L(, ) = 1 dx j 2 2 j=1 s.t. i 0 8i nx i=1 i (y i x 1) We must minimize over θ and maximize over α At op@mal solu@on, par@als w.r.t θ s are 0 Solve by a bunch of algebra and calculus... and we obtain... 37

36 SVM Dual Maximize J( ) = nx i=1 i 1 2 The decision func@on is given by h(x) = sign where b = 1 SV s.t. i 0 8i X i y i =0 i X i2sv X i2sv nx i=1 nx j=1 i y i hx, x i i + b i X j2sv i j y i y j hx i, x j i! 1 j y j hx i, x j ia 38

37 Maximize Understanding the Dual J( ) = nx i=1 i 1 2 s.t. i 0 8i X i y i =0 Balances between the weight of constraints for different classes! i nx i=1 nx j=1 i j y i y j hx i, x j i Constraint weights (α i s) cannot be nega@ve! 39

Understanding the Dual Maximize J( ) = nx i=1 i 1 2

the sum Measures the similarity between points!

38 Understanding the Dual Maximize J( ) = nx i=1 i 1 2 nx i=1 nx j=1 i j y i y j hx i, x j i s.t. i X 0 8i i y i =0 i Points with different labels increase the sum Points with same label decrease the sum Measures the similarity between points! Intui@vely, we should be more careful around points near the margin 40

39 Understanding the Dual Maximize J( ) = nx i=1 i 1 2 nx i=1 nx j=1 i j y i y j hx i, x j i s.t. i X 0 8i i y i =0 i In the solu@on, either: α i > 0 and the constraint ( y i ( x i )=1) Ø point is a support vector α i = 0 Ø point is not a support vector 41

40 Employing the Given the α*, weights are? = X i? y i x i In this formula@on, have not added x 0 = 1 Therefore, we can solve one of the SV constraints y i (? x i + 0 )=1 to obtain θ 0 i2svs Or, more commonly, take the average solu@on over all support vectors 42

41 What if Data Are Not Linearly Separable? Soft margin SVMs

42 Large Margin Classifier in Presence of Outliers C very large x 2 C not too large x 1 Based on slide by Andrew Ng 31

43 Strengths of SVMs Good in theory Good in Work well with few training instances Find globally best model Efficient algorithms Amenable to the kernel trick

44 What if Surface is Non- Linear? O O O O O O O O X X X X O O X X O O O O O O O O O O Image from h`p://

45 Kernel Methods Making the Non- Linear Linear

46 When Linear Separators Fail 0 x! x 2!

47 Mapping into a New Feature Space For example, with x i 2 R 2 ([x i1,x i2 ]) = [x i1,x i2,x i1 x i2,x 2 i1,x 2 i2] Rather than run SVM on x i, run it on Φ(x i ) Find non- linear separator in input space What if Φ(x i ) is really big? Use kernels to compute it implicitly! Image from h`p://web.engr.oregonstate.edu/ ~afern/classes/cs534/ :X 7! ˆX = (x)

48 Kernels Find kernel K such that K(x i, x j )=h (x i ), (x j )i Compu@ng K(x i, x j ) should be efficient, much more so than compu@ng Φ(x i ) and Φ(x j ) Use K(x i, x j ) in SVM algorithm rather than Remarkably, this is possible! hx i, x j i

49 The Polynomial Kernel Let x i =[x i1,x i2 ] and x j =[x j1,x j2 ] Consider the following func@on: K(x i, x j )=hxh i, x j i 2 =(x i1 x j1 + x i2 x j2 ) 2 =(x 2 i1x 2 j1 + x 2 i2x 2 j2 +2x i1 x i2 x j1 x j2 ) = h (x i ), (x j )ii where h i (x i )=[x 2 i1,x 2 i2, p p 2x i1 x i2 ] (x j )=[x 2 j1,x 2 j2, p 2x j1 x j2 ]

50 The Polynomial Kernel Given by K(x i, x j )=hx i, x j i d Φ(x) contains all monomials of degree d! Useful in visual pa`ern recogni@on Example: 16x16 pixel image monomials of degree 5 Never explicitly compute Φ(x)! Varia@on: K(x i, x j )=(hx i, x j i + 1) d Adds all lower- order monomials (degrees 1,...,d )!

51 The Kernel Trick Given an algorithm which is formulated in terms of a posi@ve definite kernel K 1, one can construct an alterna@ve algorithm by replacing K 1 with another posi@ve definite kernel K 2 Ø SVMs can use the kernel trick

52 Kernels into SVM J( ) = nx i=1 X nx J( ) = i=1 i i i 1 2 s.t. a i 0 8i X i y i =0 nx i=1 nx i=1 nx j=1 nx j=1 i j y i y j hx i, x j i i j y i y j K(x i, x j ) 53

The Gaussian Kernel Also called Radial Basis Func@on (RBF) kernel kxi x j k 2 2 K(x

Value falls off to 0 with increasing distance Note: Need to do feature scaling before

53 The Gaussian Kernel Also called Radial Basis (RBF) kernel kxi x j k 2 2 K(x i, x j )=exp 2 2 Has value 1 when x i = x j! Value falls off to 0 with increasing distance Note: Need to do feature scaling before using Gaussian Kernel lower bias, higher variance higher bias, lower variance

54 Other Kernels Sigmoid Kernel K(x i, x j ) = tanh ( x i x j + c) Neural networks use sigmoid as ac@va@on func@on SVM with a sigmoid kernel is equivalent to 2- layer perceptron Cosine Similarity Kernel K(x i, x j )= x i x j kx i kkx j k Popular choice for measuring similarity of text documents L 2 norm projects vectors onto the unit sphere; their dot product is the cosine of the angle between the vectors 59

55 Chi- squared Kernel K(x i, x j )=exp Other Kernels Widely used in computer vision applica@ons Chi- squared measures distance between probability distribu@ons Data is assumed to be non- nega@ve, owen with L 1 norm of 1 String kernels Tree kernels Graph kernels X k (x ik x jk ) 2 x ik + x jk! 60

56 ! An Aside: The Math Behind Kernels What does it mean to be a kernel? K(x i, x j )=h (x i ), (x j )i for some Φ What does it take to be a kernel? The Gram matrix G ij = K(x i, x j ) Symmetric matrix Posi@ve semi- definite matrix: "z T Gz 0 for every non- zero vector z 2 R n Establishing kernel- hood from first principles is non- trivial

57 A Few Good Kernels... Linear Kernel K(x i, x j )=hx i, x j i Polynomial kernel K(x i, x j )=(hx i, x j i + c) d c 0 trades off influence of lower order terms Gaussian kernel Sigmoid kernel Many more... K(x i, x j )=exp Cosine similarity kernel Chi- squared kernel String/tree/graph/wavelet/etc kernels kxi x j k K(x i, x j ) = tanh ( x i x j + c) 62

58 Photo Retouching (Leyvand et al., 2008)

59 Advice for Applying SVMs Use SVM sowware package to solve for parameters e.g., SVMlight, libsvm, cvx (fast!), etc. Need to specify: Choice of parameter C! Choice of kernel Associated kernel parameters e.g., K(x i, x j )=(hx i, x j i + c) d kxi x j k 2 2 K(x i, x j )=exp

Mul@- Class Classifica@on with SVMs y 2{1,.

60 Class with SVMs y 2{1,...,K} Many SVM packages already have mul@- class classifica@on built in Otherwise, use one- vs- rest Based on slide by Andrew Ng Train K SVMs, each picks out one class from rest, yielding (1),..., (K) Predict class i with largest ( (i) ) x 65

61 SVMs vs Regression (Advice from Andrew Ng) n = # training examples d = # features If d is large (rela@ve to n) (e.g., d > n with d = 10,000, n = 10-1,000) Use logis@c regression or SVM with a linear kernel If d is small (up to 1,000), n is intermediate (up to 10,000) Use SVM with Gaussian kernel If d is small (up to 1,000), n is large (50,000+) Create/add more features, then use logis@c regression or SVM without a kernel Neural networks likely to work well for most of these se~ngs, but may be slower to train Based on slide by Andrew Ng 66

62 Other SVM nu SVM nu parameter controls: of support vectors (lower bound) and rate (upper bound) E.g., =0.05 guarantees that 5% of training points are SVs and training error rate is 5% Harder to than C- SVM and not as scalable SVMs for regression One- class SVMs SVMs for clustering... 67

63 Conclusion SVMs find linear separator The kernel trick makes SVMs learn non- linear decision surfaces Strength of SVMs: Good and empirical performance Supports many types of kernels Disadvantages of SVMs: Slow to train/predict for huge data sets (but fast!) Need to choose the kernel (and tune its parameters)

64 Today Support Vector Machines (SVM) Kernels

10/05/2016. Computational Methods for Data Analysis. Massimo Poesio SUPPORT VECTOR MACHINES. Support Vector Machines Linear classifiers

10/05/2016. Computational Methods for Data Analysis. Massimo Poesio SUPPORT VECTOR MACHINES. Support Vector Machines Linear classifiers Computational Methods for Data Analysis Massimo Poesio SUPPORT VECTOR MACHINES Support Vector Machines Linear classifiers 1 Linear Classifiers denotes +1 denotes -1 w x + b>0 f(x,w,b) = sign(w x + b) How