Support Vector Machines

Size: px

Start display at page:

Download "Support Vector Machines"

Delilah Osborne
5 years ago
Views:

1 Two SVM tutorials linked in class website (please, read both): High-level presentation with applications (Hearst 1998) Detailed tutorial (Burges 1998) Support Vector Machines Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University February 22 nd, Carlos Guestrin 1

2 2006 Carlos Guestrin 2 Announcements Third homework is out Due March 1 st Final assigned by registrar: May 12, 1-4p.m Location TBD

3 Linear classifiers Which line is better? Data: Example i: w.x = j w (j) x (j) 2006 Carlos Guestrin 3

4 Pick the one with the largest margin! w.x + b = 0 w.x = j w (j) x (j) 2006 Carlos Guestrin 4

5 2006 Carlos Guestrin 5 Maximize the margin w.x + b = 0

6 2006 Carlos Guestrin 6 But there are a many planes w.x + b = 0

7 2006 Carlos Guestrin 7 Review: Normal to a plane w.x + b = 0

8 Normalized margin Canonical hyperplanes w.x + b = +1 w.x + b = 0 w.x + b = -1 x + x - margin 2γ 2006 Carlos Guestrin 8

9 Normalized margin Canonical hyperplanes w.x + b = +1 w.x + b = 0 w.x + b = -1 x + x - margin 2γ 2006 Carlos Guestrin 9

10 Margin maximization using canonical hyperplanes w.x + b = +1 w.x + b = 0 w.x + b = -1 margin 2γ 2006 Carlos Guestrin 10

11 2006 Carlos Guestrin 11 Support vector machines (SVMs) w.x + b = +1 w.x + b = 0 w.x + b = -1 Solve efficiently by quadratic programming (QP) Well-studied solution algorithms margin 2γ Hyperplane defined by support vectors

12 2006 Carlos Guestrin 12 What if the data is not linearly separable? Use features of features of features of features.

13 What if the data is still not linearly separable? Minimize w.w and number of training mistakes Tradeoff two criteria? Tradeoff #(mistakes) and w.w 0/1 loss Slack penalty C Not QP anymore Also doesn t distinguish near misses and really bad mistakes 2006 Carlos Guestrin 13

14 Slack variables Hinge loss If margin 1, don t care If margin < 1, pay linear penalty 2006 Carlos Guestrin 14

15 2006 Carlos Guestrin 15 Side note: What s the difference between SVMs and logistic regression? SVM: Logistic regression: Log loss:

16 What about multiple classes? 2006 Carlos Guestrin 16

17 2006 Carlos Guestrin 17 One against All Learn 3 classifiers:

18 2006 Carlos Guestrin 18 Learn 1 classifier: Multiclass SVM Simultaneously learn 3 sets of weights

19 Learn 1 classifier: Multiclass SVM 2006 Carlos Guestrin 19

20 2006 Carlos Guestrin 20 What you need to know Maximizing margin Derivation of SVM formulation Slack variables and hinge loss Relationship between SVMs and logistic regression 0/1 loss Hinge loss Log loss Tackling multiple class One against All Multiclass SVMs

21 SVMs, Duality and the Kernel Trick Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University February 22 nd, Carlos Guestrin 21

22 SVMs reminder 2006 Carlos Guestrin 22

23 2006 Carlos Guestrin 23 You will now Learn one of the most interesting and exciting recent advancements in machine learning The kernel trick High dimensional feature spaces at no extra cost! But first, a detour Constrained optimization!

24 Constrained optimization 2006 Carlos Guestrin 24

25 Lagrange multipliers Dual variables 2006 Carlos Guestrin 25

26 2006 Carlos Guestrin 26 Dual SVM derivation (1) the linearly separable case

27 2006 Carlos Guestrin 27 Dual SVM derivation (2) the linearly separable case

28 2006 Carlos Guestrin 28 Dual SVM interpretation w.x + b = 0

29 2006 Carlos Guestrin 29 Dual SVM formulation the linearly separable case

30 2006 Carlos Guestrin 30 Dual SVM derivation the non-separable case

31 2006 Carlos Guestrin 31 Dual SVM formulation the non-separable case

32 2006 Carlos Guestrin 32 Why did we learn about the dual SVM? There are some quadratic programming algorithms that can solve the dual faster than the primal But, more importantly, the kernel trick!!! Another little detour

33 Reminder from last time: What if the data is not linearly separable? Use features of features of features of features. Feature space can get really large really quickly! 2006 Carlos Guestrin 33

34 Higher order polynomials number of monomial terms number of input dimensions d=4 d=3 d=2 m input features d degree of polynomial grows fast! d = 6, m = 100 about 1.6 billion terms 2006 Carlos Guestrin 34

35 2006 Carlos Guestrin 35 Dual formulation only depends on dot-products, not on w!

36 Dot-product of polynomials 2006 Carlos Guestrin 36

37 2006 Carlos Guestrin 37 Finally: the kernel trick! Never represent features explicitly Compute dot products in closed form Constant-time high-dimensional dotproducts for many classes of features Very interesting theory Reproducing Kernel Hilbert Spaces Not covered in detail in 10701/15781, more in 10702

38 2006 Carlos Guestrin 38 Polynomial kernels All monomials of degree d in O(d) operations: How about all monomials of degree up to d? Solution 0: Better solution:

39 2006 Carlos Guestrin 39 Common kernels Polynomials of degree d Polynomials of degree up to d Gaussian kernels Sigmoid

40 2006 Carlos Guestrin 40 Overfitting? Huge feature space with kernels, what about overfitting??? Maximizing margin leads to sparse set of support vectors Some interesting theory says that SVMs search for simple hypothesis with large margin Often robust to overfitting

41 What about at classification time For a new input x, if we need to represent Φ(x), we are in trouble! Recall classifier: sign(w.φ(x)+b) Using kernels we are cool! 2006 Carlos Guestrin 41

42 2006 Carlos Guestrin 42 SVMs with kernels Choose a set of features and kernel function Solve dual problem to obtain support vectors α i At classification time, compute: Classify as

43 2006 Carlos Guestrin 43 What s the difference between SVMs and Logistic Regression? Loss function SVMs Hinge loss Logistic Regression Log-loss High dimensional features with kernels Yes! No

44 2006 Carlos Guestrin 44 Kernels in logistic regression Define weights in terms of support vectors: Derive simple gradient descent rule on α i

45 2006 Carlos Guestrin 45 What s the difference between SVMs and Logistic Regression? (Revisited) Loss function SVMs Hinge loss Logistic Regression Log-loss High dimensional features with kernels Yes! Yes! Solution sparse Often yes! Almost always no!

46 2006 Carlos Guestrin 46 What you need to know Dual SVM formulation How it s derived The kernel trick Derive polynomial kernel Common kernels Kernelized logistic regression Differences between SVMs and logistic regression

47 2006 Carlos Guestrin 47 Acknowledgment SVM applet:

48 2006 Carlos Guestrin 48 Acknowledgment SVM applet:

SVMs, Duality and the Kernel Trick

SVMs, Duality and the Kernel Trick Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University February 26 th, 2007 2005-2007 Carlos Guestrin 1 SVMs reminder 2005-2007 Carlos Guestrin 2 Today