Multi-class SVMs. Lecture 17: Aykut Erdem April 2016 Hacettepe University

Size: px

Start display at page:

Download "Multi-class SVMs. Lecture 17: Aykut Erdem April 2016 Hacettepe University"

Juliana Fisher
5 years ago
Views:

1 Multi-class SVMs Lecture 17: Aykut Erdem April 2016 Hacettepe University

2 Administrative We will have a make-up lecture on Saturday April 23, Project progress reports are due April 21, days left! See classes/spring2016/bbm406/project.html 2

3 Recap: Support Vector Machines hw, xi + b apple 1 hw, xi + b 1 linear function f(x) =hw, xi + b 3

4 Recap: Support Vector Machines hw, xi + b = 1 hw, xi + b =1 w optimization problem maximize w,b 1 kwk subject to y i [hx i,wi + b] 1 4

5 Recap: Support Vector Machines hw, xi + b = 1 hw, xi + b =1 w optimization problem minimize w,b 1 2 kwk2 subject to y i [hx i,wi + b] 1

6 Recap: Support Vector Machines minimize w,b 1 2 kwk2 subject to y i [hx i,wi + b] 1 w = X i y i i x i w maximize 1 2 X i,j i j y i y j hx i,x j i + X i subject to X i y i = 0 and i 0 i

7 Recap: Large Margin Classifier hw, xi + b = 1 hw, xi + b =1 w support i > 0=) vectors 7

8 Recap: Soft-margin Classifier hw, xi + b apple 1 hw, xi + b 1 minimum error separator Theorem (Minsky & Papert) is impossible Finding the minimum error separating hyperplane is NP hard

9 Recap: Adding Slack Variables i 0 hw, xi + b apple 1+ hw, xi + b 1 Convex optimization problem minimize amount of slack

10 Recap: Adding Slack Variables for 0 < apple 1 point is between the margin and correctly classified for i 0 point is misclassified hw, xi + b apple 1+ hw, xi + b 1 adopted from Andrew Zisserman Convex optimization problem minimize amount of slack

11 Adding Slack Variables Hard margin problem minimize w,b 1 2 kwk2 subject to y i [hw, x i i + b] 1 With slack variables minimize w,b 1 2 kwk2 + C X i i subject to y i [hw, x i i + b] 1 i and i 0 Problem is always feasible. Proof: w = 0 and b = 0 and i =1 (also yields upper bound)

12 Soft-margin classifier Optimisation problem: minimize w,b 1 2 kwk2 + C X i i subject to y i [hw, x i i + b] 1 i and i 0 C is a regularization parameter: small C allows constraints to be easily ignored large margin adopted from Andrew Zisserman large C makes constraints hard to ignore narrow margin C = enforces all constraints: hard margin

13 Demo time 13

14 This week Multi-class classification Introduction to kernels 14

15 Multi-class classification slide by Eric Xing 15

16 Multi-class classification Real-world problems often have multiple classes: text, speech, image, biological sequences. Algorithms studied so far: designed for binary classification problems. How do we design multi-class classification algorithms? - can the algorithms used for binary classification be generalized to multi-class classification? - can we reduce multi-class classification to binary slide by Eric Xing classification? 16

17 Multi-class classification slide by Eric Xing 17

18 Multi-class classification slide by Eric Xing 18

19 One versus all classification w + w - Learn&3&classifiers:& &.&vs.&{o,+},&weights&w.& +&vs.&{o,.},&weights&w +& o&vs.&{+,.},&weights&w o& w o Predict&label&using:& Any&problems?& slide by Eric Xing Could&we&learn&this&dataset?& 19

20 Multi-class SVM Simultaneously-learn-3-sets-- w + of-weights:-- w - How-do-we-guarantee-the-- correct-labels?-- Need-new-constraints!-- w o The- score -of-the-correct-- class-must-be-be?er-than- the- score -of-wrong-classes:-- slide by Eric Xing 20

21 Multi-class SVM As#for#the#SVM,#we#introduce#slack#variables#and#maximize#margin:## To predict, we use: Now#can#we#learn#it?### slide by Eric Xing 21

22 Kernels 22

23 Solving XOR (x 1,x 2 ) (x 1,x 2,x 1 x 2 ) XOR not linearly separable Mapping into 3 dimensions makes it easily solvable 23

24 Quadratic Features Quadratic Features in R 2 (x) := x 2 1, p 2x 1 x 2,x 2 2 Dot Product Dot Product Insight h (x), (x 0 )i = D x 2 1, p 2x 1 x 2,x 2 2 = hx, x 0 i 2., x 0 12, p E 2x 0 1x 0 2,x Insight Trick works for any polynomials of order d via hx, x 0 i d. Trick works for any polynomials of order 24

25 Linear Separation with Quadratic Kernels 25

26 Computational Efficiency Problem Extracting features can sometimes be very costly. Example: second order features in 1000 dimensions. This leads to numbers. For higher order polynomial features much worse. Solution Solu%on Don t compute the features, try to compute dot products implicitly. For some features this works... Definition Defini%on A kernel function k : X X! R is a symmetric function in its arguments for which the following property holds k(x, x 0 )=h (x), (x 0 )i for some feature map. If k(x, x 0 ) is much cheaper to compute than (x)... 26

27 Recap: The Perceptron initialize w = 0 and b =0 repeat if y i [hw, x i i + b] apple 0 then w w + y i x i and b b + y i end if until all classified correctly Nothing happens if classified correctly Weight vector is linear combination Classifier is linear combination of inner products f(x) = X y i hx i,xi + b i2i w = X i2i y i x i 27

28 Recap: The Perceptron on features nction initialize w, b =0 repeat { } {± Pick (x i,y i ) from data if y i (w (x i )+b) apple 0 then w 0 = w + y i (x i ) b 0 = b + y i until y i (w (x i )+b) > 0 for all i d X Nothing happens if classified correctly Weight vector is linear combination Classifier is linear combination of inner products f(x) = X i2i y i h (x i ), (x)i + b w = X i2i y i (x i ) 28

29 The Kernel Perceptron { } {± } initialize f =0 repeat Pick (x i,y i ) from data if y i f(x i ) apple 0 then f( ) f( )+y i k(x i, )+y i until y i f(x i ) > 0 for all i d Nothing happens if classified correctly Weight vector is linear combination w = X i2i y i (x i ) Classifier is linear combination of inner products f(x) = X i2i y i h (x i ), (x)i + b = X i2i y i k(x i,x)+b 29

Lecture 18: Kernels Risk and Loss Support Vector Regression. Aykut Erdem December 2016 Hacettepe University

Lecture 18: Kernels Risk and Loss Support Vector Regression Aykut Erdem December 2016 Hacettepe University Administrative We will have a make-up lecture on next Saturday December 24, 2016 Presentations