Multi-class SVMs. Lecture 17: Aykut Erdem April 2016 Hacettepe University

Multi-class SVMs Lecture 17: Aykut Erdem April 2016 Hacettepe University

Administrative We will have a make-up lecture on Saturday April 23, 2016. Project progress reports are due April 21, 2016 2 days left! See http://web.cs.hacettepe.edu.tr/~aykut/ classes/spring2016/bbm406/project.html 2

Recap: Support Vector Machines hw, xi + b apple 1 hw, xi + b 1 linear function f(x) =hw, xi + b 3

Recap: Support Vector Machines hw, xi + b = 1 hw, xi + b =1 w optimization problem maximize w,b 1 kwk subject to y i [hx i,wi + b] 1 4

Recap: Support Vector Machines hw, xi + b = 1 hw, xi + b =1 w optimization problem minimize w,b 1 2 kwk2 subject to y i [hx i,wi + b] 1

Recap: Support Vector Machines minimize w,b 1 2 kwk2 subject to y i [hx i,wi + b] 1 w = X i y i i x i w maximize 1 2 X i,j i j y i y j hx i,x j i + X i subject to X i y i = 0 and i 0 i

Recap: Large Margin Classifier hw, xi + b = 1 hw, xi + b =1 w support i > 0=) vectors 7

Recap: Soft-margin Classifier hw, xi + b apple 1 hw, xi + b 1 minimum error separator Theorem (Minsky & Papert) is impossible Finding the minimum error separating hyperplane is NP hard

Recap: Adding Slack Variables i 0 hw, xi + b apple 1+ hw, xi + b 1 Convex optimization problem minimize amount of slack

Recap: Adding Slack Variables for 0 < apple 1 point is between the margin and correctly classified for i 0 point is misclassified hw, xi + b apple 1+ hw, xi + b 1 adopted from Andrew Zisserman Convex optimization problem minimize amount of slack

Adding Slack Variables Hard margin problem minimize w,b 1 2 kwk2 subject to y i [hw, x i i + b] 1 With slack variables minimize w,b 1 2 kwk2 + C X i i subject to y i [hw, x i i + b] 1 i and i 0 Problem is always feasible. Proof: w = 0 and b = 0 and i =1 (also yields upper bound)

Soft-margin classifier Optimisation problem: minimize w,b 1 2 kwk2 + C X i i subject to y i [hw, x i i + b] 1 i and i 0 C is a regularization parameter: small C allows constraints to be easily ignored large margin adopted from Andrew Zisserman large C makes constraints hard to ignore narrow margin C = enforces all constraints: hard margin

Demo time 13

This week Multi-class classification Introduction to kernels 14

Multi-class classification slide by Eric Xing 15

Multi-class classification Real-world problems often have multiple classes: text, speech, image, biological sequences. Algorithms studied so far: designed for binary classification problems. How do we design multi-class classification algorithms? - can the algorithms used for binary classification be generalized to multi-class classification? - can we reduce multi-class classification to binary slide by Eric Xing classification? 16

Multi-class classification slide by Eric Xing 17

Multi-class classification slide by Eric Xing 18

One versus all classification w + w - Learn&3&classifiers:& &.&vs.&{o,+},&weights&w.& +&vs.&{o,.},&weights&w +& o&vs.&{+,.},&weights&w o& w o Predict&label&using:& Any&problems?& slide by Eric Xing Could&we&learn&this&dataset?& 19

Multi-class SVM Simultaneously-learn-3-sets-- w + of-weights:-- w - How-do-we-guarantee-the-- correct-labels?-- Need-new-constraints!-- w o The- score -of-the-correct-- class-must-be-be?er-than- the- score -of-wrong-classes:-- slide by Eric Xing 20

Multi-class SVM As#for#the#SVM,#we#introduce#slack#variables#and#maximize#margin:## To predict, we use: Now#can#we#learn#it?### slide by Eric Xing 21

Kernels 22

Solving XOR (x 1,x 2 ) (x 1,x 2,x 1 x 2 ) XOR not linearly separable Mapping into 3 dimensions makes it easily solvable 23

Quadratic Features Quadratic Features in R 2 (x) := x 2 1, p 2x 1 x 2,x 2 2 Dot Product Dot Product Insight h (x), (x 0 )i = D x 2 1, p 2x 1 x 2,x 2 2 = hx, x 0 i 2., x 0 12, p E 2x 0 1x 0 2,x 0 2 2 Insight Trick works for any polynomials of order d via hx, x 0 i d. Trick works for any polynomials of order 24

Linear Separation with Quadratic Kernels 25

Computational Efficiency Problem Extracting features can sometimes be very costly. Example: second order features in 1000 dimensions. This leads to 5005 10 5 numbers. For higher order polynomial features much worse. Solution Solu%on Don t compute the features, try to compute dot products implicitly. For some features this works... Definition Defini%on A kernel function k : X X! R is a symmetric function in its arguments for which the following property holds k(x, x 0 )=h (x), (x 0 )i for some feature map. If k(x, x 0 ) is much cheaper to compute than (x)... 26

Recap: The Perceptron initialize w = 0 and b =0 repeat if y i [hw, x i i + b] apple 0 then w w + y i x i and b b + y i end if until all classified correctly Nothing happens if classified correctly Weight vector is linear combination Classifier is linear combination of inner products f(x) = X y i hx i,xi + b i2i w = X i2i y i x i 27

Recap: The Perceptron on features nction initialize w, b =0 repeat { } {± Pick (x i,y i ) from data if y i (w (x i )+b) apple 0 then w 0 = w + y i (x i ) b 0 = b + y i until y i (w (x i )+b) > 0 for all i d X Nothing happens if classified correctly Weight vector is linear combination Classifier is linear combination of inner products f(x) = X i2i y i h (x i ), (x)i + b w = X i2i y i (x i ) 28

The Kernel Perceptron { } {± } initialize f =0 repeat Pick (x i,y i ) from data if y i f(x i ) apple 0 then f( ) f( )+y i k(x i, )+y i until y i f(x i ) > 0 for all i d Nothing happens if classified correctly Weight vector is linear combination w = X i2i y i (x i ) Classifier is linear combination of inner products f(x) = X i2i y i h (x i ), (x)i + b = X i2i y i k(x i,x)+b 29