Support Vector Machine II

Size: px

Start display at page:

Download "Support Vector Machine II"

Mary Smith
5 years ago
Views:

1 Support Vector Machine II Jia-Bin Huang ECE-5424G / CS-5824 Virginia Tech Spring 2019

2 Administrative HW 1 due tonight HW 2 released.

Declarative Learning-based Programming for Learning and Reasoning over Spatial Language

3 Online Scalable Learning Adaptive to Unknown Dynamics and Graphs Yanning Shen, University of Minnesota, Twin Cities Tuesday, February 19 9:00am - 10:15am 1110 Knowledgeworks II Declarative Learning-based Programming for Learning and Reasoning over Spatial Language Parisa Kordjamshidi, Assistant Professor, Tulane University Thursday, February 21 9:00am - 10:15am 110 McBryde

4 Logistic regression (logistic loss) m 1 min θ m i=1 y (i) log h θ x (i) + (1 y i ) log 1 h θ x (i) + λ 2m j=1 n θ j 2 Support vector machine (hinge loss) m 1 min θ m i=1 y (i) cost 1 θ x i + (1 y i ) cost 0 θ x i + λ 2m j=1 n θ j 2 log e θ x log e θ x 0 if y = 1 if y = θ x θ x

5 Support Vector Machine Cost function Large margin classification Kernels Using an SVM

6 Support vector machine min θ m C i=1 y (i) cost 1 θ x i + (1 y i ) cost 0 θ x i + n j=1 θ j 2 cost 1 θ x cost 0 θ x if y = 1 if y = θ x If y = 1, we want θ x 1 (not just 0) If y = 0, we want θ x 1 (not just < 0) θ x Slide credit: Andrew Ng

7 SVM decision boundary min θ m C i=1 y (i) cost 1 θ x i + (1 y i ) cost 0 θ x i j=1 Let s say we have a very large C n θ j 2 Whenever y i = 1: θ x i 1 Whenever y i = 0: θ x i 1 1 θ 2 σ n j=1 θ 2 j min s. t. θ x i 1 if y i = 1 θ x i 1 if y i = 0 Slide credit: Andrew Ng

8 SVM decision boundary: Linearly separable case x 2 x 1 Slide credit: Andrew Ng

9 SVM decision boundary: Linearly separable case x 2 margin x 1 Slide credit: Andrew Ng

10 Why large margin classifiers? 1 min θ 2 σ n j=1 θ j 2 x 2 s. t. θ x i 1 if y i = 1 θ x i 1 if y i = 0 x 1 margin

11 Vector inner product v 2 u 2 v u u = u 1 u v = v 1 2 v 2 u = length of vector u = u u 2 2 R p = length of projection of v onto u v 1 u 1 u v = p = u 1 v 1 + u 2 v 2 Slide credit: Andrew Ng u

12 SVM decision boundary 1 σ n 2 θ 2 j=1 θ j min s. t. θ x i 1 if y i = 1 θ x i 1 if y i = 0 Simplication: θ 0 = 0, n = 2 What s θ x i? θ x i = p (i) θ 2 1 σ n 2 j=1 θ 2 j = 1 θ θ 2 2 = 1 2 x 2 i θ 2 x (i) x 1 i p (i) θ θ = 1 2 θ 2 θ 1 θ Slide credit: Andrew Ng

13 SVM decision boundary min 2 θ 2 s. t. p (i) θ 2 1 if y i = 1 p (i) θ 2 1 if y i = 0 θ 1 p 1, p (2) small θ 2 large Simplication: θ 0 = 0, n = 2 p 1, p (2) large θ 2 can be small θ x 2 x 1 x 2 x 1 θ Slide credit: Andrew Ng

14 Rewrite the formulation Let θ j = θ j /γ, with θ 2 = 1 1 min σ n θ 2 j=1 θ j 2 max θ,γ γ x 2 x 1 margin s. t. θ x i 1 if y i = 1 θ x i 1 if y i = 0 1 min σ n θ 2 j=1 θ 2 j = min θ,γ 1 σ n θ 2 j 2 j=1 γ 2 = max γ γ s. t. θ 2 = 1 θ x i γ if y i = 1 θ x i γ if y i = 0 θ x i 1 if y i = 1 θ x i γ if y i = 1 θ x i 1 if y i = 0 θ x i γ if y i = 0

15 Data not linearly separable? 1 min σ n θ 2 j=1 θ j 2 s. t. θ x i 1 if y i = 1 θ x i 1 if y i = 0 x 2 1 min σ n θ 2 j=1 θ j 2 + C(#misclassification) s. t. θ x i 1 if y i = 1 θ x i 1 if y i = 0 x 1 NP-hard

16 Convex relaxation 1 min σ n θ 2 j=1 θ j 2 + C(#misclassification) s. t. θ x i 1 if y i = 1 θ x i 1 if y i = 0 NP-hard ξ (i) : slack variables 1 min σ n θ 2 j=1 θ j 2 + C σ i ξ (i) s. t. θ x i 1 ξ (i) if y i = 1 θ x i 1 + ξ i if y i = 0 ξ i 0 i x 1

Hinge loss Image credit: https://math.stackexchange.

17 Hinge loss Image credit:

18 1 min σ n θ 2 j=1 θ j 2 Hard-margin SVM formulation s. t. θ x i 1 if y i = 1 θ x i 1 if y i = 0 1 min σ n θ 2 j=1 Soft-margin SVM formulation θ j 2 + C σ i ξ (i) s. t. θ x i 1 ξ (i) if y i = 1 θ x i 1 + ξ i if y i = 0 ξ i 0 i

19 Support Vector Machine Cost function Large margin classification Kernels Using an SVM

20 Non-linear classification How do we separate the two classes using a hyperplane? h θ x = ቊ 1 if θ x 0 0 if θ x < 0 x

21 Non-linear classification x 2 x

22 Kernel K, a legal definition of inner product: φ: X R N s.t. K x, z = φ x φ(z)

23 Why Kernels matter? Many algorithms interact with data only via dot-products Replace x z with K x, z = φ x φ(z) Act implicitly as if data was in the higher-dimensional φspace

24 Example K x, z = x z 2 corresponds to x 1, x 2 φ x = x 1 2, x 2 2, 2x 1 x 2 φ x φ z = x 1 2, x 2 2, 2x 1 x 2 z 1 2, z 2 2, 2z 1 z 2 = x 1 2 z x 2 2 z x 1 x 2 z 1 z 2 = x 1 z 1 + x 2 z 2 2 = x z 2 = K(x, z)

25 Example K x, z = x z 2 corresponds to x 1, x 2 φ x = x 1 2, x 2 2, 2x 1 x 2 Slide credit: Maria-Florina Balcan

26 Example kernels Linear kernel K x, z = x z Gaussian (Radial basis function) kernel K x, z = exp 1 2 x z Σ 1 (x z) Sigmoid kernel K x, z = tanh a x z + b

27 Constructing new kernels Positive scaling Exponentiation Addition K x, z = ck 1 x, z, c > 0 K x, z = exp K 1 x, z K x, z = K 1 x, z + K 2 x, z Multiplication with function K x, z = f x K 1 x, z f(z) Multiplication K x, z = K 1 x, z K 2 x, z

28 Non-linear decision boundary x 2 Predict y = 1 if x 1 θ 0 + θ 1 x 1 + θ 2 x 2 + θ 3 x 1 x 2 + θ 4 x θ 5 x θ 0 + θ 1 f 1 + θ 2 f 2 + θ 3 f 3 + f 1 = x 1, f 2 = x 2, f 3 = x 1 x 2, Is there a different/better choice of the features f 1, f 2, f 3,? Slide credit: Andrew Ng

29 Kernel x 2 l (2) l (1) Give x, compute new features depending on proximity to landmarks l (1), l (2), l (3) l (3) x 1 f 1 = similarity(x, l (1) ) f 2 = similarity(x, l (2) ) f 3 = similarity(x, l (3) ) Gaussian kernel similarity x, l i = exp( x l i 2 2σ 2 ) Slide credit: Andrew Ng

30 x 2 l (2) l (1) l (3) x 1 Predict y = 1 if θ 0 + θ 1 f 1 + θ 2 f 2 + θ 3 f 3 0 f 1 = similarity(x, l (1) ) f 2 = similarity(x, l (2) ) f 3 = similarity(x, l (3) ) Ex: θ 0 = 0.5, θ 1 = 1, θ 2 = 1, θ 3 = 0 Slide credit: Andrew Ng

31 Choosing the landmarks Given x f i = similarity x, l i = exp( x l i 2 2σ 2 ) Predict y = 1 if θ 0 + θ 1 f 1 + θ 2 f 2 + θ 3 f 3 0 l x 1 2 Where to get l (1), l (2), l 3,? x 1 l 2 l 3 l m Slide credit: Andrew Ng

32 SVM with kernels Given x 1, y 1, x 2, y 2,, x m, y m Choose l 1 = x 1, l 2 = x 2, l 3 = x 3,, l m = x m Given example x: f 1 = similarity x, l (1) f 2 = similarity x, l (2) For training example x i, y i : x i f (i) f = f 0 f 1 f 2 f m Slide credit: Andrew Ng

33 SVM with kernels Hypothesis: Given x, compute features f R m+1 Predict y = 1 if θ f 0 Training (original) min θ m C i=1 y (i) cost 1 θ x i + (1 y i ) cost 0 θ x i j=1 n θ j 2 Training (with kernel) min θ m C i=1 y (i) cost 1 θ f i + (1 y i ) cost 0 θ f i j=1 m θ j 2

34 Support vector machines (Primal/Dual) Primal form 1 min σ n θ,ξ (i) 2 j=1 θ 2 j + C σ i ξ (i) s. t. θ x i 1 ξ (i) if y i = 1 θ x i 1 + ξ i if y i = 0 ξ i 0 i Lagrangian dual form 1 min α 2 i y (i) y (j) α (i) α (j) x i x (j) j i α (i) s. t. 0 α (i) C i y (i) α (i) = 0 i

35 SVM (Lagrangian dual) 1 min α 2 i j y (i) y (j) α (i) α (j) x i x (j) i α (i) s. t. 0 α (i) C i y (i) α (i) = 0 i Classifier: θ = σ i α (i) y i x (i) Replace x i x (j) with K(x i, x (j) ) The points x (i) for which α (i) 0 Support Vectors

36 SVM parameters C = 1 λ Large C: Lower bias, high variance. Small C: Higher bias, low variance. σ 2 Large σ 2 : features f i vary more smoothly. Higher bias, lower variance Small σ 2 : features f i vary less smoothly. Lower bias, higher variance Slide credit: Andrew Ng

37 SVM Demo

38 SVM song Video source:

39 Support Vector Machine Cost function Large margin classification Kernels Using an SVM

40 Using SVM SVM software package (e.g., liblinear, libsvm) to solve for θ Need to specify: Choice of parameter C. Choice of kernel (similarity function): Linear kernel: Predict y = 1 if θ x 0 Gaussian kernel: f i = exp( x l i 2 ), where l (i) = x (i) 2σ 2 Need to choose σ 2. Need proper feature scaling Slide credit: Andrew Ng

41 Kernel (similarity) functions Note: not all similarity functions make valid kernels. Many off-the-shelf kernels available: Polynomial kernel String kernel Chi-square kernel Histogram intersection kernel Slide credit: Andrew Ng

42 Multi-class classification x 2 x 1 Use one-vs.-all method. Train K SVMs, one to distinguish y = i from the rest, get θ 1, θ (2),, θ (K) Pick class i with the largest θ i x Slide credit: Andrew Ng

43 Logistic regression vs. SVMs n = number of features (x R n+1 ), m = number of training examples 1. If n is large (relative to m): (n = 10,000, m = ) Use logistic regression or SVM without a kernel ( linear kernel ) 2. If n is small, m is intermediate: (n = , m = 10 10,000) Use SVM with Gaussian kernel 3. If n is small, m is large: (n = , m = 50,000+) Create/add more features, then use logistic regression of linear SVM Neural network likely to work well for most of these case, but slower to train Slide credit: Andrew Ng

44 Things to remember Cost function min θ m C y (i) cost 1 θ f i + (1 y i ) cost 0 θ f i i=1 j=1 Large margin classification m θ j 2 Kernels x 2 Using an SVM margin x 1

Support Vector Machine I

Support Vector Machine I Jia-Bin Huang ECE-5424G / CS-5824 Virginia Tech Spring 2019 Administrative Please use piazza. No emails. HW 0 grades are back. Re-grade request for one week. HW 1 due soon. HW