Support Vector Machine I

Size: px

Start display at page:

Download "Support Vector Machine I"

Oswald Doyle
5 years ago
Views:

1 Support Vector Machine I Jia-Bin Huang ECE-5424G / CS-5824 Virginia Tech Spring 2019

2 Administrative Please use piazza. No s. HW 0 grades are back. Re-grade request for one week. HW 1 due soon. HW 2 on the way.

3 Regularized logistic regression Age h θ x = g(θ 0 + θ 1 x + θ 2 x 2 + θ 3 x θ 4 x θ 5 x 1 x 2 + θ 6 x 1 3 x 2 + θ 7 x 1 x ) Cost function: m Tumor Size J θ = 1 m y i log h θ x i + (1 y i ) log 1 h θ x i + λ 2 i=1 j=1 n θ j 2

4 Gradient descent (Regularized) Repeat { m θ 0 θ 0 α 1 m i=1 h θ x i y i h θ x = e θ x m θ j θ j α 1 m i=1 h θ x i y i x j i + λθ j } θ j J(θ)

5 θ 1 : Lasso regularization J θ = 1 m 2m i=1 h θ x i y i 2 + λ j=1 n θ j LASSO: Least Absolute Shrinkage and Selection Operator

6 Soft Thresholding operator S λ x = sign x x λ + Single predictor: Soft Thresholding 1 minimize σ θ 2m i=1 m x (i) θ y i 2 + λ θ 1 θ = 1 m < x, y > λ if 1 m 1 0 if 1 m < x, y > +λ if 1 m < x, y > > λ < x, y > λ m < x, y > < λ θ = S λ ( 1 m < x, y >)

7 Multiple predictors: : Cyclic Coordinate Descenet 1 minimize σ θ 2m i=1 m x i j θ j + σ k j x i ij θ k y i 2 + For each j, update θ j with λ k j minimize θ 1 2m i=1 θ k + λ θ j 1 m x j i θ j r j (i) 2 + λ θ j 1 where r j (i) = y i σ k j x ij i θ k

8 L1 and L2 balls Image credit:

9 Terminology Regularization function n θ 2 2 = j=1 n θ 1 = j=1 θ j 2 θ j Name Tikhonov regularization Ridge regression LASSO regression Solver Close form Proximal gradient descent, least angle regression α θ 1 + (1 α) θ 2 2 Elastic net regularization Proximal gradient descent

10 Things to remember Overfitting Complex model: doing well on the training set, but perform poorly on the testing set Cost function Norm penalty: L1 or L2 Regularized linear regression Gradient decent with weight decay (L2 norm) Regularized logistic regression Gradient decent with weight decay (L2 norm)

11 Support Vector Machine Cost function Large margin classification Kernels Using an SVM

12 Support Vector Machine Cost function Large margin classification Kernels Using an SVM

13 Logistic regression h θ x = g θ x 1 g z = 1 + e z g(z) z = θ x Suppose predict y = 1 if h θ x 0.5 predict y = 0 if h θ x < 0.5 z = θ x 0 z = θ x < 0

14 Alternative view h θ x = g θ x 1 g z = 1 + e z If y = 1, we want h θ x 1 If y = 0, we want h θ x 0 g(z) z = θ x z = θ x 0 z = θ x 0

15 Cost function for Logistic Regression Cost(h θ x, y) = log h θ x if y = 1 log 1 h θ x if y = 0 if y = 1 if y = 0 0 h θ x 1 0 h θ x 1

16 Alternative view of logistic regression Cost h θ x, y = y log h θ x (1 y) log 1 h θ x 1 1 = y log + 1 y log e θ x 1 + e θ x log e θ x log e θ x 0 if y = 1 if y = θ x θ x

17 Logistic regression (logistic loss) m 1 min θ m i=1 y (i) log h θ x (i) + (1 y i ) log 1 h θ x (i) + λ 2m j=1 n θ j 2 Support vector machine (hinge loss) m 1 min θ m i=1 y (i) cost 1 θ x i + (1 y i ) cost 0 θ x i + λ 2m j=1 n θ j 2 log e θ x log e θ x 0 if y = 1 if y = θ x θ x

18 Optimization objective for SVM m 1 min θ m i=1 y (i) cost 1 θ x i + (1 y i ) cost 0 θ x i + λ n 2m 2 θ j j=1 1) Multiply 1 m 2) Multiply C = 1 λ min θ m C i=1 y (i) cost 1 θ x i + (1 y i ) cost 0 θ x i + n j=1 θ j 2

19 Hypothesis of SVM min θ m C i=1 y (i) cost 1 θ x i + (1 y i ) cost 0 θ x i + n j=1 θ j 2 Hypothesis h θ x = ቊ 1 if θ x 0 0 if θ x < 0

20 Support Vector Machine Cost function Large margin classification Kernels Using an SVM

21 Support vector machine min θ m C i=1 y (i) cost 1 θ x i + (1 y i ) cost 0 θ x i + n j=1 θ j 2 cost 1 θ x cost 0 θ x if y = 1 if y = θ x If y = 1, we want θ x 1 (not just 0) If y = 0, we want θ x 1 (not just < 0) θ x

22 SVM decision boundary min θ m C i=1 y (i) cost 1 θ x i + (1 y i ) cost 0 θ x i j=1 Let s say we have a very large C n θ j 2 Whenever y i = 1: θ x i 1 Whenever y i = 0: θ x i 1 1 θ 2 σ n j=1 θ 2 j min s. t. θ x i 1 if y i = 1 θ x i 1 if y i = 0

23 SVM decision boundary: Linearly separable case x 2 x 1

24 SVM decision boundary: Linearly separable case x 2 margin x 1

25 Large margin classifier in the presence of outlier C very large x 2 C not too large x 1

26 Vector inner product v 2 u 2 v u u = u 1 u v = v 1 2 v 2 u = length of vector u = u u 2 2 R p = length of projection of v onto u v 1 u 1 u v = p = u 1 v 1 + u 2 v 2 u

27 SVM decision boundary 1 σ n 2 θ 2 j=1 θ j min s. t. θ x i 1 if y i = 1 θ x i 1 if y i = 0 Simplication: θ 0 = 0, n = 2 What s θ x i? θ x i = p (i) θ 2 1 σ n 2 j=1 θ 2 j = 1 θ θ 2 2 = 1 2 x 2 i θ 2 x (i) x 1 i p (i) θ θ = 1 2 θ 2 θ 1 θ

28 SVM decision boundary min 2 θ 2 s. t. p (i) θ 2 1 if y i = 1 p (i) θ 2 1 if y i = 0 θ 1 p 1, p (2) small θ 2 large Simplication: θ 0 = 0, n = 2 p 1, p (2) large θ 2 can be small θ x 2 x 1 x 2 x 1 θ

29 Support Vector Machine Cost function Large margin classification Kernels Using an SVM

30 Non-linear decision boundary x 2 Predict y = 1 if x 1 θ 0 + θ 1 x 1 + θ 2 x 2 + θ 3 x 1 x 2 + θ 4 x θ 5 x θ 0 + θ 1 f 1 + θ 2 f 2 + θ 3 f 3 + f 1 = x 1, f 2 = x 2, f 3 = x 1 x 2, Is there a different/better choice of the features f 1, f 2, f 3,?

31 Kernel x 2 l (2) l (1) Give x, compute new features depending on proximity to landmarks l (1), l (2), l (3) l (3) x 1 f 1 = similarity(x, l (1) ) f 2 = similarity(x, l (2) ) f 3 = similarity(x, l (3) ) Gaussian kernel similarity x, l i = exp( x l i 2 2σ 2 )

32 x 2 l (2) l (1) l (3) x 1 Predict y = 1 if θ 0 + θ 1 f 1 + θ 2 f 2 + θ 3 f 3 0 f 1 = similarity(x, l (1) ) f 2 = similarity(x, l (2) ) f 3 = similarity(x, l (3) ) Ex: θ 0 = 0.5, θ 1 = 1, θ 2 = 1, θ 3 = 0

33 Choosing the landmarks Given x f i = similarity x, l i = exp( x l i 2 2σ 2 ) Predict y = 1 if θ 0 + θ 1 f 1 + θ 2 f 2 + θ 3 f 3 0 l x 1 2 Where to get l (1), l (2), l 3,? x 1 l 2 l 3 l m

34 SVM with kernels Given x 1, y 1, x 2, y 2,, x m, y m Choose l 1 = x 1, l 2 = x 2, l 3 = x 3,, l m = x m Given example x: f 1 = similarity x, l (1) f 2 = similarity x, l (2) For training example x i, y i : x i f (i) f = f 0 f 1 f 2 f m

35 SVM with kernels Hypothesis: Given x, compute features f R m+1 Predict y = 1 if θ f 0 Training (original) min θ m C i=1 Training (with kernel) min θ m C i=1 y (i) cost 1 θ x i + (1 y i ) cost 0 θ x i j=1 y (i) cost 1 θ f i + (1 y i ) cost 0 θ f i j=1 n m θ j 2 θ j 2

36 SVM parameters C = 1 λ Large C: Lower bias, high variance. Small C: Higher bias, low variance. σ 2 Large σ 2 : features f i vary more smoothly. Higher bias, lower variance Small σ 2 : features f i vary less smoothly. Lower bias, higher variance

37 SVM song Video source:

38 SVM Demo

39 Support Vector Machine Cost function Large margin classification Kernels Using an SVM

40 Using SVM SVM software package (e.g., liblinear, libsvm) to solve for θ Need to specify: Choice of parameter C. Choice of kernel (similarity function): Linear kernel: Predict y = 1 if θ x 0 Gaussian kernel: f i = exp( x l i 2 ), where l (i) = x (i) 2σ 2 Need to choose σ 2. Need proper feature scaling

41 Kernel (similarity) functions Note: not all similarity functions make valid kernels. Many off-the-shelf kernsl available: Polynomial kernel String kernel Chi-square kernel Histogram intersection kernel

42 Multi-class classification x 2 x 1 Use one-vs.-all method. Train K SVMs, one to distinguish y = i from the rest, get θ 1, θ (2),, θ (K) Pick class i with the largest θ i x

43 Logistic regression vs. SVMs n = number of features (x R n+1 ), m = number of training examples 1. If n is large (relative to m): (n = 10,000, m = ) Use logistic regression or SVM without a kernel ( linear kernel ) 2. If n is small, m is intermediate: (n = , m = 10 10,000) Use SVM with Gaussian kernel 3. If n is small, m is large: (n = , m = 50,000+) Create/add more features, then use logistic regression of linear SVM Neural network likely to work well for most of these case, but slower to train

44 Things to remember Cost function min θ m C y (i) cost 1 θ f i + (1 y i ) cost 0 θ f i i=1 j=1 Large margin classification m θ j 2 Kernels x 2 Using an SVM margin x 1

Support Vector Machine II

Support Vector Machine II Jia-Bin Huang ECE-5424G / CS-5824 Virginia Tech Spring 2019 Administrative HW 1 due tonight HW 2 released. Online Scalable Learning Adaptive to Unknown Dynamics and Graphs Yanning