Support Vector Machine

Size: px

Start display at page:

Download "Support Vector Machine"

Jodie Imogen Allison
6 years ago
Views:

1 Support Vector Machine by Prof. Seungchul Lee isystems Design Lab UNIS able of Contents I.. Classification (Linear) II.. Distance from a Line III. 3. Illustrative Example I. 3.. LP Formulation II. 3.. Outlier III LP Formulation IV Maximize Margin (Finally, it is Support Vector Machine) IV. 4. Support Vector Machine V. 5. Nonlinear Support Vector Machine I. 5.. Kernel II. 5.. Classifying non-linear separable data. Classification (Linear) Figure out, autonomously, which category (or class) an unknown item should be categorized into Number of categories / classes Binary: different classes Multiclass: more than classes Feature he measurable parts that make up the unknown item (or the information you have available to categorize)

2 . Distance from a Line ω ω = x [ ], x = [ ] g(x) = ω x + ω 0 = ω x + ω x + ω 0 ω x p If and q are on the decision line g ( p ) = g ( q ) = 0 ω p + ω 0 = ω q + ω 0 = 0 ω ( p q ) = 0 ω : normal to the line (orthogonal) tells the direction of the line If x is on the line and x = d ω (where d is a normal distance from the origin to the line) ω g(x) = ω x + ω 0 = 0 ω ω d + ω = d ω ω 0 + ω 0 = d ω + ω 0 = 0 ω ω d = ω 0 ω

3 for any vector of x ω x = x + r ω ω x = ω ω ω ω ( x + r ) = r = r ω ω ω g(x) = ω x + ω 0 = r ω + ω 0 (r = d + h) = (d + h) ω + ω 0 h = g(x) ω ω = ( 0 + h) ω + ω 0 ω = h ω orthogonal distance f rom the line

4 Another method to find a distance between g(x) = and g(x) = suppose g( x ) =, g( x ) = ω x + ω 0 = ω ( x ) = ω x x + ω 0 = s =, = ( ) = ω ω x x ω ω x x ω

3. Illustrative Example Binary classification C and C Features he coordinate of the unknown animal i in the zoo x x = [ ] x Is it possible to distinguish between C and C by its coordinates on a map

5 3. Illustrative Example Binary classification C and C Features he coordinate of the unknown animal i in the zoo x x = [ ] x Is it possible to distinguish between C and C by its coordinates on a map of the zoo? We need to find a separating hyperplane (or a line in D) ω x + ω x + ω 0 = 0 x [ ] ω ω [ ] + ω 0 = 0 x ω x + ω 0 = 0 In []: import numpy as np import matplotlib.pyplot as plt %matplotlib inline

6 In []: #training data gerneration x = 8*np.random.rand(00, ) x = 7*np.random.rand(00, ) - 4 g0 = 0.8*x + x - 3 g = g0 - g = g0 + C = np.where(g >= 0)[0] C = np.where(g < 0)[0] xp = np.linspace(0,8,00).reshape(-,) ypt = -0.8*xp + 3 plt.figure(figsize=(0, 6)) plt.plot(x[c], x[c], 'ro', label='c') plt.plot(x[c], x[c], 'bo', label='c') plt.plot(xp, ypt, '--k', label='rue') plt.title('linearly and strictly separable classes', fontweight = 'bold', fontsize = 5) plt.xlabel('$x_$', fontsize = 0) plt.ylabel('$x_$', fontsize = 0) plt.legend(loc = 4) plt.xlim([0, 8]) plt.ylim([-4, 3]) plt.show()

7 Given: Hyperplane defined by ω and ω 0 Animals coordinates (or features) x Decision making: ω x + ω 0 > 0 x belongs to C ω x + ω 0 < 0 x belongs to C Find and such that given ω ω 0 or ω ω 0 x x + = 0 Find and such that given and given ω ω 0 ω ω 0 x C ω x + ω 0 > x C x + < Same problem if strictly separable ω x + ω 0 > b ω ω 0 x + > b b ω x + ω 0 >

8 In [3]: # see how data are generated xp = np.linspace(0,8,00).reshape(-,) ypt = -0.8*xp + 3 plt.figure(figsize=(0, 6)) plt.plot(x[c], x[c], 'ro', label='c') plt.plot(x[c], x[c], 'bo', label='c') plt.plot(xp, ypt, '--k', label='true') plt.plot(xp, ypt-, '-k') plt.plot(xp, ypt+, '-k') plt.title('linearly and strictly separable classes', fontweight = 'bold', fontsize = 5) plt.xlabel('$x_$', fontsize = 0) plt.ylabel('$x_$', fontsize = 0) plt.legend(loc = 4) plt.xlim([0, 8]) plt.ylim([-4, 3]) plt.show()

9 3.. LP Formulation n (= ) m = N + M features data points in training set x (i) x = [ (i) ω ] with ω = [ ] or = with ω = x (i) x (i) ω x (i) x (i) ω 0 ω ω N belongs to C in training set M belongs to C in training set ω and ω 0 are the unknown variables minimize something minimize something subject to ω x () + ω 0 ω x () + ω 0 ω x (N) + ω 0 ω x (N+) + ω 0 ω x (N+) + ω 0 ω x (N+M) + ω 0 subject to ω x () ω x () ω x (N) ω x (N+) ω x (N+) ω x (N+M) Code (CVXPY) X X = = ( x () ) ( x () ) ( x (N) ) = = ( x (N+) ) ( x (N+) ) ( x (N+M) ) x () x () x (N) x () x () x (N) x (N+) x (N+) x (N+M) x (N+) x (N+) x (N+M) X X = = ( x () ) ( x () ) ( x (N) ) = = ( x (N+) ) ( x (N+) ) ( x (N+M) ) x () x () x (N) x () x () x (N) x (N+) x (N+) x (N+M) x (N+) x (N+) x (N+M) minimize subject to something X ω + ω 0 X ω + ω 0 minimize subject to something X ω ω X

10 Form minimize subject to something X ω + ω 0 X ω + ω 0 In [4]: # CVXPY using simple classification import cvxpy as cvx X = np.hstack([x[c], x[c]]) X = np.hstack([x[c], x[c]]) X = np.asmatrix(x) X = np.asmatrix(x) N = X.shape[0] M = X.shape[0] In [5]: w = cvx.variable(,) w0 = cvx.variable(,) obj = cvx.minimize() const = [X*w + w0 >=, X*w + w0 <= -] prob = cvx.problem(obj, const).solve() w = w.value w0 = w0.value

11 In [6]: xp = np.linspace(0,8,00).reshape(-,) yp = - w[0,0]/w[,0]*xp - w0/w[,0] plt.figure(figsize=(0, 6)) plt.plot(x[:,0], X[:,], 'ro', label='c') plt.plot(x[:,0], X[:,], 'bo', label='c') plt.plot(xp, yp, 'k', label='svm') plt.plot(xp, ypt, '--k', label='true') plt.xlim([0,8]) plt.xlabel('$x_$', fontsize = 0) plt.ylabel('$x_$', fontsize = 0) plt.legend(loc = 4, fontsize = 5) plt.show() Form minimize subject to something X ω ω X

12 In [7]: N = C.shape[0] M = C.shape[0] X = np.hstack([np.ones([n,]), x[c], x[c]]) X = np.hstack([np.ones([m,]), x[c], x[c]]) X = np.asmatrix(x) X = np.asmatrix(x) w = cvx.variable(3,) obj = cvx.minimize() const = [X*w >=, X*w <= -] prob = cvx.problem(obj, const).solve() w = w.value xp = np.linspace(0,8,00).reshape(-,) yp = - w[,0]/w[,0]*xp - w[0,0]/w[,0] plt.figure(figsize=(0, 6)) plt.plot(x[:,], X[:,], 'ro', label='c') plt.plot(x[:,], X[:,], 'bo', label='c') plt.plot(xp, yp, 'k', label='svm') plt.plot(xp, ypt, '--k', label='true') plt.xlim([0,8]) plt.xlabel('$x_$', fontsize = 0) plt.ylabel('$x_$', fontsize = 0) plt.legend(loc = 4, fontsize = 5) plt.show()

13 3.. Outlier Note that in the real world, you may have noise, errors, or outliers that do not accurately represent the actual phenomena Non-separable case No solutions (hyperplane) exist We will allow some training examples to be misclassified! but we want their number to be minimized In [8]: X = np.hstack([np.ones([n,]), x[c], x[c]]) X = np.hstack([np.ones([m,]), x[c], x[c]]) outlier = np.array([, 3, ]).reshape(-,) X = np.vstack([x, outlier.]) X = np.asmatrix(x) X = np.asmatrix(x) plt.figure(figsize=(0, 6)) plt.plot(x[:,], X[:,], 'ro', label='c') plt.plot(x[:,], X[:,], 'bo', label='c') plt.xlim([0,8]) plt.xlabel('$x_$', fontsize = 0) plt.ylabel('$x_$', fontsize = 0) plt.legend(loc = 4, fontsize = 5) plt.show()

14 In [9]: w = cvx.variable(3,) obj = cvx.minimize() const = [X*w >=, X*w <= -] prob = cvx.problem(obj, const).solve() print(w.value) None 3.3. LP Formulation n (= ) m = N + M features data points in a training set x i x (i) = [ ] x (i) N belongs to C in training set M belongs to C in training set ω and ω 0 are the variables (unknown) For the non-separable case, we relex the above constraints Need slack variables u and υ where all are positive he optimization problem for the non-separable case minimize subject to N u i i= + M υ i i= ω x () + ω 0 u ω x () + ω 0 u ω x (N) + ω 0 u N ω x (N+) + ω 0 ( υ ) ω x (N+) + ω 0 ( υ ) ω x (N+M) + ω 0 ( υ M ) { u 0 v 0

15 Expressed in a matrix form X X = = u = υ = x () x () x (N) = = x (N+) x (N+) x (N+M) u u N υ υ M x () x () x (N) x () x () x (N) x (N+) x (N+) x (N+M) x (N+) x (N+) x (N+M) X X = = u = υ = ( x () ) ( x () ) ( x (N) ) = = ( x (N+) ) ( x (N+) ) ( x (N+M) ) u u N υ υ M x () x () x (N) x () x () x (N) x (N+) x (N+) x (N+M) x (N+) x (N+) x (N+M) minimize subject to u + υ X ω + ω 0 u X ω + ω 0 ( υ) u 0 υ 0 minimize subject to u + υ X ω u X ω ( υ) u 0 υ 0

16 In [0]: X = np.hstack([np.ones([c.shape[0],]), x[c], x[c]]) X = np.hstack([np.ones([c.shape[0],]), x[c], x[c]]) outlier = np.array([,, ]).reshape(-,) X = np.vstack([x, outlier.]) X = np.asmatrix(x) X = np.asmatrix(x) N = X.shape[0] M = X.shape[0] w = cvx.variable(3,) u = cvx.variable(n,) v = cvx.variable(m,) obj = cvx.minimize(np.ones((,n))*u + np.ones((,m))*v) const = [X*w >= -u, X*w <= -(-v), u >= 0, v >= 0 ] prob = cvx.problem(obj, const).solve() w = w.value xp = np.linspace(0,8,00).reshape(-,) yp = - w[,0]/w[,0]*xp - w[0,0]/w[,0] plt.figure(figsize=(0, 6)) plt.plot(x[:,], X[:,], 'ro', label='c') plt.plot(x[:,], X[:,], 'bo', label='c') plt.plot(xp, yp, '--k', label='svm') plt.plot(xp, yp-/w[,0], '-k') plt.plot(xp, yp+/w[,0], '-k') plt.xlim([0,8]) plt.xlabel('$x_$', fontsize = 0) plt.ylabel('$x_$', fontsize = 0) plt.legend(loc = 4, fontsize = 5) plt.show()

17 Further improvement Notice that hyperplane is not as accurately represent the division due to the outlier Can we do better when there are noise data or outliers? Yes, but we need to look beyond LP Idea: large margin leads to good generalization on the test data 3.4. Maximize Margin (Finally, it is Support Vector Machine) Distance (= margin) margin = ω Minimize ω to maximize the margin Multiple objectives Use gamma ( γ) as a weighting betwwen the followings: Bigger margin given robustness to outliers Hyperplane that has few (or no) errors minimize subject to ω + γ( u + υ) X ω + ω 0 u X ω + ω 0 ( υ) u 0 υ 0

18 In []: X = np.hstack([np.ones([c.shape[0],]), x[c], x[c]]) X = np.hstack([np.ones([c.shape[0],]), x[c], x[c]]) outlier = np.array([,, ]).reshape(-,) X = np.vstack([x, outlier.]) X = np.asmatrix(x) X = np.asmatrix(x) N = X.shape[0] M = X.shape[0] g = w = cvx.variable(3,) u = cvx.variable(n,) v = cvx.variable(m,) obj = cvx.minimize(cvx.norm(w,) + g*(np.ones((,n))*u + np.ones((,m))*v)) const = [X*w >= -u, X*w <= -(-v), u >= 0, v >= 0 ] prob = cvx.problem(obj, const).solve() w = w.value xp = np.linspace(0,8,00).reshape(-,) yp = - w[,0]/w[,0]*xp - w[0,0]/w[,0] plt.figure(figsize=(0, 6)) plt.plot(x[:,], X[:,], 'ro', label='c') plt.plot(x[:,], X[:,], 'bo', label='c') plt.plot(xp, yp, '--k', label='svm') plt.plot(xp, yp-/w[,0], '-k') plt.plot(xp, yp+/w[,0], '-k') plt.xlim([0,8]) plt.xlabel('$x_$', fontsize = 0) plt.ylabel('$x_$', fontsize = 0) plt.legend(loc = 4, fontsize = 5) plt.show()

19 4. Support Vector Machine Probably the most popular/influential classification algorithm A hyperplane based classifier (like the Perceptron) Additionally uses the maximum margin principle maximize distance (margin) of closest samples from the decision line maximize {minimum distance} note: perceptron only utilizes a sign of distance Finds the hyperplane with maximum separation margin on the training data minimize subject to ω + γ( u + υ) X ω + ω 0 u X ω + ω 0 ( υ) u 0 ν 0 In a more compact form ω x n + ω 0 for y n = + ω x n + ω 0 for = y n ( + ) y n ω x n ω 0 minimize subject to ω + γ( ξ) y n ( ω x n + ω 0 ) ξ n ξ 0

20 In []: # SVM in a compact form X = np.hstack([np.ones([c.shape[0],]), x[c], x[c]]) X = np.hstack([np.ones([c.shape[0],]), x[c], x[c]]) outlier = np.array([,, ]).reshape(-,) X = np.vstack([x, outlier.]) X = np.asmatrix(x) X = np.asmatrix(x) N = X.shape[0] M = X.shape[0] m = N + M X = np.vstack([x, X]) y = np.vstack([np.ones([n,]), -np.ones([m,])]) g = w = cvx.variable(3,) d = cvx.variable(m,) obj = cvx.minimize(cvx.norm(w,) + g*(np.ones([,m])*d)) const = [cvx.mul_elemwise(y, X*w) >= -d, d >= 0] prob = cvx.problem(obj, const).solve() w = w.value xp = np.linspace(0,8,00).reshape(-,) yp = - w[,0]/w[,0]*xp - w[0,0]/w[,0] plt.figure(figsize=(0, 6)) plt.plot(x[:,], X[:,], 'ro', label='c') plt.plot(x[:,], X[:,], 'bo', label='c') plt.plot(xp, yp, '--k', label='svm') plt.plot(xp, yp-/w[,0], '-k') plt.plot(xp, yp+/w[,0], '-k') plt.xlim([0,8]) plt.xlabel('$x_$', fontsize = 0) plt.ylabel('$x_$', fontsize = 0) plt.legend(loc = 4, fontsize = 5) plt.show()

21 5. Nonlinear Support Vector Machine 5.. Kernel Often we want to capture nonlinear patterns in the data nonlinear regression: input and output relationship may not be linear nonlinear classification: classes may note be separable by a linear boundary Linear models (e.g. linear regression, linear SVM) are note just rich enough Kernels: make linear model work in nonlinear settings by mapping data to higher dimensions where it exhibits linear patterns apply the linear model in the new input feature space mapping changing the feature representation = Note: such mappings can be expensive to compute in general Kernels give such mappings for (almost) free in most cases, the mappings need not be even computed using the Kernel trick!

5.. Classifying non-linear separable data Consider the binary classification problem each example represented by a single feature No linear separator exists for this data x Now map each example as x

22 5.. Classifying non-linear separable data Consider the binary classification problem each example represented by a single feature No linear separator exists for this data x Now map each example as x {x, x } Data now becomes linearly separable in the new representation Linear in the new representation Let's look at another example Each example defined by a two features No linear separator exists for this data = nonlinear in the old representation x = {, } x x Now map each example as x = {, } z = {, x x, x } Each example now has three features (derived from the old represenation) x x x Data now becomes linear separable in the new representation

In []: %%html <center><iframe width="40" height="35" src="https://www.youtube.com/embed/3licbrzprza?

23 In []: %%html <center><iframe width="40" height="35" src=" framebor der="0" allowfullscreen> </iframe></center> SVM with polynomial kernel visualization

24 In [4]: X = np.array([[-., 0], [-0.3, 0.], [-0.9, ],[0.8, 0.4],[0.4, 0.9],[0.3,-0.6], [-0.5, 0.3], [-0.8, 0.6],[-0.5, -0.5]]) X = np.array([[-, -.3], [-.6,.], [0.9, -0.7],[.6, 0.5],[.8, -.],[.6,.6],[-.6, -.7], [-.4,.8],[.6, -0.9],[0, -.6],[0.3,.7],[-.6, 0],[-.,0.]]) X = np.asmatrix(x) X = np.asmatrix(x) plt.figure(figsize=(0, 6)) plt.plot(x[:, 0], X[:,], 'ro', label='c') plt.plot(x[:, 0], X[:,], 'bo', label='c') plt.axis([-3,3,-3,3]) plt.legend(loc = 4, fontsize = 5) plt.show() In [5]: N = X.shape[0] M = X.shape[0] X = np.vstack([x, X]) y = np.vstack([np.ones([n,]), -np.ones([m,])]) X = np.asmatrix(x) y = np.asmatrix(y) m = N + M Z = np.hstack([np.ones([m,]), np.square(x[:,0]), np.sqrt()*np.multiply(x[:,0],x[:,]), np.square(x[:,])])

25 In [6]: g = w = cvx.variable(4, ) d = cvx.variable(m, ) obj = cvx.minimize(cvx.norm(w, ) + g*np.ones([,m])*d) const = [cvx.mul_elemwise(y, Z*w) >= -d, d>=0] prob = cvx.problem(obj, const).solve() w = w.value print(w) [[ ] [ ] [ ] [ ]] In [7]: # to plot [Xgr, Xgr] = np.meshgrid(np.arange(-3,3,0.), np.arange(-3,3,0.)) test_x = np.hstack([xgr.reshape(-,), Xgr.reshape(-,)]) test_x = np.asmatrix(test_x) m = test_x.shape[0] test_z = np.hstack([np.ones([m,]), np.square(test_x[:,0]), \ np.sqrt()*np.multiply(test_x[:,0],test_x[:,]), np.square(test_x[:,])]) q = test_z*w In [8]: B = [] for i in range(m): if q[i,0] > 0: B.append(test_X[i,:]) #B = np.array(b).reshape(-,) B = np.vstack(b)

26 In [9]: plt.figure(figsize=(0, 6)) plt.plot(x[:,0], X[:,], 'ro') plt.plot(x[:,0], X[:,], 'bo') plt.plot(b[:,0], B[:,], 'k.') plt.axis([-3, 3, -3, 3]) plt.show() In [0]: %%javascript $.getscript(' js')

Supervised Learning. 1. Optimization. without Scikit Learn. by Prof. Seungchul Lee Industrial AI Lab

Supervised Learning. 1. Optimization. without Scikit Learn. by Prof. Seungchul Lee Industrial AI Lab Supervised Learning without Scikit Learn by Prof. Seungchul Lee Industrial AI Lab http://isystems.unist.ac.kr/ POSECH able of Contents I.. Optimization II.. Linear Regression III. 3. Classification (Linear)