Supervised Learning. 1. Optimization. without Scikit Learn. by Prof. Seungchul Lee Industrial AI Lab

Size: px

Start display at page:

Download "Supervised Learning. 1. Optimization. without Scikit Learn. by Prof. Seungchul Lee Industrial AI Lab"

Clinton Ball
6 years ago
Views:

1 Supervised Learning without Scikit Learn by Prof. Seungchul Lee Industrial AI Lab POSECH able of Contents I.. Optimization II.. Linear Regression III. 3. Classification (Linear) I. 3.. Distance II. 3.. Illustrative Example I Optimization Formulation II Outlier III Optimization Formulation III Maximize Margin (Finally, it is Support Vector Machine) IV Logistic Regression. Optimization an important tool in ) engineering problem solving and ) decision science peolple optimize nature optimizes

(source: http://nautil.us/blog/to-save-drowning-people-ask-yourself-what-would-light-do (http://nautil.us/blog/tosave-drowning-people-ask-yourself-what-would-light-do)) 3 key components.

Once the model has been formulated, optimization algorithm can be used to find its solutions.

2 (source: ( 3 key components. objective. decision variable or unknown 3. constraints Procedures. he process of identifying objective, variables, and constraints for a given problem is known as "modeling". Once the model has been formulated, optimization algorithm can be used to find its solutions. In mathematical expression min x subject to f(x) g i (x) 0, i =,, m Remarks) equivalent min f(x) x (x) 0 g i h(x) = 0 max f(x) x g i (x) 0 h(x) 0 and { h(x) 0 he good news: for many classes of optimization problems, people have already done all the "hardwork" of developing numerical algorithms

3 max x subject to x + x x + x 9 x + x 5 x x 5 min x subject to x [ ] [ ] x x [ ] [ 9 ] [ ] x 5 x [ ] [ ] [ ] 5 x In []: import numpy as np import matplotlib.pyplot as plt import cvxpy as cvx f = np.array([[, ]]) A = np.array([[, ], [, ]]) b = np.array([[9], [5]]) lb = np.array([[], [5]]) x = cvx.variable(,) obj = cvx.minimize(-f*x) const = [A*x <= b, lb <= x] prob = cvx.problem(obj, const).solve() print (x.value) [[.] [ 7.]]. Linear Regression Begin by considering linear regression (easy to extend to more comlex predictions later on) Given { x : inputs i : outputs, find θ and θ yi y^i : predicted output x x x =, y = = + x m y y y m y^i θ x i θ θ θ = [ ] : Model parameters θ y^i = f( x i, θ) in general

4 In many cases, a linear model to predict y i can be used m θ θ y^i = θx i + θ such that min ( y^i y i ), i=

5 In []: import numpy as np import matplotlib.pyplot as plt # data points in column vector [input, output] x = np.array([0., 0.4, 0.7,.,.3,.7,.,.8, 3.0, 4.0, 4.3, 4.4, 4. 9]).reshape(-, ) y = np.array([0.5, 0.9,.,.5,.5,.0,.,.8,.7, 3.0, 3.5, 3.7, 3. 9]).reshape(-, ) # to plot plt.figure(figsize=(0, 6)) plt.plot(x, y, 'ko', label="data") plt.xlabel('x', fontsize=5) plt.ylabel('y', fontsize=5) plt.axis('scaled') plt.grid(alpha=0.3) plt.xlim([0, 5]) plt.show()

6 Use CVXPY optimization (least squared) For convenience, we define a function that maps inputs to feature vectors, ϕ x i θ y^i = [ ] [ ] θ x i θ = [ ] [ ] = ϕ ( x i )θ θ, x i feature vector ϕ( x i ) = [ ] x ϕ ( x ) y^ x ϕ ( x ) y^ Φ = = = = Φθ y^ x m ϕ ( ) x m y^m Model parameter estimation min θ y^ y = min Φθ y θ In [3]: import cvxpy as cvx m = y.shape[0] #A = np.hstack([x, np.ones([m, ])]) A = np.hstack([x, x**0]) A = np.asmatrix(a) theta = cvx.variable(, ) obj = cvx.minimize(cvx.norm(a*theta-y, )) cvx.problem(obj,[]).solve() print('theta:\n', theta.value) theta: [[ ] [ ]]

7 In [4]: # to plot plt.figure(figsize=(0, 6)) plt.title('$l_$ Regression', fontsize=5) plt.xlabel('x', fontsize=5) plt.ylabel('y', fontsize=5) plt.plot(x, y, 'ko', label="data") # to plot a straight line (fitted line) xp = np.arange(0, 5, 0.0).reshape(-, ) theta = theta.value yp = theta[0,0]*xp + theta[,0] plt.plot(xp, yp, 'r', linewidth=, label="$l_$") plt.legend(fontsize=5) plt.axis('scaled') plt.grid(alpha=0.3) plt.xlim([0, 5]) plt.show() 3. Classification (Linear) Figure out, autonomously, which category (or class) an unknown item should be categorized into Number of categories / classes Binary: different classes Multiclass: more than classes Feature he measurable parts that make up the unknown item (or the information you have available to categorize)

8 Perceptron: make use of sign of data Discuss it later Logistic regression is a classification algorithm don't be confused o find a classification boundary Sign Sign with respect to a line ω ω = [ x ], x = [ ] g(x) = ω x + ω x + ω 0 = ω x + ω 0 ω x ω 0 ω ω = ω, x = x g(x) = ω 0 + ω x + ω x = ω x x

lying on the hyperplane How to find ω All

9 Perceptron Hyperplane Separates a D-dimensional space into two half-spaces Defined by an outward pointing normal vector ω is orthogonal to any vector lying on the hyperplane How to find ω All data in class g( x) > 0 ω All data in class 0 g( ω x) < 0

Perceptron Algorithm he perceptron implements h(x) = sign( ω x) Given the training set (, ), (, ),, (, ) where {, } x y x y x N y N y i ) pick a misclassified point sign( ω x n ) y n ) and update the

10 Perceptron Algorithm he perceptron implements h(x) = sign( ω x) Given the training set (, ), (, ),, (, ) where {, } x y x y x N y N y i ) pick a misclassified point sign( ω x n ) y n ) and update the weight vector ω ω + y n x n Why perceptron updates work? Let's look at a misclassified positive example ( ) perceptron (wrongly) thinks ω < 0 old x n y n = + updates would be ω new ω newx n = + = + ω old y n x n ω old x n = ( + = + ω old x n ) x n ω old x n x nx n hus ω newx n is less negative than ω old x n In [5]: import numpy as np import matplotlib.pyplot as plt % matplotlib inline

11 In [6]: #training data gerneration m = 00 x = 8*np.random.rand(m, ) x = 7*np.random.rand(m, ) - 4 g0 = 0.8*x + x - 3 g = g0 - g = g0 + In [7]: C = np.where(g >= 0) C = np.where(g < 0) print(c) (array([ 0,, 4, 5, 6, 7, 0, 7,, 4, 5, 7, 8, 30, 3, 38, 5, 5, 53, 56, 57, 6, 6, 64, 70, 78, 85, 86, 89, 98], dtype=int64), a rray([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], dtype=int64)) In [8]: C = np.where(g >= 0)[0] C = np.where(g < 0)[0] print(c.shape) print(c.shape) (30,) (4,) In [9]: plt.figure(figsize=(0, 6)) plt.plot(x[c], x[c], 'ro', label='c') plt.plot(x[c], x[c], 'bo', label='c') plt.title('linearly seperable classes', fontsize=5) plt.legend(loc='upper left', fontsize=5) plt.xlabel(r'$x_$', fontsize=0) plt.ylabel(r'$x_$', fontsize=0) plt.show()

12 x y = = = ( x () ) ( x () ) ( x (3) ) ( x (m) ) y () y () y (3) y (m) x () x () x (3) x (m) x () x () x (3) x (m) In [0]: X = np.hstack([np.ones([c.shape[0],]), x[c], x[c]]) X = np.hstack([np.ones([c.shape[0],]), x[c], x[c]]) X = np.vstack([x, X]) y = np.vstack([np.ones([c.shape[0],]), -np.ones([c.shape[0],])]) X = np.asmatrix(x) y = np.asmatrix(y) where (x, y) is a misclassified training point ω 3 ω = ω ω ω ω + yx In []: w = np.ones([3,]) w = np.asmatrix(w) n_iter = y.shape[0] for k in range(n_iter): for i in range(n_iter): if y[i,0]!= np.sign(x[i,:]*w)[0,0]: w += y[i,0]*x[i,:]. print(w) [[-3. ] [ ] [ ]] Not a unique solution g(x) = ω x + ω 0 = ω x + ω x + ω 0 = 0 ω x = ω 0 x ω ω

13 In []: xp = np.linspace(0,8,00).reshape(-,) xp = - w[,0]/w[,0]*xp - w[0,0]/w[,0] plt.figure(figsize=(0, 6)) plt.scatter(x[c], x[c], c='r', s=50, label='c') plt.scatter(x[c], x[c], c='b', s=50, label='c') plt.plot(xp, xp, c='k', label='perceptron') plt.xlim([0,8]) plt.xlabel('$x_$', fontsize = 0) plt.ylabel('$x_$', fontsize = 0) plt.legend(loc = 4, fontsize = 5) plt.show() Perceptron 3.. Distance ω = [ ], x = [ ] g(x) = x + = + + ω ω x x ω ω 0 ω x ω x ω 0

14 Find a distance between g(x) = and g(x) = suppose g( x ) =, g( x ) = ω x + ω 0 = ω ( x x ) = ω x + ω 0 = s =, = ( ) = ω ω x x ω ω x x ω

15 3.. Illustrative Example Binary classification: C and C Features: the coordinate of ith data x x = [ ] x Is it possible to distinguish between C and C by its coordinates? We need to find a separating hyperplane (or a line in D) ω x + ω x + ω 0 = 0 x [ ] [ ω ω ] + ω 0 = 0 x ω x + ω 0 = 0 In [3]: import numpy as np import matplotlib.pyplot as plt #training data gerneration x = 8*np.random.rand(00, ) x = 7*np.random.rand(00, ) - 4 g0 = 0.8*x + x - 3 g = g0 - g = g0 + C = np.where(g >= 0)[0] C = np.where(g < 0)[0]

16 In [4]: xp = np.linspace(0,8,00).reshape(-,) ypt = -0.8*xp + 3 plt.figure(figsize=(0, 6)) plt.plot(x[c], x[c], 'ro', label='c') plt.plot(x[c], x[c], 'bo', label='c') plt.plot(xp, ypt, '--k', label='rue') plt.title('linearly and strictly separable classes', fontweight = 'bold', fo ntsize = 5) plt.xlabel('$x_$', fontsize = 0) plt.ylabel('$x_$', fontsize = 0) plt.legend(loc = 4) plt.xlim([0, 8]) plt.ylim([-4, 3]) plt.show()

17 Given: Hyperplane defined by ω and ω 0 Animals coordinates (or features) x Decision making: ω x + ω0 > 0 x belongs to C ω x + ω 0 < 0 x belongs to C Find and such that given ω ω 0 or ω ω 0 x x + = 0 Find and such that given and given ω ω 0 ω ω 0 x C ω x + ω 0 > x C x + < Same problem if strictly separable ω x + ω 0 > b ω ω 0 b ω ω 0 x + > b x + >

18 In [5]: # see how data are generated xp = np.linspace(0,8,00).reshape(-,) ypt = -0.8*xp + 3 plt.figure(figsize=(0, 6)) plt.plot(x[c], x[c], 'ro', label='c') plt.plot(x[c], x[c], 'bo', label='c') plt.plot(xp, ypt, '--k', label='true') plt.plot(xp, ypt-, '-k') plt.plot(xp, ypt+, '-k') plt.title('linearly and strictly separable classes', fontweight = 'bold', fo ntsize = 5) plt.xlabel('$x_$', fontsize = 0) plt.ylabel('$x_$', fontsize = 0) plt.legend(loc = 4) plt.xlim([0, 8]) plt.ylim([-4, 3]) plt.show() 3... Optimization Formulation n (= ) m = N + M features data points in training set x (i) x = [ (i) ω ] with ω = [ ] or = with ω = x (i) x (i) ω x (i) x (i) ω 0 ω ω N belongs to C in training set M belongs to C in training set ω and ω 0 are the unknown variables

19 minimize something minimize something subject to + + ω x () ω 0 ω x () ω 0 ω x (N) ω 0 + or subject to ω x () ω x () ω x (N) ω x (N+) ω 0 ω x (N+) ω 0 ω x (N+M) ω ω x (N+) ω x (N+) ω x (N+M) Code (CVXPY) X ( x () ) ( x () ) = = ( x (N) ) x () x () x (N) x () x () x (N) X = = ( x (N+) ) ( x (N+) ) ( x (N+M) ) x (N+) x (N+) x (N+M) x (N+) x (N+) x (N+M) minimize subject to something X ω X ω minimize subject to something X ω X ω In [6]: # CVXPY using simple classification import cvxpy as cvx N = C.shape[0] M = C.shape[0] X = np.hstack([np.ones([n,]), x[c], x[c]]) X = np.hstack([np.ones([m,]), x[c], x[c]]) X = np.asmatrix(x) X = np.asmatrix(x)

20 In [7]: w = cvx.variable(3,) obj = cvx.minimize() const = [X*w >=, X*w <= -] prob = cvx.problem(obj, const).solve() w = w.value In [8]: xp = np.linspace(0,8,00).reshape(-,) yp = - w[,0]/w[,0]*xp - w[0,0]/w[,0] plt.figure(figsize=(0, 6)) plt.plot(x[:,], X[:,], 'ro', label='c') plt.plot(x[:,], X[:,], 'bo', label='c') plt.plot(xp, yp, 'k', label='svm') plt.plot(xp, ypt, '--k', label='true') plt.xlim([0,8]) plt.xlabel('$x_$', fontsize = 0) plt.ylabel('$x_$', fontsize = 0) plt.legend(loc = 4, fontsize = 5) plt.show() 3... Outlier Note that in the real world, you may have noise, errors, or outliers that do not accurately represent the actual phenomena Non-separable case No solutions (hyperplane) exist We will allow some training examples to be misclassified! but we want their number to be minimized

21 In [9]: X = np.hstack([np.ones([n,]), x[c], x[c]]) X = np.hstack([np.ones([m,]), x[c], x[c]]) outlier = np.array([, 3, ]).reshape(-,) X = np.vstack([x, outlier.]) X = np.asmatrix(x) X = np.asmatrix(x) plt.figure(figsize=(0, 6)) plt.plot(x[:,], X[:,], 'ro', label='c') plt.plot(x[:,], X[:,], 'bo', label='c') plt.xlim([0,8]) plt.xlabel('$x_$', fontsize = 0) plt.ylabel('$x_$', fontsize = 0) plt.legend(loc = 4, fontsize = 5) plt.show() minimize subject to something X ω X ω In [0]: w = cvx.variable(3,) obj = cvx.minimize() const = [X*w >=, X*w <= -] prob = cvx.problem(obj, const).solve() print(w.value) None No solutions (hyperplane) exist We will allow some training examples to be misclassified! but we want their number to be minimized

22 3..3. Optimization Formulation n (= ) m = N + M features data points in a training set x i = x (i) with ω = x (i) N belongs to C in training set M belongs to C in training set ω and ω 0 are the variables (unknown) ω 0 ω ω minimize subject to something X ω X ω For the non-separable case, we relex the above constraints Need slack variables u and υ where all are positive he optimization problem for the non-separable case minimize subject to N u i i= + M υ i i= ω x () u ω x () u ω x (N) u N ( ) ( ) ω x (N+) υ ω x (N+) υ ω x (N+M) { u 0 v 0 ( ) υ M

23 Expressed in a matrix form X ( x () ) ( x () ) = = ( x (N) ) x () x () x (N) x () x () x (N) X = = ( x (N+) ) ( x (N+) ) ( x (N+M) ) x (N+) x (N+) x (N+M) x (N+) x (N+) x (N+M) u υ = = u u N υ υ M minimize subject to u + υ X ω u X ω ( υ) u 0 υ 0

24 In []: X = np.hstack([np.ones([c.shape[0],]), x[c], x[c]]) X = np.hstack([np.ones([c.shape[0],]), x[c], x[c]]) outlier = np.array([,, ]).reshape(-,) X = np.vstack([x, outlier.]) X = np.asmatrix(x) X = np.asmatrix(x) N = X.shape[0] M = X.shape[0] w = cvx.variable(3,) u = cvx.variable(n,) v = cvx.variable(m,) obj = cvx.minimize(np.ones((,n))*u + np.ones((,m))*v) const = [X*w >= -u, X*w <= -(-v), u >= 0, v >= 0 ] prob = cvx.problem(obj, const).solve() w = w.value In []: xp = np.linspace(0,8,00).reshape(-,) yp = - w[,0]/w[,0]*xp - w[0,0]/w[,0] plt.figure(figsize=(0, 6)) plt.plot(x[:,], X[:,], 'ro', label='c') plt.plot(x[:,], X[:,], 'bo', label='c') plt.plot(xp, yp, '--k', label='svm') plt.plot(xp, yp-/w[,0], '-k') plt.plot(xp, yp+/w[,0], '-k') plt.xlim([0,8]) plt.xlabel('$x_$', fontsize = 0) plt.ylabel('$x_$', fontsize = 0) plt.legend(loc = 4, fontsize = 5) plt.show()

25 Further improvement Notice that hyperplane is not as accurately represent the division due to the outlier Can we do better when there are noise data or outliers? Yes, but we need to look beyond LP Idea: large margin leads to good generalization on the test data 3.3. Maximize Margin (Finally, it is Support Vector Machine) Distance (= margin) margin = ω Minimize ω to maximize the margin (closest samples from the decision line) maximize {minimum distance} Use gamma ( γ) as a weighting betwwen the followings: Bigger margin given robustness to outliers Hyperplane that has few (or no) errors minimize subject to ω + γ( u + υ) X ω + ω 0 u X ω + ω 0 ( υ) u 0 υ 0

26 In [3]: g = w = cvx.variable(3,) u = cvx.variable(n,) v = cvx.variable(m,) obj = cvx.minimize(cvx.norm(w,) + g*(np.ones((,n))*u + np.ones((,m))*v)) const = [X*w >= -u, X*w <= -(-v), u >= 0, v >= 0 ] prob = cvx.problem(obj, const).solve() w = w.value In [4]: xp = np.linspace(0,8,00).reshape(-,) yp = - w[,0]/w[,0]*xp - w[0,0]/w[,0] plt.figure(figsize=(0, 6)) plt.plot(x[:,], X[:,], 'ro', label='c') plt.plot(x[:,], X[:,], 'bo', label='c') plt.plot(xp, yp, '--k', label='svm') plt.plot(xp, yp-/w[,0], '-k') plt.plot(xp, yp+/w[,0], '-k') plt.xlim([0,8]) plt.xlabel('$x_$', fontsize = 0) plt.ylabel('$x_$', fontsize = 0) plt.legend(loc = 4, fontsize = 5) plt.show()

3.4. Logistic Regression Logistic regression is a classification algorithm don't be confused Perceptron: make use of sign of data SVM: make use of margin (minimum

decision boundary (hyperplane) of optimization i h i g(x) = ω x = 0 such that maximizes Inequality of arithmetic and geometric means x + x + + x m x x x m m m and

27 3.4. Logistic Regression Logistic regression is a classification algorithm don't be confused Perceptron: make use of sign of data SVM: make use of margin (minimum distance) Distance from a single data point We want to use distance information of ALL data points logistic regression Using Distances basic idea: to find the decision boundary (hyperplane) of optimization i h i g(x) = ω x = 0 such that maximizes Inequality of arithmetic and geometric means x + x + + x m x x x m m m and that equality holds if and only if x = x = = x m Roughly speaking, this optimization of two classes i max h i g(x) ω h = = x ω x ω ω tends to position a hyperplane in the middle of

28 Using all Distances with Outliers SVM vs. Logistic Regression

Sigmoid function (, + ) (0, ) We link or squeeze to for several reasons: If σ(z) is the sigmoid function, or the logistic function σ(z) = σ( ω x) = + e z + e ω x logistic function always generates a

29 Sigmoid function (, + ) (0, ) We link or squeeze to for several reasons: If σ(z) is the sigmoid function, or the logistic function σ(z) = σ( ω x) = + e z + e ω x logistic function always generates a value between 0 and Crosses 0.5 at the origin, then flattens out Benefit of mapping via the logistic function monotonic: same or similar optimziation solution continuous and differentiable: good for gradient descent optimization probability or confidence: can be considered as probability Goal: we need to fit ω to our data P (y = + x, ω) = [0, ] + e ω x i max h i

Classified based on probability In [5]: m = 00 X0 = np.random.multivariate_normal([0, 0], np.eye(), m) X = np.random.multivariate_normal([0, 0], np.eye(), m) X = np.vstack([x0, X]) y = np.vstack([np.

30 Classified based on probability In [5]: m = 00 X0 = np.random.multivariate_normal([0, 0], np.eye(), m) X = np.random.multivariate_normal([0, 0], np.eye(), m) X = np.vstack([x0, X]) y = np.vstack([np.zeros([m,]), np.ones([m,])]) In [6]: from sklearn import linear_model clf = linear_model.logisticregression() clf.fit(x, np.ravel(y)) Out[6]: LogisticRegression(C=.0, class_weight=none, dual=false, fit_intercept=ru e, intercept_scaling=, max_iter=00, multi_class='ovr', n_jobs=, penalty='l', random_state=none, solver='liblinear', tol=0.000, verbose=0, warm_start=false) In [7]: X_new = np.array([, 0]).reshape(, -) pred = clf.predict(x_new) print(pred) [0.] In [8]: pred = clf.predict_proba(x_new) print(pred) [[ ]]

31 In [9]: plt.figure(figsize=(0, 6)) plt.plot(x0[:,0], X0[:,], '.b', label='class 0') plt.plot(x[:,0], X[:,], '.k', label='class ') plt.plot(x_new[0,0], X_new[0,], 'o', label='new Data', ms=5, mew=5) plt.title('logistic Regression', fontsize=5) plt.legend(loc='lower right', fontsize=5) plt.xlabel('x', fontsize=5) plt.ylabel('x', fontsize=5) plt.grid(alpha=0.3) plt.axis('equal') plt.show() In [30]: %%javascript $.getscript(' tebook_toc.js')

Support Vector Machine

Support Vector Machine by Prof. Seungchul Lee isystems Design Lab http://isystems.unist.ac.kr/ UNIS able of Contents I.. Classification (Linear) II.. Distance from a Line III. 3. Illustrative Example I.