1 Binary Classification

Size: px

Start display at page:

Download "1 Binary Classification"

Jasper Fleming
5 years ago
Views:

1 CS6501: Advanced Machine Learning 017 Problem Set 1 Handed Out: Feb 11, 017 Due: Feb 18, 017 Feel free to talk to other members of the class in doing the homework. You should, however, write down your solution yourself. Please try to keep the solution brief and clear. Please use Piazza first if you have questions about the homework. Also feel free to come to office hours. Please, no handwritten solutions. You will submit your solution manuscript as a single pdf file. Please use L A TEXto typeset your solutions. The homework is due at 11:59 PM on the due date. We will be using Collab for collecting the homework assignments. Please submit your solution manuscript as a pdf file and your code as a zip file. Please do NOT hand in a hard copy of your write-up. Contact the TAs if you are having technical difficulties in submitting the assignment. 1 Binary Classification A linear program can be stated as follows: Definition 1 Let A be an m n real-valued matrix, b R m, and c R n. Find a t R n, that imizes the linear function subject to z t = c T t A t b In the linear programg terology, c is often referred to as the cost vector and z t is referred to as the objective function. 1 We can use this framework to define the problem of learning a linear discriant function. 1 Note that you don t need to really know how to solve a linear program since you can use Matlab or Scipy to obtain the solution although you may wish to brush up on Linear Algebra. This discussion closely parallels the linear programg representation found in Pattern Classification, by Duda, Hart, and Stork. 1

2 The Learning Problem: 3 Let x 1, x,..., x m represent m samples, where each sample x i R n is an n-dimensional vector, and y { 1, 1} m is an m 1 vector representing the respective labels of each of the m samples. Let w R n be an n 1 vector representing the weights of the linear discriant function, and θ be the threshold value. We predict x i to be a positive example if w T x i + θ 0. On the other hand, we predict x i to be a negative example if w T x i + θ < 0. We hope that the learned linear function can separate the data set. That is, for the true labels we hope { 1 if w T x i + θ 0 y i = 1 1 if w T x i + θ < 0. In order to find a good linear separator, we propose the following linear program: w,θ,δ δ subject to y i w T x i + θ 1 δ, x i, y i D 3 where D is the data set of all training examples. δ 0 4 a. [10 points] A data set D = { x i, y i } m i=1 that satisfies condition 1 above is called linearly separable. Show that the data set D is linearly separable if and only if there exists a hyperplane w, θ that satisfies condition 3 with δ = 0 Need a hint? 4. Conclude that D is linearly separable iff the optimal solution to the linear program [ to 4] attains δ = 0. What can we say about the linear separability of the data set if there exists a hyperplane that satisfies condition 3 with δ > 0? The solution to a. contains three parts. i Linear Separability δ = 0: The separability property implies that there exists a hyperplane v T x + ρ such that y=1 v T x + ρ 0 > y= 1 v T x + ρ 3 Note that the notation used in the Learning Problem is unrelated to the one used in the Linear Program definition. You may want to figure out the correspondence. 4 Hint: Assume that D is linearly separable. D is a finite set, so it has a positive example that is closest to the hyperplane among all positive examples. Similarly, there is a negative example that is closest to the hyperplane among negative examples. Consider their distances and use them to show that condition 3 holds. Then show the other direction.

3 Let x i to be the positive sample that is closest to the hyperplane v T x + ρ, and p + = y=1 v T x + ρ = v T x i + ρ Similarly, let x j to be the negative sample that is closest to the hyperplane v T x + ρ, and p = v T x + ρ = v T x j + ρ y= 1 Note that from the definition of separability, we know that p + and p exist and p + 0 > p. Step 1: Shifting the hyperplane. Since p + 0 > p, there exist η 0 such that p + η 0 > p η. Hence, for some values of η, the hyperplane v T x + ρ η = 0 also separates D. We will find the value of η for which the hyperplane v T x + ρ η = 0 is equi-distant from x i and x j and separates D. The value of η can be obtained as follows: v T x i + ρ η v = vt x j + ρ η, v v T x i + ρ η = v T x j + ρ η, p + η = p + η. 5 This implies that η = p+ +p. We can easily verify that for such η, p + η 0 > p η, and hence the resulting hyperplane v T x + ρ η = 0 separates D. Step : Normalizing the hyperplane. With the new hyperplane v T x + ρ η = 0, and y=1 y= 1 v T x + ρ η = p+ p v T x + ρ η = p p + y v T x + ρ η p+ p x, y D Given that p + > p and η = p+ +p, by setting w = v, θ = ρ η, and δ = 0, η η y w T x + θ 1 δ, x, y D ii Linear Separability δ = 0: If there exists a hyperplane w T x + θ such that y w T x + θ 1, x, y D 3

4 It is trivial to show that iii When δ > 0: w T x + θ 1 0, x, y D, y = 1 and w T x + θ 1 < 0, x, y D, y = 1 Now consider the value of δ. Note that if 1 δ > 0, we can apply similar argument and show that the data set is linear separable. If δ 1, we are not sure if the data set is separable or not. If the imal δ 1, then the data set is not separable. b. [5 points] Show that there is a trivial optimal solution for the following linear program: w,θ,δ δ subject to y i w T x i + θ δ, x i, y i D δ 0 Show the optimal solution and use it to very briefly explain why we chose to formulate the linear program as [ to 4] instead. The trivial solution is w = 0, θ = 0, δ = 0. This solution is optimal because 1 the constraints are satisfied, and zero is the best value we can get for δ. We do not use this formulation because the solver might give us this optimal solution which does not give us a hyperplane at all. Note that there exists other optimal solutions for this formulation. c. [10 points] Let x 1 R n, x 1 T = [ ] and y 1 = 1. Let x R n, x T = [ ] and y = 1. The data set D is defined as D = { x 1, y 1, x, y }. Consider the formulation in [ to 4] applied to D. Show the set of all possible optimal solutions solve this problem by hand. Given that there are only two samples in the data set, the data set is separable. Therefore, the optimal δ is 0. Now our job becomes finding w and θ that satisfy the following constraints: w 1 + w... + w n + θ 1 w 1 w... w n + θ 1 w 1 + w w n 1 + θ Hence, w, θ, δ is the optimal solution if δ = 0 and w 1 + w w n 1 + θ. 4

5 Multiclass Classification In this problem set, we will derive, implement, and test an SGD method for a multi-class SVM classification model, and then we will compare it with the one-vs-all approach. Given a data set D = {x i, y i } N i=1 with K classes, the multi-class SVM can be written as the following unconstrained optimization problem: w={w k } 1 K wk T w k + C k=1 N i=1 k=1...k yi, k + w T k x i w T yi x i, 6 where w k is the weight vector for class k, x i, y i is the i th instance and { 1 if y k, y, k = 0 otherwise. To derive the SGD algorithm, we first need to rewrite the objective function in Eq. 6 into the form of i g iw: w={w k } N 1 N i=1 K wk T w k + C k=1 k=1...k yi, k + w T k x i w T yi x i, so that 1 g i w = N K wk T w k + C k=1 k=1...k yi, k + w T k x i w T yi x i. Now let s derive the sub-gradient of g i w. Consider g i w/ w k, the partial sub-gradient of the first term of g i w is w k /N. Regarding the second term, we have to consider several cases. Let ỹ = arg yi, k + wk T x i k. If ỹ = k and y i = k, the partial sub-gradient of the second term is Cx x = 0. Therefore, g i w/ w k = w k /N when ỹ = k and y i = k. How about other cases? 5

6 a. [15 points] Please complete the following formulation w k /N if ỹ = k and y i = k g i w w k = w k /N Cx i w k /N + Cx i w k /N b. [15 points] The SGD algorithm runs as follows: if ỹ k and y i = k if ỹ = k and y i k if ỹ k and y i k 7 for epoch = 0... T do for x i, y i in D do for k= 1... K do w k w k η g i w k / w k η is the step size. The step size is a user specified parameter. Here, we can set it with a default value Plug Eq. 7 in, we have: for epoch = 0... T do for x i, y i in D do ỹ = arg k yi, k + wk T x i for k= 1... K do w k w k ηw k /N if ỹ y i then w yi w yi η Cx i wỹ wỹ η+cx i end if Please, complete the above algorithm. c. [0points] Let s implement a baseline one-vs-all multiclass algorithm and conduct experiments on MNIST dataset 5. We provide a basic code framework at https: 5 6

7 //github.com/uvanlp/hw1 to facilitate the implementation. However, you re welcomed to write your own code. If you want to use the code we provided, implement the one-vs-all algorithm in the main function of one vs all with data reader.py. Then, conduct cross-validation to find the best C for all underlying binary classifier assume the same C is used for all the binary classifiers. Note: You can use the LinearSVC class at Scikit-Learn, svm.linearsvc.html, to implement the underlying classifiers. Report the best C and the final accuracy. d. [5points] Implement the SGD algorithm for multi-class SVM. We provide multiclass svm.py as the code framework. However, you have to implement the update rules. Conduct cross-validation to find the best C and report the result on the test set. 6 e. [0-10 points] We will give bonus points for additional experiments. 6 To test if your algorithm implementation is correct, you can check if the objective function value of the SGD algorithm matches with the score from liblinear cjlin/liblinear. Use the following command line to train a multi-class SVM model./train -s 4 -B -1 -e C 1 train.data with C=1. There is a function in the sample code allows you to print data in liblinear format. Check if the objective function values of your implementation is the same as that of liblinear on a small subset e.g., run on only the first 100 samples. 7

CS446: Machine Learning Spring Problem Set 4

CS446: Machine Learning Spring 2017 Problem Set 4 Handed Out: February 27 th, 2017 Due: March 11 th, 2017 Feel free to talk to other members of the class in doing the homework. I am more concerned that