Advanced Topics in Machine Learning, Summer Semester 2012

Size: px

Start display at page:

Download "Advanced Topics in Machine Learning, Summer Semester 2012"

Nathaniel Lawrence
5 years ago
Views:

1 Math. - Naturwiss. Fakultät Fachbereich Informatik Kognitive Systeme. Prof. A. Zell Advanced Topics in Machine Learning, Summer Semester 2012 Assignment 3 Aufgabe 1 Lagrangian Methods [20 Points] Handed out: Due: Consider the problem of finding a one-class SVM. That is, given a set of unlabeled data points {x i } N, we want to find the hyperplane that maximally separates the data from the origin such a hyperplane can be used as a novelty detector that can identify whether new data items are drawn from the same dataset as the training data. One way to formulate this problem is as the following optimization problem: 1 2 w 2 ρ s.t. w x i ρ i min w,ρ (a) Construct the Lagrangian function L (w, ρ, α) for this optimization problem; that is, add the constraints to the objective function with dual variables α i The problem is a minimization problem, so we can use its objective without negation. Next we need to write the constraints as being less than or equal to 0. This yields constraints of the form Now the Lagragian function is simply ρ w x i 0 L (w, ρ, α) = 1 N 2 w 2 ρ + α i (ρ w x i )

2 (b) Differentiate L ( ) with respect to w and set it equal to 0 to solve for w in terms of α As we saw in class, w 2 = w w and its derivative is w w w = 2w. Also, w w x i = x i. Thus the derivative is N w L (w, ρ, α) = w α i x i Setting this to 0, we obtain w : L (w, ρ, α) = 0 w w N w α i x i = 0 N w = α i x i = X α Thus, as you ve come to expect, the optimal hyperplane is expressed as a linear combination of the data. (c) Differentiate L ( ) with respect to ρ and set it equal to 0 to construct a condition on α The derivative of the Lagragian with respect to ρ is Setting this to 0, we obtain N ρ L (w, ρ, α) = 1 + L (w, ρ, α) = 0 ρ N α i = 1 α i Thus, we have the condition that the sum of the alphas must be 1; i.e., α 1 = 1

3 (d) Using the solution w from part (b), solve for the dual Lagrangian ˆL (α) only defined in terms of α. Note that the condition from part (c) should be used to eliminate ρ from the dual Lagrangian. Then construct the dual optimization program as a minimization of the dual Lagrangian with respect to α 0 and the condition from part (c). Plugging in our soultion from part (b), we get ˆL (α) = 1 N 2 [w ] w ρ + α i (ρ [w ] x i ) = 1 N 2 α XX α ρ + α i (ρ [X α] x i ) = 1 N N 2 α XX α ρ + ρ α i [X α] [α i x i ] }{{} =1 = 1 [ N ] 2 α XX α [X α] α i x i = 1 2 α XX α α XX α = 1 2 α XX }{{ } α = 1 2 α Kα =K Thus, when we maximize (Notice the mistake in the question above... since the primal was a minimization, we should maximize the dual Lagrangian) this is equivalent to minimizing its negation, and we get min α 1 2 α Kα s.t. α i 0 i and α 1 = 1

4 (e) Show that, for any Gaussian kernel with σ > 0 κ rbf (x, z) = exp ( σ x z 2) the dual program of the one-class SVM that you derived in part (d) is equivalent to the minimal hypersphere dual program defined on Slide 34 of Lecture 6. Hint, for constants a & b (w.r.t x), maximizing a + b f(x) is equivalent to maximizing f(x) if b > 0 or minimizing f(x) if b < 0. The program given in lecture was N max α i K i,i α Kα α s.t. α i 0 i and α 1 = 1 For any Gaussian kernel, κ rbf (x, x) = 1. Thus, K i,i = 1 and first sum in the above objective function reduces to N α i. However, from the constraints of the program, this is just 1 and the program becomes max 1 α Kα α s.t. α i 0 i and α 1 = 1 This optimization has the same constraints as our 2-class SVM. Further, from the above hint, using a = 1 and b = 2, we see that the two optimization problems are indeed equivalent in the sense that they will yield the same optimal α. As a side note, the 2 problems are equilavent for any kernel with the property that ; x κ rbf (x, x) = Q for some constant Q. Indeed, for Q > 0, we can see that any such kernel maps the data onto the surface of a sphere in feature space (the situation in which all points have a constant norm is equivalent to them being on a sphere centered at 0). Intersecting any hyperplane with this data sphere will in fact create a spherical cross-section on its surface... this is the same crosssection created by intersecting that data sphere with a second small sphere to surround the data (as is done for the problem in class). Thus, we can geometrically see why these 2 problems are equivalent for these kernels.

5 Aufgabe 2 Support Vector Machines [40 Points] In this exercise, you will use the dataset dataset3.txt to learn a sequence of support vector machine classifiers. You can load this data using Snip A of code-snips.m (a) Write 3 functions to construct a kernel matrix for three different kernels: κ lin (x, z) = x z κ poly (x, z) = (x z + 1) d κ rbf (x, z) = exp ( σ x z 2) Call them linkern, polykern, and rbfkern. All should take 2 matrix arguments: X, an N D matrix, and Z, an M D matrix. They should return an N M matrix of all kernel evaluations between the rows of the input matrices; i.e., a matrix K i,j = κ (X(i, :), Z(j, :)). See solution code in uebung03-code.zip (b) Use a quadratic programming (QP) solver (Octave comes with the solver qp and Matlab has quadprog) to solve for a two-class SVM from the following dual program from Lecture 6: max 1 α 1 α 2 α Gα such that α y = 0 0 α C Both of the QP solvers mentioned above are able to solve the SVM s program if passed the correct arguments. You will need to determine how to fit the SVM program into the solver of your choice (see html or for documentation on these solvers). Your code should take a kernel matrix, K, the labels y, and the parameter C > 0 and it should return the α dual variables along with the SVM displacement b. To solve for b, you will need to find unbounded support vectors, i, for which 0 < α i < C. From these, you can use the KKT conditions to compute b document how you did this. See solution code svm.m in uebung03-code.zip. Note that, in this solution, b was computed based on the fact that at an optimal solution, we have from the KKT condtions that y i f(x i ) = +1 for any i that is an unbounded support vector. Thus we can solve for b as follows: f(x i ) = 1/y i w φ (x i ) + b = y i b = y i w φ (x i ) To make the estimate of b more stable, we average over these estimates for all unbounded support vectors.

6 (c) You will now learn a sequence of support vector machines. To do so, follow these steps: (i) Use your kernel code to construct the following kernel matrices... Name Specification Parameters K 1 κ lin (x, z) K 2 κ poly (x, z) d = 3 K 3 κ poly (x, z) d = 4 K 4 κ rbf (x, z) σ =.2 K 5 κ rbf (x, z) σ = 2 (ii) use your SVM solver to learn an (α, b) for each matrix; that is, for kernel matrix, K, use the code in Snip B of code-snips.m using C = 0.5 & C = 5. Thus, the process will repeat 10 times once for each kernel/c-value pair. (iii) For each of these learned SVMs, plot the resulting SVM. To do this, run the code in Snip C of code-snips.m to plot the contours of the SVM prediction function. Note, in this code, kernel is the kernel function you are using (Important: you must use the same kernel function you used for training), & alpha and b are the SVM parameters learned by your code. Notice that magenta points/boundaries correspond to the positive class & cyan points/boundaries correspond to the negative class. (iv) Finally add additional plotting code to highlight the support vectors by placing a box around them. See solution code solution.m in uebung03-code.zip & plots below

7 Linear, C = 0.5 Linear, C = 5.0 Polynomial (d = 3), C = 0.5 Polynomial (d = 3), C = 5.0 Polynomial (d = 4), C = 0.5 Polynomial (d = 4), C = 5.0

8 RBF (σ = 0.2), C = 0.5 RBF (σ = 0.2), C = 5.0 RBF (σ = 2.0), C = 0.5 RBF (σ = 2.0), C = 5.0 (d) Analyze the behavior of these different two class SVMs. Describe how effectively they separate the dataset and which seems to be the best classifier for this dataset. The QP solver I used (qp in Octave), was unable to find a solution (reached maximum iterations) for either of the Linear kernel problems or the RBF kernel problems for σ = 2.0. As seen in the plots for these SVMs, the support vectors do not seem properly configured for the displayed boundary... this is likely because the solver did not find an optimal solution. Clearly, the linear boundary is incorrect as this data is not linearly separable. For the polynomial kernels, various degrees of separation are achieved, but the boundaries do not appear to well capture the shapes presented for this data. Clearly, the degree 3 kernel did a better job of this, but is still far from producing a good representation. The boundaries produced for the degree 4 polynomial kernel do not capture a good separation and may be due to numerial problems caused by this larger exponent. Clearly the best classifier in my experience was the RBF kernel with σ = 0.2. For both values of C it produced a reasonable classifier with fair amounts of sparisity, but clearly the boundary is well-capturing the true data shape.

9 Aufgabe 3 SVM Decomposition [20 Points] Consider the practical implementation of the feasible direction decomposition algorithm. Assume that before iteration t, the weight vector α t and the gradient vector g t are known. At iteration t, a working set including elements with indices (i 1, i 2,..., i q ) has been chosen and their weights have been reoptimized. Show how the gradient vector g t+1, which is needed for the selection of a working set for the next iteration, can be computed on O(qn) time. Hint: Recall that the gradient of the SVM training problem is computed as: g i = j SV α j y j K ij 1 Consider the change in g i from iteration t to t + 1. Using the fact that α j = 0 for j SV, we can simply write g i as a sum over all indices which gives the following change in g i g i = g t+1 i g t i = j αj t+1 y j K ij j α t jy j K ij = j = j (αj t+1 αj) t }{{} α j α j y j K ij y j K ij This g i allows us to compute the new g t+1 i by simply adding this change to the previous g t i. Moreover, the changes α j = 0 unless j {i 1, i 2,..., i q }, since these are the only indices for which alpha was changed; i.e., the working set. Thus, the change simplifies to g i = j {i 1,i 2,...,i q} α j y j K ij and can be computed as the sum of only q terms thus, we require O(q) time to compute g i. Since there are n points that require an update, the total complexity of the update is thus O(qn).

10 Aufgabe 4 Decremental SVM [20 Points] Provide a conceptual description of the procedure for removal of a selected point from an SVM solution ( decremental SVM ). The main idea of the method is to force the weight of a selected example c to zero while maintaining optimality for all examples except c. Discuss the implementation details of this procedure as inquired below. (a) What is the sign of the increment α c? To remove the point, we need to decrease α c to 0 (if it is not already 0). Thus the sign of the increment is negative. If it already has α c = 0 we can just remove the point without updating the SVM solution. (b) Which of the five bookkeeping conditions of the incremental SVM can be dropped? As listed on the lecture notes, the 4 th condition (g c becomes 0 and we terminate) must be dropped... since we are removing the c th point completely, we no longer care about whether its margin conditions are satisfied and in fact it would be incorrect to terminate based on them since c is no longer participating in the solution.

11 (c) How does the sign of the increment α c affect the specific expressions for the remaining four bookkeeping conditions? Provide the revised form for each of these conditions. Recall from lecture that α c interacts with the alphas (and thus the structure) of the other datapoints through the equations: α i = β i α c g i = γ i α c i S i O E The change in the sign of α c thus changes the conditions in following way (same order as on lecture slides): (i) i S and β i > 0. From above, we see that α c < 0 and thus α i is decreasing. The structural change in this case will occur if α i reaches 0 and this occurs when α c = α i β i. Since this quantity is negative, we will set the smallest magnitude change for these cases to be their maximum: αc 1 α i = max i S : β i >0 β i (ii) i S and β i < 0. From above, we see that α c < 0 and thus α i is increasing. The structural change in this case will occur if α i reaches C and this occurs when α c = C α i β i. Since this quantity is negative, we will set the smallest magnitude change for these cases to be their maximum: αc 2 C α i = max i S : β i <0 β i (iii) If i E, a structural change can only occur if g i > 0 since α c < 0, this will only happen if γ i < 0. Similarly, if i O, a structural change can only occur if g i < 0 since α c < 0, this will only happen if γ i > 0. Thus, this case occurs when i E & γ i < 0 or when i O & γ i > 0... these conditions on γ i are opposite the incremental case!. The structural change in this case will occur if g i reaches 0 and this occurs when α c = g i γ i. Since this quantity is negative, we will set the smallest magnitude change for this case to be their maximum: α 3 c = (iv) As stated in part (b), this condition is discarded g i max i E : γ i <0 γ i i O : γ i >0 (v) Finally, we want to test for the terminating condition that α c reaches 0. Since α c is directly decremented by α c, this simply occurs when thes step is exactly the negative of α c ; thus, α 5 c = α c Since all of the above step limits are negative, the minimal magnitude step before a structural change is given as their maxmimum: α c = max ( α 1 c, α 2 c, α 3 c, α 5 c)

12 (d) Are there any changes needed for the recursive update of the matrix Q 1? The updates to Q 1 that were given in class accomodate both the addition and removal of any point to S the only points concerned in the definition of Q. Thus, no change needs to be added since we can already accomodate both addition & removal to the set. However, we do need to add the following structural change to how we update Q 1. Namely, in the decremental SVM, we are removing x c, so it should not belong to any of the sets O, S, E. If at the beginning, c S, we should perform a decremental update to Q 1 to remove c. Further, throughout the algorithms execution, if c were to be decremented to move into the set S, we can ignore this & not update Q 1. (e) Which condition must be added if we only want to determine whether, after the removal of point c, its classification will be different from its true label ( leave-one-out error )? As can be seen from the defintion of g c, we have that g c = y c f(x c ) 1 for any point. Further, for the leave-one-out error, the check we d like to perform is whether, after removing x c, we have y c f(x c ) < 0; i.e., the prediction made by the classifier disagrees with the true label. Thus, if after our decrement completes, we have g c < 1, then x c will be mislabeled after removing it from the training. Moreover, the update can be terminated if this condition is reached since g c will only decrease during the update. In this way, we can add a new termination criteria for counting the leave-one-out errors.

Introduction to Support Vector Machines

Introduction to Support Vector Machines Shivani Agarwal Support Vector Machines (SVMs) Algorithm for learning linear classifiers Motivated by idea of maximizing margin Efficient extension to non-linear