CS , Fall 2011 Assignment 2 Solutions

Similar documents
Support Vector Machines for Classification and Regression. 1 Linearly Separable Data: Hard Margin SVMs

Support Vector Machine

Support Vector Machines for Classification and Regression

Machine Learning. Support Vector Machines. Manfred Huber

Kernel Methods and Support Vector Machines

Support Vector Machines, Kernel SVM

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Support Vector Machines and Kernel Methods

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Support Vector Machines. Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar

Machine Learning and Data Mining. Support Vector Machines. Kalev Kask

Support Vector Machine (continued)

CS798: Selected topics in Machine Learning

Support Vector Machine

(Kernels +) Support Vector Machines

Convex Optimization and Support Vector Machine

Machine Learning Support Vector Machines. Prof. Matteo Matteucci

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

Midterm. Introduction to Machine Learning. CS 189 Spring You have 1 hour 20 minutes for the exam.

CSC 411 Lecture 17: Support Vector Machine

Jeff Howbert Introduction to Machine Learning Winter

CS145: INTRODUCTION TO DATA MINING

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Pattern Recognition 2018 Support Vector Machines

Kernel Machines. Pradeep Ravikumar Co-instructor: Manuela Veloso. Machine Learning

Introduction to Support Vector Machines

Linear Support Vector Machine. Classification. Linear SVM. Huiping Cao. Huiping Cao, Slide 1/26

ML (cont.): SUPPORT VECTOR MACHINES

Statistical Machine Learning from Data

18.9 SUPPORT VECTOR MACHINES

Linear vs Non-linear classifier. CS789: Machine Learning and Neural Network. Introduction

ICS-E4030 Kernel Methods in Machine Learning

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Qualifier: CS 6375 Machine Learning Spring 2015

Outline. Basic concepts: SVM and kernels SVM primal/dual problems. Chih-Jen Lin (National Taiwan Univ.) 1 / 22

CS-E4830 Kernel Methods in Machine Learning

Support Vector Machines

Support Vector Machines

Support Vector Machines

SVMs, Duality and the Kernel Trick

Max Margin-Classifier

Linear & nonlinear classifiers

Support Vector Machines

Perceptron Revisited: Linear Separators. Support Vector Machines

Machine Learning. Lecture 6: Support Vector Machine. Feng Li.

Support Vector Machines

Linear & nonlinear classifiers

Announcements - Homework

Support Vector Machines

10/05/2016. Computational Methods for Data Analysis. Massimo Poesio SUPPORT VECTOR MACHINES. Support Vector Machines Linear classifiers

Lecture Notes on Support Vector Machine

Machine Learning Practice Page 2 of 2 10/28/13

Foundation of Intelligent Systems, Part I. SVM s & Kernel Methods

CSE546: SVMs, Dual Formula5on, and Kernels Winter 2012

Lecture Support Vector Machine (SVM) Classifiers

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines

Data Mining. Linear & nonlinear classifiers. Hamid Beigy. Sharif University of Technology. Fall 1396

Introduction to Machine Learning Prof. Sudeshna Sarkar Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

c 4, < y 2, 1 0, otherwise,

LECTURE 7 Support vector machines

L5 Support Vector Classification

Homework 4. Convex Optimization /36-725

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

SUPPORT VECTOR MACHINE

Machine Learning, Fall 2011: Homework 5

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machines: Maximum Margin Classifiers

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

EE613 Machine Learning for Engineers. Kernel methods Support Vector Machines. jean-marc odobez 2015

Mehryar Mohri Foundations of Machine Learning Courant Institute of Mathematical Sciences Homework assignment 3 April 5, 2013 Due: April 19, 2013

A GENERAL FORMULATION FOR SUPPORT VECTOR MACHINES. Wei Chu, S. Sathiya Keerthi, Chong Jin Ong

CIS 520: Machine Learning Oct 09, Kernel Methods

Modeling Data with Linear Combinations of Basis Functions. Read Chapter 3 in the text by Bishop

Support Vector Machines

Statistical Methods for SVM

MLCC 2017 Regularization Networks I: Linear Models

Machine Learning And Applications: Supervised Learning-SVM

Data Mining - SVM. Dr. Jean-Michel RICHER Dr. Jean-Michel RICHER Data Mining - SVM 1 / 55

Support Vector Machines

Support Vector Machine for Classification and Regression

HOMEWORK 4: SVMS AND KERNELS

SVMs: nonlinearity through kernels

Support vector machines Lecture 4

Statistical Pattern Recognition

Soft-margin SVM can address linearly separable problems with outliers

Introduction to Support Vector Machines

Modeling Dependence of Daily Stock Prices and Making Predictions of Future Movements

COS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture # 12 Scribe: Indraneel Mukherjee March 12, 2008

Learning with kernels and SVM

Multisurface Proximal Support Vector Machine Classification via Generalized Eigenvalues

Homework 2 Solutions Kernel SVM and Perceptron

Constrained Optimization and Support Vector Machines

Support Vector Machines. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Name (NetID): (1 Point)

Statistical Methods for NLP

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Non-linear Support Vector Machines

Transcription:

CS 94-0, Fall 20 Assignment 2 Solutions (8 pts) In this question we briefly review the expressiveness of kernels (a) Construct a support vector machine that computes the XOR function Use values of + and - (instead of and 0) for both inputs and outputs, so that an example looks like ([, ], ) or ([, ], ) Map the input [x, x 2 ] into a space consisting of x and x x 2 Draw the four input points in this space, and the maximal margin separator What is the margin? Now draw the separating line back in the original Euclidian input space The examples map from [x, x 2 ] to [x, x x 2 ] coordinates as follows: [, ] (negative) maps to [, +] [, +] (positive) maps to [, ] [+, ] (positive) maps to [+, ] [+, +] (negative) maps to [+, +] Thus, the positive examples have x x 2 and the negative examples have x x 2 + The maximum margin separator is the line x x 2 0, with a margin of The separator corresponds to the x 0 and x 2 0 axes in the original space this can be thought of as the limit of a hyperbolic separator with two branches (b) Recall that the equation of the circle in the 2-dimensional plane is (x a) 2 + (x 2 b) 2 r 2 0 Expand out the formula and show that every circular region is linearly separable from the rest of the plane in the feature space (x, x 2, x 2, x 2 2) The circle equation expands into five terms 0 x 2 + x 2 2 2ax 2bx 2 + (a 2 + b 2 r 2 ) corresponding to weights w (2a, 2b,, ) and intercept a 2 + b 2 r 2 This shows that a circular boundary is linear in this feature space, allowing linear separability In fact, the three features x, x 2, x 2 + x 2 2 suffice (c) Recall that the equation of an ellipse in the 2-dimensional plane is c(x a) 2 + d(x 2 b) 2 0 Show that an SVM using the polynomial kernel of degree 2, K(u, v) ( + u v) 2, is equivalent to a linear SVM in the feature space (, x, x 2, x 2, x 2 2, x x 2 ) and hence that SVMs with this kernel can separate any elliptic region from the rest of the plane The (axis-aligned) ellipse equation expands into six terms 0 cx 2 + dx 2 2 2acx 2bdx 2 + (a 2 c + b 2 d ) corresponding to weights w (2ac, 2bd, c, d, 0) and intercept a 2 + b 2 r 2 This shows that an elliptical boundary is linear in this feature space, allowing linear separability In fact, the four features x, x 2, x 2, x 2 2 suffice for any axis-aligned ellipse 2 (2 pts) Logistic regression is a method of fitting a probabilistic classifier that gives soft linear thresholds (See Russell & Norvig, Section 864) It is common to use logistic regression with an objective function consisting of the negative log probability of the data plus an L 2 regularizer: L(w) N i ( log (Here w does not include the extra weight w 0 ) + e yi(wt x i+b) ) + λ w 2 2

(a) Find the partial derivatives L w j First define the function g(z) g(z) +e Note that g(z) z z g(z)( g(z)) ( g(z)) Then we get, g(z)( g(z)) and also log(g(z)) z L w j (b) Find the partial second derivatives 2 L w j w k n y i x ij ( g(y i (w T x i + b))) + 2λw j i 2 L w j w k n yi 2 x ij x ik g(y i (w T x i + b))( g(y i (w T x i + b))) + 2λδ jk i where δ jk if i j, 0 if i j (c) From these results, show that L(w) is a convex function Hint: A function L is convex if its Hessian (the matrix H of second derivatives with elements H j,k 2 L w j w k ) is positive semi-definite (PSD) A matrix H is PSD if and only if a T Ha j,k a j a k H j,k 0 for all real vectors a Applying the definition of PSD matrix to the Hessian of L(w) we get, a T Ha 2 L a j a k ( a j a k w j w k j,k j,k i y 2 i x ij x ik g(y i (w T x i + b))( g(y i (w T x i + b))) + 2λδ jk ) () j,k,i a j a k y 2 i x ij x ik g(y i (w T x i + b))( g(y i (w T x i + b))) + 2λ j a 2 j (2) Define P i g(y i (w T x i + b))( g(y i (w T x i + b))) and ρ ij y i x ij Pi Then, a T Ha j,k,i a j a k x ij x ik y 2 i P i + 2λ j a 2 j i a T ρ i ρ T i a + 2λ j a 2 j i (a T ρ i ) 2 + 2λ j a 2 j 0 for λ 0 3 (8 pts) Consider the following training data, class x x 2 + + 2 2 + 2 0 0 0 0 0 (a) Plot these six training points Are the classes {+, } linearly separable? As seen in the plot the classes are linearly separable 2

Figure : Question 3 (b) Construct the weight vector of the maximum margin hyperplane by inspection and identify the support vectors The maximum margin hyperplane should have a slope of and should satisfy x 3/2, x 2 0 Therefore it s equation is x + x 2 3/2, and the weight vector is (, ) T (c) If you remove one of the support vectors does the size of the optimal margin decrease, stay the same, or increase? In this specific dataset the optimal margin increases when we remove the support vectors (, 0) or (, ) and stays the same when we remove the other two (d) (Extra Credit) Is your answer to (c) also true for any dataset? Provide a counterexample or give a short proof When we drop some constraints in a constrained maximization problem, we get an optimal value which is at least as good the previous one It is because the set of candidates satisfying the original (larger, stronger) set of contraints is a subset of the candidates satisfying the new (smaller, weaker) set of constraints So, for the weaker constraints, the oldoptimal solution is still available and there may be additions soltons that are even better In mathematical form: max f(x) max f(x) x A,x B x A Finally, note that in SVM problems we are maximizing the margin subject to the constraints given by training points When we drop any of the constraints the margin can increase or stay the same depending on the dataset In general problems with realistic datasets it is expected that the margin increases when we drop support vectors The data in this problem is constructed to demonstrate that when removing some constraints the margin can stay the same or increase depending on the geometry 4 (2 pts) Consider a dataset with 3 points in -D: 3

Figure 2: Question 4 (class) x + 0 + (a) Are the classes {+, } linearly separable? Clearly the classes are not separable in dimension (b) Consider mapping each point to 3-D using new feature vectors φ(x) [, 2x, x 2 ] T Are the classes now linearly separable? If so, find a separating hyperplane The points are mapped to (, 0, 0), (, 2, ), (, 2, ) respectively The points are now separable in 3-dimensional space A separating hyperplane is given by the weight vector (0,0,) in the new space as seen in the figure (c) Define a class variable y i {, +} which denotes the class of x i and let w (w, w 2, w 3 ) T The max-margin SVM classifier solves the following problem w,b 2 w 2 2 st (3) y i (w T φ(x i ) + b), i, 2, 3 (4) Using the method of Lagrange multipliers show that the solution is ŵ (0, 0, 2) T, b and the margin is ŵ 2 For optimization problems with inequality constraints such as the above, we should apply KKT conditions which is a generalization of Lagrange multipliers However this problem can be solved easier by noting that we have three vectors in the 3-dimensional space and all of them are support vectors Hence the all 3 constraints hold with equality Therefore we can apply the method of Lagrange multipliers to, w,b 2 w 2 2 st (5) y i (w T φ(x i ) + b), i, 2, 3 (6) 4

We have 3 constraints, and should have 3 Lagrange multipliers We first form the Lagrangian function L(w, λ) where λ (λ, λ 2, λ 3 ) as follows L(w, λ) 3 2 w 2 2 + λ i (y i (w T φ(x i ) + b) ) (7) i and differentiate with respect to optimization variables w and b and equate to zero, L(w, λ) 3 w + λ i y i φ(x i ) 0 (8) w L(w, λ) b i 3 λ i y i 0 (9) Using the data points φ(x i ), we get the following equations from the above lines, i w +λ λ 2 λ 3 0 (0) w 2 + 2λ 2 2λ 3 0 () w 3 λ 2 λ 3 0 (2) λ λ 2 λ 3 0 (3) Using (0) and (4) we get w 0 Then plugging this to equality constraints in the optimization problem, we get (4) b (5) 2w 2 + w 3 + b (6) + 2w 2 + w 3 + b (7) (6) and (7) imply that w 2 0, and w 3 2 Therefore the optimal weights are ŵ (0, 0, 2) T and b And the margin is /2 (d) Show that the solution remains the same if the constraints are changed to y i (w T φ(x i ) + b) ρ, i, 2, 3 for any ρ Note that changing the constraints in the solution of part (c) only changes equation (5-7) and we get b ρ, and ŵ (0, 0, 2ρ) T However the hyperplane described by the equation, ŵ T x+b 0, remains the same as before: {x : 2ρx 3 + ρ 0} {x : 2x 3 + 0} Hence, we have the same classifier in two cases: assign class label + if ŵ T x + b 0 and assign class otherwise (e) (Extra Credit) Is your answer to (d) also true for any dataset and ρ? Provide a counterexample or give a short proof This is true for any dataset and it follows from the homogeneity of the optimization problem For constraints y i (w T φ(x i ) + b) ρ, we can define new weight vectors w w/ρ and b b/ρ So that the constraints in the new variables are y i ( w T φ(x i ) + b) And equivalently optimize the following, w, b (8) 2 ρ2 w 2 2 st (9) y i ( w T φ(x i ) + b), i, 2, 3 (20) Since ρ 2 is a constant multiplying the objective function w 2 2, it does not change the optimal value, and the two solutions describe the same classifier w T x + b 0 ρ w T x + ρ b 0 5