Machine Learning Support Vector Machines Fabio Vandin November 20, 2017 1
Classification and Margin Consider a classification problem with two classes: instance set X = R d label set Y = { 1, 1}. Training data: S = ((x 1, y 1 ),..., (x m, y m )) Hypothesis set H = halfspaces Assumption: data is linearly separable there exist a halfspace that perfectly classify the training set In general: multiple separating hyperplanes: which one is the best choice? 2
Classification and Margin The last one seems the best choice, since it can tolerate more noise. Informally, for a given separating halfspace we define its margin as its minimum distance to an example in the training set S. Intuition: best separating hyperplane is the one with largest margin. How do we find it? 3
Linearly Separable Training Set Training set S = ((x 1, y 1 ),..., (x m, y m )) is linearly separable if there exists a halfspace (w, b) such that y i = sign( w, x i + b) for all i = 1,..., m. Equivalent to: i = 1,..., m : y i ( w, x i + b) > 0 Informally: margin of a separating hyperplane is its minimum distance to an example in the training set S 4
Separating Hyperplane and Margin Given hyperplane defined by L = {v : w, v + b = 0}, and given x, the distance of x to L is d(x, L) = min{ x v : v L} Claim: if w = 1 then d(x, L) = w, x + b (Proof: Claim 15.1 [UML]) 5
Margin and Support Vectors The margin of a separating hyperplane is the distance of the closest example in training set to it. If w = 1 the margin is: min w, x i + b i {1,...,m} The closest examples are called support vectors 6
Support Vector Machine (SVM) Hard-SVM: seek for the separating hyperplane with largest margin (only for linearly separable data) Computational problem: arg max (w,b): w =1 subject to i : y i ( w, x i + b) > 0 min w, x i + b i {1,...,m} Equivalent formulation (due to separability assumption): arg max (w,b): w =1 min y i( w, x i + b) i {1,...,m} 7
Hard-SVM: Quadratic Programming Formulation input: (x 1, y 1 ),..., (x m, y m ) solve: (w 0, b 0 ) = arg min (w,b) w 2 subject to i : y i ( w, x i + b) 1 output: ŵ = w 0 w 0, ˆb = b 0 w 0 How do we get a solution? Quadratic optimization problem: objective is convex quadratic function, constraints are linear inequalities Quadratic Programming solvers! Proposition The output of algorithm above is a solution to the Equivalent Formulation in the previous slide. Proof: on the board and in the book (Lemma 15.2 [UML]) 8
Equivalent Formulation and Support Vectors Equivalent formulation (homogeneous halfspaces): assume first component of x X is 1, then w 0 = min w w 2 subject to i : y i w, x i 1 Support Vectors = vectors at minimum distance from w 0 The support vectors are the only ones that matter for defining w 0! Proposition Let w 0 be as above. Let I = {i : w 0, x = 1}. Then there exist coefficients α 1,..., α m such that w 0 = i I α i x i Support vectors = {x i : i I } Note: Solving Hard-SVM is equivalent to find α i for i = 1,..., m, and α i 0 only for support vectors 9
Soft-SVM 10 Hard-SVM works if data is linearly separable. What if data is not linearly separable? soft-svm Idea: modify constraints of Hard-SVM to allow for some violation, but take into account violations into objective function
Hard-SVM constraints: Soft-SVM Constraints y i ( w, x i + b) 1 Soft-SVM constraints: slack variables: ξ 1,..., ξ m 0 vector ξ for each i = 1,..., m: y i ( w, x i + b) 1 ξ i ξ i : how much constraint y i ( w, x i + b) 1 is violated Soft-SVM minimizes combinations of norm of w average of ξ i Tradeoff among two terms is controlled by a parameter λ R, λ > 0 11
Soft-SVM: Optimization Problem 12 input: (x 1, y 1 ),..., (x m, y m ), parameter λ > 0 solve: ( ) min λ w 2 + 1 ξ i w,b,ξ m subject to i : y i ( w, x i + b) 1 ξ i and ξ i 0 output: w, b Equivalent formulation: consider the hinge loss l hinge ((w, b), (x, y)) = max{0, 1 y( w, x + b)} Given (w, b) and a training S, the empirical risk L hinge S ((w, b)) is L hinge S ((w, b)) = 1 l hinge ((w, b), (x i, y i )) m
Soft-SVM: solve Soft-SVM as RLM min w,b,ξ ( λ w 2 + 1 m ) ξ i subject to i : y i ( w, x i + b) 1 ξ i and ξ i 0 Equivalent formulation with hinge loss: that is min w,b ( min w,b λ w 2 + 1 m ( λ w 2 + L hinge S (w, b) ) ) l hinge ((w, b), (x i, y i )) Note: λ w 2 : l 2 regularization L hinge S (w, b): empirical risk for hinge loss 13
Soft-SVM: Solution 14 We need to solve: ( How? min w,b λ w 2 + 1 m ) l hinge ((w, b), (x i, y i )) standard solvers for optimization problems Stochastic Gradient Descent
Gradient Descent 15 General approach for minimizing a differentiable convex function f (w) Let f : R d R be a differentiable function Definition The gradient f (w) of f at w = (w 1,..., w d ) is ( ) f (w) f (w) f (w) =,..., w 1 w d Intuition: the gradient points in the direction of the greatest rate of increase of f around w
Let η R, η > 0 be a parameter. Gradient Descent (GD) algorithm: w (0) 0; for t 0 to T 1 do w (t+1) w (t) η f (w (t) ); return w = 1 T T t=1 w(t) ; Notes: output vector could also be w (T ) or arg min t {1,...,T } f (w (t) ) returning w is useful for nondifferentiable functions (using subgradients instead of gradients...) and for stochastic gradient descent... Guarantees: Let f be convex and ρ-lipschitz continuos; let w arg min w: w B f (w). Then for every ɛ > 0, to find w such that f ( w) f (w ) ɛ it is sufficient to run the GD algorithm for T B 2 ρ 2 /ɛ 2 iterations. 16
Stochastic Gradient Descent (SGD) 17 Idea: instead of using exactly the gradient, we take a (random) vector with expected value equal to the gradient direction. SGD algorithm: w (0) 0; for t 1 to T do choose v t at random from a distribution such that E[v t w (t) ] f (w (t) ); w (t+1) w (t) ηv t ; return w = 1 T T t=1 w(t) ;
Guarantees: Let f be convex and ρ-lipschitz continuos; let w arg min w: w B f (w). Then for every ɛ > 0, to find w such that E[f ( w)] f (w ) ɛ it is sufficient to run the SGD algorithm for T B 2 ρ 2 /ɛ 2 iterations. 18
19 Why should we use SGD instead of GD? Question: when do we use GD in the first place? Answer: for example to find w that minimizes L S (w) That is f (w) = L S (w) f (w) depends on all pairs (x i, y i ) S, i = 1,..., m: may require long time to compute it! What about SGD? We need to pick v t such that E[v t w (t) ] f (w (t) ): how? Pick a random (x i, y i ) pick v t l(w (t), (x i, y i )): satisfies the requirement! requires much less computation than GD Analogously we can use SGD for regularized losses, etc.
We want to solve min w SGD for Solving Soft-SVM ( λ 2 w 2 + 1 m ) max{0, 1 y w, x i } Note: it s standard to add a 1 2 some computations. in the regularization to simplify Algorithm: θ (1) 0 ; for t 1 to T do let w (t) 1 λt θ(t) ; choose i uniformly at random from {1,..., m}; if y i w (t), x i < 1 then θ (t+1) θ (t) + y i x i ; else θ (t+1) θ (t) ; return w = 1 T T t=1 w(t) ; 20
Duality We now present (Hard-)SVM in a different way which is very useful for kernels We want to solve w 0 = min w 1 2 w 2 subject to i : y i w, x i 1 Define Note that g(w) = g(w) = max α R m :α 0 α i (1 y i w, x i ) { 0 if i : y i w, x i 1 + otherwise Then the problem above can be rewritten as: ( ) 1 min w 2 w 2 + g(w) 21
( ) 1 min w 2 w 2 + g(w) Rerranging a bit the terms we obtain the equivalent problem: ( ) 1 min max w α R m :α 0 2 w 2 + α i (1 y i w, x i ) One can prove that min w max α R m :α 0 ( 1 2 w 2 + ( = max min 1 α R m :α 0 w 2 w 2 + ) α i (1 y i w, x i ) ) α i (1 y i w, x i ) Note: the result above is called strong duality 22
23 We therefore define the dual problem ( ) max min 1 α R m :α 0 w 2 w 2 + α i (1 y i w, x i ) Note: once α is fixed the optimization with respect to w is unconstrained the objective is differentiable at the optimum the gradient is equal to 0: w α i y i x i = 0 w = α i y i x i Replacing in the dual problem we get: max 1 2 m α α R m :α 0 2 i y i x i + α i 1 y i α j y j x j, x i j=1
24 max α R m :α 0 1 2 α 2 i y i x i + α i m 1 y i α j y j x j, x i j=1 rearranging we get the problem: max α i 1 α R m :α 0 2 α i α j y i y j x j, x i j=1 Note: solution is the vector α which defines the support vectors {x i : α i 0} dual problem requires only to compute inner products x j, x i, does not need to consider x i by itself