Machine Learning. Support Vector Machines. Manfred Huber

Size: px

Start display at page:

Download "Machine Learning. Support Vector Machines. Manfred Huber"

Brianna Berry
5 years ago
Views:

1 Machine Learning Support Vector Machines Manfred Huber

2 Support Vector Machines Both logistic regression and linear discriminant analysis learn a linear discriminant function to separate the data elements of two classes Forms a hyperplane that separates the data points h! (x) =! T x h x,b (x) = w T x + b Generally infinite answers Manfred Huber

3 Maximum Margin Classifier Classifiers with larger margins have stronger confidence in their predictions Larger margins represent more tolerance to noise Larger margins are more robust to outliers e x 2! Margin e x 1 Manfred Huber

4 Maximum Margin Classifier Learn a classifier that maximizes the margin Assume classes are labeled 1 and -1 How do we measure the margin of a classifier? " $ y (i) = # %$ Functional margin 1 if x (i) isinclass1!1 otherwise ˆ! (i) = y (i) (w T x (i) + b) Has scaling problems if used with a logistic function Geometric margin!! (i) = y (i) w T x (i) # " w + b w Manfred Huber $ & %

5 Maximum Margin Classifier From the margin of a data point we can formulate the margin of the classifier on a training set Margin provides a new performance function γ! = min i! (i) max γ,w,b s.t. y (i) (w T x (i) + b) γ, i =1,...,m w =1. Constrained optimization problem Manfred Huber

6 Maximum Margin Classifier Non-convex constraints are difficult so we reformulate this to max γ,w,b s.t. ˆγ w y (i) (w T x (i) + b) ˆγ, i =1,...,m Since we can scale functional margins, we can make it 1 and replace maximizing 1/ x with minimizing x 2 min γ,w,b 1 2 w 2 s.t. y (i) (w T x (i) + b) 1, i =1,...,m Solvable using Quadratic Programming Manfred Huber

7 Constrained Optimization Constrained optimization problems are harder to solve than the previous unconstrained ones Could solve the previous problem using Quadratic Programming There is a more efficient solution for this formulation that requires some optimization background Manfred Huber

8 Lagrange Multipliers Lagrange multipliers allow the transfer of some constrained optimization problems into unconstrained ones Consider a constrained optimization problem with equality constraints min w s.t. There is an unconstrained Lagrangian version f(w) h i (w) =0, i =1,...,l. L(w, β) =f(w)+ l β i h i (w) Stationary point of this is necessary for original optimum Manfred Huber

9 Lagrange Duality Lagrange duality extends this under some conditions to solving optimization problems with inequalities min w s.t. h i (w) =0, i =1,...,l. From this we can formulate the generalized Lagrangian k l L(w, α, β) =f(w)+ α i g i (w)+ β i h i (w). f(w) g i (w) 0, i =1,...,k The relation to the original problem is more complex Manfred Huber

10 Lagrange primal Lagrange Duality θ P (w) = max L(w, α, β). α,β : α i 0 The Lagrange primal has interesting properties θ P (w) = { f(w) if w satisfies primal constraints otherwise. Minimizing the primal leads to the same result as the original problem min w θ P (w) = min w max L(w, α, β), α,β : α i 0 Manfred Huber

11 Lagrange Duality Lagrange dual θ D (α, β) = min w Maximizing the dual has an interesting relation to minimizing the primal d = And under certain conditions we get max min α,β : α i 0 w d = p, L(w, α, β). Equality is fulfilled if L(w, α, β) min w n L(w, α, β).= min L(w, α, * β). * = max 0 w max L(w, α, β) =p α,β : α i 0 L(w, α, β) * * * * α,β : α i 0 Manfred Huber

12 Lagrange Duality If g i (w) are convex and strictly feasible and h i (x) are affine we get the Karush-Kuhn- Tucker (KKT) conditions we can pick L Solution w* solves the primal and we get L(w, α, β ) = 0, i =1,...,l β i g i (w ) 0, i =1,...,k L(w, α, β ) w i = 0, i =1,...,n * and β* solve the dual and we get α 0, i =1,...,k α i g i (w ) = 0, i =1,...,k Manfred Huber

13 Lagrange Duality From these we can see that in this case we also get our equality condition d = p, The dual complementarity condition provides additional insights αi g i (w ) = 0, i =1,...,k For all active inequality constraints (i.e. constraints where *>0) i we have Correspond to points that lie on the margin (support vectors) g i (w ) = 0, Number of support vectors is often much smaller than the number of data points Minimum is d+1 for d-dimensional input space Manfred Huber

14 Maximum Margin Classifier Applying this to our maximum margin problem min γ,w,b 1 2 w 2 s.t. y (i) (w T x (i) + b) 1, i =1,...,m We can rewrite the constraints as g i (w) = y (i) (w T x (i) + b)+1 0. And get a Lagrangian L(w, b, α) = 1 2 w 2 [ α i y (i) (w T x (i) + b) 1 ] Manfred Huber

15 Maximum Margin Classifier Using the inner function of the dual we can L derive functions for w* and b* Placing them into the Lagrangian results in This is a function in only in the parameters i on the sum of inner products of the data points W (α) = α i 1 y (i) y (j) α i α j x (i),x (j). 2 Manfred Huber 2015 i,j=1 15 w L(w, b, α) =w α i y (i) x (i) =0 w = α i y (i) x (i) m b L(w, b, α) = α i y (i) =0. L(w, b, α) = α i 1 2 i,j=1 y (i) y (j) α i α j (x (i) ) T x (j)

16 Maximum Margin Classifier From this we get the Lagrangian dual problem as max α W (α) = s.t. α i 1 2 α i 0, i =1,...,m α i y (i) =0, i,j=1 y (i) y (j) α i α j x (i),x (j). This problem needs to be solved for * Manfred Huber

17 Maximum Margin Classifier Once solved for * the remaining parameters can be computed w* can be computed from the inner dual function w = α i y (i) x (i). b* can be computed from the primal problem b = max i:y (i) = 1 w T x (i) + min i:y (i) =1 w T x (i). 2 Once this is solved we can make predictions by evaluating the point against the line w T x + b = ( ) α i y (i) x (i),x + b. Manfred Huber

18 Sequential Minimal Optimization An efficient way to solve for * is SMO Iteratively optimizes over pairs of parameters, keeping the other parameters fixed From the constraints we get for 1 and 2 α 1 y (1) + α 2 y (2) = α i y (i). = ζ. i=3 We can rewrite W( ) as W (α 1, α 2,...,α e this quadrat m )=W((ζ α 2 y (2) )y (1), α 2,...,α m ). α2 new = 2 H Manfred Huber Using α new,unclipped 2 as the unconstrained optimization result and considering onvincethat yours both 1 and 2 have to be at least 0 (limiting the parameter line to the upper right quadrant) H if α new,unclipped 2 >H α new,unclipped 2 if L α new,unclipped L if α new,unclipped 2 <L

19 Support Vector Machines Everything so far required linearly separable date to make the constraints feasible To make it practical we need to make it able to deal non-separable data and with outliers Introduce a regularization term to deal with constraint violations, resulting in a modified version of the original min γ,w,b 1 2 w 2 + C s.t. ξ i y (i) (w T x (i) + b) 1 ξ i, i =1,...,m ξ i 0, i =1,...,m. This added a set of additional inequality constraints Manfred Huber

20 Support Vector Machines This results in a Lagrangian L(w, b, ξ, α,r)= 1 2 wt w + C ξ i [ ] m α i y (i) (x T w + b) 1+ξ i r i ξ i. And slightly modified dual and KKT constraints max α W (α) = s.t. α i 1 2 Has the concept of active constraints (support vectors) Manfred Huber i,j=1 0 α i C, i =1,...,m α i y (i) =0, α i =0 y (i) (w T x (i) + b) 1 α i = C y (i) (w T x (i) + b) 1 0 < α i <C y (i) (w T x (i) + b) =1 y (i) y (j) α i α j x (i),x (j)

21 Support Vector Machines The optimization problem is still solvable using the SMO algorithm Clipping now requires 1 and 2 to lie within a box spanned by 0 and C C H! (1) (2) 1 y +! 2 y ="! 2 L! 1 C Manfred Huber

22 Kernels and the Kernel Trick So far we have strictly dealt with linear Support Vector Machines Can address this using non-linear features Since in the entire algorithm for SVM only uses x in inner products <x (i),x (j) > we can apply the kernel trick A kernel function is any function of the form K(x, z) =φ(x) T φ(z). The kernel trick refers to the fact that in any algorithms that only uses the inner product of the data items we can replace the inner product of the features with the kernel function Under certain conditions the value of a kernel function is significantly cheaper than computing the feature vector allowing higher dimensional feature spaces to be used Manfred Huber

23 Kernels and the Kernel Trick Example Kernels Square kernel Represents a n 2 -dimensional feature vector x 1 x 1 x 1 x 2 x 1 x 3 x 2 x 1 φ(x) = x 2 x 2 x 2 x 3 x 3 x 1 x 3 x 2 x 3 x 3 K(x,z) can be computed in O(n) time Similar applies to general polynomial kernels K(x, z) =(x T z + c) d e space, corresponding Manfred Huber K(x, z) =(x T z) 2. = ( )( n (x i x j )(z i z j ) =φ(x) T φ(z). i,j=1

24 Kernels and the Kernel Trick Gaussian kernel Corresponds to a feature vector of infinite dimension A Kernel matrix is the square matrix of values of the kernel function applied to all pairs of data points K ij = K(x (i),x (j) ) andhencek A function must is a b valid (Mercer) kernel function if and only if for any data set the Kernel matrix is K(x, z) = e! Symmetric: K i, j = K j,i Positive semi-definite z T Kz! 0 Manfred Huber x!y 2 2! 2

25 Kernel-Base Support Vector Machines The kernel trick can be applied to any algorithm that uses the data only in the form of an inner product The kernel trick allows SVMs to operate efficiently in very high-dimensional non-linear feature spaces expressed by appropriate kernel functions Together with appropriate heuristics to make the SMO algorithm efficient, this makes SVMs very efficient non-linear classifiers Manfred Huber

Support Vector Machines

CS229 Lecture notes Andrew Ng Part V Support Vector Machines This set of notes presents the Support Vector Machine (SVM) learning algorithm. SVMs are among the best (and many believe is indeed the best)