SVM optimization and Kernel methods

Announcements SVM optimization and Kernel methods w 4 is up. Due in a week. Kaggle is up 4/13/17 1 4/13/17 2 Outline Review SVM optimization Non-linear transformations in SVM Soft-margin SVM Goal: Find margins that maximize separablility of classes Margins are defined by data points closest to the hyperplane. 4/13/17 3 4/13/17 4 1

Optimizing Maximize geometric (e.g. normalized) margins γ. Without misclassifying a point y # w % x # + b γ for all i And our normalizing constraint w = 1 This was horribly non-convex so we tried something else instead Substitute in the definition of functional margins γ =,- w Gets rid of the normalizing constraint Optimizing cont. Now because the functional margin can arbitrarily be scaled, we assign it to a value 1. Our maximization problem is now 1 max s. t. y # w % x # + b 1, i = 1,, N w,3 w Which is equivalent to minimizing min w % w s. t. y # w % x # + b 1, i = 1,, N w,3, Note that γ is gone from this equation. 4/13/17 5 4/13/17 6 Intuition that the problem is convex > min,,w,3? w? s. t y # w % x # + b 1, i = 1,, N Intuition that the problem is convex > min,,w,3? w? s. t y # w % x # + b 1, i = 1,, N w > w > w > w? w? w? 4/13/17 4/13/17 2

Lagrange formulation Minimize L w, b, α = 1 2 w? E α # (y # with respect to w and b And maximize with respect to each α # 0 M L = w E α # y # x # = 0 L b = E α #y # = 0 w % x # + b 1) 4/13/17 9 Substituting into the Lagrangian we get w = E α # y # x # and E α # y # = 0 L w, b, α = 1 2 w% w E α # (y # w % x # + b 1) L α = E α # 1 2 E E y #y P α # α P x #, x P PI> Where α # 0, i = 1,, N and α # y # = 0 4/13/17 10 New optimization requirements L α = E α # 1 2 E y #y P α # α P x #, x P #,PI> which we maximize with respect to α subject to α # 0 for i = 1,, N and α # y # = 0 Extending this idea to SVMs L α = E α # 1 2 E y #y P α # α P x #, x P #,PI> Let s replace x #, x P with z #, z P where z is higher dimension This is something we don t know how to optimize using our statistical/mathematical knowledge for this class But this is a standard form with a known solution 4/13/17 11 4/17/17 12 3

Extending this idea to SVMs L α = E α # 1 2 E y #y P α # α P z #, z P #,PI> ow much more expensive is this? Increasing hypothesis complexity L α = α # >? #,PI> y #y P vs L α = α # >? #,PI> y #y P α # α P x #, x P α # α P z #, z P The number of parameters has not changed. α # is based on the number data points w = α # y # x # and α # y # w % x # + b 1 = 0 If z #, z P is easy to calculate, this problem isn t any harder. Inner products are cheap. So we can go to higher dimensions easily 4/17/17 13 4/17/17 14 igher dimensions: Non-linear hyperplane ere our support vectors live in Z space. In X space we have pre-images of our support vectors. Margins are maintained in Z space but to compute the margin we still only need dot products. Extending this idea to SVMs Our algorithm can be written in terms of inner products with the input features E.g. we don t need the input data (or its transformation) except for the inner product L α = E α # 1 2 E y #y P α # α P x #, x P #,PI> Let s replace x #, x P with φ(x # ), φ(x P ) Computing the inner product is a fairly cheap computation. This allows us to make much more complex models 4/13/17 15 4/17/17 16 4

Kernel trick We can easily compute K x, z by finding φ(x) and φ z and taking the inner product. But it s easy to imagine how this can get computationally expensive There are some K x, z though where we don t actually need to compute the individual φ(x) and φ z s instead we get the inner product directly This allows us to get an SVM result in high dimensional feature space given by φ but without explicitly finding or representing φ x Effectively free higher dimensional hypotheses Generalizing the Kernel x % z + c X φ gives us some intuitions to how Kernels work c controls the relative weighting between x # (first order) terms and the x # x P (second order) terms. More broadly the kernel K x, z = x % z + c X corresponds to a feature mapping to a YZX X feature space, which includes all terms x, x?,, x X Whereas computing φ(x) takes O n X time, the kernel still takes only O n time. Don t explicitly representing feature vectors in our high d space 4/13/17 17 4/13/17 18 Kernels as a similarity measure If φ x and φ z are close together, what should K x, z be? Remember the Kernel is replacing our dot product (which measures the angle between two vectors). K x, z is some sort of (approximate) distance between x and z If φ x and φ z are close, K x, z is larger If φ x and φ z are far, K x, z is small Gaussian kernel K x, z = exp? x z 2σ? Captures similarity It s close to 1 when x and z are close And close to 0 when x and z are far apart 4/13/17 19 4/13/17 20 5

Non-separable case Non-separable case By placing the input features in high dimensional space, Kernels We re increasing the likelihood that the data is actually separable Either because the data is already linearly separable Or can be made so by a (potentially non-linear) transformation 4/13/17 21 4/13/17 22 Non-separable case Some cases, we don t want to find a separating hyperplane It can be susceptible to outliers Sometimes we want a robust rather than accurate hypothesis Regularization in Optimal Margin Classifiers Extend optimal margin (linear SVMs) to non-linearly separable data Makes the margin less sensitive to outliers Introduce ll > regularization Like ridge regression, penalize the values for being in the margin Or penalize for being incorrectly classified. 1 min,,w,3 2 w? + C E ξ # s. t. y # w % x # + b 1 ξ #, i = 1,, n and ξ # 0 4/17/17 23 4/17/17 24 6

Regularization and SVMs 1 min,,w,3 2 w? + C E ξ # s. t. y # w % x # + b 1 ξ #, i = 1,, n and ξ # 0 Training data points can now have a (functional) margin less than 1 If a data point has a functional margin 1 ξ # (with ξ # > 0) There s a cost in or objective fxn of Cξ # C controls the relative weighting between minimizing w? and having functional margins at least 1. Regularization As in ridge regression, we can control the penalty C In ridge we called it λ And it controlled the size of the βs compared to the ability to fit the training data Now we re trading off between Large margins (making w? small) And functional margins of at least 1 (ξ # = 0) As C goes to 0, we get the optimal margin classifier 4/17/17 25 4/17/17 26 Solving for w and b Need to construct the Lagrangian again, this time with regularization. We have two Lagrange multipliers One for the constraint y # w % x # + b 1 ξ # And another for ξ # 0 L w, b, ξ, α, r = 1 2 w% w + C E ξ # E α # y # w % x + b 1 + ξ # E r # ξ # Solving for w and b L w, b, ξ, α, r = 1 2 w% w + C E ξ # E α # y # w % x + b 1 + ξ # E r # ξ # α # and r # are Lagrange multipliers so α #, r # 0 We solve for the derivative w.r.t. w and b, set it to 0 Simplifying, we obtain max W α = E α # 1 j 2 E y #y P α # α P x #, x P #,PI> s.t. 0 α # C, i = 1,, N and α # y # = 0 4/17/17 27 4/17/17 28 7

Solving for w and b max E α # 1 j 2 E y #y P α # α P x #, x P #,PI> s.t. 0 α # C, i = 1,, N and α # y # = 0 Note that we now have an upper bound C for α # w can be expressed in terms of α # s Support vectors Considering the values α # we could tell if something was a support vector If it was greater than 0, it was a support vector Otherwise, we knew the data point was not near the decision boundary Now though, we can have points in the margins. ow do we determine if something is a support vector? The only change from regularization is that we put an upper bound on α # 4/17/17 29 4/17/17 30 Support vectors Considering the values α # we could tell if something was a support vector α # = 0 y # w % x # + b 1 far from boundary α # = C y # w % x # + b 1 in the margin/misclassifed 0 < α # < C y # w % x # + b = 1 support vector 4/17/17 31 4/17/17 32 8

Algorithms for solving SVMs We saw that we needed quadratic programming And Lagrange multipliers for standard SVMs ow about an algorithm for the soft-margin (e.g. non separable) SVMs? 4/17/17 33 Aside: Coordinate ascent Let s say we have some function max W(α >, α?,, α ) that j we re trying to optimize What methods have we ween to optimize such a problem? A new algorithm: Loop until convergence: { } For i = 1,, n { α # argmax j- # W(α >,, α #>, α - #, α #Z>,, α ) } 4/17/17 34 Intermediate step: Coordinate ascent Loop until convergence: { For i = 1,, n { α # argmax j- # W(α >,, α #>, α - #, α #Z>,, α ) } } old all variables except for α # fixed Reoptimize W α with respect to just parameter α # ere we re just visiting α s in order But you could imagine other selection algorithms (e.g. one that maximizes the increase for W α ) Sequential minimal optimization SMO algorithm We want to solve max W α = E α # 1 j 2 E y #y P α # α P x #, x P #,PI> s.t. 0 α # C, i = 1,, N and α # y # = 0 Let s say we have a set of α # s that satisfy our constraints What happens if we try coordinate ascent? 4/17/17 35 4/17/17 36 9

We re stuck Because we know that α # y # = 0 We can see that α > y > = #I? α # y # Equivalently, α > = y # #I? α # y # So we can t get a better estimate for α > here Instead we ll consider two α s at a time SMO algorithm Loop until convergence { 1. Select some pair α # and α P to update next } (usually we use a heuristic that picks two that will allow the largest progress toward the global minimum) 2. Reoptimize our estimate of w (W α ) with respect to α # and α #, while holding all α (k i, j) fixed. 4/17/17 37 4/17/17 38 10