SVM optimization and Kernel methods

Similar documents
Support Vector Machine (continued)

Statistical Pattern Recognition

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machines

Machine Learning. Support Vector Machines. Manfred Huber

Perceptron Revisited: Linear Separators. Support Vector Machines

Support Vector Machines. Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar

Introduction to Support Vector Machines

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines

Support Vector Machines

L5 Support Vector Classification

Pattern Recognition 2018 Support Vector Machines

Statistical Machine Learning from Data

Linear & nonlinear classifiers

Introduction to SVM and RVM

Jeff Howbert Introduction to Machine Learning Winter

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Support Vector Machines. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Support Vector Machines, Kernel SVM

Machine Learning. Lecture 6: Support Vector Machine. Feng Li.

Support Vector Machine. Industrial AI Lab.

10/05/2016. Computational Methods for Data Analysis. Massimo Poesio SUPPORT VECTOR MACHINES. Support Vector Machines Linear classifiers

CS6375: Machine Learning Gautam Kunapuli. Support Vector Machines

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Support Vector Machines

SVMs, Duality and the Kernel Trick

Incorporating detractors into SVM classification

CSC 411 Lecture 17: Support Vector Machine

Linear & nonlinear classifiers

Statistical Methods for NLP

Kernel Methods and Support Vector Machines

Announcements - Homework

Support Vector Machines.

LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES. Supervised Learning

Machine Learning Support Vector Machines. Prof. Matteo Matteucci

Support Vector Machines

Support Vector Machines

Support Vector Machine. Industrial AI Lab. Prof. Seungchul Lee

Support Vector Machines Explained

1. Kernel ridge regression In contrast to ordinary least squares which has a cost function. m (θ T x (i) y (i) ) 2, J(θ) = 1 2.

Support Vector Machines

CS145: INTRODUCTION TO DATA MINING

Support Vector Machines

18.9 SUPPORT VECTOR MACHINES

Support Vector Machines and Kernel Methods

COMP 875 Announcements

CSE546: SVMs, Dual Formula5on, and Kernels Winter 2012

Kernel Machines. Pradeep Ravikumar Co-instructor: Manuela Veloso. Machine Learning

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

Support Vector Machines

Lecture 10: A brief introduction to Support Vector Machine

Review: Support vector machines. Machine learning techniques and image analysis

Introduction to Machine Learning

SUPPORT VECTOR MACHINE

Linear, threshold units. Linear Discriminant Functions and Support Vector Machines. Biometrics CSE 190 Lecture 11. X i : inputs W i : weights

Support Vector Machine

Cheng Soon Ong & Christian Walder. Canberra February June 2018

ML (cont.): SUPPORT VECTOR MACHINES

Lecture Notes on Support Vector Machine

Support'Vector'Machines. Machine(Learning(Spring(2018 March(5(2018 Kasthuri Kannan

Natural Language Processing. Classification. Features. Some Definitions. Classification. Feature Vectors. Classification I. Dan Klein UC Berkeley

Learning From Data Lecture 25 The Kernel Trick

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

COMP 652: Machine Learning. Lecture 12. COMP Lecture 12 1 / 37

Foundation of Intelligent Systems, Part I. SVM s & Kernel Methods

CIS 520: Machine Learning Oct 09, Kernel Methods

Pattern Recognition and Machine Learning. Perceptrons and Support Vector machines

Machine Learning (CSE 446): Multi-Class Classification; Kernel Methods

Data Mining. Linear & nonlinear classifiers. Hamid Beigy. Sharif University of Technology. Fall 1396

Convex Optimization and Support Vector Machine

CMU-Q Lecture 24:

Linear Models for Classification

Support vector machines Lecture 4

Support Vector Machines. CAP 5610: Machine Learning Instructor: Guo-Jun QI

Outline. Basic concepts: SVM and kernels SVM primal/dual problems. Chih-Jen Lin (National Taiwan Univ.) 1 / 22

Classifier Complexity and Support Vector Classifiers

Support Vector Machines MIT Course Notes Cynthia Rudin

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Support Vector Machines for Classification and Regression. 1 Linearly Separable Data: Hard Margin SVMs

Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x))

Support Vector Machines. Maximizing the Margin

Support Vector Machines

Applied Machine Learning Annalisa Marsico

Max Margin-Classifier

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

Neural Networks. Prof. Dr. Rudolf Kruse. Computational Intelligence Group Faculty for Computer Science

Support Vector Machines

Machine Learning and Data Mining. Support Vector Machines. Kalev Kask

Kernelized Perceptron Support Vector Machines

Lecture 10: Support Vector Machine and Large Margin Classifier

LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Support Vector Machines: Maximum Margin Classifiers

About this class. Maximizing the Margin. Maximum margin classifiers. Picture of large and small margin hyperplanes

Support Vector Machine I

Support Vector Machines for Classification: A Statistical Portrait

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017

SVAN 2016 Mini Course: Stochastic Convex Optimization Methods in Machine Learning

Transcription:

Announcements SVM optimization and Kernel methods w 4 is up. Due in a week. Kaggle is up 4/13/17 1 4/13/17 2 Outline Review SVM optimization Non-linear transformations in SVM Soft-margin SVM Goal: Find margins that maximize separablility of classes Margins are defined by data points closest to the hyperplane. 4/13/17 3 4/13/17 4 1

Optimizing Maximize geometric (e.g. normalized) margins γ. Without misclassifying a point y # w % x # + b γ for all i And our normalizing constraint w = 1 This was horribly non-convex so we tried something else instead Substitute in the definition of functional margins γ =,- w Gets rid of the normalizing constraint Optimizing cont. Now because the functional margin can arbitrarily be scaled, we assign it to a value 1. Our maximization problem is now 1 max s. t. y # w % x # + b 1, i = 1,, N w,3 w Which is equivalent to minimizing min w % w s. t. y # w % x # + b 1, i = 1,, N w,3, Note that γ is gone from this equation. 4/13/17 5 4/13/17 6 Intuition that the problem is convex > min,,w,3? w? s. t y # w % x # + b 1, i = 1,, N Intuition that the problem is convex > min,,w,3? w? s. t y # w % x # + b 1, i = 1,, N w > w > w > w? w? w? 4/13/17 4/13/17 2

Lagrange formulation Minimize L w, b, α = 1 2 w? E α # (y # with respect to w and b And maximize with respect to each α # 0 M L = w E α # y # x # = 0 L b = E α #y # = 0 w % x # + b 1) 4/13/17 9 Substituting into the Lagrangian we get w = E α # y # x # and E α # y # = 0 L w, b, α = 1 2 w% w E α # (y # w % x # + b 1) L α = E α # 1 2 E E y #y P α # α P x #, x P PI> Where α # 0, i = 1,, N and α # y # = 0 4/13/17 10 New optimization requirements L α = E α # 1 2 E y #y P α # α P x #, x P #,PI> which we maximize with respect to α subject to α # 0 for i = 1,, N and α # y # = 0 Extending this idea to SVMs L α = E α # 1 2 E y #y P α # α P x #, x P #,PI> Let s replace x #, x P with z #, z P where z is higher dimension This is something we don t know how to optimize using our statistical/mathematical knowledge for this class But this is a standard form with a known solution 4/13/17 11 4/17/17 12 3

Extending this idea to SVMs L α = E α # 1 2 E y #y P α # α P z #, z P #,PI> ow much more expensive is this? Increasing hypothesis complexity L α = α # >? #,PI> y #y P vs L α = α # >? #,PI> y #y P α # α P x #, x P α # α P z #, z P The number of parameters has not changed. α # is based on the number data points w = α # y # x # and α # y # w % x # + b 1 = 0 If z #, z P is easy to calculate, this problem isn t any harder. Inner products are cheap. So we can go to higher dimensions easily 4/17/17 13 4/17/17 14 igher dimensions: Non-linear hyperplane ere our support vectors live in Z space. In X space we have pre-images of our support vectors. Margins are maintained in Z space but to compute the margin we still only need dot products. Extending this idea to SVMs Our algorithm can be written in terms of inner products with the input features E.g. we don t need the input data (or its transformation) except for the inner product L α = E α # 1 2 E y #y P α # α P x #, x P #,PI> Let s replace x #, x P with φ(x # ), φ(x P ) Computing the inner product is a fairly cheap computation. This allows us to make much more complex models 4/13/17 15 4/17/17 16 4

Kernel trick We can easily compute K x, z by finding φ(x) and φ z and taking the inner product. But it s easy to imagine how this can get computationally expensive There are some K x, z though where we don t actually need to compute the individual φ(x) and φ z s instead we get the inner product directly This allows us to get an SVM result in high dimensional feature space given by φ but without explicitly finding or representing φ x Effectively free higher dimensional hypotheses Generalizing the Kernel x % z + c X φ gives us some intuitions to how Kernels work c controls the relative weighting between x # (first order) terms and the x # x P (second order) terms. More broadly the kernel K x, z = x % z + c X corresponds to a feature mapping to a YZX X feature space, which includes all terms x, x?,, x X Whereas computing φ(x) takes O n X time, the kernel still takes only O n time. Don t explicitly representing feature vectors in our high d space 4/13/17 17 4/13/17 18 Kernels as a similarity measure If φ x and φ z are close together, what should K x, z be? Remember the Kernel is replacing our dot product (which measures the angle between two vectors). K x, z is some sort of (approximate) distance between x and z If φ x and φ z are close, K x, z is larger If φ x and φ z are far, K x, z is small Gaussian kernel K x, z = exp? x z 2σ? Captures similarity It s close to 1 when x and z are close And close to 0 when x and z are far apart 4/13/17 19 4/13/17 20 5

Non-separable case Non-separable case By placing the input features in high dimensional space, Kernels We re increasing the likelihood that the data is actually separable Either because the data is already linearly separable Or can be made so by a (potentially non-linear) transformation 4/13/17 21 4/13/17 22 Non-separable case Some cases, we don t want to find a separating hyperplane It can be susceptible to outliers Sometimes we want a robust rather than accurate hypothesis Regularization in Optimal Margin Classifiers Extend optimal margin (linear SVMs) to non-linearly separable data Makes the margin less sensitive to outliers Introduce ll > regularization Like ridge regression, penalize the values for being in the margin Or penalize for being incorrectly classified. 1 min,,w,3 2 w? + C E ξ # s. t. y # w % x # + b 1 ξ #, i = 1,, n and ξ # 0 4/17/17 23 4/17/17 24 6

Regularization and SVMs 1 min,,w,3 2 w? + C E ξ # s. t. y # w % x # + b 1 ξ #, i = 1,, n and ξ # 0 Training data points can now have a (functional) margin less than 1 If a data point has a functional margin 1 ξ # (with ξ # > 0) There s a cost in or objective fxn of Cξ # C controls the relative weighting between minimizing w? and having functional margins at least 1. Regularization As in ridge regression, we can control the penalty C In ridge we called it λ And it controlled the size of the βs compared to the ability to fit the training data Now we re trading off between Large margins (making w? small) And functional margins of at least 1 (ξ # = 0) As C goes to 0, we get the optimal margin classifier 4/17/17 25 4/17/17 26 Solving for w and b Need to construct the Lagrangian again, this time with regularization. We have two Lagrange multipliers One for the constraint y # w % x # + b 1 ξ # And another for ξ # 0 L w, b, ξ, α, r = 1 2 w% w + C E ξ # E α # y # w % x + b 1 + ξ # E r # ξ # Solving for w and b L w, b, ξ, α, r = 1 2 w% w + C E ξ # E α # y # w % x + b 1 + ξ # E r # ξ # α # and r # are Lagrange multipliers so α #, r # 0 We solve for the derivative w.r.t. w and b, set it to 0 Simplifying, we obtain max W α = E α # 1 j 2 E y #y P α # α P x #, x P #,PI> s.t. 0 α # C, i = 1,, N and α # y # = 0 4/17/17 27 4/17/17 28 7

Solving for w and b max E α # 1 j 2 E y #y P α # α P x #, x P #,PI> s.t. 0 α # C, i = 1,, N and α # y # = 0 Note that we now have an upper bound C for α # w can be expressed in terms of α # s Support vectors Considering the values α # we could tell if something was a support vector If it was greater than 0, it was a support vector Otherwise, we knew the data point was not near the decision boundary Now though, we can have points in the margins. ow do we determine if something is a support vector? The only change from regularization is that we put an upper bound on α # 4/17/17 29 4/17/17 30 Support vectors Considering the values α # we could tell if something was a support vector α # = 0 y # w % x # + b 1 far from boundary α # = C y # w % x # + b 1 in the margin/misclassifed 0 < α # < C y # w % x # + b = 1 support vector 4/17/17 31 4/17/17 32 8

Algorithms for solving SVMs We saw that we needed quadratic programming And Lagrange multipliers for standard SVMs ow about an algorithm for the soft-margin (e.g. non separable) SVMs? 4/17/17 33 Aside: Coordinate ascent Let s say we have some function max W(α >, α?,, α ) that j we re trying to optimize What methods have we ween to optimize such a problem? A new algorithm: Loop until convergence: { } For i = 1,, n { α # argmax j- # W(α >,, α #>, α - #, α #Z>,, α ) } 4/17/17 34 Intermediate step: Coordinate ascent Loop until convergence: { For i = 1,, n { α # argmax j- # W(α >,, α #>, α - #, α #Z>,, α ) } } old all variables except for α # fixed Reoptimize W α with respect to just parameter α # ere we re just visiting α s in order But you could imagine other selection algorithms (e.g. one that maximizes the increase for W α ) Sequential minimal optimization SMO algorithm We want to solve max W α = E α # 1 j 2 E y #y P α # α P x #, x P #,PI> s.t. 0 α # C, i = 1,, N and α # y # = 0 Let s say we have a set of α # s that satisfy our constraints What happens if we try coordinate ascent? 4/17/17 35 4/17/17 36 9

We re stuck Because we know that α # y # = 0 We can see that α > y > = #I? α # y # Equivalently, α > = y # #I? α # y # So we can t get a better estimate for α > here Instead we ll consider two α s at a time SMO algorithm Loop until convergence { 1. Select some pair α # and α P to update next } (usually we use a heuristic that picks two that will allow the largest progress toward the global minimum) 2. Reoptimize our estimate of w (W α ) with respect to α # and α #, while holding all α (k i, j) fixed. 4/17/17 37 4/17/17 38 10