Support Vector Machines MIT Course Notes Cynthia Rudin

Similar documents
Support Vector Machines. Maximizing the Margin

Intelligent Systems: Reasoning and Recognition. Perceptrons and Support Vector Machines

Support Vector Machines. Machine Learning Series Jerry Jeychandra Blohm Lab

Kernel Methods and Support Vector Machines

Support Vector Machines. Goals for the lecture

Soft-margin SVM can address linearly separable problems with outliers

1 Bounding the Margin

Boosting with log-loss

Bayes Decision Rule and Naïve Bayes Classifier

Geometrical intuition behind the dual problem

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Combining Classifiers

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines

Lecture 9: Multi Kernel SVM

E0 370 Statistical Learning Theory Lecture 6 (Aug 30, 2011) Margin Analysis

CS Lecture 13. More Maximum Likelihood

CSC 411 Lecture 17: Support Vector Machine

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Feature Extraction Techniques

Using EM To Estimate A Probablity Density With A Mixture Of Gaussians

Lecture 21. Interior Point Methods Setup and Algorithm

Introduction to Optimization Techniques. Nonlinear Programming

Approximation in Stochastic Scheduling: The Power of LP-Based Priority Policies

Topic 5a Introduction to Curve Fitting & Linear Regression

Model Fitting. CURM Background Material, Fall 2014 Dr. Doreen De Leon

Finite fields. and we ve used it in various examples and homework problems. In these notes I will introduce more finite fields

Least Squares Fitting of Data

Support Vector Machine (continued)

COMP 652: Machine Learning. Lecture 12. COMP Lecture 12 1 / 37

The Methods of Solution for Constrained Nonlinear Programming

Foundations of Machine Learning Boosting. Mehryar Mohri Courant Institute and Google Research

Support Vector Machines, Kernel SVM

Announcements - Homework

SVM optimization and Kernel methods

Principles of Optimal Control Spring 2008

Pattern Recognition and Machine Learning. Artificial Neural networks

Ensemble Based on Data Envelopment Analysis

Machine Learning. Support Vector Machines. Manfred Huber

Principal Components Analysis

Machine Learning: Fisher s Linear Discriminant. Lecture 05

Support Vector Machines. Machine Learning Fall 2017

CHAPTER 8 CONSTRAINED OPTIMIZATION 2: SEQUENTIAL QUADRATIC PROGRAMMING, INTERIOR POINT AND GENERALIZED REDUCED GRADIENT METHODS

a a a a a a a m a b a b

Support Vector Machine Classification of Uncertain and Imbalanced data using Robust Optimization

Support Vector Machines

Support Vector Machines

A Note on Scheduling Tall/Small Multiprocessor Tasks with Unit Processing Time to Minimize Maximum Tardiness

Support Vector Machines and Kernel Methods

Physically Based Modeling CS Notes Spring 1997 Particle Collision and Contact

Statistical Pattern Recognition

Donald Fussell. October 28, Computer Science Department The University of Texas at Austin. Point Masses and Force Fields.

1 Identical Parallel Machines

F = 0. x o F = -k x o v = 0 F = 0. F = k x o v = 0 F = 0. x = 0 F = 0. F = -k x 1. PHYSICS 151 Notes for Online Lecture 2.4.

Support Vector Machines for Classification and Regression. 1 Linearly Separable Data: Hard Margin SVMs

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Support Vector Machines for Classification and Regression

Pattern Recognition and Machine Learning. Learning and Evaluation for Pattern Recognition

COMP 875 Announcements

(Kernels +) Support Vector Machines

Stochastic Subgradient Methods

Figure 1: Equivalent electric (RC) circuit of a neurons membrane

Block designs and statistics

1 Proof of learning bounds

On Constant Power Water-filling

Pattern Recognition and Machine Learning. Artificial Neural networks

Ch 12: Variations on Backpropagation

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Training an RBM: Contrastive Divergence. Sargur N. Srihari

The Simplex Method is Strongly Polynomial for the Markov Decision Problem with a Fixed Discount Rate

Machine Learning. Lecture 6: Support Vector Machine. Feng Li.

A Self-Organizing Model for Logical Regression Jerry Farlow 1 University of Maine. (1900 words)

1 Rademacher Complexity Bounds

Classification and Support Vector Machine

Max Margin-Classifier

Linear & nonlinear classifiers

16 Independence Definitions Potential Pitfall Alternative Formulation. mcs-ftl 2010/9/8 0:40 page 431 #437

arxiv: v1 [cs.ds] 29 Jan 2012

PAC-Bayes Analysis Of Maximum Entropy Learning

Intelligent Systems: Reasoning and Recognition. Artificial Neural Networks

Estimating Parameters for a Gaussian pdf

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Chaotic Coupled Map Lattices

1 Generalization bounds based on Rademacher complexity

Support Vector Machines

Support Vector Machines

Soft-margin SVM can address linearly separable problems with outliers

Algorithms for parallel processor scheduling with distinct due windows and unit-time jobs

COS 424: Interacting with Data. Written Exercises

Introduction to Discrete Optimization

About this class. Maximizing the Margin. Maximum margin classifiers. Picture of large and small margin hyperplanes

Jeff Howbert Introduction to Machine Learning Winter

MSEC MODELING OF DEGRADATION PROCESSES TO OBTAIN AN OPTIMAL SOLUTION FOR MAINTENANCE AND PERFORMANCE

Support Vector Machine

Support Vector Machines

LECTURE 7 Support vector machines

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES. Supervised Learning

A Simple Regression Problem

ma x = -bv x + F rod.

Linear & nonlinear classifiers

Transcription:

Support Vector Machines MIT 5.097 Course Notes Cynthia Rudin Credit: Ng, Hastie, Tibshirani, Friedan Thanks: Şeyda Ertekin Let s start with soe intuition about argins. The argin of an exaple x i = distance fro exaple to decision boundary = y i f(x i ) The argin is positive if the exaple is on the correct side of the decision boundary, otherwise it s negative. Here s the intuition for SVM s: We want all exaples to have large argins, want the to be as far fro decision boundary as possible. That way, the decision boundary is ore stable, we are confident in all decisions.

Most other algoriths (logistic regression, decision trees, perceptron) don t generally produce large argins. (AdaBoost generally produces large argins.) As in logistic regression and AdaBoost, function f is linear, f(x) = λ (j) x (j) + λ 0. j= Note that the intercept ter can get swept into x by adding a as the last coponent of each x. Then f(x) would be just λ T x but for this lecture we ll keep the intercept ter separately because SVM handles that ter differently than if you put the intercept as a separate feature. We classify x using sign(f(x)). If x i has a large argin, we are confident that we classified it correctly. So we re essentially suggesting to use the argin y i f(x i ) to easure the confidence in our prediction. But there is a proble with using y i f(x i ) to easure confidence in prediction. There is soe arbitrariness about it. How? What should we do about it?

SVM s axiize the distance fro the decision boundary to the nearest training exaple they axiize the iniu argin. There is a geoetric perspective too. I set the intercept to zero for this picture (so the decision boundary passes through the origin): The decision boundary are x s where λ T x = 0. That eans the unit vector for λ ust be perpendicular to those x s that lie on the decision boundary. Now that you have the intuition, we ll put the intercept back, and we have to translate the decision boundary, so it s really the set of x s where λ T x = λ 0. The argin of exaple i is denoted γ i : B is the point on the decision boundary closest to the positive exaple x i. B is λ B = x i γ i λ 3

since we oved γ i units along the unit vector to get fro the exaple to B. Since B lies on the decision boundary, it obeys λ T x + λ 0 = 0, where x is B. (I wrote the intercept there explicitly). So, ( ) λ T λ x i γ i + λ 0 = 0 Siplifying, λ λ T λ x i γ i + λ 0 = 0 λ λ T x i + λ 0 γ i = λ =: f(x i ) (this is the noralized version of f) = y i f(x i ) since y i =. Note that here we noralized so we wouldn t have the arbitrariness in the eaning of the argin. If the exaple is negative, the sae calculation works, with a few sign flips (we d need to ove γ i units rather than γ i units). So the geoetric argin fro the picture is the sae as the functional argin y i f(x i ). Maxiize the iniu argin Support vector achines axiize the iniu argin. They would like to have all exaples being far fro the decision boundary. So they ll choose f this way: ax ax γ s.t. y i f(x i ) γ i =... f γ λ T x i + λ 0 ax γ s.t. y i γ i =... γ,λ,λ 0 λ ax γ s.t. y i (λ T x i + λ 0 ) γ λ i =.... γ,λ,λ 0 4

For any λ and λ 0 that satisfy this, any positively scaled ultiple satisfies the too, so we can arbitrarily set λ = /γ so that the right side is. Now when we axiize γ, we re axiizing γ = / λ. So we have Equivalently, ax s.t. y i (λ T xi + λ 0 ) i =.... λ,λ 0 λ in λ,λ 0 λ s.t. y i (λ T x i + λ 0 ) 0 i =... () (the / and square are just for convenience) which is the sae as: in λ s.t. y i (λ T x i + λ 0 ) + 0 λ,λ 0 leading to the Lagrangian n i =... L ([λ, λ 0 ], α) = λ (j) + [ ] α i yi (λ T x i + λ 0 ) + j= i= Writing the KKT conditions, starting with Lagrangian stationarity, where we need to find the gradient wrt λ and the derivative wrt λ 0 : λ L ([λ, λ 0 ], α) = λ α i y i x i = 0 = λ = α i y i x i. i= i= L ([λ, λ0 ], α) = α i y i = 0 = α i y i = 0. λ 0 i= α i 0 i [ α i y i (λ T x i + λ 0 ) + ] = 0 i T y i (λ x i + λ 0 ) + 0. i= (dual feasibility) (copleentary slackness) (prial feasibility) 5

Using the KKT conditions, we can siplify the Lagrangian. L ([λ, λ 0 ], α) = λ + λ T ( α i y i x i ) + ( α i y i λ 0 ) + α i i= i= i= (We just expanded ters. Now we ll plug in the first KKT condition.) = λ λ λ 0 (α i y i ) + α i i= i= (Plug in the second KKT condition.) = λ (j) + 0 + αi () j= Again using the first KKT condition, we can rewrite the first ter. ( ) λ (j) (j) = α i y i x j= i j= i= i= = (j) (j) α i α k y i y k x i x k j= i= k= = T α i α k yiy k xi x k. i= k= Plugging back into the Lagrangian (), which now only depends on α, and putting in the second and third KKT conditions gives us the dual proble; ax L (α) α where { T α i 0 i =... L (α) = α i α i α k y i y k xi x k s.t. i= α iy i = 0 i= i,k We ll use the last two KKT conditions in what follows, for instance to get conditions on λ 0, but what we ve already done is enough to define the dual proble for α. We can solve this dual proble. Either (i) we d use a generic quadratic prograing solver, or (ii) use another algorith, like SMO, which I will discuss 6 (3)

later. For now, assue we solved it. So we have α,..., α. We can use the solution of the dual proble to get the solution of the prial proble. We can plug α into the first KKT condition to get λ = α i y i x i. (4) We still need to get λ 0, but we can see soething cool in the process. Support Vectors i= Look at the copleentary slackness KKT condition and the prial and dual feasibility conditions: α i > 0 y i (λ T x i + λ 0 ) = [ ] T α α i < 0 (Can t happen) i y i (λ x i + λ 0 ) + = 0 y i(λ T x i + λ 0) + < 0 α i = 0 y i (λ T x i + λ 0) + > 0 (Can t happen) Define the optial (scaled) scoring function: f (x i) = λ T x i + λ 0, then { α i > 0 y i f (x i ) = scaled argin i = < y i f (x i ) α i = 0 The exaples in the first category, for which the scaled argin is and the constraints are active are called support vectors. They are the closest to the decision boundary. 7

Support vectors x x Iage by MIT OpenCourseWare. Finish What We Were Doing Earlier To get λ 0, use the copleentarity condition for any of the support vectors (in other words, use the fact that the unnoralized argin of the support vectors is one): = y i (λ T x i + λ 0). If you take a positive support vector, y i =, then λ = λ T 0 x i. Written another way, since the support vectors have the sallest argins, λ 0 = in λ T x i. i:y i = So that s the solution! Just to recap, to get the scoring function f for SVM, you d copute α fro the dual proble (3), plug it into (4) to get λ, plug that into the equation above to get λ 0, and that s the solution to the prial proble, and the coefficients for f. 8

Because of the for of the solution: λ = α i y i x i. i= it is possible that λ is very fast to calculate. Why is that? Think support vectors. The Nonseparable Case If there is no separating hyperplane, there is no feasible solution to the proble we wrote above. Most real probles are nonseparable. Let s fix our SVM so it can accoodate the nonseparable case. The new forulation will penalize istakes the farther they are fro the decision boundary. So we are allowed to ake istakes now, but we pay a price. 9

Let s change our prial proble () to this new prial proble: { y λ T i( x i + λ 0 ) ξ in λ,λ 0,ξ λ + C ξ i s.t. i (5) ξ i 0 i= So the constraints allow soe slack of size ξ i, but we pay a price for it in the objective. That is, if y i f(x i ) then ξ i gets set to 0, penalty is 0. Otherwise, if y i f(x i ) = ξ i, we pay price ξ i. Paraeter C trades off between the twin goals of aking the λ sall (aking what-was-the-iniu-argin / λ large) and ensuring that ost exaples have argin at least / λ. Going on a Little Tangent Rewrite the penalty another way: If y i f(x i ), zero penalty. Else, pay price ξ i = y i f(x i ) Third tie s the char: Pay price ξ i = y i f(x i ) + where this notation z + eans take the axiu of z and 0. Equation (5) becoes: in λ + C y i f(x i ) λ,λ 0 i= Does that look failiar? + 0

The Dual for the Nonseparable Case For the Lagrangian of (5): L(λ, α, r) = λ T, b, ξ + C ξ i [ ] α i y i (λ x i + λ 0 ) + ξ i r i ξ i i= i= i= where α i s and r i s are Lagrange ultipliers (constrained to be 0). The dual turns out to be (after soe work) ax { α i α C i =.. α i α k y i y k x T 0. i x k s.t. i (6) i= α iy i = 0 α i= i,k= So the only difference fro the original proble s Lagrangian (3) is that 0 α i was changed to 0 α i C. Neat! Solving the dual proble with SMO SMO (Sequential Minial Optiization) is a type of coordinate ascent algorith, but adapted to SVM so that the solution always stays within the feasible region. Start with (6). Let s say you want to hold α,..., α fixed and take a coordinate step in the first direction. That is, change α to axiize the objective in (6). Can we ake any progress? Can we get a better feasible solution by doing this? Turns out, no. Look at the constraint in (6), i= α iy i = 0. This eans: αy = αiy i, or ultiplying by y, i= α = y αi y i. i= So, since α,..., α are fixed, α is also fixed.

So, if we want to update any of the α i s, we need to update at least of the siultaneously to keep the solution feasible (i.e., to keep the constraints satisfied). Start with a feasible vector α. Let s update α and α, holding α 3,..., α fixed. What values of α and α are we allowed to choose? Again, the constraint is: α y + α y = i=3 α i y i =: ζ (fixed constant). We are only allowed to choose α, α on the line, so when we pick α, we get α autoatically, fro α = (ζ y α y ) = y (ζ α y ) (y = /y since y {+, }). Also, the other constraints in (6) say 0 α, α C. So, α needs to be within [L,H] on the figure (in order for α to stay within [0, C]), where we will always have 0 L,H C. To do the coordinate ascent step, we will optiize the objective over α, keeping it within [L,H]. Intuitively, (6) becoes: ax α [L,H] α + α + constants α i α k y i y k x T i x k where α = y (ζ α y ). i,k (7) The objective is quadratic in α. This eans we can just set its derivative to 0 to optiize it and get α for the next iteration of SMO. If the optial value is

outside of [L,H], just choose α to be either L or H for the next iteration. For instance, if this is a plot of (7) s objective (soeties it doesn t look like this, soeties it s upside-down), then we ll choose : Note: there are heuristics to choose the order of α i s chosen to update. 3

MIT OpenCourseWare http://ocw.it.edu 5.097 Prediction: Machine Learning and Statistics Spring 0 For inforation about citing these aterials or our Ters of Use, visit: http://ocw.it.edu/ters.