Linear Support Vector Machines

Similar documents
Boosting. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 1, / 32

CSCI567 Machine Learning (Fall 2014)

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 12

Support vector machine revisited

18.657: Mathematics of Machine Learning

CSCI567 Machine Learning (Fall 2014)

Machine Learning Brett Bernstein

1 Duality revisited. AM 221: Advanced Optimization Spring 2016

10-701/ Machine Learning Mid-term Exam Solution

Support Vector Machines and Kernel Methods

1 The Primal and Dual of an Optimization Problem

Linear Classifiers III

Machine Learning for Data Science (CS 4786)

Introduction to Optimization Techniques. How to Solve Equations

APPENDIX A SMO ALGORITHM

Differentiable Convex Functions

Optimization Methods MIT 2.098/6.255/ Final exam

Questions and answers, kernel part

REGRESSION WITH QUADRATIC LOSS

Linear regression. Daniel Hsu (COMS 4771) (y i x T i β)2 2πσ. 2 2σ 2. 1 n. (x T i β y i ) 2. 1 ˆβ arg min. β R n d

IP Reference guide for integer programming formulations.

Regression with quadratic loss

Supplemental Material: Proofs

Definitions and Theorems. where x are the decision variables. c, b, and a are constant coefficients.

10/2/ , 5.9, Jacob Hays Amit Pillay James DeFelice

Math 5311 Problem Set #5 Solutions

Topics Machine learning: lecture 2. Review: the learning problem. Hypotheses and estimation. Estimation criterion cont d. Estimation criterion

Chapter 7. Support Vector Machine

Lecture 7: October 18, 2017

Optimally Sparse SVMs

Machine Learning for Data Science (CS 4786)

Polynomials with Rational Roots that Differ by a Non-zero Constant. Generalities

ECE 901 Lecture 14: Maximum Likelihood Estimation and Complexity Regularization

w (1) ˆx w (1) x (1) /ρ and w (2) ˆx w (2) x (2) /ρ.

Maximum Likelihood Estimation and Complexity Regularization

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 11

1. By using truth tables prove that, for all statements P and Q, the statement

The picture in figure 1.1 helps us to see that the area represents the distance traveled. Figure 1: Area represents distance travelled

SVM for Statisticians

Linear Regression Demystified

ECE 901 Lecture 12: Complexity Regularization and the Squared Loss

Empirical Process Theory and Oracle Inequalities

The Method of Least Squares. To understand least squares fitting of data.

SNAP Centre Workshop. Basic Algebraic Manipulation

ECE 8527: Introduction to Machine Learning and Pattern Recognition Midterm # 1. Vaishali Amin Fall, 2015

Assignment 2 Solutions SOLUTION. ϕ 1 Â = 3 ϕ 1 4i ϕ 2. The other case can be dealt with in a similar way. { ϕ 2 Â} χ = { 4i ϕ 1 3 ϕ 2 } χ.

( θ. sup θ Θ f X (x θ) = L. sup Pr (Λ (X) < c) = α. x : Λ (x) = sup θ H 0. sup θ Θ f X (x θ) = ) < c. NH : θ 1 = θ 2 against AH : θ 1 θ 2

Intro to Learning Theory

MOST PEOPLE WOULD RATHER LIVE WITH A PROBLEM THEY CAN'T SOLVE, THAN ACCEPT A SOLUTION THEY CAN'T UNDERSTAND.

Frequentist Inference

Topic 9: Sampling Distributions of Estimators

Problem Set 4 Due Oct, 12

LINEAR PROGRAMMING II

DETERMINATION OF MECHANICAL PROPERTIES OF A NON- UNIFORM BEAM USING THE MEASUREMENT OF THE EXCITED LONGITUDINAL ELASTIC VIBRATIONS.

6.867 Machine learning, lecture 7 (Jaakkola) 1

SOLUTIONS TO EXAM 3. Solution: Note that this defines two convergent geometric series with respective radii r 1 = 2/5 < 1 and r 2 = 1/5 < 1.

Lecture 12: September 27

Sequences. Notation. Convergence of a Sequence

Lecture 4. Hw 1 and 2 will be reoped after class for every body. New deadline 4/20 Hw 3 and 4 online (Nima is lead)

HOMEWORK #10 SOLUTIONS

Chapter 6 Principles of Data Reduction

Introduction to Machine Learning DIS10

Convex Formulation for Learning from Positive and Unlabeled Data. is convex, 2

Monte Carlo Integration

Lecture 11 and 12: Basic estimation theory

Lecture 24: Variable selection in linear models

Estimation for Complete Data

Machine Learning Theory (CS 6783)

Polynomial Functions and Their Graphs

Math 451: Euclidean and Non-Euclidean Geometry MWF 3pm, Gasson 204 Homework 3 Solutions

Enumerative & Asymptotic Combinatorics

1 Generating functions for balls in boxes

Sequences and Series of Functions

Additional Notes on Power Series

ON WELLPOSEDNESS QUADRATIC FUNCTION MINIMIZATION PROBLEM ON INTERSECTION OF TWO ELLIPSOIDS * M. JA]IMOVI], I. KRNI] 1.

Seunghee Ye Ma 8: Week 5 Oct 28

LECTURE 2 LEAST SQUARES CROSS-VALIDATION FOR KERNEL DENSITY ESTIMATION

3. Z Transform. Recall that the Fourier transform (FT) of a DT signal xn [ ] is ( ) [ ] = In order for the FT to exist in the finite magnitude sense,

TEACHER CERTIFICATION STUDY GUIDE

Supplementary Material for Fast Stochastic AUC Maximization with O(1/n)-Convergence Rate

Ω ). Then the following inequality takes place:

We are mainly going to be concerned with power series in x, such as. (x)} converges - that is, lims N n

A widely used display of protein shapes is based on the coordinates of the alpha carbons - - C α

STAT Homework 1 - Solutions

A survey on penalized empirical risk minimization Sara A. van de Geer

Linear Programming and the Simplex Method

CHAPTER 5. Theory and Solution Using Matrix Techniques

OPTIMIZED SOLUTION OF PRESSURE VESSEL DESIGN USING GEOMETRIC PROGRAMMING

Statistical Inference Based on Extremum Estimators

Ma 530 Introduction to Power Series

Ma 4121: Introduction to Lebesgue Integration Solutions to Homework Assignment 5

Let us give one more example of MLE. Example 3. The uniform distribution U[0, θ] on the interval [0, θ] has p.d.f.

f(x) dx as we do. 2x dx x also diverges. Solution: We compute 2x dx lim

Properties and Hypothesis Testing

Measures of Spread: Standard Deviation

( ) (( ) ) ANSWERS TO EXERCISES IN APPENDIX B. Section B.1 VECTORS AND SETS. Exercise B.1-1: Convex sets. are convex, , hence. and. (a) Let.

Homework Set #3 - Solutions

6.867 Machine learning

Chapter 8. Euler s Gamma function

Transcription:

Liear Support Vector Machies David S. Roseberg The Support Vector Machie For a liear support vector machie (SVM), we use the hypothesis space of affie fuctios F = { f(x) = w T x + b w R d, b R } ad evaluate them with respect to the SVM loss fuctio, also kow as the hige loss. The hige loss is a margi loss defied as l(m) = ( m) +, where m = yf(x) is the margi for the predictio fuctio f o the example (x, y), ad (x) + = x(x 0) deotes the positive part of x. The SVM traditioally uses a l 2 regularizatio term, ad the objective fuctio is writte as J(w, b) = 2 w 2 + c ( yi w T x i + b ]). + Note that the w parameter is regularized, while the bias term b is ot regularized. A alterative approach (which saves some writig), is to drop the b ad add a a costat feature, say with the value, to the represetatio of x. With this approach, the bias term will be regularized alog with the rest of the parameters. Rather tha the typical λ regularizatio parameter attached to the l 2 pealty, for SVMs it s traditioal to have a c parameter attached to the empirical risk compoet. The larger c is, the more relative importace we attach to miimizig the empirical risk compared to fidig a simple hypothesis with small l 2 -orm.

2 3 Compute the Lagragia Dual 2 Formulatig SVM as a QP The SVM optimizatio problem is mi w R d,b R 2 w 2 + c ( yi w T x i + b ]). (2.) + This is a ucostraied optimizatio problem (which is ice), but the objective fuctio is ot differetiable, which makes it difficult to work with. We ca formulate a equivalet problem with a differetiable objective, but we ll have to add ew costraits to do so. Note that 2.is equivalet to miimize 2 w 2 + c subject to ξ i ( y i w T x i + b ]) +, sice the miimizatio will always drive dow ξ i util ξ i = ( y i w T x i + b ]) +. We ca ow break up the iequality ito two parts: ξ i miimize subject to 2 w 2 + c ξ i ξ i 0 for i =,..., ξ i ( y i w T x i + b ]) for i =,..., We ow have a differetiable objective fuctio i d + + variables with 2 affie costraits. This is a quadratic program that ca be solved by ay off-the-shelf QP solver. 3 Compute the Lagragia Dual The Lagragia for this formulatio is L(w, b, ξ, α, λ) = 2 w 2 + c ( ξ i + α i yi w T x i + b ] ) ξ i λ i ξ i = ( c ) 2 wt w + ξ i α ( i λ i + α i yi w T x i + b ]).

3 From our study of Lagragia duality, we kow that the origial problem ca ow be expressed as if sup L(w, b, ξ, α, λ). w,b,ξ α,λ 0 Sice our costraits are affie, by Slater s coditio we have strog duality so log as the problem is feasible (i.e. so log as there is at least oe poit i the feasible set). The costraits are satisfied by w = 0 ad ξ i = for i =,...,, so we have strog duality. Thus we get the same result if we solve the followig dual problem: sup if α,λ 0 w,b,ξ L(w, b, ξ, α, λ). As usual, we capture the ier optimizatio i the Lagrage dual objective: g(α, λ) = if w,ξ L(w, b, ξ, α, λ). Note that if c α i λ i 0, the the Lagragia is ubouded below (by takig ξ i ± ) ad thus the ifimum is. For ay give (α, λ), the fuctio (w, ξ) L(w, b, ξ, α, λ) is covex ad differetiable, thus we have a optimal poit if ad oly if all partial derivatives of L with respect to w, b, ad ξ are 0: w L = 0 w b L = 0 α i y i x i = 0 w = α i y i x i (3.) ξi L = 0 c α i λ i = 0 α i + λ i = c (3.2) Note that oe of the coditios is α i +λ i = c, which agrees with our previous observatio that if α i + λ i c the L us ubouded below. Substitutig these coditios back ito L, the secod term disappears, while the first ad third terms become 2 wt w = α i α j y i y j x T i x j 2 i,j= α i ( y i w T x i + b ] ) = α i α i α j y i y j x T j x i b α i y i. i,j= }{{} =0

4 3 Compute the Lagragia Dual Puttig it together, the dual fuctio is { g(α, λ) = α i 2 i,j= α iα j y i y j x T j x i α iy i = 0, α i + λ i = c, all i otherwise. Thus we ca write the dual problem as sup α,λ s.t. α i 2 α i α j y i y j x T j x i i,j= α i + λ i = c, i =,..., α i, λ i 0, i =,..., We ca actually elimiate the λ variables, replacig the last three costraits by 0 α i c : sup α s.t. α i 2 α i 0, c ]. α i α j y i y j x T j x i i,j= Whe writte i stadard form, this has a quadratic objective i ukows ad 2+ costraits. Note that these costraits have a particularly simple form: they are called box costraits. If α is a solutio to the dual problem, the by strog duality ad (3.), the optimal solutio to the primal problem is give by w = αi y i x i. Note that w = α i y i x i oly depeds o those examples for which α i > 0 (recall that α i 0 by costrait). These examples are called support vectors.

5 Sice α i 0, c ], we see that c cotrols the amout of weight we ca put o ay sigle example. Note that we still do t have a expressio for the optimal bias term b. We ll derive this below usig complemetary slackess coditios. 4 Cosequeces of Complemetary Slackess Let (w, b, ξ i ) ad (α, λ ) be optimal solutios to the primal ad dual problems, respectively. For otatioal coveiece, let s defie f (x) = x T i w +b. By strog duality, we have the followig complemetary slackess coditios: αi ( y i f (x i ) ξi ) = 0 (4.) ( c ) λ i ξi = α i ξi = 0 (4.2) We ow draw may straightforward coclusios: As we oted above, ξ i is the hige loss o example i. Whe ξ i = 0, we re either at the margi (i.e. y i f (x i ) = ) or o the good side of the margi (y i f (x i ) > ). That is ξ i = 0 = y i f (x i ). (4.3) By (4.2), α i = 0 implies ξ i = 0, which by (4.3) implies y i f (x). α i ( 0, c ) implies ξ i = 0, by 4.2. The by 4. we get y i f (x i ) =. So the predictio is right o the margi. If y i f (x i ) < the the margi loss is ξi αi = c. > 0, ad (4.2) implies that If y i f (x) > the the margi loss is ξ i = 0, ad 4. implies α i = 0. The cotrapositive of the previous result is that α i > 0 implies y i f (x). This seems to be all we ca say for the specific case α i = c. We also ca t draw ay extra iformatio about αi o the margi (y i f (x i ) = ). for poits exactly

6 4 Cosequeces of Complemetary Slackess We summarize these results below: αi = 0 = y i f (x i ) ( αi 0, c ) = y i f (x i ) = α i = c = y i f (x i ) 4. Determiig b y i f (x i ) < = αi = c y i f (x i ) = = αi 0, c ] y i f (x i ) > = αi = 0 Fially, let s determie b. Suppose there exists a i such that α i ( 0, c ). The ξ i = 0 by (4.2) ad by 4. we get y i x T i w + b ] =. Sice y i {, }, ) we ca coclude that b = y i x T i w. With exact calculatios, we would get the same b for ay choice of i with αi ( 0, ) c. With umerical error, however, it will be more robust to average over all eligible i s: { ( b = mea y i x T i w αi 0, c )}. If there are o α i ( 0, c ), the we have a degeerate SVM traiig problem, for which w = 0, ad we always predict the majority class. This is show i Rifki et al. s A Note o Support Vector Machie Degeeracy, a MIT AI Lab Techical Report.