Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

Similar documents
CSC 411 Lecture 17: Support Vector Machine

Support Vector Machines for Classification and Regression. 1 Linearly Separable Data: Hard Margin SVMs

Kernelized Perceptron Support Vector Machines

Support Vector Machines for Classification and Regression

Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron

Topics we covered. Machine Learning. Statistics. Optimization. Systems! Basics of probability Tail bounds Density Estimation Exponential Families

Machine Learning Basics Lecture 4: SVM I. Princeton University COS 495 Instructor: Yingyu Liang

Support vector machines Lecture 4

Machine Learning And Applications: Supervised Learning-SVM

Lecture Support Vector Machine (SVM) Classifiers

Machine Learning. Support Vector Machines. Manfred Huber

Outline. Basic concepts: SVM and kernels SVM primal/dual problems. Chih-Jen Lin (National Taiwan Univ.) 1 / 22

Support Vector Machines, Kernel SVM

MLCC 2017 Regularization Networks I: Linear Models

Machine Learning. Linear Models. Fabio Vandin October 10, 2017

Announcements - Homework

Machine Learning. Linear Models. Fabio Vandin October 10, 2017

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Support Vector Machine

Support Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017

Dan Roth 461C, 3401 Walnut

Introduction to Machine Learning (67577) Lecture 7

Machine Learning A Geometric Approach

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Binary Classification / Perceptron

Stochastic Subgradient Method

CS6375: Machine Learning Gautam Kunapuli. Support Vector Machines

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Support Vector Machines

Machine Learning. Regularization and Feature Selection. Fabio Vandin November 14, 2017

ICS-E4030 Kernel Methods in Machine Learning

Cheng Soon Ong & Christian Walder. Canberra February June 2018

10701 Recitation 5 Duality and SVM. Ahmed Hefny

A short introduction to supervised learning, with applications to cancer pathway analysis Dr. Christina Leslie

Machine Learning. Regularization and Feature Selection. Fabio Vandin November 13, 2017

Warm up: risk prediction with logistic regression

Support Vector Machines

SVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels

Discriminative Models

ECS171: Machine Learning

Max Margin-Classifier

Cutting Plane Training of Structural SVM

Introduction to Support Vector Machines

The Lagrangian L : R d R m R r R is an (easier to optimize) lower bound on the original problem:

Support Vector Machine (SVM) and Kernel Methods

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Support Vector Machines. Machine Learning Fall 2017

Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x))

CS-E4830 Kernel Methods in Machine Learning

Machine Learning 4771

Support Vector Machines

Computational and Statistical Learning Theory

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18

Jeff Howbert Introduction to Machine Learning Winter

Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade

Kernels. Machine Learning CSE446 Carlos Guestrin University of Washington. October 28, Carlos Guestrin

Support Vector Machines: Maximum Margin Classifiers

Stochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure

Support Vector Machines and Kernel Methods

Discriminative Models

HOMEWORK 4: SVMS AND KERNELS

Pattern Recognition 2018 Support Vector Machines

Support Vector Machine for Classification and Regression

Statistical Machine Learning from Data

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines

Learning by constraints and SVMs (2)

Convex Optimization and Support Vector Machine

Big Data Analytics: Optimization and Randomization

Lecture 2 - Learning Binary & Multi-class Classifiers from Labelled Training Data

Lecture Notes on Support Vector Machine

Support Vector Machines

Support Vector Machines

Convex Optimization Lecture 16

Support Vector Machine (SVM) and Kernel Methods

Machine Learning for NLP

Support Vector Machine (continued)

Machine Learning. Model Selection and Validation. Fabio Vandin November 7, 2017

Support Vector Machine

Mehryar Mohri Foundations of Machine Learning Courant Institute of Mathematical Sciences Homework assignment 3 April 5, 2013 Due: April 19, 2013

Support Vector Machines

Support Vector Machine

Machine Learning Support Vector Machines. Prof. Matteo Matteucci

Multisurface Proximal Support Vector Machine Classification via Generalized Eigenvalues

Linear Models. min. i=1... m. i=1...m

Accelerating Stochastic Optimization

Coordinate Descent and Ascent Methods

Machine Learning Lecture 6 Note

10-701/ Machine Learning - Midterm Exam, Fall 2010

STA141C: Big Data & High Performance Statistical Computing

Statistical Machine Learning II Spring 2017, Learning Theory, Lecture 4

Linear classifiers Lecture 3

Kernel Machines. Pradeep Ravikumar Co-instructor: Manuela Veloso. Machine Learning

Accelerate Subgradient Methods

Linear & nonlinear classifiers

Introduction to Support Vector Machines

Support Vector Machines.

Polyhedral Computation. Linear Classifiers & the SVM

Machine Learning and Data Mining. Support Vector Machines. Kalev Kask

Linear, threshold units. Linear Discriminant Functions and Support Vector Machines. Biometrics CSE 190 Lecture 11. X i : inputs W i : weights

Statistical Pattern Recognition

Transcription:

Machine Learning Support Vector Machines Fabio Vandin November 20, 2017 1

Classification and Margin Consider a classification problem with two classes: instance set X = R d label set Y = { 1, 1}. Training data: S = ((x 1, y 1 ),..., (x m, y m )) Hypothesis set H = halfspaces Assumption: data is linearly separable there exist a halfspace that perfectly classify the training set In general: multiple separating hyperplanes: which one is the best choice? 2

Classification and Margin The last one seems the best choice, since it can tolerate more noise. Informally, for a given separating halfspace we define its margin as its minimum distance to an example in the training set S. Intuition: best separating hyperplane is the one with largest margin. How do we find it? 3

Linearly Separable Training Set Training set S = ((x 1, y 1 ),..., (x m, y m )) is linearly separable if there exists a halfspace (w, b) such that y i = sign( w, x i + b) for all i = 1,..., m. Equivalent to: i = 1,..., m : y i ( w, x i + b) > 0 Informally: margin of a separating hyperplane is its minimum distance to an example in the training set S 4

Separating Hyperplane and Margin Given hyperplane defined by L = {v : w, v + b = 0}, and given x, the distance of x to L is d(x, L) = min{ x v : v L} Claim: if w = 1 then d(x, L) = w, x + b (Proof: Claim 15.1 [UML]) 5

Margin and Support Vectors The margin of a separating hyperplane is the distance of the closest example in training set to it. If w = 1 the margin is: min w, x i + b i {1,...,m} The closest examples are called support vectors 6

Support Vector Machine (SVM) Hard-SVM: seek for the separating hyperplane with largest margin (only for linearly separable data) Computational problem: arg max (w,b): w =1 subject to i : y i ( w, x i + b) > 0 min w, x i + b i {1,...,m} Equivalent formulation (due to separability assumption): arg max (w,b): w =1 min y i( w, x i + b) i {1,...,m} 7

Hard-SVM: Quadratic Programming Formulation input: (x 1, y 1 ),..., (x m, y m ) solve: (w 0, b 0 ) = arg min (w,b) w 2 subject to i : y i ( w, x i + b) 1 output: ŵ = w 0 w 0, ˆb = b 0 w 0 How do we get a solution? Quadratic optimization problem: objective is convex quadratic function, constraints are linear inequalities Quadratic Programming solvers! Proposition The output of algorithm above is a solution to the Equivalent Formulation in the previous slide. Proof: on the board and in the book (Lemma 15.2 [UML]) 8

Equivalent Formulation and Support Vectors Equivalent formulation (homogeneous halfspaces): assume first component of x X is 1, then w 0 = min w w 2 subject to i : y i w, x i 1 Support Vectors = vectors at minimum distance from w 0 The support vectors are the only ones that matter for defining w 0! Proposition Let w 0 be as above. Let I = {i : w 0, x = 1}. Then there exist coefficients α 1,..., α m such that w 0 = i I α i x i Support vectors = {x i : i I } Note: Solving Hard-SVM is equivalent to find α i for i = 1,..., m, and α i 0 only for support vectors 9

Soft-SVM 10 Hard-SVM works if data is linearly separable. What if data is not linearly separable? soft-svm Idea: modify constraints of Hard-SVM to allow for some violation, but take into account violations into objective function

Hard-SVM constraints: Soft-SVM Constraints y i ( w, x i + b) 1 Soft-SVM constraints: slack variables: ξ 1,..., ξ m 0 vector ξ for each i = 1,..., m: y i ( w, x i + b) 1 ξ i ξ i : how much constraint y i ( w, x i + b) 1 is violated Soft-SVM minimizes combinations of norm of w average of ξ i Tradeoff among two terms is controlled by a parameter λ R, λ > 0 11

Soft-SVM: Optimization Problem 12 input: (x 1, y 1 ),..., (x m, y m ), parameter λ > 0 solve: ( ) min λ w 2 + 1 ξ i w,b,ξ m subject to i : y i ( w, x i + b) 1 ξ i and ξ i 0 output: w, b Equivalent formulation: consider the hinge loss l hinge ((w, b), (x, y)) = max{0, 1 y( w, x + b)} Given (w, b) and a training S, the empirical risk L hinge S ((w, b)) is L hinge S ((w, b)) = 1 l hinge ((w, b), (x i, y i )) m

Soft-SVM: solve Soft-SVM as RLM min w,b,ξ ( λ w 2 + 1 m ) ξ i subject to i : y i ( w, x i + b) 1 ξ i and ξ i 0 Equivalent formulation with hinge loss: that is min w,b ( min w,b λ w 2 + 1 m ( λ w 2 + L hinge S (w, b) ) ) l hinge ((w, b), (x i, y i )) Note: λ w 2 : l 2 regularization L hinge S (w, b): empirical risk for hinge loss 13

Soft-SVM: Solution 14 We need to solve: ( How? min w,b λ w 2 + 1 m ) l hinge ((w, b), (x i, y i )) standard solvers for optimization problems Stochastic Gradient Descent

Gradient Descent 15 General approach for minimizing a differentiable convex function f (w) Let f : R d R be a differentiable function Definition The gradient f (w) of f at w = (w 1,..., w d ) is ( ) f (w) f (w) f (w) =,..., w 1 w d Intuition: the gradient points in the direction of the greatest rate of increase of f around w

Let η R, η > 0 be a parameter. Gradient Descent (GD) algorithm: w (0) 0; for t 0 to T 1 do w (t+1) w (t) η f (w (t) ); return w = 1 T T t=1 w(t) ; Notes: output vector could also be w (T ) or arg min t {1,...,T } f (w (t) ) returning w is useful for nondifferentiable functions (using subgradients instead of gradients...) and for stochastic gradient descent... Guarantees: Let f be convex and ρ-lipschitz continuos; let w arg min w: w B f (w). Then for every ɛ > 0, to find w such that f ( w) f (w ) ɛ it is sufficient to run the GD algorithm for T B 2 ρ 2 /ɛ 2 iterations. 16

Stochastic Gradient Descent (SGD) 17 Idea: instead of using exactly the gradient, we take a (random) vector with expected value equal to the gradient direction. SGD algorithm: w (0) 0; for t 1 to T do choose v t at random from a distribution such that E[v t w (t) ] f (w (t) ); w (t+1) w (t) ηv t ; return w = 1 T T t=1 w(t) ;

Guarantees: Let f be convex and ρ-lipschitz continuos; let w arg min w: w B f (w). Then for every ɛ > 0, to find w such that E[f ( w)] f (w ) ɛ it is sufficient to run the SGD algorithm for T B 2 ρ 2 /ɛ 2 iterations. 18

19 Why should we use SGD instead of GD? Question: when do we use GD in the first place? Answer: for example to find w that minimizes L S (w) That is f (w) = L S (w) f (w) depends on all pairs (x i, y i ) S, i = 1,..., m: may require long time to compute it! What about SGD? We need to pick v t such that E[v t w (t) ] f (w (t) ): how? Pick a random (x i, y i ) pick v t l(w (t), (x i, y i )): satisfies the requirement! requires much less computation than GD Analogously we can use SGD for regularized losses, etc.

We want to solve min w SGD for Solving Soft-SVM ( λ 2 w 2 + 1 m ) max{0, 1 y w, x i } Note: it s standard to add a 1 2 some computations. in the regularization to simplify Algorithm: θ (1) 0 ; for t 1 to T do let w (t) 1 λt θ(t) ; choose i uniformly at random from {1,..., m}; if y i w (t), x i < 1 then θ (t+1) θ (t) + y i x i ; else θ (t+1) θ (t) ; return w = 1 T T t=1 w(t) ; 20

Duality We now present (Hard-)SVM in a different way which is very useful for kernels We want to solve w 0 = min w 1 2 w 2 subject to i : y i w, x i 1 Define Note that g(w) = g(w) = max α R m :α 0 α i (1 y i w, x i ) { 0 if i : y i w, x i 1 + otherwise Then the problem above can be rewritten as: ( ) 1 min w 2 w 2 + g(w) 21

( ) 1 min w 2 w 2 + g(w) Rerranging a bit the terms we obtain the equivalent problem: ( ) 1 min max w α R m :α 0 2 w 2 + α i (1 y i w, x i ) One can prove that min w max α R m :α 0 ( 1 2 w 2 + ( = max min 1 α R m :α 0 w 2 w 2 + ) α i (1 y i w, x i ) ) α i (1 y i w, x i ) Note: the result above is called strong duality 22

23 We therefore define the dual problem ( ) max min 1 α R m :α 0 w 2 w 2 + α i (1 y i w, x i ) Note: once α is fixed the optimization with respect to w is unconstrained the objective is differentiable at the optimum the gradient is equal to 0: w α i y i x i = 0 w = α i y i x i Replacing in the dual problem we get: max 1 2 m α α R m :α 0 2 i y i x i + α i 1 y i α j y j x j, x i j=1

24 max α R m :α 0 1 2 α 2 i y i x i + α i m 1 y i α j y j x j, x i j=1 rearranging we get the problem: max α i 1 α R m :α 0 2 α i α j y i y j x j, x i j=1 Note: solution is the vector α which defines the support vectors {x i : α i 0} dual problem requires only to compute inner products x j, x i, does not need to consider x i by itself