Support vector machines

Similar documents
CS-E4830 Kernel Methods in Machine Learning

ICS-E4030 Kernel Methods in Machine Learning

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machine (SVM) and Kernel Methods

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Support Vector Machine (continued)

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Lecture Notes on Support Vector Machine

Machine Learning. Support Vector Machines. Manfred Huber

The Lagrangian L : R d R m R r R is an (easier to optimize) lower bound on the original problem:

Convex Optimization. Dani Yogatama. School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA. February 12, 2014

Introduction to Machine Learning Spring 2018 Note Duality. 1.1 Primal and Dual Problem

Karush-Kuhn-Tucker Conditions. Lecturer: Ryan Tibshirani Convex Optimization /36-725

Machine Learning. Lecture 6: Support Vector Machine. Feng Li.

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Support Vector Machines

Support Vector Machines

Support Vector Machines, Kernel SVM

Support Vector Machines: Maximum Margin Classifiers

Support Vector Machines

Support Vector Machine (SVM) and Kernel Methods

Jeff Howbert Introduction to Machine Learning Winter

Support Vector Machines

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Support Vector Machines for Classification and Regression

Support Vector Machine

Support Vector Machines for Classification and Regression. 1 Linearly Separable Data: Hard Margin SVMs

Lecture 2: Linear SVM in the Dual

Announcements - Homework

Lecture 18: Optimization Programming

Statistical Machine Learning from Data

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines

Support Vector Machines

CSC 411 Lecture 17: Support Vector Machine

Lecture Support Vector Machine (SVM) Classifiers

Linear & nonlinear classifiers

Convex Optimization and Support Vector Machine

Support Vector Machines and Kernel Methods

Support Vector Machines for Regression

Introduction to Support Vector Machines

Convex Optimization & Lagrange Duality

14. Duality. ˆ Upper and lower bounds. ˆ General duality. ˆ Constraint qualifications. ˆ Counterexample. ˆ Complementary slackness.

LECTURE 7 Support vector machines

Introduction to Machine Learning Lecture 7. Mehryar Mohri Courant Institute and Google Research

Linear & nonlinear classifiers

Kernel Methods and Support Vector Machines

I.3. LMI DUALITY. Didier HENRION EECI Graduate School on Control Supélec - Spring 2010

Constrained Optimization and Lagrangian Duality

Linear Support Vector Machine. Classification. Linear SVM. Huiping Cao. Huiping Cao, Slide 1/26

CS6375: Machine Learning Gautam Kunapuli. Support Vector Machines

Duality. Lagrange dual problem weak and strong duality optimality conditions perturbation and sensitivity analysis generalized inequalities

Machine Learning Techniques

EE613 Machine Learning for Engineers. Kernel methods Support Vector Machines. jean-marc odobez 2015

Outline. Basic concepts: SVM and kernels SVM primal/dual problems. Chih-Jen Lin (National Taiwan Univ.) 1 / 22

Lecture 10: A brief introduction to Support Vector Machine

Lecture: Duality of LP, SOCP and SDP

Support vector machines Lecture 4

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Convex Optimization M2

Data Mining. Linear & nonlinear classifiers. Hamid Beigy. Sharif University of Technology. Fall 1396

Applied inductive learning - Lecture 7

L5 Support Vector Classification

Kernel Machines. Pradeep Ravikumar Co-instructor: Manuela Veloso. Machine Learning

Constrained Optimization and Support Vector Machines

Introduction to Machine Learning Prof. Sudeshna Sarkar Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

Linear vs Non-linear classifier. CS789: Machine Learning and Neural Network. Introduction

10701 Recitation 5 Duality and SVM. Ahmed Hefny

Pattern Recognition 2018 Support Vector Machines

Support Vector Machines

5. Duality. Lagrangian

Lecture 3 January 28

SVM and Kernel machine

Learning with kernels and SVM

Non-linear Support Vector Machines

LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES. Supervised Learning

Convex Optimization M2

LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

CS269: Machine Learning Theory Lecture 16: SVMs and Kernels November 17, 2010

Max Margin-Classifier

Multi-class SVMs. Lecture 17: Aykut Erdem April 2016 Hacettepe University

Soft-Margin Support Vector Machine

Support Vector Machines

Machine Learning and Data Mining. Support Vector Machines. Kalev Kask

Lecture 18: Kernels Risk and Loss Support Vector Regression. Aykut Erdem December 2016 Hacettepe University

Indirect Rule Learning: Support Vector Machines. Donglin Zeng, Department of Biostatistics, University of North Carolina

Machine Learning And Applications: Supervised Learning-SVM

Support Vector Machine for Classification and Regression

Statistical Pattern Recognition

Chapter 9. Support Vector Machine. Yongdai Kim Seoul National University

CS-E4830 Kernel Methods in Machine Learning

Machine Learning A Geometric Approach

A GENERAL FORMULATION FOR SUPPORT VECTOR MACHINES. Wei Chu, S. Sathiya Keerthi, Chong Jin Ong

Soft-margin SVM can address linearly separable problems with outliers

Support Vector Machines

Convex Optimization Boyd & Vandenberghe. 5. Duality

Lecture 10: Support Vector Machine and Large Margin Classifier

A Brief Review on Convex Optimization

Homework 3. Convex Optimization /36-725

CS798: Selected topics in Machine Learning

Transcription:

Support vector machines Guillaume Obozinski Ecole des Ponts - ParisTech SOCN course 2014 SVM, kernel methods and multiclass 1/23

Outline 1 Constrained optimization, Lagrangian duality and KKT 2 Support vector machines SVM, kernel methods and multiclass 2/23

Outline 1 Constrained optimization, Lagrangian duality and KKT 2 Support vector machines SVM, kernel methods and multiclass 3/23

Constrained optimization, Lagrangian duality and KKT SVM, kernel methods and multiclass 4/23

Review: Constrained optimization Optimization problem in canonical form min x X f (x) s.t. h i (x) = 0, g j (x) 0, i [[1, n]] j [[1, m]] with X R p. f, g j functions, h i affine functions. SVM, kernel methods and multiclass 5/23

Review: Constrained optimization Optimization problem in canonical form min x X f (x) s.t. h i (x) = 0, g j (x) 0, i [[1, n]] j [[1, m]] with X R p. f, g j functions, h i affine functions. The problem is convex if f, g j and X are convex (w.l.o.g X = ). SVM, kernel methods and multiclass 5/23

Review: Constrained optimization Optimization problem in canonical form min x X f (x) s.t. h i (x) = 0, g j (x) 0, i [[1, n]] j [[1, m]] with X R p. f, g j functions, h i affine functions. The problem is convex if f, g j and X are convex (w.l.o.g X = ). Lagrangian m L(x, λ, µ) = f (x) + λ i h i (x) + µ j g j (x) j=1 SVM, kernel methods and multiclass 5/23

Lagrangian duality Lagrangian m L(x, λ, µ) = f (x) + λ i h i (x) + µ j g j (x) j=1 SVM, kernel methods and multiclass 6/23

Lagrangian duality Lagrangian L(x, λ, µ) = f (x) + Primal vs Dual problem m λ i h i (x) + µ j g j (x) j=1 p = min x X max λ R n,µ R m + L(x, λ, µ) (P) d = max λ R n,µ R m + min x X L(x, λ, µ) (D) SVM, kernel methods and multiclass 6/23

Maxmin inequalities max y min f (x, y) min x x max f (x, y) y SVM, kernel methods and multiclass 7/23

Maxmin inequalities f (x, y) max f (x, y) y SVM, kernel methods and multiclass 7/23

Maxmin inequalities min f (x, y) min x x max f (x, y) y SVM, kernel methods and multiclass 7/23

Maxmin inequalities max y min f (x, y) min x x max f (x, y) y SVM, kernel methods and multiclass 7/23

Maxmin inequalities max y min f (x, y) min x x max f (x, y) y Weak duality In general, we have d p. This is called weak duality. SVM, kernel methods and multiclass 7/23

Maxmin inequalities max y min f (x, y) min x x max f (x, y) y Weak duality In general, we have d p. This is called weak duality. Strong duality In some cases, we have strong duality: d = p Solutions to (P) and (D) are the same SVM, kernel methods and multiclass 7/23

Slater s qualification condition Slater s qualification condition is a condition on the constraints that guarantees that strong duality holds. Consider an optimization problem in canonical form. SVM, kernel methods and multiclass 8/23

Slater s qualification condition Slater s qualification condition is a condition on the constraints that guarantees that strong duality holds. Consider an optimization problem in canonical form. Definition: Slater s condition (strong form) There exists x X such that h(x) = 0 and g(x) < 0 entrywise. SVM, kernel methods and multiclass 8/23

Slater s qualification condition Slater s qualification condition is a condition on the constraints that guarantees that strong duality holds. Consider an optimization problem in canonical form. Definition: Slater s condition (strong form) There exists x X such that h(x) = 0 and g(x) < 0 entrywise. Definition: Slater s condition (weak form) There exists x X such that h(x) = 0 and g(x) 0 entrywise, but with g i (x) < 0 if g i is not affine. SVM, kernel methods and multiclass 8/23

Slater s qualification condition Slater s qualification condition is a condition on the constraints that guarantees that strong duality holds. Consider an optimization problem in canonical form. Definition: Slater s condition (strong form) There exists x X such that h(x) = 0 and g(x) < 0 entrywise. Definition: Slater s condition (weak form) There exists x X such that h(x) = 0 and g(x) 0 entrywise, but with g i (x) < 0 if g i is not affine. Slater s conditions requires that there exists a feasible point which is strictly feasible for all non-affine constraints. SVM, kernel methods and multiclass 8/23

Karush-Kuhn-Tucker conditions Theorem For a convex problem defined by differentiable functions f, h i, g j, x is an optimal solution if and only if there exists (λ, µ) such that the KKT conditions are satisfied. KKT conditions f (x) + λ i h i (x) + m µ j g j (x) = 0 (Lagrangian stationarity) j=1 h(x) = 0, g(x) 0 (primal feasibility) j [[1, m]], µ j 0 (dual feasibility) µ j g j (x) = 0 (complementary slackness) SVM, kernel methods and multiclass 9/23

Outline 1 Constrained optimization, Lagrangian duality and KKT 2 Support vector machines SVM, kernel methods and multiclass 10/23

Support vector machines SVM, kernel methods and multiclass 11/23

Hard margin SVM Binary classification problem with y i { 1, 1}. SVM, kernel methods and multiclass 12/23

Hard margin SVM Binary classification problem with y i { 1, 1}. Margin 1 w SVM, kernel methods and multiclass 12/23

Hard margin SVM Binary classification problem with y i { 1, 1}. Margin 1 w Constraints: for y i = 1 require w x i + b 1 for y i = 1 require w x i + b 1 SVM, kernel methods and multiclass 12/23

Hard margin SVM Binary classification problem with y i { 1, 1}. Margin 1 w Constraints: for y i = 1 require w x i + b 1 for y i = 1 require w x i + b 1 This leads to min 1 2 w 2 s.t. i, y i (w x i + b) 1 SVM, kernel methods and multiclass 12/23

Hard margin SVM Binary classification problem with y i { 1, 1}. Margin 1 w Constraints: for y i = 1 require w x i + b 1 for y i = 1 require w x i + b 1 This leads to min 1 2 w 2 s.t. i, y i (w x i + b) 1 quadratic program (not a so useful property nowadays) unfeasible if the data is not separable SVM, kernel methods and multiclass 12/23

Hard-margin SVM SVM, kernel methods and multiclass 13/23

Soft margin SVM Authorize some points to be on the wrong side of the margin Penalize by a cost proportional to the distance to the margin Introduce some slack variables ξ i measuring the violation for each datapoint. SVM, kernel methods and multiclass 14/23

Soft margin SVM Authorize some points to be on the wrong side of the margin Penalize by a cost proportional to the distance to the margin Introduce some slack variables ξ i measuring the violation for each datapoint. min w,ξ 1 2 w 2 + C s.t. i, ξ i { y i (w x i + b) 1 ξ i ξ i 0 SVM, kernel methods and multiclass 14/23

Lagrangian of the SVM L(w, ξ, α, ν) = 1 2 w 2 + C ( ξ i + α i 1 ξi y i (w x i + b) ) ν i ξ i SVM, kernel methods and multiclass 15/23

Lagrangian of the SVM L(w, ξ, α, ν) = 1 ( 2 w 2 + C ξ i + α i 1 ξi y i (w x i + b) ) ν i ξ i = 1 2 w 2 w ( ) α i y i x i + ξ i (C α i ν i ) α i y i b + α i SVM, kernel methods and multiclass 15/23

Lagrangian of the SVM L(w, ξ, α, ν) = 1 2 w 2 + C ξ i + ( α i 1 ξi y i (w x i + b) ) = 1 2 w 2 w ( ) α i y i x i + ξ i (C α i ν i ) Stationarity of the Lagrangian L w L = w α i y i x i, = C α i ν i ξ i and ν i ξ i α i y i b + L b = α i y i. α i SVM, kernel methods and multiclass 15/23

Lagrangian of the SVM L(w, ξ, α, ν) = 1 2 w 2 + C ξ i + ( α i 1 ξi y i (w x i + b) ) = 1 2 w 2 w ( ) α i y i x i + ξ i (C α i ν i ) Stationarity of the Lagrangian L w L = w α i y i x i, = C α i ν i ξ i and ν i ξ i α i y i b + L b = α i y i. α i So that L = 0 leads to w = α i y i x i, 0 α i C and α i y i = 0. SVM, kernel methods and multiclass 15/23

Dual of the SVM max 1 2 α i y i x i + α i α 2 s.t. α i y i = 0, i, 0 α i C. with max α 1 2 α D y KD y α + α 1 s.t. α y = 0, 0 α C. y = (y 1,..., y n ) the vector of labels D y = Diag(y) a diagonal matrix with the label K the Gram matrix with K ij = x i x j

Dual of the SVM max α 1 2 s.t. 1 i,j n α i α j y i y j x i x j + α i α i y i = 0, i, 0 α i C. SVM, kernel methods and multiclass 16/23

Dual of the SVM max 1 2 α i y i x i + α i α 2 s.t. α i y i = 0, i, 0 α i C. with max α 1 2 α D y KD y α + α 1 s.t. α y = 0, 0 α C. y = (y 1,..., y n ) the vector of labels D y = Diag(y) a diagonal matrix with the label K the Gram matrix with K ij = x i x j SVM, kernel methods and multiclass 16/23

KKT conditions for the SVM SVM, kernel methods and multiclass 17/23

KKT conditions for the SVM w = α i y i x i α i + ν i = C α i y i = 0 1 ξ i y i f (x i ) 0 ξ i 0 α i 0 ν i 0 ( α i 1 ξi y i f (x i ) ) = 0 ν i ξ i = 0 (LS) (LS) (LS) (PF) (PF) (DF) (DF) (CS) (CS) with f (x i ) = w x i + b SVM, kernel methods and multiclass 17/23

KKT conditions for the SVM w = α i y i x i α i + ν i = C α i y i = 0 1 ξ i y i f (x i ) 0 ξ i 0 α i 0 ν i 0 ( α i 1 ξi y i f (x i ) ) = 0 ν i ξ i = 0 (LS) (LS) (LS) (PF) (PF) (DF) (DF) (CS) (CS) Let I = {i ξ i > 0} with f (x i ) = w x i + b SVM, kernel methods and multiclass 17/23

KKT conditions for the SVM w = α i y i x i α i + ν i = C α i y i = 0 1 ξ i y i f (x i ) 0 ξ i 0 α i 0 ν i 0 ( α i 1 ξi y i f (x i ) ) = 0 ν i ξ i = 0 (LS) (LS) (LS) (PF) (PF) (DF) (DF) (CS) (CS) Let I = {i ξ i > 0} M = {i y i f (x i ) = 1} with f (x i ) = w x i + b SVM, kernel methods and multiclass 17/23

KKT conditions for the SVM w = α i y i x i α i + ν i = C α i y i = 0 1 ξ i y i f (x i ) 0 ξ i 0 α i 0 ν i 0 ( α i 1 ξi y i f (x i ) ) = 0 ν i ξ i = 0 (LS) (LS) (LS) (PF) (PF) (DF) (DF) (CS) (CS) Let I = {i ξ i > 0} M = {i y i f (x i ) = 1} S = {i α i 0} with f (x i ) = w x i + b SVM, kernel methods and multiclass 17/23

KKT conditions for the SVM w = α i y i x i α i + ν i = C α i y i = 0 1 ξ i y i f (x i ) 0 ξ i 0 α i 0 ν i 0 ( α i 1 ξi y i f (x i ) ) = 0 ν i ξ i = 0 (LS) (LS) (LS) (PF) (PF) (DF) (DF) (CS) (CS) Let I = {i ξ i > 0} M = {i y i f (x i ) = 1} S = {i α i 0} W = (I M) c with f (x i ) = w x i + b SVM, kernel methods and multiclass 17/23

KKT conditions for the SVM w = α i y i x i α i + ν i = C α i y i = 0 1 ξ i y i f (x i ) 0 ξ i 0 α i 0 ν i 0 ( α i 1 ξi y i f (x i ) ) = 0 ν i ξ i = 0 (LS) (LS) (LS) (PF) (PF) (DF) (DF) (CS) (CS) Let I = {i ξ i > 0} M = {i y i f (x i ) = 1} S = {i α i 0} W = (I M) c with f (x i ) = w x i + b SVM, kernel methods and multiclass 17/23

KKT conditions for the SVM w = α i y i x i α i + ν i = C α i y i = 0 1 ξ i y i f (x i ) 0 ξ i 0 α i 0 ν i 0 ( α i 1 ξi y i f (x i ) ) = 0 ν i ξ i = 0 with f (x i ) = w x i + b (LS) (LS) (LS) (PF) (PF) (DF) (DF) (CS) (CS) Let I = {i ξ i > 0} M = {i y i f (x i ) = 1} S = {i α i 0} W = (I M) c i I ν i = 0 α i = C i S SVM, kernel methods and multiclass 17/23

KKT conditions for the SVM w = α i y i x i α i + ν i = C α i y i = 0 1 ξ i y i f (x i ) 0 ξ i 0 α i 0 ν i 0 ( α i 1 ξi y i f (x i ) ) = 0 ν i ξ i = 0 with f (x i ) = w x i + b (LS) (LS) (LS) (PF) (PF) (DF) (DF) (CS) (CS) Let I = {i ξ i > 0} M = {i y i f (x i ) = 1} S = {i α i 0} W = (I M) c i I ν i = 0 α i = C i S i W α i = 0 i / S We have 0 α i C. SVM, kernel methods and multiclass 17/23

KKT conditions for the SVM w = α i y i x i α i + ν i = C α i y i = 0 1 ξ i y i f (x i ) 0 ξ i 0 α i 0 ν i 0 ( α i 1 ξi y i f (x i ) ) = 0 ν i ξ i = 0 with f (x i ) = w x i + b (LS) (LS) (LS) (PF) (PF) (DF) (DF) (CS) (CS) Let I = {i ξ i > 0} M = {i y i f (x i ) = 1} S = {i α i 0} W = (I M) c i I ν i = 0 α i = C i S i W α i = 0 i / S We have 0 α i C. The set S of support vectors is therefore composed of some points on the margin and all incorrectly placed points. SVM, kernel methods and multiclass 17/23

SVM summary so far Optimization problem formulated as a strongly convex QP SVM, kernel methods and multiclass 18/23

SVM summary so far Optimization problem formulated as a strongly convex QP whose dual is also a QP SVM, kernel methods and multiclass 18/23

SVM summary so far Optimization problem formulated as a strongly convex QP whose dual is also a QP The support vectors are the points that have a non zero optimal weight α i SVM, kernel methods and multiclass 18/23

SVM summary so far Optimization problem formulated as a strongly convex QP whose dual is also a QP The support vectors are the points that have a non zero optimal weight α i The optimal solution is w = i S α i y ix i, i.e. a weighted combination of the support vectors SVM, kernel methods and multiclass 18/23

SVM summary so far Optimization problem formulated as a strongly convex QP whose dual is also a QP The support vectors are the points that have a non zero optimal weight α i The optimal solution is w = i S α i y ix i, i.e. a weighted combination of the support vectors The solution does not depend on the well-classified points Leads to working set strategies. Computational gain SVM, kernel methods and multiclass 18/23

SVM summary so far Optimization problem formulated as a strongly convex QP whose dual is also a QP The support vectors are the points that have a non zero optimal weight α i The optimal solution is w = i S α i y ix i, i.e. a weighted combination of the support vectors The solution does not depend on the well-classified points Leads to working set strategies. Computational gain Remarks: 1 the dual solution α is not necessarily unique there might be several possible sets of support vectors. SVM, kernel methods and multiclass 18/23

SVM summary so far Optimization problem formulated as a strongly convex QP whose dual is also a QP The support vectors are the points that have a non zero optimal weight α i The optimal solution is w = i S α i y ix i, i.e. a weighted combination of the support vectors The solution does not depend on the well-classified points Leads to working set strategies. Computational gain Remarks: 1 the dual solution α is not necessarily unique there might be several possible sets of support vectors. 2 How do we determine b? SVM, kernel methods and multiclass 18/23

SVM summary so far Optimization problem formulated as a strongly convex QP whose dual is also a QP The support vectors are the points that have a non zero optimal weight α i The optimal solution is w = i S α i y ix i, i.e. a weighted combination of the support vectors The solution does not depend on the well-classified points Leads to working set strategies. Computational gain Remarks: 1 the dual solution α is not necessarily unique there might be several possible sets of support vectors. 2 How do we determine b? SVM, kernel methods and multiclass 18/23

Representer property for the SVM f (x) = w x + b = i S α i y i x i x + b = i S α i y i k(x i, x) + b with k(x, x ) = x x. SVM, kernel methods and multiclass 19/23

Representer property for the SVM f (x) = w x + b = i S α i y i x i x + b = i S α i y i k(x i, x) + b with k(x, x ) = x x. Eventually, this whole formulation depends only on the dot product between points SVM, kernel methods and multiclass 19/23

Representer property for the SVM f (x) = w x + b = i S α i y i x i x + b = i S α i y i k(x i, x) + b with k(x, x ) = x x. Eventually, this whole formulation depends only on the dot product between points Can we use another dot product than the one associated to the usual Euclidean distance in R p? SVM, kernel methods and multiclass 19/23

Hinge loss interpretation of the SVM min w,ξ 1 2 w 2 + C s.t. i, ξ i { y i (w x i + b) 1 ξ i ξ i 0 SVM, kernel methods and multiclass 20/23

Hinge loss interpretation of the SVM min w,ξ 1 2 w 2 + C s.t. i, ξ i { ξ i 1 y i (w x i + b) ξ i 0 min w 1 2 w 2 + C max ( 1 y i (w x i + b), 0 ) Define the hinge loss l(a, y) = (1 ya) + with (u) + = max(u, 0). SVM, kernel methods and multiclass 20/23

Hinge loss interpretation of the SVM min w,ξ 1 2 w 2 + C s.t. i, ξ i { y i (w x i + b) 1 ξ i ξ i 0 min w 1 2 w 2 + C max ( 1 y i (w x i + b), 0 ) Define the hinge loss l(a, y) = (1 ya) + with (u) + = max(u, 0). Our problem is now of the form min l(f (x i ), y i ) + 1 w 2C w 2 with f (x) = w x + b. SVM, kernel methods and multiclass 20/23

Hinge loss vs other losses The hinge loss is the least convex loss which upper bounds the 0-1 loss and equals 0 for large scores. SVM, kernel methods and multiclass 21/23

SVM with the quadratic hinge loss Quadratic hinge loss: l(a, y) = (1 ya) 2 +. SVM, kernel methods and multiclass 22/23

SVM with the quadratic hinge loss Quadratic hinge loss: l(a, y) = (1 ya) 2 +. Quadratic SVM min w 1 2 w 2 + C max ( 1 y i (w x i + b), 0 ) 2 SVM, kernel methods and multiclass 22/23

SVM with the quadratic hinge loss Quadratic hinge loss: l(a, y) = (1 ya) 2 +. Quadratic SVM min w 1 2 w 2 + C max ( 1 y i (w x i + b), 0 ) 2 min w,ξ 1 2 w 2 + C s.t. i, ξ 2 i { y i (w x i + b) 1 ξ i ξ i 0 SVM, kernel methods and multiclass 22/23

SVM with the quadratic hinge loss Quadratic hinge loss: l(a, y) = (1 ya) 2 +. Quadratic SVM min w 1 2 w 2 + C max ( 1 y i (w x i + b), 0 ) 2 min w,ξ 1 2 w 2 + C s.t. i, ξ 2 i { y i (w x i + b) 1 ξ i ξ i 0 Penalizes more strongly misclassified points Less robust to outliers Tends to be less sparse Score in [0, 1] for n large, interpretable as a probability. SVM, kernel methods and multiclass 22/23

Imbalanced classification SVM, kernel methods and multiclass 23/23

Imbalanced classification Learn a binary classifier from (x i, y i ) pairs with P = {i y i = 1} N = {i y i = 1}, n + = P, n = N and with n + n. SVM, kernel methods and multiclass 23/23

Imbalanced classification Learn a binary classifier from (x i, y i ) pairs with P = {i y i = 1} N = {i y i = 1}, n + = P, n = N and with n + n. Problem: to minimize the number of mistakes the classifier learnt might classify all points as negatives. SVM, kernel methods and multiclass 23/23

Imbalanced classification Learn a binary classifier from (x i, y i ) pairs with P = {i y i = 1} N = {i y i = 1}, n + = P, n = N and with n + n. Problem: to minimize the number of mistakes the classifier learnt might classify all points as negatives. Some ways to address the issue Subsample the negatives, and learn an ensemble of classifiers. SVM, kernel methods and multiclass 23/23

Imbalanced classification Learn a binary classifier from (x i, y i ) pairs with P = {i y i = 1} N = {i y i = 1}, n + = P, n = N and with n + n. Problem: to minimize the number of mistakes the classifier learnt might classify all points as negatives. Some ways to address the issue Subsample the negatives, and learn an ensemble of classifiers. Introduce different costs for the positives and negatives 1 min w R p 2 w 2 2 + C + ξ i + C ξ i i P i N s.t. i, y i (w x i + b) 1 ξ i SVM, kernel methods and multiclass 23/23

Imbalanced classification Learn a binary classifier from (x i, y i ) pairs with P = {i y i = 1} N = {i y i = 1}, n + = P, n = N and with n + n. Problem: to minimize the number of mistakes the classifier learnt might classify all points as negatives. Some ways to address the issue Subsample the negatives, and learn an ensemble of classifiers. Introduce different costs for the positives and negatives 1 min w R p 2 w 2 2 + C + ξ i + C ξ i i P i N s.t. i, y i (w x i + b) 1 ξ i Naive choice: C + = C/n + and C = C/n Is suboptimal in theory and in practice!! Better to search for the optimal hyperparameter pair (C +, C ). SVM, kernel methods and multiclass 23/23