Classification and Support Vector Machine

Classification and Support Vector Machine Yiyong Feng and Daniel P. Palomar The Hong Kong University of Science and Technology (HKUST) ELEC 5470 - Convex Optimization Fall 2017-18, HKUST, Hong Kong

Outline of Lecture Problem Statement 1 Problem Statement 2 : Linear/Logistic Regression Linear Regression Regression with Huberized Loss Logistic Regression 3 Support Vector Machine () Linearly Separable Linearly Nonseparable Nonlinear 4 Application: Multiclass Image Classification Y. Feng and D. Palomar (HKUST) CvxOpt in Image Classification ELEC 5470 2 / 70

Classification Problem A set of input points with binary labels: x i R n y i { 1, 1}, i = 1,..., N How to classify the x i s? 5 4 Pos Class Neg Class 3 2 1 0 1 2 3 4 3 2 1 0 1 2 3 4 5 Y. Feng and D. Palomar (HKUST) CvxOpt in Image Classification ELEC 5470 4 / 70

Classification Problem Separate the two classes using a linear model: ŷ = β T x + β 0 We can classify as follows: { predict Pos Class, if sign (ŷ) = +1 predict Neg Class, if sign (ŷ) = 1 Thus, misclassification happens if sign (ŷ) y! Note that then the decision boundary is the hyperplane β T x + β 0 = 0 Y. Feng and D. Palomar (HKUST) CvxOpt in Image Classification ELEC 5470 5 / 70

Classification Problem Then our goal is to minimize the number of misclassifications: minimize β 0,β,{ŷ i } subject to N i=1 1 {sign(ŷ i ) y i } ŷ i = β T x i + β 0, i However, the objective loss function is nonconvex and nondifferentiable Y. Feng and D. Palomar (HKUST) CvxOpt in Image Classification ELEC 5470 6 / 70

What could we do? Y. Feng and D. Palomar (HKUST) CvxOpt in Image Classification ELEC 5470 7 / 70

Outline of Lecture Problem Statement Linear Regression Regression with Huberized Loss Logistic Regression 1 Problem Statement 2 : Linear/Logistic Regression Linear Regression Regression with Huberized Loss Logistic Regression 3 Support Vector Machine () Linearly Separable Linearly Nonseparable Nonlinear 4 Application: Multiclass Image Classification Y. Feng and D. Palomar (HKUST) CvxOpt in Image Classification ELEC 5470 8 / 70

Classification as Linear Regression Linear Regression Regression with Huberized Loss Logistic Regression A straightforward idea is to do linear regression, i.e., replacing the misclassification loss with the residual sum of squares loss: minimize β 0,β,{ŷ i } subject to N i=1 ŷ i y i 2 ŷ i = β T x i + β 0, i Note that ŷ i y i 2 = ŷ i y i 2 y 2 i = (1 y iŷ i ) 2 Y. Feng and D. Palomar (HKUST) CvxOpt in Image Classification ELEC 5470 9 / 70

From Loss Function Point of View Linear Regression Regression with Huberized Loss Logistic Regression Misclassification loss 1 {sign(ŷ) y} ; squared error loss (ŷ y) 2 = (1 y i ŷ i ) 2 10 9 Misclassification Linear Reg: Squared Error 8 7 6 Loss 5 4 3 2 1 0 3 2 1 0 1 2 3 y ŷ Y. Feng and D. Palomar (HKUST) CvxOpt in Image Classification ELEC 5470 10 / 70

Decision Boundary Problem Statement Linear Regression Regression with Huberized Loss Logistic Regression 5 4 Pos Class Neg Class Linear Reg 3 2 1 0 1 2 3 4 3 2 1 0 1 2 3 4 5 Y. Feng and D. Palomar (HKUST) CvxOpt in Image Classification ELEC 5470 11 / 70

Linear Regression Regression with Huberized Loss Logistic Regression Can we do better? Y. Feng and D. Palomar (HKUST) CvxOpt in Image Classification ELEC 5470 12 / 70

Linear Regression Regression with Huberized Loss Logistic Regression Some idea? 10 9 Misclassification Linear Reg: Squared Error 8 7 6 Loss 5 4 3 2 1 0 3 2 1 0 1 2 3 y ŷ Can we find some better loss function approximation? Y. Feng and D. Palomar (HKUST) CvxOpt in Image Classification ELEC 5470 13 / 70

Famous Huber Loss Problem Statement Linear Regression Regression with Huberized Loss Logistic Regression Huber Loss (with parameter M): { x 2 φ hub (x) = M (2 x M) Select M = 2, define loss as φ hub (1 yŷ): x < M x M 10 9 8 Misclassification Linear Reg: Squared Error Huber Loss 7 6 Loss 5 4 3 2 1 0 3 2 1 0 1 2 3 y ŷ Y. Feng and D. Palomar (HKUST) CvxOpt in Image Classification ELEC 5470 14 / 70

Famous Huber Loss Problem Statement Linear Regression Regression with Huberized Loss Logistic Regression Actually, we don t need to penalize when yŷ > 0! Define { φ hub (x) x 0 φ hub_pos (x) = 0 x < 0 Select M = 2, define loss as φ hub_pos (1 yŷ): 10 9 8 Misclassification Linear Reg: Squared Error Class Huber 7 6 Loss 5 4 3 2 1 0 3 2 1 0 1 2 3 y ŷ Y. Feng and D. Palomar (HKUST) CvxOpt in Image Classification ELEC 5470 15 / 70

Minimize the Huberized Loss Linear Regression Regression with Huberized Loss Logistic Regression Then, we take the Huberized loss as the approximation and intend to minimize it: minimize N i=1 φ hub_pos (1 y i ŷ i ) β 0,β,{ŷ i } subject to ŷ i = β T x i + β 0, i where φ hub_pos (x) with M = 2 looks like: 8 7 6 φhub pos(x) 5 4 3 2 1 0 3 2 1 0 1 2 3 x Y. Feng and D. Palomar (HKUST) CvxOpt in Image Classification ELEC 5470 16 / 70

Decision Boundary Problem Statement Linear Regression Regression with Huberized Loss Logistic Regression 5 4 3 Pos Class Neg Class Linear Reg Reg with Huber 2 1 0 1 2 3 4 5 3 2 1 0 1 2 3 4 5 Y. Feng and D. Palomar (HKUST) CvxOpt in Image Classification ELEC 5470 17 / 70

Linear Regression Regression with Huberized Loss Logistic Regression Can we find more approximations? Y. Feng and D. Palomar (HKUST) CvxOpt in Image Classification ELEC 5470 18 / 70

Logistic Model Problem Statement Linear Regression Regression with Huberized Loss Logistic Regression Given a data point x R n, model the probability of label y { 1, +1} as (recall that ŷ = β T x + β 0 ): 1 P (Y = y x) = 1 + e y ŷ Here, the function g (z) = 1 1+e z is called logistic function: 1 logistic function 0.8 g(z) 0.6 0.4 0.2 0 10 8 6 4 2 0 2 4 6 8 10 z Y. Feng and D. Palomar (HKUST) CvxOpt in Image Classification ELEC 5470 19 / 70

Logistic Model Problem Statement Linear Regression Regression with Huberized Loss Logistic Regression The above probability formula amounts to modeling the log-odds ratio as linear model ŷ: log P (Y = 1 x) 1 + eŷ = log P (Y = 1 x) 1 + e ŷ = ŷ = βt x + β 0 Classification rule: { predict Pos Class, if sign (ŷ) = +1 predict Neg Class, if sign (ŷ) = 1 The above classification rule is exactly the same as we wanted before! Y. Feng and D. Palomar (HKUST) CvxOpt in Image Classification ELEC 5470 20 / 70

Negative Log-likelihood as Loss Linear Regression Regression with Huberized Loss Logistic Regression Armed with logistic model, we can have the likelihood function: N N 1 LH (β, β 0 ) = P (Y = y i x i ) = 1 + e y i ŷ i i=1 i=1 The loss function can be defined as negative log-likelihood: Loss (β, β 0 ) = log LH (β, β 0 ) = N i=1 ) log (1 + e y i ŷ i Y. Feng and D. Palomar (HKUST) CvxOpt in Image Classification ELEC 5470 21 / 70

Logistic Regression Problem Statement Linear Regression Regression with Huberized Loss Logistic Regression Now, minimize the loss function (i.e., maximize the likelihood): minimize β 0,β,{ŷ i } subject to N i=1 log ( 1 + e y i ŷ i ) ŷ i = β T x i + β 0, i Y. Feng and D. Palomar (HKUST) CvxOpt in Image Classification ELEC 5470 22 / 70

From Loss Function Point of View Linear Regression Regression with Huberized Loss Logistic Regression Misclassification loss 1 {sign(ŷ) y} ; squared error loss (ŷ y) 2 ; Class Huber φ hub_pos (1 yŷ); Negative log-likelihood log ( 1 + e y ŷ) 10 9 8 Misclassification Linear Reg: Squared Error Huber Class Logistic Reg: neg log LH 7 6 Loss 5 4 3 2 1 0 3 2 1 0 1 2 3 y ŷ Y. Feng and D. Palomar (HKUST) CvxOpt in Image Classification ELEC 5470 23 / 70

Decision Boundary Problem Statement Linear Regression Regression with Huberized Loss Logistic Regression 5 4 3 Pos Class Neg Class Linear Reg Reg with Huber Logistic Reg 2 1 0 1 2 3 4 5 3 2 1 0 1 2 3 4 5 Y. Feng and D. Palomar (HKUST) CvxOpt in Image Classification ELEC 5470 24 / 70

Outline of Lecture Problem Statement Linearly Separable Linearly Nonseparable Nonlinear 1 Problem Statement 2 : Linear/Logistic Regression Linear Regression Regression with Huberized Loss Logistic Regression 3 Support Vector Machine () Linearly Separable Linearly Nonseparable Nonlinear 4 Application: Multiclass Image Classification Y. Feng and D. Palomar (HKUST) CvxOpt in Image Classification ELEC 5470 25 / 70

Optimal Separating Hyperplane Linearly Separable Linearly Nonseparable Nonlinear So many hyperplanes can separate the two classes BUT, what is the optimal separating hyperplane? Optimal separating hyperplane should: separate the two classes, and maximize the distance to the closest point from either class. Y. Feng and D. Palomar (HKUST) CvxOpt in Image Classification ELEC 5470 26 / 70

Signed Distance Problem Statement Linearly Separable Linearly Nonseparable Nonlinear 1 β βt (x x 0) x decision boundary: β T x + β 0 = 0 signed distance β β x 0 β T x+β 0 = 0 1 β βt (x x 0 ) = 1 ( β T ) x + β 0 β Y. Feng and D. Palomar (HKUST) CvxOpt in Image Classification ELEC 5470 27 / 70

Maximize the Margin Linearly Separable Linearly Nonseparable Nonlinear Then the margin is defined as 1 β y ( β T ) x + β 0 Now, maximize the distance to the closest point from either class maximize M β 0,β,M 1 subject to β y ( i β T ) x i + β 0 M, i Y. Feng and D. Palomar (HKUST) CvxOpt in Image Classification ELEC 5470 28 / 70

Support Vector Machine Linearly Separable Linearly Nonseparable Nonlinear For any β and β 0 satisfying these inequalities, any positively scaled multiple satisfies them too, then arbitrarily set β = 1 M The above problem amounts to minimize β 0,β subject to β y i ( β T x i + β 0 ) 1, i Y. Feng and D. Palomar (HKUST) CvxOpt in Image Classification ELEC 5470 29 / 70

Boundary and Support Vectors Linearly Separable Linearly Nonseparable Nonlinear 6 4 Pos Class Neg Class Linear Reg Reg with Huber Logistic Reg Support Lines 2 0 2 4 6 3 2 1 0 1 2 3 4 5 Y. Feng and D. Palomar (HKUST) CvxOpt in Image Classification ELEC 5470 30 / 70

Linearly Separable Linearly Nonseparable Nonlinear The case we considered before is linearly separable What if it s linearly nonseparable? Y. Feng and D. Palomar (HKUST) CvxOpt in Image Classification ELEC 5470 31 / 70

Linearly Nonseparable Case Linearly Separable Linearly Nonseparable Nonlinear 5 4 Pos Class Neg Class 3 2 1 0 1 2 3 4 3 2 1 0 1 2 3 4 5 Y. Feng and D. Palomar (HKUST) CvxOpt in Image Classification ELEC 5470 32 / 70

Linearly Nonseparable Case Linearly Separable Linearly Nonseparable Nonlinear Relax the constraints by Introducing positive slack variables ξ i s: y i ( β T x i + β 0 ) 1 ξi And then penalize i ξ i in the objective function: minimize β 0,β,{ξ i } subject to 1 2 β 2 + C N i=1 ξ i ξ i 0, y i ( β T x i + β 0 ) 1 ξi, i Linearly separable case corresponds to C = Y. Feng and D. Palomar (HKUST) CvxOpt in Image Classification ELEC 5470 33 / 70

The as a Penalization Method Linearly Separable Linearly Nonseparable Nonlinear Revisit the primal problem: minimize β 0,β,{ξ i } subject to 1 2 β 2 + C N i=1 ξ i ξ i 0, y i ( β T x i + β 0 ) 1 ξi, i Consider the optimization problem: minimize β 0,β,{ŷ i } subject to N i=1 [1 y iŷ i ] + + λ 2 β 2 ŷ i = β T x i + β 0, i where loss is N i=1 [1 y iŷ i ] + (called hinge loss) Two problems are equivalent with λ = 1/C (linearly separable case corresponds to λ = 0) Y. Feng and D. Palomar (HKUST) CvxOpt in Image Classification ELEC 5470 34 / 70

From Loss Function Point of View Linearly Separable Linearly Nonseparable Nonlinear Misclassification loss 1 {sign(ŷ) y} ; squared error loss (ŷ y) 2 ; Class Huber φ hub_pos (1 yŷ); Negative log-likelihood loss log ( 1 + e y ŷ) ; Hinge loss [1 y i ŷ i ] + 10 9 8 7 Misclassification Linear Reg: Squared Error Class Huber Logistic Reg: neg log LH : hinge loss 6 Loss 5 4 3 2 1 0 3 2 1 0 1 2 3 y ŷ Y. Feng and D. Palomar (HKUST) CvxOpt in Image Classification ELEC 5470 35 / 70

Solving I Problem Statement Linearly Separable Linearly Nonseparable Nonlinear The Lagrange function is L (β, β 0, ξ i, α i, µ i ) = 1 2 β 2 + C i=1 i=1 N i=1 N N [ ( µ i ξ i α i yi β T ) x i + β 0 (1 ξi ) ] where α i 0 and µ i 0, i, are dual variables Setting derivatives w.r.t. β, β 0, {ξ i } to zero ξ i β = N α i y i x i, 0 = i=1 N α i y i, α i = C µ i i i=1 Y. Feng and D. Palomar (HKUST) CvxOpt in Image Classification ELEC 5470 36 / 70

Solving II Problem Statement Linearly Separable Linearly Nonseparable Nonlinear Dual function: g (α i, µ i ) = inf L (β, β 0, ξ i, α i, µ i ) β,β 0,{ξ i } = 1 N N N N α i α j y i y j x T i x j + C ξ i (α i + µ i ) ξ i 2 = i=1 j=1 N N α i 1 2 i=1 i=1 j=1 N α i α j y i y j x T i x j β 0 N i=1 j=1 N α i α j y i y j x T i x j i=1 i=1 N α i y i + N i=1 i=1 α i Y. Feng and D. Palomar (HKUST) CvxOpt in Image Classification ELEC 5470 37 / 70

Solving III Problem Statement Linearly Separable Linearly Nonseparable Nonlinear Dual problem: maximize {α i } subject to N i=1 α i 1 2 0 α i C, i N i=1 α iy i = 0. N N i=1 j=1 α iα j y i y j x T i x j Dual problem is a QP! Vector-matrix representation of the objective: 1 T α 1 2 αt ( Diag (y) X T XDiag (y) ) α Y. Feng and D. Palomar (HKUST) CvxOpt in Image Classification ELEC 5470 38 / 70

KKT Conditions Problem Statement Linearly Separable Linearly Nonseparable Nonlinear KKT equations characterize the solution to the primal and dual problems ( ξ i 0, y i β T ) x i + β 0 1 ξi, i N 0 = α i y i, 0 α i C, i β = i=1 N α i y i x i, α i = C µ i, i i=1 0 = µ i ξ i, α i [ yi ( β T x i + β 0 ) (1 ξi ) ] = 0, i Y. Feng and D. Palomar (HKUST) CvxOpt in Image Classification ELEC 5470 39 / 70

Finding Decision Boundary Linearly Separable Linearly Nonseparable Nonlinear Given optimal dual αi s, we have: β = N αi y i x i. i=1 Support Vectors: those observations i with nonzero α i For any 0 < α j < C, we have ξ j = 0, and ( N ) y j i=1 α i y ix T i x j + β 0 = 1, hence we can solve for β0 Decision function becomes: D (x) = sign (ŷ) = sign ( ( β T x + β0 ) N ) = sign αi y i x T i x + β0 i=1 Observation: in the dual problem and the above decision function, the only operation on x i s is the inner product! Y. Feng and D. Palomar (HKUST) CvxOpt in Image Classification ELEC 5470 40 / 70

Decision Boundary Problem Statement Linearly Separable Linearly Nonseparable Nonlinear 6 4 Pos Class Neg Class C=1 2 0 2 4 6 3 2 1 0 1 2 3 4 5 Large C: focus attention more on points near the boundary small margin Small C: involves data further away large margin Y. Feng and D. Palomar (HKUST) CvxOpt in Image Classification ELEC 5470 41 / 70

Linearly Separable Linearly Nonseparable Nonlinear 6 C=0.01 6 C=0.1 4 2 0 2 4 6 4 2 0 2 4 6 4 2 0 2 4 6 4 2 0 2 4 6 6 C=10 6 C=10000 4 4 2 2 0 0 2 2 4 4 6 4 2 0 2 4 6 6 4 2 0 2 4 6 Y. Feng and D. Palomar (HKUST) CvxOpt in Image Classification ELEC 5470 42 / 70

Linearly Separable Linearly Nonseparable Nonlinear Decision Boundary: Revisit Linearly Separable Case 6 4 Pos Class Neg Class Linear Reg Reg with Huber Logistic Reg Support Lines 2 0 2 4 6 3 2 1 0 1 2 3 4 5 Y. Feng and D. Palomar (HKUST) CvxOpt in Image Classification ELEC 5470 43 / 70

Linearly Separable Linearly Nonseparable Nonlinear Decision Boundary: Linearly Nonseparable Case 6 4 Pos Class Neg Class Linear Reg Reg with Huber Logistic Reg C=10 Support Lines 2 0 2 4 6 3 2 1 0 1 2 3 4 5 Y. Feng and D. Palomar (HKUST) CvxOpt in Image Classification ELEC 5470 44 / 70

Linearly Separable Linearly Nonseparable Nonlinear For the linearly separable/nonseparable cases, so far works well! But, what if the linear decision boundary doesn t work any more? Y. Feng and D. Palomar (HKUST) CvxOpt in Image Classification ELEC 5470 45 / 70

Nonlinear Decision Surface Linearly Separable Linearly Nonseparable Nonlinear 1 0.9 0.8 0.7 0.6 0.5 0.4 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Y. Feng and D. Palomar (HKUST) CvxOpt in Image Classification ELEC 5470 46 / 70

Feature Transformation Linearly Separable Linearly Nonseparable Nonlinear Naive method: use a curve instead of a line not efficient Feature Transformation: pre-process the data with h : R n H, x h (x) R n : input space; H: feature space [ h(x)= x 2 1 x 2 ] T 2 ============= Y. Feng and D. Palomar (HKUST) CvxOpt in Image Classification ELEC 5470 47 / 70

After Feature Transformation Linearly Separable Linearly Nonseparable Nonlinear Using the features h (x) as inputs: Dual problem: maximize N i=1 α i 1 2 {α i } subject to The decision function: ŷ = β T h (x) + β 0 N N i=1 j=1 α iα j y i y j h (x i ) T h (x j ) N i=1 α iy i = 0, 0 α i C, i D (x) = sign ( ( β T ) N ) h (x) + β 0 = sign α i y i h (x i ) T h (x) + β 0 where β 0 can be determined, as before, by solving y i ŷ i = 1 for any x i for which 0 < α i < C Y. Feng and D. Palomar (HKUST) CvxOpt in Image Classification ELEC 5470 48 / 70 i=1

Kernelized Problem Statement Linearly Separable Linearly Nonseparable Nonlinear In fact, we need not specify the transformation h (x) at all, but require only knowledge of the kernel function that computes inner products in the transformed space, i.e.: Dual problem: maximize N i=1 α i 1 2 {α i } subject to K (x i, x j ) h (x i ) T h (x j ) N N i=1 j=1 α iα j y i y j K (x i, x j ) N i=1 α iy i = 0, 0 α i C, i The decision function: ( N ) D (x) =sign α i y i K (x i, x) + β 0 i=1 where β 0 can be determined, as before, by solving y i ŷ i = 1 for any x i for which 0 < α i < C Y. Feng and D. Palomar (HKUST) CvxOpt in Image Classification ELEC 5470 49 / 70

Kernel Problem Statement Linearly Separable Linearly Nonseparable Nonlinear K should be a symmetric positive (semi-) definite function The previous linearly corresponds to K (x i, x j ) x T i x j Popular kernels: pth Degree polynomial: K (x i, x j ) = ( 1 + x T i x ) p j Gaussian radial kernel: Neural network: K (x i, x j ) = e x i x j 2 /σ 2 K (x i, x j ) = tanh ( κ 1 x T i x ) j + κ 2 Y. Feng and D. Palomar (HKUST) CvxOpt in Image Classification ELEC 5470 50 / 70

Steps for Applying Linearly Separable Linearly Nonseparable Nonlinear Steps: 1 In-sample training: 1 Select the parameter C 2 Select the kernel function K (x i, x j ) and related parameters 3 Solve the dual problem to obtain αi 4 Compute ( β0, and then classify according to N ) sign i=1 α i y ik (x i, x) + β0 2 Out-of-sample testing: 1 Use the trained model to test out-of-samples 2 If the out-of-sample test is not good, adjust parameter C, or kernels, and re-train the model until the out-of-sample result is good enough Y. Feng and D. Palomar (HKUST) CvxOpt in Image Classification ELEC 5470 51 / 70

Linearly Separable Linearly Nonseparable Nonlinear Decision Boundary by Gaussian Radial Kernel 1 C=1 σ=0.1 0.9 0.8 0.7 0.6 0.5 0.4 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Y. Feng and D. Palomar (HKUST) CvxOpt in Image Classification ELEC 5470 52 / 70

More than two classes? Linearly Separable Linearly Nonseparable Nonlinear What if there are more than TWO classes? Y. Feng and D. Palomar (HKUST) CvxOpt in Image Classification ELEC 5470 53 / 70

Multiclass Classification Setup Linearly Separable Linearly Nonseparable Nonlinear Labels: { 1, +1} {1, 2,..., K} Classification decision rule: f : x R n {1, 2,..., K} 6 Class 1 Class 2 Class 3 4 2 0 2 4 6 6 4 2 0 2 4 6 Y. Feng and D. Palomar (HKUST) CvxOpt in Image Classification ELEC 5470 54 / 70

Multiclass Classification Methods Linearly Separable Linearly Nonseparable Nonlinear Main ideas: Decompose the multiclass classification problem into multiple binary classification problems Use the majority voting principle or a combined decision from a committee to predict the label Common approaches: One-vs-Rest (One-vs-All) approach One-vs-One (Pairwise, All-vs-All) approach Y. Feng and D. Palomar (HKUST) CvxOpt in Image Classification ELEC 5470 55 / 70

One-vs-Rest Approach Linearly Separable Linearly Nonseparable Nonlinear Steps: 1 Solve K different binary problems: classify class i as +1 versus the rest classes for all j {1,..., K} \i as 1 2 Assign a test sample to the class arg max i f i (x), where f i (x) is the solution from the i-th problem Properties: Simple to implement, performs well in practice Not optimal (asymptotically) Y. Feng and D. Palomar (HKUST) CvxOpt in Image Classification ELEC 5470 56 / 70

Linearly Separable Linearly Nonseparable Nonlinear One-vs-Rest Example: Step 1, training 6 4 Class 1 v.s. Class 2&3 Class 1: label +1 Class 2: label 1 Class 3: label 1 2 0 2 4 6 6 4 2 0 2 4 6 Y. Feng and D. Palomar (HKUST) CvxOpt in Image Classification ELEC 5470 57 / 70

Linearly Separable Linearly Nonseparable Nonlinear One-vs-Rest Example: Step 1, training... Class 2 v.s. Class 1&3 Class 3 v.s. Class 1&2 6 6 Class 1: label 1 Class 1: label 1 Class 2: label +1 Class 2: label 1 Class 3: label 1 Class 3: label +1 4 4 2 2 0 0 2 2 4 4 6 6 4 2 0 2 4 6 6 6 4 2 0 2 4 6 Y. Feng and D. Palomar (HKUST) CvxOpt in Image Classification ELEC 5470 58 / 70

One-vs-Rest Example: Step 2, test Linearly Separable Linearly Nonseparable Nonlinear Test point x = [ 0 2 ] T, by fi (x) = x T β i + β0 i, we have f 1 (x) = 1.0783, f 2 (x) = 0.5545, and f 3 (x) = 2.2560 Assign [ 0 2 ] T to class 2! 6 4 Class 1 Class 2 Class 3 Class 1 vs 2&3 Class 2 vs 1&3 Class 3 vs 1&2 Test Sample 2 0 2 4 6 6 4 2 0 2 4 6 Y. Feng and D. Palomar (HKUST) CvxOpt in Image Classification ELEC 5470 59 / 70

One-vs-One Approach Linearly Separable Linearly Nonseparable Nonlinear Steps: ( ) 1 K Solve different binary problems: classify class i as +1 2 versus each class j i as 1. Each classifier is called f ij 2 For prediction at a point, each classifier is queried once and issues a vote. The class with the maximum number of votes is the winner Properties: Training process is efficient: small binary problems There are too many problems when K is large (If K = 10, we need to train 45 binary classifiers) Simple to implement, performs competitively in practice Y. Feng and D. Palomar (HKUST) CvxOpt in Image Classification ELEC 5470 60 / 70

Linearly Separable Linearly Nonseparable Nonlinear One-vs-One Example: Step 1, training 6 4 Class 1 v.s. Class 2 Class 1: label +1 Class 2: label 1 Class 3 2 0 2 4 6 6 4 2 0 2 4 6 Y. Feng and D. Palomar (HKUST) CvxOpt in Image Classification ELEC 5470 61 / 70

Linearly Separable Linearly Nonseparable Nonlinear One-vs-One Example: Step 1, training... Class 1 v.s. Class 3 Class 2 v.s. Class 3 6 6 Class 1: label +1 Class 1 Class 2 Class 2: label +1 Class 3: label 1 Class 3: label 1 4 4 2 2 0 0 2 2 4 4 6 6 4 2 0 2 4 6 6 6 4 2 0 2 4 6 Y. Feng and D. Palomar (HKUST) CvxOpt in Image Classification ELEC 5470 62 / 70

One-vs-One Example: Step 2, test Linearly Separable Linearly Nonseparable Nonlinear The same test point x = [ 0 2 ] T : f 12 (x) = 0.7710 < 0, vote to class 2 f 23 (x) = 1.0336 > 0, vote to class 2 f 13 (x) = 0.7957 > 0, vote to class 1 Conclusion: class 2 wins! 6 4 Class 1 Class 2 Class 3 Class 1 vs 2 Class 2 vs 3 Class 1 vs 3 Test Sample 2 0 2 4 6 6 4 2 0 2 4 6 Y. Feng and D. Palomar (HKUST) CvxOpt in Image Classification ELEC 5470 63 / 70

Application: Histogram-based Image Classification Multiclass: air shows, bears, Arabian horses, night scenes, and several 1058 more classes not shown IEEE TRANSACTIONS hereon NEURAL NETWORKS, VOL. 10, NO. 5, SEPTEMBER 1999 Y. Feng and D. Palomar (HKUST) CvxOpt in Image Classification ELEC 5470 65 / 70

App: Histogram-based Image Multi-Classification, Modeling Why not put image pixels into a vector? Drawbacks: large size lack of invariance with respect to translations Histogram-based Image representation color space: Hue-Saturation-Value (HSV) # of bins per color component = 16 x R 4096 : histogram of the picture, dimension = 16 3 = 4096 y {airshows, bears,... }: the class labels Y. Feng and D. Palomar (HKUST) CvxOpt in Image Classification ELEC 5470 66 / 70

App: Histogram-based Image Multi-Classification, Applied s: linear Poly 2: K (x i, x j ) = ( 1 + x T i x j Radial basis function ) 2 K (x i, x j ) = e ρd(x i,x j ) where the distance measure d (x i, x j ) can be Gaussian: d (x i, x j ) = x i x j 2 Laplacian (l 1 distance): d (x i, x j ) = x i x j χ 2 : d (x i, x j ) = k ( x (k) i ) x (k) 2 j x (k) i +x (k) j Y. Feng and D. Palomar (HKUST) CvxOpt in Image Classification ELEC 5470 67 / 70

App: Histogram-based Image Multi-Classification, Result Criteria: error rate in percentage Benchmark, K-nearest neighbor (KNN) method Database KNN l 2 KNN χ 2 Corel14 47.7 26.5 Corel7 51.4 35.4 Database linear Poly 2 Radial basis function Gaussian Laplacian χ 2 Corel14 36.3 33.6 28.8 14.7 14.7 Corel7 42.7 38.9 32.2 20.5 21.6 Y. Feng and D. Palomar (HKUST) CvxOpt in Image Classification ELEC 5470 68 / 70

For Further Reading Trevor Hastie, Robert Tibshirani, and Jerome Friedman, The elements of statistical learning. Springer New York, 2009. Olivier Chapelle, Patrick Haffner, and Vladimir N. Vapnik, Support vector machines for histogram-based image classification, IEEE Transactions on Neural Networks. 10(5):1055 1044, 1999.

Thanks For more information visit: http://www.danielppalomar.com