Support Vector Machines for Classification: A Statistical Portrait

Support Vector Machines for Classification: A Statistical Portrait Yoonkyung Lee Department of Statistics The Ohio State University May 27, 2011 The Spring Conference of Korean Statistical Society KAIST, Daejeon, Korea

Handwritten digit recognition Figure: 16 16 grayscale images scanned from postal envelopes, courtesy of Hastie, Tibshirani, & Friedman (2001). Cortes & Vapnik (1995) applied SVM to the data and demonstrated its improved accuracy over decision trees and neural network.

Classification Training data {(x i, y i ), i = 1,...,n} x = (x 1,...,x p ) R p y Y = {1,...,k} Learn a rule φ : R p Y from the training data, which can be generalized to novel cases. x2 1.0 0.5 0.0 0.5 1.0 1.0 0.5 0.0 0.5 1.0 x1

The Bayes decision rule The 0-1 loss function: L(y, φ(x)) = I(y φ(x)) (X, Y): a random sample from P(x, y), and p j (x) = P(Y = j X = x) The rule that minimizes the risk R(φ) = EL(Y, φ(x)) = P(Y φ(x)): The Bayes error rate: φ B (x) = arg max j Y p j(x) R = R(φ B ) = 1 E(max p j (X))

Two approaches to classification Probability based plug-in rules (soft classification): ˆφ(x) = arg max j Y ˆp j (x) e.g. logistic regression, density estimation (LDA, QDA),... R(ˆφ) R 2E max j Y p j(x) ˆp j (X) Error minimization (hard classification): Find φ F minimizing R n (φ) = 1 n n L(y i, φ(x i )). i=1 e.g. large margin classifiers (support vector machine, boosting,...)

Discriminant function Much easier to find a real-valued discriminant function f(x) first and obtain a classification rule φ(x) through f. For instance, in the binary setting Y = { 1,+1} (symmetric labels) Classification rule: φ(x) = sign(f(x)) for a discriminant function f Classification boundary: {x f(x) = 0} yf(x) > 0 indicates correct decision for (x, y).

Linearly separable case x2 1.0 0.5 0.0 0.5 1.0 1.0 0.5 0.0 0.5 1.0 x1

Perceptron algorithm Rosenblatt (1958), The perceptron: A probabilistic model for information storage and organization in the brain. Find a separating hyperplane by sequentially updating β and β 0 of a linear classifier, φ(x) = sign(β x + β 0 ). Step 1. Initialize β (0) = 0 and β (0) 0 = 0. Step 2. While there is a misclassified point such that y i (β (m 1) x i + β (m 1) 0 ) 0 for m = 1, 2,..., repeat Choose a misclassified point (xi, y i ). Update β (m) = β (m 1) + y i x i and β (m) 0 = β (m 1) 0 + y i. (Novikoff) The algorithm terminates within (R 2 + 1)(b 2 + 1)/δ 2 iterations, where R = max i x i and δ = min i y i (w x i + b) > 0 for some w R p with w = 1 and b R.

Optimal separating hyperplane x 2 1.5 1.0 0.5 0.0 0.5 1.0 margin=2/ β β t x + β 0 = 1 β t x + β 0 = 0 β t x + β 0 = 1 1.0 0.5 0.0 0.5 1.0 1.5 x 1

Support Vector Machines Boser, Guyon, & Vapnik (1992), A training algorithm for optimal margin classifiers. Vapnik (1995), The Nature of Statistical Learning Theory. Find the separating hyperplane with the maximum margin : f(x) = β x + β 0 minimizing β 2 subject to y i f(x i ) 1 for all i = 1,...,n Classification rule: φ(x) = sign(f(x))

Why large margin? Vapnik s justification for large margin: - The complexity of separating hyperplanes is inversely related to margin. - Algorithms that maximize the margin can be expected to produce lower test error rates. A form of regularization: e.g. ridge regression, LASSO, smoothing splines, Tikhonov regularization

Non-separable case Relax the separability condition to y i f(x i ) 1 ξ i by introducing slack variable ξ i 0. (common technique in constrained optimization) Take ξ i (proportional to the distance of x from yf(x) = 1) as a loss. Find f(x) = β x + β 0 minimizing 1 n n (1 y i f(x i )) + + λ 2 β 2 i=1 Hinge loss: L(y, f(x)) = (1 yf(x)) + where (t) + = max(t, 0).

Hinge loss [ t] * 2 (1 t) + 1 0 2 1 0 1 2 t=yf Figure: (1 yf(x)) + is a convex upper bound of the misclassification loss I(y φ(x)) = [ yf(x)] (1 yf(x)) + where [t] = I(t 0) and (t) + = max{t, 0}.

Remarks on hinge loss Originates from the separability condition as an inequality and its relaxation. Taking it as negative log likelihood would imply a very unusual probability model. Yields a robust method compared to logistic regression and boosting. Singularity at 1 leads to a sparse solution.

Computation: quadratic programming Primal problem: minimize w.r.t. β 0, β, and ξ i 1 n n ξ i + λ 2 β 2 i=1 subject to y i (β x i + β 0 ) 1 ξ i and ξ i 0 for i = 1,...,n. Dual problem: maximize w.r.t. α i (Lagrange multipliers) n i=1 α i 1 2nλ i,j α i α j y i y j x i x j subject to 0 α i 1 and n i=1 α iy i = 0 for i = 1,...,n. ˆβ = 1 nλ n i=1 ˆα iy i x i (from KKT conditions) Support vectors: data points with ˆα i > 0

Operational properties The SVM classification rule depends on the support vectors only (sparsity). The sparsity leads to efficient data reduction and fast evaluation at the testing phase. Can handle high dimensional data even when p n as the solution depends on x only through inner products x i x j in the dual formulation. Need to solve quadratic programming problem of size n.

Nonlinear SVM Linear SVM solution: f(x) = n c i (x i x) + b i=1 Replace the Euclidean inner product x t with K(x, t) = Φ(x) Φ(t) for a mapping Φ from R p to a higher dimensional feature space. Nonlinear kernels: K(x, t) = (1 + x t) d, exp( x t 2 /2σ 2 ),... e.g. For p = 2 and x = (x 1, x 2 ), Φ(x) = (1, 2x 1, 2x 2, x 2 1, x 2 2, 2x 1 x 2 ) gives K(x, t) = (1 + x t) 2.

Kernels Aizerman, Braverman, and Rozonoer (1964), Theoretical foundations of the potential function method in pattern recognition learning. Kernel trick: replace the dot product in linear methods with a kernel. Kernelize. kernel LDA, kernel PCA, kernel k-means algorithm,... K(x, t) = Φ(x) Φ(t): non-negative definite Closely connected to reproducing kernels. This revelation came at AMS-IMS-SIAM Summer Conference, Adaptive Selection of Statistical Models and Procedures, Mount Holyoke College, MA, June 1996. (G. Wahba s recollection)

Regularization in RKHS Wahba (1990), Spline Models for Observational Data. Find f(x) = M ν=1 d νφ ν (x) + h(x) with h H K minimizing 1 n n L(y i, f(x i )) + λ h 2 H K. i=1 H K : a reproducing Kernel Hilbert space of functions defined on a domain which can be arbitrary K(x, t): reproducing kernel if i) K(x, ) H K for each x ii) f(x) =< K(x, ), f( ) > HK for all f H K (the reproducing property) The null space is spanned by {φ ν } M ν=1. J(f) = h 2 H K : penalty

SVM in general Find f(x) = b + h(x) with h H K minimizing 1 n n (1 y i f(x i )) + + λ h 2 H K. i=1 The null space: M = 1 and φ 1 (x) = 1 Linear SVM: H K = {h(x) = β x β R p } with K(x, t) = x t and h 2 H K = β x 2 H K = β 2

Representer Theorem Kimeldorf and Wahba (1971), Some results on Tchebycheffian Spline Functions. The minimizer f = M ν=1 d νφ ν + h with h H K of 1 n n L(y i, f(x i )) + λ h 2 H K i=1 has a representation of the form ˆf(x) = M ν=1 ˆd ν φ ν (x) + n ĉ i K(x i, x). i=1 }{{} h(x) h 2 H K = i,j ĉiĉjk(x i, x j )

Implications of the general treatment Kernelized SVM is a special case of the RKHS method. K(x i, ) form basis functions for f. There is no restriction on input domains and the form of a kernel function as long as the kernel is non-negative definite (by the Moore-Aronszajn theorem). Kernels can be defined on non-numerical domains such as strings of DNA bases, text, and graph, expanding the realm of applications well beyond the Euclidean vector space.

Statistical properties Bayes risk consistent when the space generated by a kernel is sufficiently rich. Lin (2000), Zhang (AOS 2004), Bartlett et al. (JASA 2006) Population minimizer f (limiting discriminant function) for Binomial deviance L(y, f(x)) = log(1 + exp( yf(x))): f (x) = log p 1 (x) 1 p 1 (x) Hinge loss L(y, f(x)) = (1 yf(x))+ : f (x) = sign{p 1 (x) 1/2} Designed for prediction only and no probability estimates available from ˆf in general. Can be less efficient than probability modeling in reducing error rate.

SVM vs logistic regression 1.5 1 0.5 0 0.5 1 true probability logistic regression SVM 1.5 2 1 0 1 2 x Figure: Solid: 2p(x) 1, dotted: 2ˆp LR (x) 1 and dashed: ˆf SVM (x)

Extensions and further developments Extensions to the multiclass case Feature selection: Make the embedding through kernel explicit. Kernel learning Efficient algorithms for large data sets when the penalty parameter λ is fixed Characterization of the entire solution path Beyond classification: Regression, novelty detection, clustering, and semi-supervised learning,...

Reference This talk is based on a book chapter: Lee (2010), Support Vector Machines for Classification: A Statistical Portrait in Statistical Methods in Molecular Biology. See references therein. A preliminary version of the manuscript is available on my webpage http://www.stat.osu.edu/ yklee.