Support Vector Machines for Classification: A Statistical Portrait

Similar documents
A Bahadur Representation of the Linear Support Vector Machine

Does Modeling Lead to More Accurate Classification?

A Study of Relative Efficiency and Robustness of Classification Methods

Indirect Rule Learning: Support Vector Machines. Donglin Zeng, Department of Biostatistics, University of North Carolina

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machines

Stat542 (F11) Statistical Learning. First consider the scenario where the two classes of points are separable.

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machine

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Support Vector Machine (continued)

Chapter 9. Support Vector Machine. Yongdai Kim Seoul National University

Structured Statistical Learning with Support Vector Machine for Feature Selection and Prediction

Support Vector Machines

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Linear vs Non-linear classifier. CS789: Machine Learning and Neural Network. Introduction

Linear & nonlinear classifiers

Jeff Howbert Introduction to Machine Learning Winter

Outline. Basic concepts: SVM and kernels SVM primal/dual problems. Chih-Jen Lin (National Taiwan Univ.) 1 / 22

Statistical Pattern Recognition

Discriminative Models

Polyhedral Computation. Linear Classifiers & the SVM

Perceptron Revisited: Linear Separators. Support Vector Machines

Discriminative Models

(Kernels +) Support Vector Machines

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

References. Lecture 7: Support Vector Machines. Optimum Margin Perceptron. Perceptron Learning Rule

Support Vector Machines. Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar

Max Margin-Classifier

Kernel Methods and Support Vector Machines

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines

Lecture 18: Kernels Risk and Loss Support Vector Regression. Aykut Erdem December 2016 Hacettepe University

Support Vector Machines

CS798: Selected topics in Machine Learning

Statistical Methods for Data Mining

Announcements - Homework

Support Vector Machines and Kernel Methods

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM

Support Vector Machines.

Statistical Machine Learning from Data

Machine Learning Support Vector Machines. Prof. Matteo Matteucci

Learning with kernels and SVM

Linear & nonlinear classifiers

Statistical Properties and Adaptive Tuning of Support Vector Machines

CS6375: Machine Learning Gautam Kunapuli. Support Vector Machines

Machine Learning. Support Vector Machines. Manfred Huber

Support Vector Machines Explained

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Pattern Recognition and Machine Learning. Perceptrons and Support Vector machines

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel

A Magiv CV Theory for Large-Margin Classifiers

Neural Networks. Prof. Dr. Rudolf Kruse. Computational Intelligence Group Faculty for Computer Science

Kernel Machines. Pradeep Ravikumar Co-instructor: Manuela Veloso. Machine Learning

CIS 520: Machine Learning Oct 09, Kernel Methods

A Sparse Solution Approach to Gene Selection for Cancer Diagnosis Using Microarray Data

Support Vector Machine for Classification and Regression

Basis Expansion and Nonlinear SVM. Kai Yu

SUPPORT VECTOR MACHINE

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

Statistical Methods for SVM

CSC 411 Lecture 17: Support Vector Machine

Support Vector Machines

Support Vector Machines

Lecture Support Vector Machine (SVM) Classifiers

LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES. Supervised Learning

Support'Vector'Machines. Machine(Learning(Spring(2018 March(5(2018 Kasthuri Kannan

Statistical Data Mining and Machine Learning Hilary Term 2016

Lecture 10: Support Vector Machine and Large Margin Classifier

Recap from previous lecture

Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron

Warm up: risk prediction with logistic regression

Kernel Methods. Konstantin Tretyakov MTAT Machine Learning

Lecture 10: A brief introduction to Support Vector Machine

Lecture Notes on Support Vector Machine

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Support Vector Machines for Classification: A Statistical Portrait

Convex Optimization and Support Vector Machine

Introduction to SVM and RVM

Support Vector Machines: Maximum Margin Classifiers

Support Vector Machines

Kernel Logistic Regression and the Import Vector Machine

Kernel Methods. Konstantin Tretyakov MTAT Machine Learning

Support Vector Machines, Kernel SVM

Introduction to Logistic Regression and Support Vector Machine

Introduction to Machine Learning

Linear, threshold units. Linear Discriminant Functions and Support Vector Machines. Biometrics CSE 190 Lecture 11. X i : inputs W i : weights

Machine Learning. Lecture 6: Support Vector Machine. Feng Li.

A Tutorial on Support Vector Machine

Review: Support vector machines. Machine learning techniques and image analysis

Support Vector Machines

Lecture 16: Modern Classification (I) - Separating Hyperplanes

Support Vector Machine

CS145: INTRODUCTION TO DATA MINING

COMP 652: Machine Learning. Lecture 12. COMP Lecture 12 1 / 37

Introduction to Support Vector Machines

Support Vector Machines

6.036 midterm review. Wednesday, March 18, 15

Support Vector Machines

Transcription:

Support Vector Machines for Classification: A Statistical Portrait Yoonkyung Lee Department of Statistics The Ohio State University May 27, 2011 The Spring Conference of Korean Statistical Society KAIST, Daejeon, Korea

Handwritten digit recognition Figure: 16 16 grayscale images scanned from postal envelopes, courtesy of Hastie, Tibshirani, & Friedman (2001). Cortes & Vapnik (1995) applied SVM to the data and demonstrated its improved accuracy over decision trees and neural network.

Classification Training data {(x i, y i ), i = 1,...,n} x = (x 1,...,x p ) R p y Y = {1,...,k} Learn a rule φ : R p Y from the training data, which can be generalized to novel cases. x2 1.0 0.5 0.0 0.5 1.0 1.0 0.5 0.0 0.5 1.0 x1

The Bayes decision rule The 0-1 loss function: L(y, φ(x)) = I(y φ(x)) (X, Y): a random sample from P(x, y), and p j (x) = P(Y = j X = x) The rule that minimizes the risk R(φ) = EL(Y, φ(x)) = P(Y φ(x)): The Bayes error rate: φ B (x) = arg max j Y p j(x) R = R(φ B ) = 1 E(max p j (X))

Two approaches to classification Probability based plug-in rules (soft classification): ˆφ(x) = arg max j Y ˆp j (x) e.g. logistic regression, density estimation (LDA, QDA),... R(ˆφ) R 2E max j Y p j(x) ˆp j (X) Error minimization (hard classification): Find φ F minimizing R n (φ) = 1 n n L(y i, φ(x i )). i=1 e.g. large margin classifiers (support vector machine, boosting,...)

Discriminant function Much easier to find a real-valued discriminant function f(x) first and obtain a classification rule φ(x) through f. For instance, in the binary setting Y = { 1,+1} (symmetric labels) Classification rule: φ(x) = sign(f(x)) for a discriminant function f Classification boundary: {x f(x) = 0} yf(x) > 0 indicates correct decision for (x, y).

Linearly separable case x2 1.0 0.5 0.0 0.5 1.0 1.0 0.5 0.0 0.5 1.0 x1

Perceptron algorithm Rosenblatt (1958), The perceptron: A probabilistic model for information storage and organization in the brain. Find a separating hyperplane by sequentially updating β and β 0 of a linear classifier, φ(x) = sign(β x + β 0 ). Step 1. Initialize β (0) = 0 and β (0) 0 = 0. Step 2. While there is a misclassified point such that y i (β (m 1) x i + β (m 1) 0 ) 0 for m = 1, 2,..., repeat Choose a misclassified point (xi, y i ). Update β (m) = β (m 1) + y i x i and β (m) 0 = β (m 1) 0 + y i. (Novikoff) The algorithm terminates within (R 2 + 1)(b 2 + 1)/δ 2 iterations, where R = max i x i and δ = min i y i (w x i + b) > 0 for some w R p with w = 1 and b R.

Optimal separating hyperplane x 2 1.5 1.0 0.5 0.0 0.5 1.0 margin=2/ β β t x + β 0 = 1 β t x + β 0 = 0 β t x + β 0 = 1 1.0 0.5 0.0 0.5 1.0 1.5 x 1

Support Vector Machines Boser, Guyon, & Vapnik (1992), A training algorithm for optimal margin classifiers. Vapnik (1995), The Nature of Statistical Learning Theory. Find the separating hyperplane with the maximum margin : f(x) = β x + β 0 minimizing β 2 subject to y i f(x i ) 1 for all i = 1,...,n Classification rule: φ(x) = sign(f(x))

Why large margin? Vapnik s justification for large margin: - The complexity of separating hyperplanes is inversely related to margin. - Algorithms that maximize the margin can be expected to produce lower test error rates. A form of regularization: e.g. ridge regression, LASSO, smoothing splines, Tikhonov regularization

Non-separable case Relax the separability condition to y i f(x i ) 1 ξ i by introducing slack variable ξ i 0. (common technique in constrained optimization) Take ξ i (proportional to the distance of x from yf(x) = 1) as a loss. Find f(x) = β x + β 0 minimizing 1 n n (1 y i f(x i )) + + λ 2 β 2 i=1 Hinge loss: L(y, f(x)) = (1 yf(x)) + where (t) + = max(t, 0).

Hinge loss [ t] * 2 (1 t) + 1 0 2 1 0 1 2 t=yf Figure: (1 yf(x)) + is a convex upper bound of the misclassification loss I(y φ(x)) = [ yf(x)] (1 yf(x)) + where [t] = I(t 0) and (t) + = max{t, 0}.

Remarks on hinge loss Originates from the separability condition as an inequality and its relaxation. Taking it as negative log likelihood would imply a very unusual probability model. Yields a robust method compared to logistic regression and boosting. Singularity at 1 leads to a sparse solution.

Computation: quadratic programming Primal problem: minimize w.r.t. β 0, β, and ξ i 1 n n ξ i + λ 2 β 2 i=1 subject to y i (β x i + β 0 ) 1 ξ i and ξ i 0 for i = 1,...,n. Dual problem: maximize w.r.t. α i (Lagrange multipliers) n i=1 α i 1 2nλ i,j α i α j y i y j x i x j subject to 0 α i 1 and n i=1 α iy i = 0 for i = 1,...,n. ˆβ = 1 nλ n i=1 ˆα iy i x i (from KKT conditions) Support vectors: data points with ˆα i > 0

Operational properties The SVM classification rule depends on the support vectors only (sparsity). The sparsity leads to efficient data reduction and fast evaluation at the testing phase. Can handle high dimensional data even when p n as the solution depends on x only through inner products x i x j in the dual formulation. Need to solve quadratic programming problem of size n.

Nonlinear SVM Linear SVM solution: f(x) = n c i (x i x) + b i=1 Replace the Euclidean inner product x t with K(x, t) = Φ(x) Φ(t) for a mapping Φ from R p to a higher dimensional feature space. Nonlinear kernels: K(x, t) = (1 + x t) d, exp( x t 2 /2σ 2 ),... e.g. For p = 2 and x = (x 1, x 2 ), Φ(x) = (1, 2x 1, 2x 2, x 2 1, x 2 2, 2x 1 x 2 ) gives K(x, t) = (1 + x t) 2.

Kernels Aizerman, Braverman, and Rozonoer (1964), Theoretical foundations of the potential function method in pattern recognition learning. Kernel trick: replace the dot product in linear methods with a kernel. Kernelize. kernel LDA, kernel PCA, kernel k-means algorithm,... K(x, t) = Φ(x) Φ(t): non-negative definite Closely connected to reproducing kernels. This revelation came at AMS-IMS-SIAM Summer Conference, Adaptive Selection of Statistical Models and Procedures, Mount Holyoke College, MA, June 1996. (G. Wahba s recollection)

Regularization in RKHS Wahba (1990), Spline Models for Observational Data. Find f(x) = M ν=1 d νφ ν (x) + h(x) with h H K minimizing 1 n n L(y i, f(x i )) + λ h 2 H K. i=1 H K : a reproducing Kernel Hilbert space of functions defined on a domain which can be arbitrary K(x, t): reproducing kernel if i) K(x, ) H K for each x ii) f(x) =< K(x, ), f( ) > HK for all f H K (the reproducing property) The null space is spanned by {φ ν } M ν=1. J(f) = h 2 H K : penalty

SVM in general Find f(x) = b + h(x) with h H K minimizing 1 n n (1 y i f(x i )) + + λ h 2 H K. i=1 The null space: M = 1 and φ 1 (x) = 1 Linear SVM: H K = {h(x) = β x β R p } with K(x, t) = x t and h 2 H K = β x 2 H K = β 2

Representer Theorem Kimeldorf and Wahba (1971), Some results on Tchebycheffian Spline Functions. The minimizer f = M ν=1 d νφ ν + h with h H K of 1 n n L(y i, f(x i )) + λ h 2 H K i=1 has a representation of the form ˆf(x) = M ν=1 ˆd ν φ ν (x) + n ĉ i K(x i, x). i=1 }{{} h(x) h 2 H K = i,j ĉiĉjk(x i, x j )

Implications of the general treatment Kernelized SVM is a special case of the RKHS method. K(x i, ) form basis functions for f. There is no restriction on input domains and the form of a kernel function as long as the kernel is non-negative definite (by the Moore-Aronszajn theorem). Kernels can be defined on non-numerical domains such as strings of DNA bases, text, and graph, expanding the realm of applications well beyond the Euclidean vector space.

Statistical properties Bayes risk consistent when the space generated by a kernel is sufficiently rich. Lin (2000), Zhang (AOS 2004), Bartlett et al. (JASA 2006) Population minimizer f (limiting discriminant function) for Binomial deviance L(y, f(x)) = log(1 + exp( yf(x))): f (x) = log p 1 (x) 1 p 1 (x) Hinge loss L(y, f(x)) = (1 yf(x))+ : f (x) = sign{p 1 (x) 1/2} Designed for prediction only and no probability estimates available from ˆf in general. Can be less efficient than probability modeling in reducing error rate.

SVM vs logistic regression 1.5 1 0.5 0 0.5 1 true probability logistic regression SVM 1.5 2 1 0 1 2 x Figure: Solid: 2p(x) 1, dotted: 2ˆp LR (x) 1 and dashed: ˆf SVM (x)

Extensions and further developments Extensions to the multiclass case Feature selection: Make the embedding through kernel explicit. Kernel learning Efficient algorithms for large data sets when the penalty parameter λ is fixed Characterization of the entire solution path Beyond classification: Regression, novelty detection, clustering, and semi-supervised learning,...

Reference This talk is based on a book chapter: Lee (2010), Support Vector Machines for Classification: A Statistical Portrait in Statistical Methods in Molecular Biology. See references therein. A preliminary version of the manuscript is available on my webpage http://www.stat.osu.edu/ yklee.