A Bahadur Representation of the Linear Support Vector Machine

Similar documents
Does Modeling Lead to More Accurate Classification?

A Study of Relative Efficiency and Robustness of Classification Methods

Support Vector Machines for Classification: A Statistical Portrait

A Bahadur Representation of the Linear Support Vector Machine

Structured Statistical Learning with Support Vector Machine for Feature Selection and Prediction

Indirect Rule Learning: Support Vector Machines. Donglin Zeng, Department of Biostatistics, University of North Carolina

A Sparse Solution Approach to Gene Selection for Cancer Diagnosis Using Microarray Data

Stat542 (F11) Statistical Learning. First consider the scenario where the two classes of points are separable.

Does Modeling Lead to More Accurate Classification?: A Study of Relative Efficiency

Chapter 9. Support Vector Machine. Yongdai Kim Seoul National University

Statistical Properties and Adaptive Tuning of Support Vector Machines

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM

Outline. Basic concepts: SVM and kernels SVM primal/dual problems. Chih-Jen Lin (National Taiwan Univ.) 1 / 22

Lecture 18: Multiclass Support Vector Machines

Support Vector Machines

Lecture 3: Statistical Decision Theory (Part II)

Lecture 5: LDA and Logistic Regression

Lecture 18: Kernels Risk and Loss Support Vector Regression. Aykut Erdem December 2016 Hacettepe University

Lecture 10 February 23

Chap 2. Linear Classifiers (FTH, ) Yongdai Kim Seoul National University

Recap from previous lecture

Hilbert Space Methods in Learning

Statistical Machine Learning Hilary Term 2018

Statistical Data Mining and Machine Learning Hilary Term 2016

Robust Support Vector Machines for Probability Distributions

Outline. Supervised Learning. Hong Chang. Institute of Computing Technology, Chinese Academy of Sciences. Machine Learning Methods (Fall 2012)

Support Vector Machines for Classification: A Statistical Portrait

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

Bits of Machine Learning Part 1: Supervised Learning

Statistical Properties of Large Margin Classifiers

Classification. Chapter Introduction. 6.2 The Bayes classifier

Linear Classification and SVM. Dr. Xin Zhang

Approximation Theoretical Questions for SVMs

Support'Vector'Machines. Machine(Learning(Spring(2018 March(5(2018 Kasthuri Kannan

Sufficient Dimension Reduction using Support Vector Machine and it s variants

Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron

Bayesian Support Vector Machines for Feature Ranking and Selection

Support Vector Machines II. CAP 5610: Machine Learning Instructor: Guo-Jun QI

Perceptron Revisited: Linear Separators. Support Vector Machines

SVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels

Support Vector Machines, Kernel SVM

Binary Classification / Perceptron

Warm up: risk prediction with logistic regression

CIS 520: Machine Learning Oct 09, Kernel Methods

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Machine Learning 1. Linear Classifiers. Marius Kloft. Humboldt University of Berlin Summer Term Machine Learning 1 Linear Classifiers 1

Support Vector Machine

SUPPORT VECTOR MACHINE

Lecture Notes 15 Prediction Chapters 13, 22, 20.4.

Support Vector Machine & Its Applications

Midterm exam CS 189/289, Fall 2015

Direct Learning: Linear Classification. Donglin Zeng, Department of Biostatistics, University of North Carolina

Neural Networks. Prof. Dr. Rudolf Kruse. Computational Intelligence Group Faculty for Computer Science

A short introduction to supervised learning, with applications to cancer pathway analysis Dr. Christina Leslie

Support Vector Machine (continued)

DS-GA 1003: Machine Learning and Computational Statistics Homework 6: Generalized Hinge Loss and Multiclass SVM

A Tutorial on Support Vector Machine

Nearest Neighbor. Machine Learning CSE546 Kevin Jamieson University of Washington. October 26, Kevin Jamieson 2

Statistical Approaches to Learning and Discovery. Week 4: Decision Theory and Risk Minimization. February 3, 2003

Constrained Optimization and Support Vector Machines

LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES. Supervised Learning

Machine Learning 2017

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Polyhedral Computation. Linear Classifiers & the SVM

An Improved 1-norm SVM for Simultaneous Classification and Variable Selection

Learning with Rejection

Delta Method. Example : Method of Moments for Exponential Distribution. f(x; λ) = λe λx I(x > 0)

Lecture 10: Support Vector Machine and Large Margin Classifier

Support Vector Machines

Machine Learning And Applications: Supervised Learning-SVM

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Statistical learning theory, Support vector machines, and Bioinformatics

ECE521 week 3: 23/26 January 2017

Logistic Regression. Machine Learning Fall 2018

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Statistical Methods for Data Mining

ASSESSING ROBUSTNESS OF CLASSIFICATION USING ANGULAR BREAKDOWN POINT

Lecture 2 Machine Learning Review

Classification and Support Vector Machine

Support Vector Machines. Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

Machine Learning. Lecture 9: Learning Theory. Feng Li.

Machine Learning. VC Dimension and Model Complexity. Eric Xing , Fall 2015

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel

About this class. Maximizing the Margin. Maximum margin classifiers. Picture of large and small margin hyperplanes

Announcements. Proposals graded

The Margin Vector, Admissible Loss and Multi-class Margin-based Classifiers

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Linear & nonlinear classifiers

Kernel Methods and Support Vector Machines

Classifier Complexity and Support Vector Classifiers

AdaBoost and other Large Margin Classifiers: Convexity in Classification

Support Vector Machine (SVM) and Kernel Methods

Sparse Kernel Machines - SVM

Machine Learning Lecture 7

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017

Data Mining. Linear & nonlinear classifiers. Hamid Beigy. Sharif University of Technology. Fall 1396

CS145: INTRODUCTION TO DATA MINING

Support Vector Machine for Classification and Regression

Introduction to Machine Learning

Transcription:

A Bahadur Representation of the Linear Support Vector Machine Yoonkyung Lee Department of Statistics The Ohio State University October 7, 2008 Data Mining and Statistical Learning Study Group

Outline Support Vector Machines Statistical Properties of SVM Main Results (asymptotic analysis of the linear SVM) An Illustrative Example Discussion

Applications Handwritten digit recognition Cancer diagnosis with microarray data Text categorization

Classification x = (x 1,...,x d ) R d y Y = {1,...,k} Learn a rule φ : R d Y from the training data {(x i, y i ), i = 1,...,n}, where (x i, y i ) are i.i.d. with P(X, Y). The 0-1 loss function : L(y, φ(x)) = I(y φ(x))

Methods of Regularization (Penalization) Find f(x) F minimizing 1 n n L(y i, f(x i )) + λj(f). i=1 Empirical risk + penalty F: a class of candidate functions J(f): complexity of the model f λ > 0: a regularization parameter Without the penalty J(f), ill-posed problem

Maximum Margin Hyperplane 1 0.5 Margin=2/ w 0 wx+b=1 wx+b=0 0.5 wx+b= 1 1 1 0.5 0 0.5 1

Support Vector Machines Boser, Guyon, & Vapnik (1992) Vapnik (1995), The Nature of Statistical Learning Theory A discussion paper (2006) in Statistical Science y i { 1, 1}, class labels in the binary case Find f F = {f(x) = w x + b w R d and b R} minimizing 1 n n (1 y i f(x i )) + + λ w 2, i=1 where λ is a regularization parameter. Classification rule : φ(x) = sign(f(x))

Hinge Loss [ t] * 2 (1 t) + 1 0 2 1 0 1 2 t=yf (1 yf(x)) + is an upper bound of the misclassification loss function I(y φ(x)) = [ yf(x)] (1 yf(x)) + where [t] = I(t 0) and (t) + = max{t, 0}.

SVM in General Find f(x) = b + h(x) with h H K (RKHS) minimizing 1 n n (1 y i f(x i )) + + λ h 2 H K. i=1 Linear SVM: H K = {h(x) = w x w R d } with i) K(x, x ) = x x ii) h 2 H K = w x 2 H K = w 2 Nonlinear SVM: K(x, x ) = (1 + x x ) d, exp( x x 2 /2σ 2 ),...

Statistical Properties of SVM Fisher consistency (Lin, DM & KD 2002) arg f min E[(1 Yf(X)) + X = x] = sign(p(x) 1/2) where p(x) = P(Y = 1 X = x) SVM approximates the Bayes decision rule φ B (x) = sign(p(x) 1/2). Bayes risk consistency (Zhang, AOS 2004, Steinwart, IEEE IT 2005, Bartlett et al., JASA 2006) R(ˆf SVM ) R(φ B ) in prob. under universal approximation condition on H K Rate of convergence (Steinwart et al., AOS 2007)

Main Questions Recursive Feature Elimination (Guyon et al., ML 2002): backward elimination of variables based on the fitted coefficients of the linear SVM What is the statistical behavior of the coefficients? What determines their variances? Study asymptotic properties of the coefficients of the linear SVM.

Something New, Old, and Borrowed The hinge loss: not everywhere differentiable, no closed form expression for the solution Useful link: sign(p(x) 1/2), the population minimizer w.r.t. the hinge loss is the median of Y at x. Asymptotics for least absolute deviation (LAD) estimators (Pollard, ET 1991) Convexity of the loss

Preliminaries (X, Y): a pair of random variables with X X R d and Y {1, 1} P(Y = 1) = π + and P(Y = 1) = π Let f and g be the densities of X given Y = 1 and 1. With x = (1, x 1,...,x d ) and β = (b, w ), h(x; β) = w x + b = x β 1 βλ,n = arg min β n n (1 y i h(x i ; β)) + + λ w 2 i=1

Population Version [ ] L(β) = E 1 Yh(X; β) + β = arg min β L(β) The gradient of L(β): ( S(β) = E ψ(1 Yh(X; β))y X ) where ψ(t) = I(t 0) The Hessian matrix of L(β): H(β) = E (δ(1 Yh(X; β)) X where δ is the Dirac delta function. X )

More on H(β) The Hessian matrix of L(β): H(β) = E (δ(1 Yh(X; β)) X X ) ) H j,k (β) = E (δ(1 Yh(X; β))x j X k for 0 j, k d = π + δ(1 b w x)x j x k f(x)dx X + π δ(1 + b + w x)x j x k g(x)dx. X For a function s on X, define the Radon transform Rs of s for p R and ξ R d as (Rs)(p, ξ) = δ(p ξ x)s(x)dx. (the integral of s over hyperplanes ξ x = p) H j,k (β) = π + (Rf j,k )(1 b, w) + π (Rg j,k )(1 + b, w), where f j,k (x) = x j x k f(x) and g j,k (x) = x j x k g(x). X

Regularity Conditions (A1) The densities f and g are continuous and have finite second moments. (A2) There exists B(x 0, δ 0 ) such that f(x) > C 1 and g(x) > C 1 for every x B(x 0, δ 0 ). (A3) For some 1 i d, {x i G i }x i g(x)dx < or X X {x i G + i }x i g(x)dx > X {x i F + i }x i f(x)dx X {x i F i }x i f(x)dx. (when π + = π, it says that the means are different.) (A4) Let M + = {x X x β = 1} and M = {x X x β = 1}. There exist two subsets of M + and M on which the class densities f and g are bounded away from zero.

Bahadur Representation Bahadur (1966), A Note on Quantiles in Large Samples A statistical estimator is approximated by a sum of independent variables with a higher-order remainder. Let ξ = F 1 (p) be the pth quantile of distribution F. For X 1,...,X n iid F, the sample pth quantile is ξ + [ n i=1 where f(x) = F (x). ] I(X i > ξ) n(1 p) /nf(ξ) + R n,

Bahadur-type Representation of the Linear SVM Theorem Suppose that (A1)-(A4) are met. For λ = o(n 1/2 ), we have n ( βλ,n β ) = 1 n H(β ) 1 n ψ(1 Y i h(x i ; β ))Y i Xi +o P (1). i=1 Recall that H(β ) = E (δ(1 Yh(X; β )) X X ) ψ(t) = I(t 0). and

Asymptotic Normality of β λ,n Theorem Suppose (A1)-(A4) are satisfied. For λ = o(n 1/2 ), n ( βλ,n β ) N(0, H(β ) 1 G(β )H(β ) 1) in distribution, where ( G(β) = E ψ(1 Yh(X; β)) X X ). Corollary Under the same conditions as in Theorem, ( n h(x; β ) ( λ,n ) h(x; β ) N 0, x H(β ) 1 G(β )H(β ) 1 x ) in distribution.

An Illustrative Example Two multivariate normal distributions in R d with mean vectors µ f and µ g and a common covariance matrix Σ π + = π = 1/2. What is the relation between the Bayes decision boundary and the optimal hyperplane by the SVM, h (x; β ) = 0? µ f µ g

Example The Bayes decision boundary (Fisher s LDA): { } { Σ 1 (µ f µ g ) x 1 } 2 (µ f + µ g ) The hyperplane determined by the SVM : x β = 0 with S(β ) = 0. β balances two classes within the margin ( E ψ(1 Yh(X; β ))Y X ) = 0 = 0. P(h(X; β ) 1 Y = 1) = P(h(X; β ) 1 Y = 1) ( ) ( ) E I{h(X; β ) 1}X j Y = 1 = E I{h(X; β ) 1}X j Y = 1

Example Direct calculation shows that [ β 1 = C(d Σ (µ f, µ g )) 2 (µ f + µ g ) ] Σ 1 (µ I f µ g ), d where d Σ (µ f, µ g ) is the Mahalanobis distance between the two distributions. The linear SVM is equivalent to Fisher s LDA. The assumptions (A1)-(A4) are satisfied. So, the main theorem applies. Consider d = 1, µ f + µ g = 0, σ = 1, and d Σ (µ f, µ g ) = µ f µ g.

Distance and Margins d=0.5 0.0 0.2 0.4 5 0 5 x d=1 5 0 5 x 0.0 0.2 0.4 d=3 0.0 0.2 0.4 5 0 5 x d=6 5 0 5 x 0.0 0.2 0.4

Asymptotic Variance asymptotic variability 0 10 20 30 40 50 asymptotic variability 0 10 20 30 40 50 0 1 2 3 4 5 6 Mahalanobis distance 0 1 2 3 4 5 6 Mahalanobis distance (a) Intercept (b) Slope Figure: The asymptotic variabilities of the intercept and the slope for the optimal hyperplane as a function of the Mahalanobis distance.

Bivariate Normal Example µ f = (1, 1), µ g = ( 1, 1) and Σ = diag(1, 1) d Σ (µ f, µ g ) = 2 2 and the Bayes error rate is 0.07865. 1 n Find ( ˆβ 0, ˆβ 1, ˆβ 2 ) = arg min (1 y i x i β)+ β n i=1 Sample size n Estimates 100 200 500 Optimal coefficients ˆβ 0 0.0006-0.0013 0.0022 0 ˆβ 1 0.7709 0.7450 0.7254 0.7169 ˆβ 2 0.7749 0.7459 0.7283 0.7169 Table: Averages of estimated optimal coefficients over 1000 replicates.

Sampling Distributions of ˆβ 0 and ˆβ 1 Density 0 1 2 3 estimate asymptotic Density 0 1 2 3 4 5 estimate asymptotic 0.4 0.2 0.0 0.2 0.4 0.5 0.6 0.7 0.8 0.9 1.0 (a) ˆβ 0 (b) ˆβ 1 Figure: Estimated sampling distributions of ˆβ 0 and ˆβ 1

Type I error rates Type I error rate 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 d=6 d=12 d=18 d=24 500 1000 1500 2000 n Figure: The median values of the type I error rates in variable selection when µ f = (1 d/2, 0 d/2 ), µ g = 0 d, and Σ = I d

Concluding Remarks Examine asymptotic properties of the coefficients of variables in the linear SVM. Establish Bahadur type representation of the coefficients. How the margins of the optimal hyperplane and the underlying probability distribution characterize their statistical behavior Variable selection for the SVM in the framework of hypothesis testing For practical applications, need consistent estimators of G(β ) and H(β ). Extension of the SVM asymptotics to the nonlinear case Explore a different scenario where d also grows with n.

Reference A Bahadur Representation of the Linear Support Vector Machine, Koo, J.-Y., Lee, Y., Kim, Y., and Park, C., Journal of Machine Learning Research (2008). Available at www.stat.osu.edu/ yklee or http://www.jmlr.org/.