A Magiv CV Theory for Large-Margin Classifiers

Similar documents
Support Vector Machines for Classification: A Statistical Portrait

Indirect Rule Learning: Support Vector Machines. Donglin Zeng, Department of Biostatistics, University of North Carolina

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

CS6375: Machine Learning Gautam Kunapuli. Support Vector Machines

Discriminative Models

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

CIS 520: Machine Learning Oct 09, Kernel Methods

Direct Learning: Linear Classification. Donglin Zeng, Department of Biostatistics, University of North Carolina

Support Vector Machines: Maximum Margin Classifiers

Discriminative Models

A GENERAL FORMULATION FOR SUPPORT VECTOR MACHINES. Wei Chu, S. Sathiya Keerthi, Chong Jin Ong

Lecture 10: Support Vector Machine and Large Margin Classifier

Machine Learning Practice Page 2 of 2 10/28/13

Topics we covered. Machine Learning. Statistics. Optimization. Systems! Basics of probability Tail bounds Density Estimation Exponential Families

Support Vector Machines

Support Vector Machines

Lecture 10: A brief introduction to Support Vector Machine

Statistical Methods for Data Mining

Basis Expansion and Nonlinear SVM. Kai Yu

Review: Support vector machines. Machine learning techniques and image analysis

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

Support Vector Machines

Stat542 (F11) Statistical Learning. First consider the scenario where the two classes of points are separable.

Linear & nonlinear classifiers

Support Vector Machine I

Another Look at DWD: Thrifty Algorithm and Bayes Risk Consistency in RKHS

Pattern Recognition 2018 Support Vector Machines

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machines

LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Introduction to Logistic Regression and Support Vector Machine

Nearest Neighbors Methods for Support Vector Machines

Change point method: an exact line search method for SVMs

Introduction to Machine Learning

Introduction to SVM and RVM

Lecture 18: Kernels Risk and Loss Support Vector Regression. Aykut Erdem December 2016 Hacettepe University

Multicategory Vertex Discriminant Analysis for High-Dimensional Data

Support Vector Machines and Kernel Methods

Statistical Machine Learning from Data

Linear, threshold units. Linear Discriminant Functions and Support Vector Machines. Biometrics CSE 190 Lecture 11. X i : inputs W i : weights

Coordinate descent. Geoff Gordon & Ryan Tibshirani Optimization /

Machine Learning. Lecture 6: Support Vector Machine. Feng Li.

Kernel Machines. Pradeep Ravikumar Co-instructor: Manuela Veloso. Machine Learning

Nonlinear Support Vector Machines through Iterative Majorization and I-Splines

10-701/ Machine Learning - Midterm Exam, Fall 2010

Support Vector Machine, Random Forests, Boosting Based in part on slides from textbook, slides of Susan Holmes. December 2, 2012

ECS289: Scalable Machine Learning

Online Learning and Sequential Decision Making

Big Data Analytics: Optimization and Randomization

Perceptron Revisited: Linear Separators. Support Vector Machines

Support Vector Machine (SVM) and Kernel Methods

Statistical Machine Learning from Data

CSCI-567: Machine Learning (Spring 2019)

Bayesian Support Vector Machines for Feature Ranking and Selection

A Bias Correction for the Minimum Error Rate in Cross-validation

Model Selection for Gaussian Processes

Support Vector Machines

Introduction to Support Vector Machines

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM

Mehryar Mohri Foundations of Machine Learning Courant Institute of Mathematical Sciences Homework assignment 3 April 5, 2013 Due: April 19, 2013

Machine Learning Support Vector Machines. Prof. Matteo Matteucci

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

ECE 5424: Introduction to Machine Learning

Statistical Methods for SVM

Regularization Paths

Part I Week 7 Based in part on slides from textbook, slides of Susan Holmes

Proximal Newton Method. Ryan Tibshirani Convex Optimization /36-725

Statistical Data Mining and Machine Learning Hilary Term 2016

Ensemble Methods. Charles Sutton Data Mining and Exploration Spring Friday, 27 January 12

Introduction to Support Vector Machines

Generalized Boosted Models: A guide to the gbm package

Linear vs Non-linear classifier. CS789: Machine Learning and Neural Network. Introduction

Iteratively Reweighted Least Square for Asymmetric L 2 -Loss Support Vector Regression

The Margin Vector, Admissible Loss and Multi-class Margin-based Classifiers

Introduction to Machine Learning Midterm Exam

Machine Learning. Support Vector Machines. Manfred Huber

Chapter 9. Support Vector Machine. Yongdai Kim Seoul National University

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

10/05/2016. Computational Methods for Data Analysis. Massimo Poesio SUPPORT VECTOR MACHINES. Support Vector Machines Linear classifiers

SMO Algorithms for Support Vector Machines without Bias Term

Advanced Introduction to Machine Learning

Support Vector Machines.

CSC 411 Lecture 17: Support Vector Machine

Machine Learning - MT & 5. Basis Expansion, Regularization, Validation

Machine Learning for NLP

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

A talk on Oracle inequalities and regularization. by Sara van de Geer

An Improved Conjugate Gradient Scheme to the Solution of Least Squares SVM

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Kernel Learning via Random Fourier Representations

STA141C: Big Data & High Performance Statistical Computing

Classifier Complexity and Support Vector Classifiers

A Study of Relative Efficiency and Robustness of Classification Methods

COMS 4771 Regression. Nakul Verma

Support Vector Machine (SVM) and Kernel Methods

Bits of Machine Learning Part 1: Supervised Learning

Support Vector Machine II

Approximation Theoretical Questions for SVMs

Transcription:

A Magiv CV Theory for Large-Margin Classifiers Hui Zou School of Statistics, University of Minnesota June 30, 2018 Joint work with Boxiang Wang

Outline 1 Background 2 Magic CV formula 3 Magic support vector machines 4 Magic CV applications in kernel learning theory 5 Numerical studies

Binary classification Observations: a collection of i.i.d. training data (x 1, y 1 ), (x 2, y 2 ),..., (x n, y n ). Predictors (covariates, input vector): x i = (x i1,..., x ip ). Response (class label, output variable): y i = { 1, 1}. Build a model ˆf using the training data. Given any new input x, we predict the class label ŷ = ˆf(x). 2/37

Classification toolbox linear discriminant analysis logistic regression kernel density classifier naïve Bayes classifier neural network boosting ensembles random forest support vector machine (SVM)... Experiment: Fernández et al (2014, JMLR) compared 179 commonly used classifiers on 121 UCI data sets. Conclusion: best classifiers are random forest, kernel SVM, neural networks, and boosting ensembles. 3/37

Support vector machine The Non-separable Case Distance y i (ω 0 + x T i ω) y = +1 may be negative. x2 η i x i Introduce slack variables η i 0. Redefine the distance d i : y = 1 ω d i = y i (ω 0 + x T i ω) + η i such that d i 0 for all i. x 1 4/37

Support vector machine SVM (Vapnik, 1995) argmax ω 0,ω subject to min i d i, d i = y i (ω 0 + x T i ω) + η i 0, i, η i 0, i, ω T ω = 1, η i t. i The tuning parameter t controls the extent of the slack variables. 5/37

Computing SVM in the dual space Lagrange dual function: [ n max L D = max α i 1 α α 2 subject to i=1 n i=1 i =1 n α i y i = 0 and 0 α i γ, i. i=1 The solution has the form: ˆf(x) = ˆβ 0 + ] n α i α i y i y i x i, x i, n ˆα i y i x i, x. The coefficients, ˆα i, are nonzero only when the observations are the support vectors. i=1 6/37

Kernel trick in the dual space Lagrange dual and solution with kernel function: [ n max L D = max α i 1 α α 2 subject to i=1 n i=1 i =1 n α i y i = 0 and 0 α i γ, i. i=1 ˆf(x) = β 0 + Gaussian kernel: n ˆα i y i K(x i, x). i=1 K(x, x ) = exp( σ x x 2 2). ] n α i α i y i y i K(x i, x i ), 7/37

State-of-the-art SVM solvers interior point method. Example: R package kernlab. sequential minimal optimization. (Platt, 1999; Osuna et al, 1997; Keerthi et al., 2001; Fan et al., 2005) LIBSVM library, R package e1071. 8/37

Tuning the SVM The kernel SVM has high prediction accuracy. The generalization error of SVM depends on the choice of the tuning parameter. Two tasks: 1 Model comparison/selection: e.g., choose the tuning parameter for a procedure. 2 Model assessment: estimate the generalization error for the final model. Cross-validation: perhaps the simplest and the most widely used tool. 9/37

Cross-validation

V -fold cross-validation Fold 1 Fold 2 Fold 3 Validation Train Train Train Train Validation Train Train Train Train Validation Train Fold V Train Train Train Validation Cross-validation error: CV(λ) = V 1 V v=1 L(Y v, ˆf [ v] λ (X v )). Tuning parameter: ˆλ = argmin CV(λ). λ Generalization error: Err( ˆf λ ) CV (λ). Leave-one-out CV (LOOCV): V = n. Ten-fold CV: V = 10. 10/37

LOOCV or ten-fold CV? LOOCV is an almost unbiased estimator for the true generalization error, i.e., small bias. LOOCV is deterministic while ten-fold CV is random in terms of different training/validation splits. LOOCV is computationally expensive due to n times model fits.? LOOCV is claimed to have larger variance than ten-fold CV. The last statement is quite popular but it is not generally true. 11/37

Mean of classification error 0.5 0.6 0.7 0.8 0.9 1.0 LOOCV cv error true error 10 fold cv error true error 5 fold cv error true error 2 fold cv error true error Variance of classification error 0.02 0.04 0.06 0.08 10 8 6 4 10 8 6 4 10 8 6 4 10 8 6 4 log(lambda) log(lambda) log(lambda) log(lambda) 1 LOOCV has almost no bias in estimating generalization error. 2 LOOCV has the variance no larger than other V -fold CV. 12/37

Magic CV Formula

LOOCV for regression Model: y = f(x) + ɛ. Estimate f using the regularization: ˆf λ = argmin f [ 1 n ] n (y i f(x i )) 2 + λp (f). i=1 Examples: ridge regression, f(x) = x β, P (f) = β 2 2 ; smoothing spline, P (f) = f (u) 2 du. LOOCV estimate: ˆf [ v] λ = argmin f 1 n n i=1,i v (y i f(x i )) 2 + λp (f). LOOCV error: LOOCV(λ) = 1 n i=1 (y i n ˆf [ i] λ (x i )) 2. 13/37

Magic leave-one-out lemma for regression Craven and Wahba (1979) Suppose that ˆf λ (x i ) = H i y, is self-stable, then LOOCV(λ) = 1 n = 1 n n i=1 n i=1 ( y i ) [ i] 2 ˆf λ (x i ) (y i ˆf λ (x i )) 2 (1 h ii ) 2. 14/37

Self-stability property f(x [ 5], y [ 5] ) f(x, ỹ) y f(x, y) y x x 15/37

The Question We must consider the computation cost of the SVM with CV. Can we compute the exact LOOCV of SVM and related classifiers without repeating the algorithm n times? 16/37

Our contributions Propose a magic CV formula to compute the exact cross-validation error in the context of large-margin classification. Develop a magic SVM by designing a very efficient algorithm to solve kernel SVM and applying the magic CV formula for tuning the parameters. Obtain the theoretic bounds of the expectation and variance of cross-validation error based on the CV formula. 17/37

Performance demo method time (sec) error (%) time (sec) error (%) arrhythmia n = 452 p = 191 musk n = 476 p = 166 magicsvm 46.199 24.093 (0.602) 97.780 10.798 (0.415) kernlab 755.316 24.447 (0.602) 708.587 11.113 (0.433) e1071 1544.779 24.137 (0.602) 2078.586 10.777 (0.420) australian n = 690 p = 14 SAfrica n = 462 p = 65 magicsvm 63.798 14.058 (0.372) 53.558 28.939 (0.616) kernlab 440.839 14.058 (0.388) 342.820 29.589 (0.750) e1071 887.315 14.203 (0.371) 791.532 29.307 (0.692) hepatitis n = 112 p = 18 sonar n = 208 p = 6 magicsvm 1.664 14.464 (1.044) 14.151 17.885 (0.749) kernlab 22.837 14.643 (0.966) 80.996 18.798 (0.973) e1071 20.781 14.732 (0.933) 134.681 18.942 (1.351) LSVT n = 126 p = 309 valley n = 606 p = 100 magicsvm 3.185 15.873 (0.620) 129.080 2.096 (0.186) kernlab 111.767 15.635 (0.766) 899.347 2.195 (0.232) e1071 232.434 16.270 (0.763) 2314.652 2.096 (0.209) 18/37

Magic CV formula for SVM

Cross-validation estimates: A(X [ v], y [ v] ) = A(X, ỹ). Regression Large-margin Classification! "! "! "! "! #! #! #! #! % '( % (* % )! % -! &! &! &! &!,!!,! 19/37

Cross-validation estimates: A(X [ v], y [ v] ) = A(X, ỹ). Regression Large-margin Classification! "! "! "! "! "! "! "! "! #! #! #! #! #! #! #! #! %! % '( % '( (* % (* % ) % )! %! % - -! &! &! &! &! &! &! &! &!!,!,!!!,!,! 19/37

Kernel SVM in the primal space SVM: argmax ω 0,ω subject to loss + penalty: min i d i, d i = y i (ω 0 + x T i ω) + η i 0, η i 0, i, ω T ω = 1, η i t. i [ 1 n [ argmin 1 yi (β 0 + x T i β) ] ] β 0,β n + + λβt β. i=1 0.0 0.5 1.0 1.5 SVM -0.5 0.0 0.5 1.0 1.5 2.0 2.5 20/37

Kernel SVM in the primal space 1 min f H K [ 1 n ] n [1 y i f(x i )] + + λ f 2 H K, i=1 2 Mercer Theorem: a kernel function K has an eigen-expansion, K(x, x ) = γ t φ t (x)φ t (x ). t=1 3 An Hilbert space H K is defined as the collection of functions f(x) = θ t φ t (x). t=1 with the inner product defined as θ t φ t (x), δ t φ t (x) θ t δ t /γ t. t=1 t =1 H K = t=1 21/37

4 The representer theorem (Kimeldorf and Wahba, 1971): min f H K [ 1 n ˆα = arg min α ] n [1 y i f(x i )] + + λ f 2 H K, i=1 ˆf(x) = [ 1 n n ˆα i K(x, x i ) = K T i α, i=1 n [ 1 yi K T i α ] ] + + λαt Kα. i=1 22/37

Magic leave-one-out formula for SVM If we let ỹ v = 0 and ỹ i = y i if i v, then we have ˆf [ v] λ = argmin f H K [ 1 n ] n L (ỹ i f(x i )) + λ f 2 H K. i=1 23/37

Magic cross-validation formula (V -fold CV) The ith data point is allocated to the fold τ(i) by randomization: τ : {1,..., n} {1,..., V }. The V -fold CV estimate: ˆf [ v] λ = argmin 1 f H K n {i:τ(i) v} L (y i f(x i )) + λ f 2 H K. We define ỹ i = 0 if τ(i) = v and ỹ i = y i if τ(i) v, then we have ˆf [ v] λ = argmin f H K [ 1 n ] n L (ỹ i f(x i )) + λ f 2 H K. i=1 24/37

Magic SVM Efficient algorithm for training SVM Exact smoothing principle Accelerated proximal gradient descent Tune SVM using leave-one-out cross-validation Magic CV formula 25/37

Smoothed SVM loss: 0 u 1 + δ, L δ 1 (u) = 4δ [u (1 + δ)]2 1 δ < u < 1 + δ, 1 u u 1 δ. Lipschitz gradient: L δ (u 1 ) L δ (u 2 ) 1 2δ u 1 u 2. Loss Lδ(u) 0.0 0.4 0.8 δ = 0.5 δ = 0.25 δ = 0.1 δ = 0.01 SVM 0.0 0.5 1.0 1.5 2.0 u 26/37

Theorem (finite exact smoothing of SVM) With training data K and y given, suppose α SVM and α δ are the unique solution of the following problems, then there exists a small δ such that α δ = α SVM when δ < δ. α SVM = argmin α R n α δ = argmin α R n [ 1 n [ 1 n ] n L(y i K i α) + λα Kα. i=1 ] n L δ (y i K i α) + λα Kα. i=1 The exact SVM solution is virtually attained before δ = 0. Define a sequence δ (d+1) = rδ (d) and 0 < r < 1. We solve ˆα δ (d) sequentially and terminate the algorithm when some ˆα δ (d) satisfies the KKT condition of the SVM problem. 27/37

Smoothed SVM: min α R Qδ (α) min n α R n [ 1 n ] n L δ (y i K i α) + λα Kα. i=1 Accelerated proximal gradient descent update: [ α (t+1) = argmin λα Kα + 1 ( ) ] α ᾱ (t) 2δ l δ (ᾱ (t) 2 ) α R n 4nδ 2 ( = ᾱ (t) 2λK + 1 ) 1 ( 2nδ KK l δ (ᾱ (t) ) + 2λKᾱ (t)). ( ) ᾱ (t) = α (t) rt 1 + (α (t) α (t 1) ), r 1 = 1, r t+1 = r t+1 ( 1 + 1 + 4r 2 t ) /2. 28/37

Algorithm 1 Magic SVM Require: y, K, λ, and r, (e.g., r = 2 3 ). 1: Initialize δ. Define L δ. Initialize each α [ v]. 2: repeat 3: Compute P 1 δ (K) = (2λK + 1 2nδ KK) 1. 4: for v = 1,..., n do 5: Let ỹ i = y i if i v, and ỹ v = 0. 6: repeat 7: Compute z, with z i = ỹ i L δ ((ỹ i K i α [ v] )/n. 8: α [ v] α [ v] P 1 δ (K) K z + 2λK α [ v]). 9: until the convergence condition is met. 10: end for 11: Update δ rδ. 12: until the KKT condition check of SVM is passed. The complexity of regular LOOCV SVM O(n 4 ). The complexity of the entire magic LOOCV SVM is O(n 3 ), 29/37

The algorithm can be generalized to other variants of CV, such as V -fold CV, delete-v CV, etc. The algorithm can be generalized to other kernel machines, e.g., logistic regression, squared SVM, huber SVM. 30/37

Simulation Define µ + = (1,..., 1, 0,..., 0) and µ = (0,..., 0, 1,..., 1). Positive class: 10 k=1 0.1N(µ k+, 4I) where µ k+ N(µ +, I). Negative class: 10 k=1 0.1N(µ k, 4I) where µ k N(µ, I). Ratios of the run time without the magic CV to the magic CV: ratio of run time 0 10 20 30 40 50 60 p=0.2n p=0.5n 100 200 300 400 500 sample size 31/37

n p time (seconds) objective value magicsvm kernlab e1071 magicsvm kernlab e1071 200 40 45.059 319.026 617.379 0.640 0.640 0.640 (0.011) (0.011) (0.011) 100 39.461 504.113 900.292 0.599 0.599 0.675 (0.014) (0.014) (0.073) 300 60 122.958 967.405 2305.041 0.631 0.631 0.631 (0.010) (0.010) (0.010) 150 107.102 1680.196 3443.444 0.611 0.611 0.611 (0.018) (0.017) (0.017) 400 80 373.881 2279.869 6232.798 0.629 0.629 0.629 (0.012) (0.012) (0.012) 200 343.342 4506.013 9473.093 0.608 0.608 0.608 (0.015) (0.015) (0.015) 32/37

Magic CV in kernel learning theory

Theorem (Bounding expectation of V -fold CV error) Suppose each training data in T n = {(x i, y i )} n i=1 is sampled from the same distribution. Suppose the ith data point is allocated to the τ(i)th set. Let B = sup x K(x, x) and Λ = sup u L (u). Then the expectation of V -fold CV error, 2 < V n, satisfies E T n err V CV E T n err( ˆf λ ) + BΛ2 2λV. V -fold CV error bound: err V CV = 1 V n Training error: v=1 L(y i {i:τ(i)=v} ˆf [ v] λ (x i )). err( ˆf λ ) = 1 n n i=1 ( L y i ˆf ) λ (x i ). 33/37

Corollary (Generalization error bound for kernel SVM) Suppose each training data in T n = {(x i, y i )} n i=1 is sampled from the same distribution. Define f = argmin f HK Err(f). Then E T n Err( ˆf λ ) Err( f) + λ f 2 H K + 1 2λn. Gaussian kernel: B = sup x K(x, x) = 1; SVM: Λ = sup u L (u) = 1. 34/37

Theorem (Bounding the variance of the V -fold CV) Suppose each training data in T n = {(x i, y i )} n i=1 is sampled from the same distribution. Suppose the ith data point is allocated to the τ(i)th set. Let B = sup x K(x, x) and Λ = sup u L (u). Var T n (err LOOCV ) 1 n ( 1 + ) 4 1. λ 35/37

Thank You