A Magiv CV Theory for Large-Margin Classifiers

A Magiv CV Theory for Large-Margin Classifiers Hui Zou School of Statistics, University of Minnesota June 30, 2018 Joint work with Boxiang Wang

Outline 1 Background 2 Magic CV formula 3 Magic support vector machines 4 Magic CV applications in kernel learning theory 5 Numerical studies

Binary classification Observations: a collection of i.i.d. training data (x 1, y 1 ), (x 2, y 2 ),..., (x n, y n ). Predictors (covariates, input vector): x i = (x i1,..., x ip ). Response (class label, output variable): y i = { 1, 1}. Build a model ˆf using the training data. Given any new input x, we predict the class label ŷ = ˆf(x). 2/37

Classification toolbox linear discriminant analysis logistic regression kernel density classifier naïve Bayes classifier neural network boosting ensembles random forest support vector machine (SVM)... Experiment: Fernández et al (2014, JMLR) compared 179 commonly used classifiers on 121 UCI data sets. Conclusion: best classifiers are random forest, kernel SVM, neural networks, and boosting ensembles. 3/37

Support vector machine The Non-separable Case Distance y i (ω 0 + x T i ω) y = +1 may be negative. x2 η i x i Introduce slack variables η i 0. Redefine the distance d i : y = 1 ω d i = y i (ω 0 + x T i ω) + η i such that d i 0 for all i. x 1 4/37

Support vector machine SVM (Vapnik, 1995) argmax ω 0,ω subject to min i d i, d i = y i (ω 0 + x T i ω) + η i 0, i, η i 0, i, ω T ω = 1, η i t. i The tuning parameter t controls the extent of the slack variables. 5/37

Computing SVM in the dual space Lagrange dual function: [ n max L D = max α i 1 α α 2 subject to i=1 n i=1 i =1 n α i y i = 0 and 0 α i γ, i. i=1 The solution has the form: ˆf(x) = ˆβ 0 + ] n α i α i y i y i x i, x i, n ˆα i y i x i, x. The coefficients, ˆα i, are nonzero only when the observations are the support vectors. i=1 6/37

Kernel trick in the dual space Lagrange dual and solution with kernel function: [ n max L D = max α i 1 α α 2 subject to i=1 n i=1 i =1 n α i y i = 0 and 0 α i γ, i. i=1 ˆf(x) = β 0 + Gaussian kernel: n ˆα i y i K(x i, x). i=1 K(x, x ) = exp( σ x x 2 2). ] n α i α i y i y i K(x i, x i ), 7/37

State-of-the-art SVM solvers interior point method. Example: R package kernlab. sequential minimal optimization. (Platt, 1999; Osuna et al, 1997; Keerthi et al., 2001; Fan et al., 2005) LIBSVM library, R package e1071. 8/37

Tuning the SVM The kernel SVM has high prediction accuracy. The generalization error of SVM depends on the choice of the tuning parameter. Two tasks: 1 Model comparison/selection: e.g., choose the tuning parameter for a procedure. 2 Model assessment: estimate the generalization error for the final model. Cross-validation: perhaps the simplest and the most widely used tool. 9/37

Cross-validation

V -fold cross-validation Fold 1 Fold 2 Fold 3 Validation Train Train Train Train Validation Train Train Train Train Validation Train Fold V Train Train Train Validation Cross-validation error: CV(λ) = V 1 V v=1 L(Y v, ˆf [ v] λ (X v )). Tuning parameter: ˆλ = argmin CV(λ). λ Generalization error: Err( ˆf λ ) CV (λ). Leave-one-out CV (LOOCV): V = n. Ten-fold CV: V = 10. 10/37

LOOCV or ten-fold CV? LOOCV is an almost unbiased estimator for the true generalization error, i.e., small bias. LOOCV is deterministic while ten-fold CV is random in terms of different training/validation splits. LOOCV is computationally expensive due to n times model fits.? LOOCV is claimed to have larger variance than ten-fold CV. The last statement is quite popular but it is not generally true. 11/37

Mean of classification error 0.5 0.6 0.7 0.8 0.9 1.0 LOOCV cv error true error 10 fold cv error true error 5 fold cv error true error 2 fold cv error true error Variance of classification error 0.02 0.04 0.06 0.08 10 8 6 4 10 8 6 4 10 8 6 4 10 8 6 4 log(lambda) log(lambda) log(lambda) log(lambda) 1 LOOCV has almost no bias in estimating generalization error. 2 LOOCV has the variance no larger than other V -fold CV. 12/37

Magic CV Formula

LOOCV for regression Model: y = f(x) + ɛ. Estimate f using the regularization: ˆf λ = argmin f [ 1 n ] n (y i f(x i )) 2 + λp (f). i=1 Examples: ridge regression, f(x) = x β, P (f) = β 2 2 ; smoothing spline, P (f) = f (u) 2 du. LOOCV estimate: ˆf [ v] λ = argmin f 1 n n i=1,i v (y i f(x i )) 2 + λp (f). LOOCV error: LOOCV(λ) = 1 n i=1 (y i n ˆf [ i] λ (x i )) 2. 13/37

Magic leave-one-out lemma for regression Craven and Wahba (1979) Suppose that ˆf λ (x i ) = H i y, is self-stable, then LOOCV(λ) = 1 n = 1 n n i=1 n i=1 ( y i ) [ i] 2 ˆf λ (x i ) (y i ˆf λ (x i )) 2 (1 h ii ) 2. 14/37

Self-stability property f(x [ 5], y [ 5] ) f(x, ỹ) y f(x, y) y x x 15/37

The Question We must consider the computation cost of the SVM with CV. Can we compute the exact LOOCV of SVM and related classifiers without repeating the algorithm n times? 16/37

Our contributions Propose a magic CV formula to compute the exact cross-validation error in the context of large-margin classification. Develop a magic SVM by designing a very efficient algorithm to solve kernel SVM and applying the magic CV formula for tuning the parameters. Obtain the theoretic bounds of the expectation and variance of cross-validation error based on the CV formula. 17/37

Performance demo method time (sec) error (%) time (sec) error (%) arrhythmia n = 452 p = 191 musk n = 476 p = 166 magicsvm 46.199 24.093 (0.602) 97.780 10.798 (0.415) kernlab 755.316 24.447 (0.602) 708.587 11.113 (0.433) e1071 1544.779 24.137 (0.602) 2078.586 10.777 (0.420) australian n = 690 p = 14 SAfrica n = 462 p = 65 magicsvm 63.798 14.058 (0.372) 53.558 28.939 (0.616) kernlab 440.839 14.058 (0.388) 342.820 29.589 (0.750) e1071 887.315 14.203 (0.371) 791.532 29.307 (0.692) hepatitis n = 112 p = 18 sonar n = 208 p = 6 magicsvm 1.664 14.464 (1.044) 14.151 17.885 (0.749) kernlab 22.837 14.643 (0.966) 80.996 18.798 (0.973) e1071 20.781 14.732 (0.933) 134.681 18.942 (1.351) LSVT n = 126 p = 309 valley n = 606 p = 100 magicsvm 3.185 15.873 (0.620) 129.080 2.096 (0.186) kernlab 111.767 15.635 (0.766) 899.347 2.195 (0.232) e1071 232.434 16.270 (0.763) 2314.652 2.096 (0.209) 18/37

Magic CV formula for SVM

Cross-validation estimates: A(X [ v], y [ v] ) = A(X, ỹ). Regression Large-margin Classification! "! "! "! "! #! #! #! #! % '( % (* % )! % -! &! &! &! &!,!!,! 19/37

Cross-validation estimates: A(X [ v], y [ v] ) = A(X, ỹ). Regression Large-margin Classification! "! "! "! "! "! "! "! "! #! #! #! #! #! #! #! #! %! % '( % '( (* % (* % ) % )! %! % - -! &! &! &! &! &! &! &! &!!,!,!!!,!,! 19/37

Kernel SVM in the primal space SVM: argmax ω 0,ω subject to loss + penalty: min i d i, d i = y i (ω 0 + x T i ω) + η i 0, η i 0, i, ω T ω = 1, η i t. i [ 1 n [ argmin 1 yi (β 0 + x T i β) ] ] β 0,β n + + λβt β. i=1 0.0 0.5 1.0 1.5 SVM -0.5 0.0 0.5 1.0 1.5 2.0 2.5 20/37

Kernel SVM in the primal space 1 min f H K [ 1 n ] n [1 y i f(x i )] + + λ f 2 H K, i=1 2 Mercer Theorem: a kernel function K has an eigen-expansion, K(x, x ) = γ t φ t (x)φ t (x ). t=1 3 An Hilbert space H K is defined as the collection of functions f(x) = θ t φ t (x). t=1 with the inner product defined as θ t φ t (x), δ t φ t (x) θ t δ t /γ t. t=1 t =1 H K = t=1 21/37

4 The representer theorem (Kimeldorf and Wahba, 1971): min f H K [ 1 n ˆα = arg min α ] n [1 y i f(x i )] + + λ f 2 H K, i=1 ˆf(x) = [ 1 n n ˆα i K(x, x i ) = K T i α, i=1 n [ 1 yi K T i α ] ] + + λαt Kα. i=1 22/37

Magic leave-one-out formula for SVM If we let ỹ v = 0 and ỹ i = y i if i v, then we have ˆf [ v] λ = argmin f H K [ 1 n ] n L (ỹ i f(x i )) + λ f 2 H K. i=1 23/37

Magic cross-validation formula (V -fold CV) The ith data point is allocated to the fold τ(i) by randomization: τ : {1,..., n} {1,..., V }. The V -fold CV estimate: ˆf [ v] λ = argmin 1 f H K n {i:τ(i) v} L (y i f(x i )) + λ f 2 H K. We define ỹ i = 0 if τ(i) = v and ỹ i = y i if τ(i) v, then we have ˆf [ v] λ = argmin f H K [ 1 n ] n L (ỹ i f(x i )) + λ f 2 H K. i=1 24/37

Magic SVM Efficient algorithm for training SVM Exact smoothing principle Accelerated proximal gradient descent Tune SVM using leave-one-out cross-validation Magic CV formula 25/37

Smoothed SVM loss: 0 u 1 + δ, L δ 1 (u) = 4δ [u (1 + δ)]2 1 δ < u < 1 + δ, 1 u u 1 δ. Lipschitz gradient: L δ (u 1 ) L δ (u 2 ) 1 2δ u 1 u 2. Loss Lδ(u) 0.0 0.4 0.8 δ = 0.5 δ = 0.25 δ = 0.1 δ = 0.01 SVM 0.0 0.5 1.0 1.5 2.0 u 26/37

Theorem (finite exact smoothing of SVM) With training data K and y given, suppose α SVM and α δ are the unique solution of the following problems, then there exists a small δ such that α δ = α SVM when δ < δ. α SVM = argmin α R n α δ = argmin α R n [ 1 n [ 1 n ] n L(y i K i α) + λα Kα. i=1 ] n L δ (y i K i α) + λα Kα. i=1 The exact SVM solution is virtually attained before δ = 0. Define a sequence δ (d+1) = rδ (d) and 0 < r < 1. We solve ˆα δ (d) sequentially and terminate the algorithm when some ˆα δ (d) satisfies the KKT condition of the SVM problem. 27/37

Smoothed SVM: min α R Qδ (α) min n α R n [ 1 n ] n L δ (y i K i α) + λα Kα. i=1 Accelerated proximal gradient descent update: [ α (t+1) = argmin λα Kα + 1 ( ) ] α ᾱ (t) 2δ l δ (ᾱ (t) 2 ) α R n 4nδ 2 ( = ᾱ (t) 2λK + 1 ) 1 ( 2nδ KK l δ (ᾱ (t) ) + 2λKᾱ (t)). ( ) ᾱ (t) = α (t) rt 1 + (α (t) α (t 1) ), r 1 = 1, r t+1 = r t+1 ( 1 + 1 + 4r 2 t ) /2. 28/37

Algorithm 1 Magic SVM Require: y, K, λ, and r, (e.g., r = 2 3 ). 1: Initialize δ. Define L δ. Initialize each α [ v]. 2: repeat 3: Compute P 1 δ (K) = (2λK + 1 2nδ KK) 1. 4: for v = 1,..., n do 5: Let ỹ i = y i if i v, and ỹ v = 0. 6: repeat 7: Compute z, with z i = ỹ i L δ ((ỹ i K i α [ v] )/n. 8: α [ v] α [ v] P 1 δ (K) K z + 2λK α [ v]). 9: until the convergence condition is met. 10: end for 11: Update δ rδ. 12: until the KKT condition check of SVM is passed. The complexity of regular LOOCV SVM O(n 4 ). The complexity of the entire magic LOOCV SVM is O(n 3 ), 29/37

The algorithm can be generalized to other variants of CV, such as V -fold CV, delete-v CV, etc. The algorithm can be generalized to other kernel machines, e.g., logistic regression, squared SVM, huber SVM. 30/37

Simulation Define µ + = (1,..., 1, 0,..., 0) and µ = (0,..., 0, 1,..., 1). Positive class: 10 k=1 0.1N(µ k+, 4I) where µ k+ N(µ +, I). Negative class: 10 k=1 0.1N(µ k, 4I) where µ k N(µ, I). Ratios of the run time without the magic CV to the magic CV: ratio of run time 0 10 20 30 40 50 60 p=0.2n p=0.5n 100 200 300 400 500 sample size 31/37

n p time (seconds) objective value magicsvm kernlab e1071 magicsvm kernlab e1071 200 40 45.059 319.026 617.379 0.640 0.640 0.640 (0.011) (0.011) (0.011) 100 39.461 504.113 900.292 0.599 0.599 0.675 (0.014) (0.014) (0.073) 300 60 122.958 967.405 2305.041 0.631 0.631 0.631 (0.010) (0.010) (0.010) 150 107.102 1680.196 3443.444 0.611 0.611 0.611 (0.018) (0.017) (0.017) 400 80 373.881 2279.869 6232.798 0.629 0.629 0.629 (0.012) (0.012) (0.012) 200 343.342 4506.013 9473.093 0.608 0.608 0.608 (0.015) (0.015) (0.015) 32/37

Magic CV in kernel learning theory

Theorem (Bounding expectation of V -fold CV error) Suppose each training data in T n = {(x i, y i )} n i=1 is sampled from the same distribution. Suppose the ith data point is allocated to the τ(i)th set. Let B = sup x K(x, x) and Λ = sup u L (u). Then the expectation of V -fold CV error, 2 < V n, satisfies E T n err V CV E T n err( ˆf λ ) + BΛ2 2λV. V -fold CV error bound: err V CV = 1 V n Training error: v=1 L(y i {i:τ(i)=v} ˆf [ v] λ (x i )). err( ˆf λ ) = 1 n n i=1 ( L y i ˆf ) λ (x i ). 33/37

Corollary (Generalization error bound for kernel SVM) Suppose each training data in T n = {(x i, y i )} n i=1 is sampled from the same distribution. Define f = argmin f HK Err(f). Then E T n Err( ˆf λ ) Err( f) + λ f 2 H K + 1 2λn. Gaussian kernel: B = sup x K(x, x) = 1; SVM: Λ = sup u L (u) = 1. 34/37

Theorem (Bounding the variance of the V -fold CV) Suppose each training data in T n = {(x i, y i )} n i=1 is sampled from the same distribution. Suppose the ith data point is allocated to the τ(i)th set. Let B = sup x K(x, x) and Λ = sup u L (u). Var T n (err LOOCV ) 1 n ( 1 + ) 4 1. λ 35/37

Thank You