A Magiv CV Theory for Large-Margin Classifiers

Size: px

Start display at page:

Download "A Magiv CV Theory for Large-Margin Classifiers"

Sandra Fowler
5 years ago
Views:

1 A Magiv CV Theory for Large-Margin Classifiers Hui Zou School of Statistics, University of Minnesota June 30, 2018 Joint work with Boxiang Wang

2 Outline 1 Background 2 Magic CV formula 3 Magic support vector machines 4 Magic CV applications in kernel learning theory 5 Numerical studies

3 Binary classification Observations: a collection of i.i.d. training data (x 1, y 1 ), (x 2, y 2 ),..., (x n, y n ). Predictors (covariates, input vector): x i = (x i1,..., x ip ). Response (class label, output variable): y i = { 1, 1}. Build a model ˆf using the training data. Given any new input x, we predict the class label ŷ = ˆf(x). 2/37

4 Classification toolbox linear discriminant analysis logistic regression kernel density classifier naïve Bayes classifier neural network boosting ensembles random forest support vector machine (SVM)... Experiment: Fernández et al (2014, JMLR) compared 179 commonly used classifiers on 121 UCI data sets. Conclusion: best classifiers are random forest, kernel SVM, neural networks, and boosting ensembles. 3/37

5 Support vector machine The Non-separable Case Distance y i (ω 0 + x T i ω) y = +1 may be negative. x2 η i x i Introduce slack variables η i 0. Redefine the distance d i : y = 1 ω d i = y i (ω 0 + x T i ω) + η i such that d i 0 for all i. x 1 4/37

6 Support vector machine SVM (Vapnik, 1995) argmax ω 0,ω subject to min i d i, d i = y i (ω 0 + x T i ω) + η i 0, i, η i 0, i, ω T ω = 1, η i t. i The tuning parameter t controls the extent of the slack variables. 5/37

7 Computing SVM in the dual space Lagrange dual function: [ n max L D = max α i 1 α α 2 subject to i=1 n i=1 i =1 n α i y i = 0 and 0 α i γ, i. i=1 The solution has the form: ˆf(x) = ˆβ 0 + ] n α i α i y i y i x i, x i, n ˆα i y i x i, x. The coefficients, ˆα i, are nonzero only when the observations are the support vectors. i=1 6/37

8 Kernel trick in the dual space Lagrange dual and solution with kernel function: [ n max L D = max α i 1 α α 2 subject to i=1 n i=1 i =1 n α i y i = 0 and 0 α i γ, i. i=1 ˆf(x) = β 0 + Gaussian kernel: n ˆα i y i K(x i, x). i=1 K(x, x ) = exp( σ x x 2 2). ] n α i α i y i y i K(x i, x i ), 7/37

9 State-of-the-art SVM solvers interior point method. Example: R package kernlab. sequential minimal optimization. (Platt, 1999; Osuna et al, 1997; Keerthi et al., 2001; Fan et al., 2005) LIBSVM library, R package e /37

10 Tuning the SVM The kernel SVM has high prediction accuracy. The generalization error of SVM depends on the choice of the tuning parameter. Two tasks: 1 Model comparison/selection: e.g., choose the tuning parameter for a procedure. 2 Model assessment: estimate the generalization error for the final model. Cross-validation: perhaps the simplest and the most widely used tool. 9/37

11 Cross-validation

12 V -fold cross-validation Fold 1 Fold 2 Fold 3 Validation Train Train Train Train Validation Train Train Train Train Validation Train Fold V Train Train Train Validation Cross-validation error: CV(λ) = V 1 V v=1 L(Y v, ˆf [ v] λ (X v )). Tuning parameter: ˆλ = argmin CV(λ). λ Generalization error: Err( ˆf λ ) CV (λ). Leave-one-out CV (LOOCV): V = n. Ten-fold CV: V = /37

13 LOOCV or ten-fold CV? LOOCV is an almost unbiased estimator for the true generalization error, i.e., small bias. LOOCV is deterministic while ten-fold CV is random in terms of different training/validation splits. LOOCV is computationally expensive due to n times model fits.? LOOCV is claimed to have larger variance than ten-fold CV. The last statement is quite popular but it is not generally true. 11/37

14 Mean of classification error LOOCV cv error true error 10 fold cv error true error 5 fold cv error true error 2 fold cv error true error Variance of classification error log(lambda) log(lambda) log(lambda) log(lambda) 1 LOOCV has almost no bias in estimating generalization error. 2 LOOCV has the variance no larger than other V -fold CV. 12/37

15 Magic CV Formula

16 LOOCV for regression Model: y = f(x) + ɛ. Estimate f using the regularization: ˆf λ = argmin f [ 1 n ] n (y i f(x i )) 2 + λp (f). i=1 Examples: ridge regression, f(x) = x β, P (f) = β 2 2 ; smoothing spline, P (f) = f (u) 2 du. LOOCV estimate: ˆf [ v] λ = argmin f 1 n n i=1,i v (y i f(x i )) 2 + λp (f). LOOCV error: LOOCV(λ) = 1 n i=1 (y i n ˆf [ i] λ (x i )) 2. 13/37

17 Magic leave-one-out lemma for regression Craven and Wahba (1979) Suppose that ˆf λ (x i ) = H i y, is self-stable, then LOOCV(λ) = 1 n = 1 n n i=1 n i=1 ( y i ) [ i] 2 ˆf λ (x i ) (y i ˆf λ (x i )) 2 (1 h ii ) 2. 14/37

18 Self-stability property f(x [ 5], y [ 5] ) f(x, ỹ) y f(x, y) y x x 15/37

19 The Question We must consider the computation cost of the SVM with CV. Can we compute the exact LOOCV of SVM and related classifiers without repeating the algorithm n times? 16/37

20 Our contributions Propose a magic CV formula to compute the exact cross-validation error in the context of large-margin classification. Develop a magic SVM by designing a very efficient algorithm to solve kernel SVM and applying the magic CV formula for tuning the parameters. Obtain the theoretic bounds of the expectation and variance of cross-validation error based on the CV formula. 17/37

21 Performance demo method time (sec) error (%) time (sec) error (%) arrhythmia n = 452 p = 191 musk n = 476 p = 166 magicsvm (0.602) (0.415) kernlab (0.602) (0.433) e (0.602) (0.420) australian n = 690 p = 14 SAfrica n = 462 p = 65 magicsvm (0.372) (0.616) kernlab (0.388) (0.750) e (0.371) (0.692) hepatitis n = 112 p = 18 sonar n = 208 p = 6 magicsvm (1.044) (0.749) kernlab (0.966) (0.973) e (0.933) (1.351) LSVT n = 126 p = 309 valley n = 606 p = 100 magicsvm (0.620) (0.186) kernlab (0.766) (0.232) e (0.763) (0.209) 18/37

22 Magic CV formula for SVM

23 Cross-validation estimates: A(X [ v], y [ v] ) = A(X, ỹ). Regression Large-margin Classification! "! "! "! "! #! #! #! #! % '( % (* % )! % -! &! &! &! &!,!!,! 19/37

24 Cross-validation estimates: A(X [ v], y [ v] ) = A(X, ỹ). Regression Large-margin Classification! "! "! "! "! "! "! "! "! #! #! #! #! #! #! #! #! %! % '( % '( (* % (* % ) % )! %! % - -! &! &! &! &! &! &! &! &!!,!,!!!,!,! 19/37

25 Kernel SVM in the primal space SVM: argmax ω 0,ω subject to loss + penalty: min i d i, d i = y i (ω 0 + x T i ω) + η i 0, η i 0, i, ω T ω = 1, η i t. i [ 1 n [ argmin 1 yi (β 0 + x T i β) ] ] β 0,β n + + λβt β. i= SVM /37

26 Kernel SVM in the primal space 1 min f H K [ 1 n ] n [1 y i f(x i )] + + λ f 2 H K, i=1 2 Mercer Theorem: a kernel function K has an eigen-expansion, K(x, x ) = γ t φ t (x)φ t (x ). t=1 3 An Hilbert space H K is defined as the collection of functions f(x) = θ t φ t (x). t=1 with the inner product defined as θ t φ t (x), δ t φ t (x) θ t δ t /γ t. t=1 t =1 H K = t=1 21/37

27 4 The representer theorem (Kimeldorf and Wahba, 1971): min f H K [ 1 n ˆα = arg min α ] n [1 y i f(x i )] + + λ f 2 H K, i=1 ˆf(x) = [ 1 n n ˆα i K(x, x i ) = K T i α, i=1 n [ 1 yi K T i α ] ] + + λαt Kα. i=1 22/37

28 Magic leave-one-out formula for SVM If we let ỹ v = 0 and ỹ i = y i if i v, then we have ˆf [ v] λ = argmin f H K [ 1 n ] n L (ỹ i f(x i )) + λ f 2 H K. i=1 23/37

29 Magic cross-validation formula (V -fold CV) The ith data point is allocated to the fold τ(i) by randomization: τ : {1,..., n} {1,..., V }. The V -fold CV estimate: ˆf [ v] λ = argmin 1 f H K n {i:τ(i) v} L (y i f(x i )) + λ f 2 H K. We define ỹ i = 0 if τ(i) = v and ỹ i = y i if τ(i) v, then we have ˆf [ v] λ = argmin f H K [ 1 n ] n L (ỹ i f(x i )) + λ f 2 H K. i=1 24/37

30 Magic SVM Efficient algorithm for training SVM Exact smoothing principle Accelerated proximal gradient descent Tune SVM using leave-one-out cross-validation Magic CV formula 25/37

31 Smoothed SVM loss: 0 u 1 + δ, L δ 1 (u) = 4δ [u (1 + δ)]2 1 δ < u < 1 + δ, 1 u u 1 δ. Lipschitz gradient: L δ (u 1 ) L δ (u 2 ) 1 2δ u 1 u 2. Loss Lδ(u) δ = 0.5 δ = 0.25 δ = 0.1 δ = 0.01 SVM u 26/37

32 Theorem (finite exact smoothing of SVM) With training data K and y given, suppose α SVM and α δ are the unique solution of the following problems, then there exists a small δ such that α δ = α SVM when δ < δ. α SVM = argmin α R n α δ = argmin α R n [ 1 n [ 1 n ] n L(y i K i α) + λα Kα. i=1 ] n L δ (y i K i α) + λα Kα. i=1 The exact SVM solution is virtually attained before δ = 0. Define a sequence δ (d+1) = rδ (d) and 0 < r < 1. We solve ˆα δ (d) sequentially and terminate the algorithm when some ˆα δ (d) satisfies the KKT condition of the SVM problem. 27/37

33 Smoothed SVM: min α R Qδ (α) min n α R n [ 1 n ] n L δ (y i K i α) + λα Kα. i=1 Accelerated proximal gradient descent update: [ α (t+1) = argmin λα Kα + 1 ( ) ] α ᾱ (t) 2δ l δ (ᾱ (t) 2 ) α R n 4nδ 2 ( = ᾱ (t) 2λK + 1 ) 1 ( 2nδ KK l δ (ᾱ (t) ) + 2λKᾱ (t)). ( ) ᾱ (t) = α (t) rt 1 + (α (t) α (t 1) ), r 1 = 1, r t+1 = r t+1 ( r 2 t ) /2. 28/37

34 Algorithm 1 Magic SVM Require: y, K, λ, and r, (e.g., r = 2 3 ). 1: Initialize δ. Define L δ. Initialize each α [ v]. 2: repeat 3: Compute P 1 δ (K) = (2λK + 1 2nδ KK) 1. 4: for v = 1,..., n do 5: Let ỹ i = y i if i v, and ỹ v = 0. 6: repeat 7: Compute z, with z i = ỹ i L δ ((ỹ i K i α [ v] )/n. 8: α [ v] α [ v] P 1 δ (K) K z + 2λK α [ v]). 9: until the convergence condition is met. 10: end for 11: Update δ rδ. 12: until the KKT condition check of SVM is passed. The complexity of regular LOOCV SVM O(n 4 ). The complexity of the entire magic LOOCV SVM is O(n 3 ), 29/37

35 The algorithm can be generalized to other variants of CV, such as V -fold CV, delete-v CV, etc. The algorithm can be generalized to other kernel machines, e.g., logistic regression, squared SVM, huber SVM. 30/37

36 Simulation Define µ + = (1,..., 1, 0,..., 0) and µ = (0,..., 0, 1,..., 1). Positive class: 10 k=1 0.1N(µ k+, 4I) where µ k+ N(µ +, I). Negative class: 10 k=1 0.1N(µ k, 4I) where µ k N(µ, I). Ratios of the run time without the magic CV to the magic CV: ratio of run time p=0.2n p=0.5n sample size 31/37

37 n p time (seconds) objective value magicsvm kernlab e1071 magicsvm kernlab e (0.011) (0.011) (0.011) (0.014) (0.014) (0.073) (0.010) (0.010) (0.010) (0.018) (0.017) (0.017) (0.012) (0.012) (0.012) (0.015) (0.015) (0.015) 32/37

38 Magic CV in kernel learning theory

39 Theorem (Bounding expectation of V -fold CV error) Suppose each training data in T n = {(x i, y i )} n i=1 is sampled from the same distribution. Suppose the ith data point is allocated to the τ(i)th set. Let B = sup x K(x, x) and Λ = sup u L (u). Then the expectation of V -fold CV error, 2 < V n, satisfies E T n err V CV E T n err( ˆf λ ) + BΛ2 2λV. V -fold CV error bound: err V CV = 1 V n Training error: v=1 L(y i {i:τ(i)=v} ˆf [ v] λ (x i )). err( ˆf λ ) = 1 n n i=1 ( L y i ˆf ) λ (x i ). 33/37

40 Corollary (Generalization error bound for kernel SVM) Suppose each training data in T n = {(x i, y i )} n i=1 is sampled from the same distribution. Define f = argmin f HK Err(f). Then E T n Err( ˆf λ ) Err( f) + λ f 2 H K + 1 2λn. Gaussian kernel: B = sup x K(x, x) = 1; SVM: Λ = sup u L (u) = 1. 34/37

41 Theorem (Bounding the variance of the V -fold CV) Suppose each training data in T n = {(x i, y i )} n i=1 is sampled from the same distribution. Suppose the ith data point is allocated to the τ(i)th set. Let B = sup x K(x, x) and Λ = sup u L (u). Var T n (err LOOCV ) 1 n ( 1 + ) 4 1. λ 35/37

42 Thank You

Support Vector Machines for Classification: A Statistical Portrait

Support Vector Machines for Classification: A Statistical Portrait Yoonkyung Lee Department of Statistics The Ohio State University May 27, 2011 The Spring Conference of Korean Statistical Society KAIST,