Trading Convexity for Scalability

Size: px

Start display at page:

Download "Trading Convexity for Scalability"

Marsha Knight
6 years ago
Views:

1 Trading Convexity for Scalability Léon Bottou Ronan Collobert, Fabian Sinz, Jason Weston NEC Laboratories of America

2 A Word About Convexity Advantages of Convexity: - There is a unique solution - We know plenty of good algorithms for convex programming - Makes theory easier Drawback: - Who said convex problems are the only interesting ones? (except for lazy mathematicians)

3 Summary I. Non-Convex SVMs Faster and sparser than convex SVMs! II. Fast Transductive SVMs Faster than any TSVM implementation, especially convex TSVM approaches in noisy conditions

4 Support Vector Machines Decision function: ŷ(x) = w Φ(x) + b Primal formulation: min w, b 1 2 w 2 + C i H 1 [ y i ŷ(x i ) ] with the Hinge Loss H s (z) = s z + Dual formulation: min α G(α) = 1 2 α i α j Φ(x i ) Φ(x j ) i,j i y i α i s.t. α i = i y i α i C

5 Part I Non-Convex SVMs

6 SVM Known Problem - The number of SVs increases linearly with L (Steinwart, 4) - The cost attributed to one example (x, y) is CH 1 [ y ŷ(x) ] H 1(z) Given z = y ŷ(x), we have Outliers are SVs Our decision function is expressed with garbage z

7 Non-Convex SVMs Ramp Loss R (z) s s z (Neural Networks) (Mason, 2) (Shen, 23) Examples lying in the flat areas of the loss cannot be SVs

8 The Concave-Convex Procedure (CCCP) Consider a cost function J(θ) Decompose into a convex part and a concave part J(θ) = J vex (θ) + J cav (θ) Iterative algorithm { } θ t+1 = argmin θ J vex (θ) + J cav(θ t ) θ (Le Thi, 94) (Yuille, 3) J(θ t ) is guaranteed to decrease at each iteration Converges to a local minima

9 Ramp Algebra R 1 1 = H H s= z z s= z J s (w, b) = 1 2 w 2 + C = 1 2 w 2 + C L R s [ y i ŷ(x i ) ] i=1 L H 1 [ y i ŷ(x i ) ] i=1 } {{ } Convex C L H s [ y i ŷ(x i ) ] i=1 } {{ } Concave

10 The Algorithm 1. Initialize β = and choose s 2. Minimize G(α) = 1 α 2 i α j Φ(x i ) Φ(x j ) y i α i i,j i with α i = and β i y i α i C β i 3. Update w and b i 4. Update β β i { C if yi ŷ(x i ) < s otherwise 5. Go back to step 2 until convergence

11 Raw Results Train Test Notes Waveform Artificial data, 21 dims. Banana Artificial data, 2 dims. USPS+N Class vs. rest. with 1% training label noise. Adult As in (Platt, 1999). 1 raetsch/data/index.html 2 ftp://ftp.ics.uci.edu/pub/machine-learning-databases SVM H 1 SVM R s Dataset Error SV Error SV Waveform 8.8% % 865 Banana 9.5% % 891 USPS+N.5% % 61 Adult 15.1% % 4588 All results are averaged using 1 random splits train-test

12 Speedup: Test vs Initial Training Set Size We do not need to initialize CCCP using the full dataset.65.6 Test Error (%) SVM H 1 Test Error # of SVs Test Error (%) SVM H 1 Test Error # of SVs 7 Test Error (%) Number of SVs Test Error (%) Number of SVs Initial Set Size (% of Training Set Size) USPS+N Initial Set Size (% of Training Set Size) Adult

13 Convex vs Non-Convex SVMs SVM H SVM H 1 1 SVM R SVM R s s Time (s) Number of SVs 5 2 USPS+N Adult USPS+N Adult

14 Details on USPS+N Number Of Support Vectors SVM H 1 SVM R 1 SVM R Number Of Training Examples Testing Error (%) SVM H 1 SVM R s Number Of Support Vectors

15 Details on Adult Number Of Support Vectors SVM H 1 SVM R 1 SVM R Testing Error (%) SVM H 1 SVM R s Number Of Training Examples x Number Of Support Vectors

16 Objective Function vs Iterations 1525 SVM R 1 SVM R x 15 SVM R 1 x SVM R 1 Objective Function SVM R Objective Function SVM R 1 Objective Function SVM R SVM R Objective Function Iterations USPS+N Iterations Adult Fast convergence of the CCCP procedure

17 Part II Transductive SVMs

18 Transductive SVMs

19 Losses for Transduction - (x i, y i ) 1 i N labeled examples, (x i ) N+1 i N+U unlabeled examples - Cost to be minimized J(θ) = 1 2 w 2 + C L i=1 L+U H 1 [ y i ŷ(x i ) ] + C i=l+1 J U [ ŷ(x i ) ], - Possible losses J U for unlabeled z z z

20 Ramp Algebra for Transduction - Loss considered given an unlabeled x and z = ŷ(x) J U (z) = R s (z) + R s ( z) - Ramp Loss on unlabeled appearing twice with both possible labels J(θ) = 1 2 w 2 + C = 1 2 w 2 + C L i=1 L+U H 1 [ y i ŷ(x i ) ] + C i=l+1 L L+2U H 1 [ y i ŷ(x i ) ] + C i=1 i=l+1 J U [ ŷ(x i ) ] R s [ y i ŷ(x i ) ]. - Decompose again the Ramp into two Hinges and apply CCCP

21 Balancing Constraint - Transductive SVMs fail without balancing constraint (Joachims, 1999) - Constraint: (Chapelle & Zien, 25) - CCCP remains valid - Extra example 1 U L+U i=l+1 Φ(x ) = 1 U ŷ(x i ) = 1 L L+U i=l+1 L i=1 Φ(x i ) y i - Efficiency: compute the kernel column only once Φ(x ) Φ(x j ) = 1 U L+U i=l+1 Φ(x i ) Φ(x j ) j

22 The Algorithm 1. Initialize w and b with the SVM solution 2. Choose s, initialize β as in (1), set ξ i = y i (i ) ξ = 1 L Lj=1 y j 3. Minimize G(α) = 1 α 2 i α j Φ(x i ) Φ(x j ) ξ i α i i,j i with α i = and β i y i α i C β i i 4. Update w and b 5. Update β { C if yi ŷ(x β i i ) < s and i L + 1 otherwise 6. Go back to step 2 until convergence (1)

23 Raw Results data set classes dims points labeled g5c Coil Text Uspst Coil2 g5c Text Uspst SVM SVMLight-TSVM TSVM CCCP-TSVM s= U C =L C CCCP-TSVM as in (Chapelle & Zien, 25)

24 CCCP-TSVM vs SVMLight vs TSVM 5 4 SVMLight TSVM TSVM CCCP TSVM SVMLight TSVM TSVM CCCP TSVM Time (secs) 3 2 Time (secs) Number Of Unlabeled Examples g5c Number Of Unlabeled Examples Text

25 Large Scale Datasets: Reuters and MNIST 17 Reuters RCV1 8 MNIST Test Error Test Error Number of unlabeled examples Reuters RCV Number of unlabeled examples MNIST

26 Large Scale Datasets: Scaling k training set 4 3 CCCP TSVM Quadratic fit optimization time [sec] Time (Hours) number of unlabeled examples [k] Reuters RCV Number of Unlabeled Examples x 1 4 MNIST Quadratic Tendency

27 Conclusion I. Two non-convex algorithms with advantages over convex alternatives II. CCCP is one good way to handle non-convex problems III. Why limiting ourselves to convex algorithms?

Large Scale Semi-supervised Linear SVM with Stochastic Gradient Descent

Journal of Computational Information Systems 9: 15 (2013) 6251 6258 Available at http://www.jofcis.com Large Scale Semi-supervised Linear SVM with Stochastic Gradient Descent Xin ZHOU, Conghui ZHU, Sheng