CTJLSVM: Componentwise Triple Jump Acceleration for Training Linear SVM

Size: px

Start display at page:

Download "CTJLSVM: Componentwise Triple Jump Acceleration for Training Linear SVM"

Terence Lewis
6 years ago
Views:

1 CTJLSVM: Componentwise Triple Jump Acceleration for Training Linear SVM Han-Shen Huang, Porter Chang (Ker2) and Chun-Nan Hsu AI for Investigating Anti-cancer solutions (AIIA Lab) Institute of Information Science Academia Sinica Taipei, Taiwan ICML 2008 Large Scale Learning Workshop Huang, Chang and Hsu (AIIA.IIS.AS.TW) CTJLSVM July 9, / 25

2 Outline 1 Triple Jump Extrapolation Huang, Chang and Hsu (AIIA.IIS.AS.TW) CTJLSVM July 9, / 25

3 Outline 1 Triple Jump Extrapolation 2 Application to Linear SVM Huang, Chang and Hsu (AIIA.IIS.AS.TW) CTJLSVM July 9, / 25

4 Outline 1 Triple Jump Extrapolation 2 Application to Linear SVM 3 Implementation Huang, Chang and Hsu (AIIA.IIS.AS.TW) CTJLSVM July 9, / 25

5 Outline 1 Triple Jump Extrapolation 2 Application to Linear SVM 3 Implementation 4 Discussion Huang, Chang and Hsu (AIIA.IIS.AS.TW) CTJLSVM July 9, / 25

6 Outline Triple Jump Extrapolation 1 Triple Jump Extrapolation 2 Application to Linear SVM 3 Implementation 4 Discussion Huang, Chang and Hsu (AIIA.IIS.AS.TW) CTJLSVM July 9, / 25

7 Triple Jump Extrapolation Aitken s Acceleration Fixed point iteration Solve w = M(w), w R d, by iteratively substituting the input of M with the output of M in the previous iteration: w (t+1) = M(w (t) ), w (t+2) = M(w (t+1) ),.... until w = M(w ). E.g., EM, Generalized Iterative Scaling, Gradient Descent, etc. Huang, Chang and Hsu (AIIA.IIS.AS.TW) CTJLSVM July 9, / 25

8 Triple Jump Extrapolation Aitken s Acceleration Fixed point iteration Solve w = M(w), w R d, by iteratively substituting the input of M with the output of M in the previous iteration: w (t+1) = M(w (t) ), w (t+2) = M(w (t+1) ),.... until w = M(w ). E.g., EM, Generalized Iterative Scaling, Gradient Descent, etc. Aitken s acceleration Apply a linear Taylor expansion of M around w, w (t+1) = M(w (t) ) w + M (w )(w (t) w ) = w + J(w (t) w ), (1) By applying M to w (t) consecutively for h times, w (t+h) w + J h (w (t) w ). Suppose that eig(j) ( 1, 1), we have the multivariate Aitken s acceleration: w w (t) + J h (w (t+1) w (t) ) h=0 = w (t) + (I J) 1 (M(w (t) ) w (t) ). Huang, Chang and Hsu (AIIA.IIS.AS.TW) CTJLSVM July 9, / 25

9 Triple Jump Extrapolation Triple Jump Extrapolation Approximating Jacobian Since it is prohibitively expensive to compute J, from J(w (t) w (t 1) ) M(w (t) ) w (t), replace J with a scalar: γ (t) := M(w(t) ) w (t) w (t) w (t 1). (2) to obtain w (t+1) = w (t) + (1 γ (t) ) 1 (M(w (t) ) w (t) ). (3) Huang, Chang and Hsu (AIIA.IIS.AS.TW) CTJLSVM July 9, / 25

10 Triple Jump Extrapolation Triple Jump Extrapolation Approximating Jacobian Since it is prohibitively expensive to compute J, from J(w (t) w (t 1) ) M(w (t) ) w (t), replace J with a scalar: γ (t) := M(w(t) ) w (t) w (t) w (t 1). (2) to obtain w (t+1) = w (t) + (1 γ (t) ) 1 (M(w (t) ) w (t) ). (3) Algorithm Triple Jump 1: Initialize w (0), t 0 2: repeat in iteration t 3: if mod(t, 2) == 0 then 4: hop: w (t+1) M(w (t) ); 5: else 6: step: Apply M again to obtain M(w (t) ); 7: jump: Apply Equation (3) to obtain w (t+1) ; 8: end if 9: t t + 1; 10: until Convergence Huang, Chang and Hsu (AIIA.IIS.AS.TW) CTJLSVM July 9, / 25

11 Triple Jump GIS Triple Jump Extrapolation x Independent data set CTJPGIS TJPGIS PGIS Penalized Log Likelihood # of Forward backward evaluations Huang, Chang and Hsu (AIIA.IIS.AS.TW) CTJLSVM July 9, / 25

12 Triple Jump Extrapolation Global and Componentwise Extrapolation Eigenvalue approximation and global rates of convergence Another way to derive the triple jump method is to approximate J with its eigenvalue. The global rate of convergence of M is given by: w (t+1) w R = lim t R(t) := lim t w (t) w = λmax w(t+1) w (t) w (t) w (t 1). R = λ max of J was established by (Dempster et al. 1977) when M is EM. In practice, w is unknown. Replacing it with empirical values leads to Equation (2). Huang, Chang and Hsu (AIIA.IIS.AS.TW) CTJLSVM July 9, / 25

13 Triple Jump Extrapolation Global and Componentwise Extrapolation Eigenvalue approximation and global rates of convergence Another way to derive the triple jump method is to approximate J with its eigenvalue. The global rate of convergence of M is given by: w (t+1) w R = lim t R(t) := lim t w (t) w = λmax w(t+1) w (t) w (t) w (t 1). R = λ max of J was established by (Dempster et al. 1977) when M is EM. In practice, w is unknown. Replacing it with empirical values leads to Equation (2). Componentwise rate of convergence Now the i-th componentwise rate of convergence of M is defined as R i = lim t R(t) := lim i t w (t+1) w i i w (t) w i i = λ i w(t+1) w (t) i i w (t) w (t 1). i i Huang, Chang and Hsu (AIIA.IIS.AS.TW) CTJLSVM July 9, / 25

14 Triple Jump Extrapolation Componentwise Triple Jump Componentwise extrapolation We can estimate the i-th eigenvalue λ i similarly by γ (t) i := M(w(t) ) i w (t) i w (t) i w (t 1) i and extrapolate at each dimension i by: w (t+1) i = w (t) i, i, (4) + (1 γ (t) i ) 1 (M(w (t) ) i w (t) i ). (5) Huang, Chang and Hsu (AIIA.IIS.AS.TW) CTJLSVM July 9, / 25

15 Triple Jump Extrapolation Componentwise Triple Jump Componentwise extrapolation We can estimate the i-th eigenvalue λ i similarly by γ (t) i := M(w(t) ) i w (t) i w (t) i w (t 1) i and extrapolate at each dimension i by: w (t+1) i = w (t) i Global vs. Componentwise, i, (4) + (1 γ (t) i ) 1 (M(w (t) ) i w (t) i ). (5) When R i R, componentwise extrapolation should be preferred (Schafer 1997; Hsu et al. 2006). Huang, Chang and Hsu (AIIA.IIS.AS.TW) CTJLSVM July 9, / 25

16 Outline Application to Linear SVM 1 Triple Jump Extrapolation 2 Application to Linear SVM 3 Implementation 4 Discussion Huang, Chang and Hsu (AIIA.IIS.AS.TW) CTJLSVM July 9, / 25

17 Application to Linear SVM SGD for Linear SVM Decomposing primal objective function for linear SVM: P(w; X) = m P(w; x i ) i=1 m i=1 C max(0, 1 y i wx i ) + 1 2m w 2 Gradient Descent (Off-line) 1: Initialize w (0) 2: repeat in iteration t 3: w (t+1) w (t) η P(w (t) ; X) 4: Update η 5: until Convergence SGD (On-line) 1: Initialize w (0) 2: repeat in iteration t 3: w (t+1) w (t) η P(w (t) ; x i ) 4: Update η 5: until Convergence Huang, Chang and Hsu (AIIA.IIS.AS.TW) CTJLSVM July 9, / 25

18 Application to Linear SVM Accelerating SGD with CTJ By treating a SGD epoch as M, we can apply CTJ to accelerate SGD for training linear SVM. Huang, Chang and Hsu (AIIA.IIS.AS.TW) CTJLSVM July 9, / 25

19 Application to Linear SVM Accelerating SGD with CTJ By treating a SGD epoch as M, we can apply CTJ to accelerate SGD for training linear SVM. Algorithm CTJLSVM 1: Initialize w (0), t 0, κ 2: repeat in iteration t 3: if mod(t, 2) == 0 then 4: hop: w (t+1) M(w (t) ); 5: else 6: step: Apply M again to obtain M(w (t) ); 7: Compute γ (t) by Equation (4) 8: γ (t) i min(κ, γ (t) i ) i to prevent erratic extrapolation 9: jump: Apply Equation (3) to obtain w (t+1) ; 10: end if 11: Update η; 12: t t + 1; 13: until Convergence Huang, Chang and Hsu (AIIA.IIS.AS.TW) CTJLSVM July 9, / 25

20 Application to Linear SVM Accelerating SGD with CTJ By treating a SGD epoch as M, we can apply CTJ to accelerate SGD for training linear SVM. CTJ takes only O(d) more time per iteration than SGD and the sparsity trick for SGD can still be applied here. Algorithm CTJLSVM 1: Initialize w (0), t 0, κ 2: repeat in iteration t 3: if mod(t, 2) == 0 then 4: hop: w (t+1) M(w (t) ); 5: else 6: step: Apply M again to obtain M(w (t) ); 7: Compute γ (t) by Equation (4) 8: γ (t) i min(κ, γ (t) i ) i to prevent erratic extrapolation 9: jump: Apply Equation (3) to obtain w (t+1) ; 10: end if 11: Update η; 12: t t + 1; 13: until Convergence Huang, Chang and Hsu (AIIA.IIS.AS.TW) CTJLSVM July 9, / 25

21 Outline Implementation 1 Triple Jump Extrapolation 2 Application to Linear SVM 3 Implementation 4 Discussion Huang, Chang and Hsu (AIIA.IIS.AS.TW) CTJLSVM July 9, / 25

22 Parameter Settings Implementation κ: upper bound for eigenvalue estimation; Since we didn t implement convergence checking, we set κ = 0.7, a conservative setting considering that eigenvalues could be close to 1 Huang, Chang and Hsu (AIIA.IIS.AS.TW) CTJLSVM July 9, / 25

23 Implementation Parameter Settings κ: upper bound for eigenvalue estimation; Since we didn t implement convergence checking, we set κ = 0.7, a conservative setting considering that eigenvalues could be close to 1 η (0) : initial step size = 0.1 Huang, Chang and Hsu (AIIA.IIS.AS.TW) CTJLSVM July 9, / 25

24 Implementation Parameter Settings κ: upper bound for eigenvalue estimation; Since we didn t implement convergence checking, we set κ = 0.7, a conservative setting considering that eigenvalues could be close to 1 η (0) : initial step size = 0.1 Mini-batch size for SGD is always 1 Huang, Chang and Hsu (AIIA.IIS.AS.TW) CTJLSVM July 9, / 25

25 Implementation Stopping Condition Since this is not a dual method, we have to design our own stopping condition. Initial condition: P(w (t 1) ; X) P(w (t) ; X) P(w (t 1) ; X) < δ = If this condition is satisfied 2 times successively, then decrease η η by a factor = 0.1. Repeat until η is decreased τ = 7 times. Huang, Chang and Hsu (AIIA.IIS.AS.TW) CTJLSVM July 9, / 25

26 Outline Discussion 1 Triple Jump Extrapolation 2 Application to Linear SVM 3 Implementation 4 Discussion Huang, Chang and Hsu (AIIA.IIS.AS.TW) CTJLSVM July 9, / 25

27 Discussion Key Tricks CTJLSVM vs. SGD CTJLSVM effectively accelerates SGD. CTJLSVM usually achieves a lower objective. CTJLSVM s convergence is less sensitive to C. Huang, Chang and Hsu (AIIA.IIS.AS.TW) CTJLSVM July 9, / 25

28 Discussion Key Tricks CTJLSVM vs. SGD CTJLSVM effectively accelerates SGD. CTJLSVM usually achieves a lower objective. CTJLSVM s convergence is less sensitive to C. Comparison CTj LSVM02: CTJLSVM with settings the same for final Linear track submissions lsvm04: SGD LSVM with a constant η linear SVM03: SGD LSVM with decreasing η per example by textbook update rule η η(0) 1+t/λ Results for Alpha data set Huang, Chang and Hsu (AIIA.IIS.AS.TW) CTJLSVM July 9, / 25

29 CTJ accelerates SGD Discussion Huang, Chang and Hsu (AIIA.IIS.AS.TW) CTJLSVM July 9, / 25

30 Discussion CTJ takes no extra time Huang, Chang and Hsu (AIIA.IIS.AS.TW) CTJLSVM July 9, / 25

31 Discussion CTJ is less sensitive to C Huang, Chang and Hsu (AIIA.IIS.AS.TW) CTJLSVM July 9, / 25

32 Performance tuning Discussion Tuning SGD Textbook update η η(0) 1+t/λ is slow! A fixed η makes great progress in the beginning. Huang, Chang and Hsu (AIIA.IIS.AS.TW) CTJLSVM July 9, / 25

33 Performance tuning Discussion Tuning SGD Textbook update η η(0) 1+t/λ is slow! A fixed η makes great progress in the beginning. Tuning CTJ How to guarentee convergence? Skip CTJ when failing to improve objective. But this check takes time. Instead, we use a smaller κ = 0.7. We usually use κ = 0.9 for EM and CRF. Upper bound of extrapolation rate (1 γ) 1 = 3.33 for κ = 0.7, (1 γ) 1 = 10.0 for κ = 0.9. Huang, Chang and Hsu (AIIA.IIS.AS.TW) CTJLSVM July 9, / 25

34 Discussion Wild vs. Linear Track Run convert.py for SVMlight format may hurt aoprc? C = 0.5 yields much lower aoprc than required C = Data sets may not be linearly separable lower obj leads to higher aoprc! τ (# of η decreasing) in stopping condition controls obj and aoprc: a large τ leads to lower obj but higher aoprc; a small τ leads to lower aoprc but higher obj. Huang, Chang and Hsu (AIIA.IIS.AS.TW) CTJLSVM July 9, / 25

35 Discussion Wild vs. Linear Track Run convert.py for SVMlight format may hurt aoprc? C = 0.5 yields much lower aoprc than required C = Data sets may not be linearly separable lower obj leads to higher aoprc! τ (# of η decreasing) in stopping condition controls obj and aoprc: a large τ leads to lower obj but higher aoprc; a small τ leads to lower aoprc but higher obj. We forgot to reduce τ for wild track... OUCH!! Huang, Chang and Hsu (AIIA.IIS.AS.TW) CTJLSVM July 9, / 25

36 Discussion Tricks that we didn t do We found several tricks that the participants could do to take advantage of the scoring rules Compute objective with a smaller data set to minimize obj. Scale C from 10 5 instead of 10 4 to minimize avgtime for C. Report worse results for small data size to boost Effort. Carefully choose T 1...T 10 to minimize autime vs. Report results of huge real-world data sets. obj. Huang, Chang and Hsu (AIIA.IIS.AS.TW) CTJLSVM July 9, / 25

37 Discussion CTJLSVM can handle large data Table: Post-Challenge Test Result of CTJLSVM for OCR Method Data Obj. CPU Time CTJLSVM (T1) CTJLSVM (converged) Test rank Test rank 2 (converged) Overall test rank CPU Time in second caliberated; T1: first reporting time point No parameter re-tuning for CTJLSVM Huang, Chang and Hsu (AIIA.IIS.AS.TW) CTJLSVM July 9, / 25

38 Final Words Discussion CTJ is a generic method that can be applied to accelerate fixed-point iteration methods, including EM, GIS, SGD, etc. in machine learning. Huang, Chang and Hsu (AIIA.IIS.AS.TW) CTJLSVM July 9, / 25

39 Discussion Final Words CTJ is a generic method that can be applied to accelerate fixed-point iteration methods, including EM, GIS, SGD, etc. in machine learning. Accelerating SGD by CTJ for training Linear SVM performs reasonably well. Huang, Chang and Hsu (AIIA.IIS.AS.TW) CTJLSVM July 9, / 25

40 Discussion Final Words CTJ is a generic method that can be applied to accelerate fixed-point iteration methods, including EM, GIS, SGD, etc. in machine learning. Accelerating SGD by CTJ for training Linear SVM performs reasonably well. A clever way to adjust step size in SGD and stopping condition will improve CTJLSVM. Huang, Chang and Hsu (AIIA.IIS.AS.TW) CTJLSVM July 9, / 25

41 Discussion Final Words CTJ is a generic method that can be applied to accelerate fixed-point iteration methods, including EM, GIS, SGD, etc. in machine learning. Accelerating SGD by CTJ for training Linear SVM performs reasonably well. A clever way to adjust step size in SGD and stopping condition will improve CTJLSVM. We thank the organizers and hope there will be a next year. Huang, Chang and Hsu (AIIA.IIS.AS.TW) CTJLSVM July 9, / 25

Discussion Thank you for your attention! http://aiia.iis.sinica.edu.

42 Discussion Thank you for your attention! chunnan This work is supported in part by Advanced Bioinformatics Core (ABC), National Research Program for Genomic Medicine (NRPGM), National Science Council, Taiwan. We also wish to thank Yuh-Jye Lee for his useful advices on SVM implementation. Huang, Chang and Hsu (AIIA.IIS.AS.TW) CTJLSVM July 9, / 25

Support Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017

Support Vector Machines: Training with Stochastic Gradient Descent Machine Learning Fall 2017 1 Support vector machines Training by maximizing margin The SVM objective Solving the SVM optimization problem