Incremental and Decremental Training for Linear Classification

Incremental and Decremental Training for Linear Classification Authors: Cheng-Hao Tsai, Chieh-Yen Lin, and Chih-Jen Lin Department of Computer Science National Taiwan University Presenter: Ching-Pei Lee Aug. 5, 14 C.-H. Tsai, C.-Y. Lin, C.-J. Lin (NTU) Incremental and Decremental Training Aug. 5, 14 1 / 31

Outline 1 Introduction Analysis on Our Setting Primal or Dual Problem Optimization Methods Implementation Issue 3 Experiments 4 Conclusions C.-H. Tsai, C.-Y. Lin, C.-J. Lin (NTU) Incremental and Decremental Training Aug. 5, 14 / 31

Incremental and Decremental Training In classification, if few data points are added or removed, incremental and decremental techniques can be applied to quickly update the model. Issues of incremental/decremental learning: - Not guaranteed to be faster than direct training. - Complicated framework. - Scenarios are application dependent. C.-H. Tsai, C.-Y. Lin, C.-J. Lin (NTU) Incremental and Decremental Training Aug. 5, 14 4 / 31

Why Linear Classification for Incremental and Decremental learning? We focus on linear classification. Kernel methods: - Model is a linear combination of training instances: few optimization choices. - Updating cached kernel elements is complicated. Linear methods: - Model is a simple vector. - Comparable accuracy on some applications. C.-H. Tsai, C.-Y. Lin, C.-J. Lin (NTU) Incremental and Decremental Training Aug. 5, 14 5 / 31

Linear Classification Training instances {(x i, y i )} l i=1, x i R n, y i { 1, 1}. Linear classification: min w 1 wt w + C ξ(w; x i, y i ) max(, 1 y i w T x i ) max(, 1 y i w T x i ) log(1 + e y iw T x i ) l ξ(w; x i, y i ). i=1 for L1-loss SVM, for L-loss SVM, for Logistic Regression. C.-H. Tsai, C.-Y. Lin, C.-J. Lin (NTU) Incremental and Decremental Training Aug. 5, 14 6 / 31

Dual Problems l max h(α i, C) 1 α i=1 αt Qα subject to α i U, i = 1,..., l, where Q = Q + d I, Q i,j = y i y j x T i x T j { C U = d = and { for L1-loss SVM and LR, 1 C for L-loss SVM. Further, h(α i, C) = { αi for L1-loss and L-loss SVM, C log C α i log α i (C α i ) log(c α i ) for LR. C.-H. Tsai, C.-Y. Lin, C.-J. Lin (NTU) Incremental and Decremental Training Aug. 5, 14 7 / 31

Incremental and Decremental Learning with Warm Start (1/) In incremental learning, we assume (x i, y i ), i = l + 1,..., l + k are instances added. In decremental learning, we assume (x i, y i ), i = 1,..., k are instances removed. We consider a warm start setting by utilizing the optimal solution of the original problem. C.-H. Tsai, C.-Y. Lin, C.-J. Lin (NTU) Incremental and Decremental Training Aug. 5, 14 8 / 31

Incremental and Decremental Learning with Warm Start (/) Denote the optimal solutions of the original primal and dual problems by w and α, respectively. Primal: we choose w as the initial solution for both incremental and decremental learning. We can use the same w because the features are unchanged. { [α1 Dual:,..., α l,,..., ]T for incremental, [αk+1,..., α l ]T for decremental. C.-H. Tsai, C.-Y. Lin, C.-J. Lin (NTU) Incremental and Decremental Training Aug. 5, 14 9 / 31

Analysis on Our Setting Though our warm start setting is simple, there are several issues occurring while applying warm start on incremental and decremental learning. We investigate the setting from three aspects. - Solving primal or dual problem. - Choosing optimization methods. - Implementation issue. C.-H. Tsai, C.-Y. Lin, C.-J. Lin (NTU) Incremental and Decremental Training Aug. 5, 14 11 / 31

Primal or Dual Problem: Which is Better? An initial point closer to the optimum is more likely to reduce the running time. Main finding: the primal initial point is closer to the optimal point than the dual, so the primal problem is preferred. C.-H. Tsai, C.-Y. Lin, C.-J. Lin (NTU) Incremental and Decremental Training Aug. 5, 14 13 / 31

Primal Initial Objective Value for Incremental Learning The new optimization problem is min w 1 wt w+c l ξ(w; x i, y i ) + C i=1 l+k i=l+1 ξ(w; x i, y i ). C.-H. Tsai, C.-Y. Lin, C.-J. Lin (NTU) Incremental and Decremental Training Aug. 5, 14 14 / 31

Primal Initial Objective Value for Incremental Learning The new optimization problem is min w 1 wt w+c l ξ(w; x i, y i ) + C i=1 l+k i=l+1 ξ(w; x i, y i ). The primal initial objective value is 1 w T w + C l ξ(w ; x i, y i ) + C i=1 l+k i=l+1 ξ(w ; x i, y i ). C.-H. Tsai, C.-Y. Lin, C.-J. Lin (NTU) Incremental and Decremental Training Aug. 5, 14 14 / 31

Dual Initial Objective Value for Incremental Learning The dual initial objective value is l l+k h(αi, C) + h(, C) 1 [ α T ] [ ] [ ] Q T. α i=1 i=l+1 C.-H. Tsai, C.-Y. Lin, C.-J. Lin (NTU) Incremental and Decremental Training Aug. 5, 14 15 / 31

Dual Initial Objective Value for Incremental Learning = The dual initial objective value is l l+k h(αi, C) + h(, C) 1 [ α T ] [ ] [ ] Q T. α i=1 i=l+1 l h(αi, C) 1 α T Qα i=1 = 1 w T w + C l ξ(w ; x i, y i ). i=1 C.-H. Tsai, C.-Y. Lin, C.-J. Lin (NTU) Incremental and Decremental Training Aug. 5, 14 15 / 31

Comparison of Primal and Dual Initial Objective Values (1/) A scaled form of the original problem is min w 1 Cl wt w + 1 l l ξ(w; x i, y i ) i=1 If: k is not large, the original and new data points follow a similar distribution, and w describes the average loss well, the primal initial objective value should be close to the new optimal value. C.-H. Tsai, C.-Y. Lin, C.-J. Lin (NTU) Incremental and Decremental Training Aug. 5, 14 16 / 31

Comparison of Primal and Dual Initial Objective Values (/) primal: 1 w T w + C dual: 1 w T w + C l l+k ξ(w ; x i, y i ) + C ξ(w ; x i, y i ). i=1 l ξ(w ; x i, y i ). i=1 i=l+1 The dual initial objective value may be farther from the new optimal value because of lacking the last term. C.-H. Tsai, C.-Y. Lin, C.-J. Lin (NTU) Incremental and Decremental Training Aug. 5, 14 17 / 31

Decremental Learning Initial Objective Values Similar primal situation for decremental learning. Dual initial objective value: l h(αi, C) 1 ( l αi αj K(i, j) + (αi ) d ) i=k+1 = 1 w T w + C l i=k+1 k+1 i,j l ξ(w ; x i, y i ) 1 1 i,j k i=k+1 α i α j K(i, j). Too small in comparison with the primal because of the last term. C.-H. Tsai, C.-Y. Lin, C.-J. Lin (NTU) Incremental and Decremental Training Aug. 5, 14 18 / 31

Optimization Methods with Warm Start If an initial solution is close to the optimum, a high-order optimization method may be advantageous because of the fast final convergence. dist. to optimum time (a) Linear convergence dist. to optimum time (b) Superlinear or quadratic convergence C.-H. Tsai, C.-Y. Lin, C.-J. Lin (NTU) Incremental and Decremental Training Aug. 5, 14 / 31

Optimization Methods in Experiments We consider TRON (Lin et al., 8), PCD (Chang et al., 8) and DCD (Hsieh et al., 8; Yu et al., 11) in our experiments. The details of theses methods are listed below. Name form optimization order TRON primal trust region Newton high-order PCD primal coordinate descent low-order DCD dual coordinate descent low-order C.-H. Tsai, C.-Y. Lin, C.-J. Lin (NTU) Incremental and Decremental Training Aug. 5, 14 1 / 31

Additional Information Requirement We check whether additional information must be maintained in the model after training. Originally the model contains only w. Primal problem: Only need w. No additional information is needed. Dual problem: α is additional information. Maintaining α is complicated because we need to track which coordinate corresponds to which instance. Thus, primal solvers are easier to implement than dual solvers. C.-H. Tsai, C.-Y. Lin, C.-J. Lin (NTU) Incremental and Decremental Training Aug. 5, 14 3 / 31

Experiment Setting Incremental learning: we randomly divide each data set to r parts so that r 1 parts as original data, 1 part as new data. Decremental learning follows a similar setting. A smaller r means that data change is larger. C.-H. Tsai, C.-Y. Lin, C.-J. Lin (NTU) Incremental and Decremental Training Aug. 5, 14 5 / 31

Data Sets Data set l: #instances n: #features density ijcnn 49,99 41.37% webspam 35, 54 33.51% news 19,996 1,355,191.3% yahoo-jp 176,3 83,6.% We evaluate relative primal and dual differences to optimal objective value for primal and dual solvers, respectively. f (w) f (w ) f (w ) and f D (α) f D (α ). f D (α ) We show results of logistic regression with C = 1. C.-H. Tsai, C.-Y. Lin, C.-J. Lin (NTU) Incremental and Decremental Training Aug. 5, 14 6 / 31

Relative Difference to Optimal Objective Value for Incremental Learning Data set Form No warm start r = 5 r = 5 ijcnn Primal.4e+ 3.4e.e 5 Dual 1.e+ 1.9e 1 1.8e webspam Primal.1e+ 1.e 8.5e Dual 1.e+ 1.9e 1 1.9e news Primal 1.3e+ 1.7e 1.3e 3 Dual 1.e+ 1.6e 1 1.4e yahoo-jp Primal.4e+ 1.6e 1.3e 3 Dual 1.e+ 1.9e 1 1.9e C.-H. Tsai, C.-Y. Lin, C.-J. Lin (NTU) Incremental and Decremental Training Aug. 5, 14 7 / 31

Incremental Learning ijcnn webspam news yahoo-jp TRON PCD.1..3.4 4 6 8 1 1 3 4 1 3 4 5 6 7 1 3 4 5 6 7 1 3 4 5 5 1 15 5 3 35 1 3 4 5 6 DCD.5.1.15. 1 3 4 5..4.6.8 1 1. 1.4.5 1 1.5.5 3 3.5 C.-H. Tsai, C.-Y. Lin, C.-J. Lin (NTU) Incremental and Decremental Training Aug. 5, 14 8 / 31

Decremental Learning ijcnn webspam news yahoo-jp TRON PCD.1..3.4 4 6 8 1 1 3 4 1 3 4 5 6 7 1 3 4 5 6 7 1 3 4 5 5 1 15 5 3 1 3 4 5 6 DCD.5.1.15. 1 3 4..4.6.8 1.5 1 1.5.5 3 C.-H. Tsai, C.-Y. Lin, C.-J. Lin (NTU) Incremental and Decremental Training Aug. 5, 14 9 / 31

Conclusions Warm start for a primal-based high-order optimization method is preferred because: The warm start setting generally gives a better primal initial solution than the dual. For implementation, a primal-based method is more straightforward than a dual-based method. The warm start setting more effectively speeds up a high-order optimization method such as TRON. We implement TRON as an incremental/decremental learning extension of LIBLINEAR (Fan et al., 8) at http://www.csie.ntu.edu.tw/~cjlin/papers/ws. C.-H. Tsai, C.-Y. Lin, C.-J. Lin (NTU) Incremental and Decremental Training Aug. 5, 14 31 / 31