Incremental and Decremental Training for Linear Classification

Similar documents
Large-scale Linear RankSVM

Training Support Vector Machines: Status and Challenges

Outline. Basic concepts: SVM and kernels SVM primal/dual problems. Chih-Jen Lin (National Taiwan Univ.) 1 / 22

Distributed Box-Constrained Quadratic Optimization for Dual Linear SVM

A Dual Coordinate Descent Method for Large-scale Linear SVM

Introduction to Logistic Regression and Support Vector Machine

Supplement: Distributed Box-constrained Quadratic Optimization for Dual Linear SVM

Coordinate Descent Method for Large-scale L2-loss Linear Support Vector Machines

A Dual Coordinate Descent Method for Large-scale Linear SVM

Working Set Selection Using Second Order Information for Training SVM

StreamSVM Linear SVMs and Logistic Regression When Data Does Not Fit In Memory

Coordinate Descent Method for Large-scale L2-loss Linear SVM. Kai-Wei Chang Cho-Jui Hsieh Chih-Jen Lin

Convex Optimization Algorithms for Machine Learning in 10 Slides

STA141C: Big Data & High Performance Statistical Computing

A Study on Trust Region Update Rules in Newton Methods for Large-scale Linear Classification

Machine Learning Techniques

ECS289: Scalable Machine Learning

Big Data Analytics: Optimization and Randomization

LIBLINEAR: A Library for Large Linear Classification

Advanced Topics in Machine Learning

An Improved GLMNET for L1-regularized Logistic Regression

Machine Learning Techniques

Introduction to Support Vector Machines

ECS289: Scalable Machine Learning

An Ensemble Ranking Solution for the Yahoo! Learning to Rank Challenge

ECS289: Scalable Machine Learning

Distributed Box-Constrained Quadratic Optimization for Dual Linear SVM

Machine Learning Techniques

SMO Algorithms for Support Vector Machines without Bias Term

Subsampled Hessian Newton Methods for Supervised

A Revisit to Support Vector Data Description (SVDD)

SVMs, Duality and the Kernel Trick

CSC 411 Lecture 17: Support Vector Machine

Support Vector Machines and Kernel Methods

Coordinate Descent and Ascent Methods

UVA CS 4501: Machine Learning

LIBSVM: a Library for Support Vector Machines (Version 2.32)

The Common-directions Method for Regularized Empirical Risk Minimization

Preconditioned Conjugate Gradient Methods in Truncated Newton Frameworks for Large-scale Linear Classification

Cutting Plane Training of Structural SVM

Sparse Support Vector Machines by Kernel Discriminant Analysis

Solving the SVM Optimization Problem

Parallel and Distributed Stochastic Learning -Towards Scalable Learning for Big Data Intelligence

Jeff Howbert Introduction to Machine Learning Winter

Support Vector Machines for Classification and Regression. 1 Linearly Separable Data: Hard Margin SVMs

Machine Learning Techniques

An Improved GLMNET for L1-regularized Logistic Regression

Support Vector Machines

Machine Learning. Support Vector Machines. Manfred Huber

Supplement of Limited-memory Common-directions Method for Distributed Optimization and its Application on Empirical Risk Minimization

Support Vector Machines

LIBSVM: a Library for Support Vector Machines

Modelli Lineari (Generalizzati) e SVM

Support Vector Machines and Kernel Methods

Machine Learning Basics Lecture 4: SVM I. Princeton University COS 495 Instructor: Yingyu Liang

Motivation Subgradient Method Stochastic Subgradient Method. Convex Optimization. Lecture 15 - Gradient Descent in Machine Learning

Machine Learning. Lecture 6: Support Vector Machine. Feng Li.

Efficient HIK SVM Learning for Image Classification

Block stochastic gradient update method

On-line Support Vector Machine Regression

Support Vector Regression (SVR) Descriptions of SVR in this discussion follow that in Refs. (2, 6, 7, 8, 9). The literature

Max Margin-Classifier

Optimization Methods for Machine Learning

A Simple Decomposition Method for Support Vector Machines

Matrix Factorization and Factorization Machines for Recommender Systems

Support Vector Machines for Classification and Regression

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26

A Parallel SGD method with Strong Convergence

Basis Expansion and Nonlinear SVM. Kai Yu

Support Vector Machines

Incorporating detractors into SVM classification

Vasil Khalidov & Miles Hansard. C.M. Bishop s PRML: Chapter 5; Neural Networks

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization

Support Vector Machine

Distributed Block-diagonal Approximation Methods for Regularized Empirical Risk Minimization

CSC 411: Lecture 04: Logistic Regression

Learning From Data Lecture 25 The Kernel Trick

A Comparison of Optimization Methods and Software for Large-scale L1-regularized Linear Classification

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

Machine Learning Practice Page 2 of 2 10/28/13

SVM Incremental Learning, Adaptation and Optimization

Least Squares SVM Regression

Discriminative Models

Advanced Topics in Machine Learning, Summer Semester 2012

SUPPORT VECTOR MACHINE FOR THE SIMULTANEOUS APPROXIMATION OF A FUNCTION AND ITS DERIVATIVE

Machine Learning and Data Mining. Support Vector Machines. Kalev Kask

PASSCoDe: Parallel ASynchronous Stochastic dual Co-ordinate Descent

Polyhedral Computation. Linear Classifiers & the SVM

Support Vector Machine (SVM) and Kernel Methods

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines

Kernel Machines. Pradeep Ravikumar Co-instructor: Manuela Veloso. Machine Learning

Nonlinear Support Vector Machines through Iterative Majorization and I-Splines

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016

Lecture 3 January 28

Machine Learning. Lecture 04: Logistic and Softmax Regression. Nevin L. Zhang

CS6375: Machine Learning Gautam Kunapuli. Support Vector Machines

Logistic Regression Trained with Different Loss Functions. Discussion

COMP 652: Machine Learning. Lecture 12. COMP Lecture 12 1 / 37

Transcription:

Incremental and Decremental Training for Linear Classification Authors: Cheng-Hao Tsai, Chieh-Yen Lin, and Chih-Jen Lin Department of Computer Science National Taiwan University Presenter: Ching-Pei Lee Aug. 5, 14 C.-H. Tsai, C.-Y. Lin, C.-J. Lin (NTU) Incremental and Decremental Training Aug. 5, 14 1 / 31

Outline 1 Introduction Analysis on Our Setting Primal or Dual Problem Optimization Methods Implementation Issue 3 Experiments 4 Conclusions C.-H. Tsai, C.-Y. Lin, C.-J. Lin (NTU) Incremental and Decremental Training Aug. 5, 14 / 31

Outline 1 Introduction Analysis on Our Setting Primal or Dual Problem Optimization Methods Implementation Issue 3 Experiments 4 Conclusions C.-H. Tsai, C.-Y. Lin, C.-J. Lin (NTU) Incremental and Decremental Training Aug. 5, 14 3 / 31

Incremental and Decremental Training In classification, if few data points are added or removed, incremental and decremental techniques can be applied to quickly update the model. Issues of incremental/decremental learning: - Not guaranteed to be faster than direct training. - Complicated framework. - Scenarios are application dependent. C.-H. Tsai, C.-Y. Lin, C.-J. Lin (NTU) Incremental and Decremental Training Aug. 5, 14 4 / 31

Why Linear Classification for Incremental and Decremental learning? We focus on linear classification. Kernel methods: - Model is a linear combination of training instances: few optimization choices. - Updating cached kernel elements is complicated. Linear methods: - Model is a simple vector. - Comparable accuracy on some applications. C.-H. Tsai, C.-Y. Lin, C.-J. Lin (NTU) Incremental and Decremental Training Aug. 5, 14 5 / 31

Linear Classification Training instances {(x i, y i )} l i=1, x i R n, y i { 1, 1}. Linear classification: min w 1 wt w + C ξ(w; x i, y i ) max(, 1 y i w T x i ) max(, 1 y i w T x i ) log(1 + e y iw T x i ) l ξ(w; x i, y i ). i=1 for L1-loss SVM, for L-loss SVM, for Logistic Regression. C.-H. Tsai, C.-Y. Lin, C.-J. Lin (NTU) Incremental and Decremental Training Aug. 5, 14 6 / 31

Dual Problems l max h(α i, C) 1 α i=1 αt Qα subject to α i U, i = 1,..., l, where Q = Q + d I, Q i,j = y i y j x T i x T j { C U = d = and { for L1-loss SVM and LR, 1 C for L-loss SVM. Further, h(α i, C) = { αi for L1-loss and L-loss SVM, C log C α i log α i (C α i ) log(c α i ) for LR. C.-H. Tsai, C.-Y. Lin, C.-J. Lin (NTU) Incremental and Decremental Training Aug. 5, 14 7 / 31

Incremental and Decremental Learning with Warm Start (1/) In incremental learning, we assume (x i, y i ), i = l + 1,..., l + k are instances added. In decremental learning, we assume (x i, y i ), i = 1,..., k are instances removed. We consider a warm start setting by utilizing the optimal solution of the original problem. C.-H. Tsai, C.-Y. Lin, C.-J. Lin (NTU) Incremental and Decremental Training Aug. 5, 14 8 / 31

Incremental and Decremental Learning with Warm Start (/) Denote the optimal solutions of the original primal and dual problems by w and α, respectively. Primal: we choose w as the initial solution for both incremental and decremental learning. We can use the same w because the features are unchanged. { [α1 Dual:,..., α l,,..., ]T for incremental, [αk+1,..., α l ]T for decremental. C.-H. Tsai, C.-Y. Lin, C.-J. Lin (NTU) Incremental and Decremental Training Aug. 5, 14 9 / 31

Outline 1 Introduction Analysis on Our Setting Primal or Dual Problem Optimization Methods Implementation Issue 3 Experiments 4 Conclusions C.-H. Tsai, C.-Y. Lin, C.-J. Lin (NTU) Incremental and Decremental Training Aug. 5, 14 1 / 31

Analysis on Our Setting Though our warm start setting is simple, there are several issues occurring while applying warm start on incremental and decremental learning. We investigate the setting from three aspects. - Solving primal or dual problem. - Choosing optimization methods. - Implementation issue. C.-H. Tsai, C.-Y. Lin, C.-J. Lin (NTU) Incremental and Decremental Training Aug. 5, 14 11 / 31

Outline 1 Introduction Analysis on Our Setting Primal or Dual Problem Optimization Methods Implementation Issue 3 Experiments 4 Conclusions C.-H. Tsai, C.-Y. Lin, C.-J. Lin (NTU) Incremental and Decremental Training Aug. 5, 14 1 / 31

Primal or Dual Problem: Which is Better? An initial point closer to the optimum is more likely to reduce the running time. Main finding: the primal initial point is closer to the optimal point than the dual, so the primal problem is preferred. C.-H. Tsai, C.-Y. Lin, C.-J. Lin (NTU) Incremental and Decremental Training Aug. 5, 14 13 / 31

Primal Initial Objective Value for Incremental Learning The new optimization problem is min w 1 wt w+c l ξ(w; x i, y i ) + C i=1 l+k i=l+1 ξ(w; x i, y i ). C.-H. Tsai, C.-Y. Lin, C.-J. Lin (NTU) Incremental and Decremental Training Aug. 5, 14 14 / 31

Primal Initial Objective Value for Incremental Learning The new optimization problem is min w 1 wt w+c l ξ(w; x i, y i ) + C i=1 l+k i=l+1 ξ(w; x i, y i ). C.-H. Tsai, C.-Y. Lin, C.-J. Lin (NTU) Incremental and Decremental Training Aug. 5, 14 14 / 31

Primal Initial Objective Value for Incremental Learning The new optimization problem is min w 1 wt w+c l ξ(w; x i, y i ) + C i=1 l+k i=l+1 ξ(w; x i, y i ). The primal initial objective value is 1 w T w + C l ξ(w ; x i, y i ) + C i=1 l+k i=l+1 ξ(w ; x i, y i ). C.-H. Tsai, C.-Y. Lin, C.-J. Lin (NTU) Incremental and Decremental Training Aug. 5, 14 14 / 31

Dual Initial Objective Value for Incremental Learning The dual initial objective value is l l+k h(αi, C) + h(, C) 1 [ α T ] [ ] [ ] Q T. α i=1 i=l+1 C.-H. Tsai, C.-Y. Lin, C.-J. Lin (NTU) Incremental and Decremental Training Aug. 5, 14 15 / 31

Dual Initial Objective Value for Incremental Learning = The dual initial objective value is l l+k h(αi, C) + h(, C) 1 [ α T ] [ ] [ ] Q T. α i=1 i=l+1 l h(αi, C) 1 α T Qα i=1 = 1 w T w + C l ξ(w ; x i, y i ). i=1 C.-H. Tsai, C.-Y. Lin, C.-J. Lin (NTU) Incremental and Decremental Training Aug. 5, 14 15 / 31

Comparison of Primal and Dual Initial Objective Values (1/) A scaled form of the original problem is min w 1 Cl wt w + 1 l l ξ(w; x i, y i ) i=1 If: k is not large, the original and new data points follow a similar distribution, and w describes the average loss well, the primal initial objective value should be close to the new optimal value. C.-H. Tsai, C.-Y. Lin, C.-J. Lin (NTU) Incremental and Decremental Training Aug. 5, 14 16 / 31

Comparison of Primal and Dual Initial Objective Values (/) primal: 1 w T w + C dual: 1 w T w + C l l+k ξ(w ; x i, y i ) + C ξ(w ; x i, y i ). i=1 l ξ(w ; x i, y i ). i=1 i=l+1 The dual initial objective value may be farther from the new optimal value because of lacking the last term. C.-H. Tsai, C.-Y. Lin, C.-J. Lin (NTU) Incremental and Decremental Training Aug. 5, 14 17 / 31

Decremental Learning Initial Objective Values Similar primal situation for decremental learning. Dual initial objective value: l h(αi, C) 1 ( l αi αj K(i, j) + (αi ) d ) i=k+1 = 1 w T w + C l i=k+1 k+1 i,j l ξ(w ; x i, y i ) 1 1 i,j k i=k+1 α i α j K(i, j). Too small in comparison with the primal because of the last term. C.-H. Tsai, C.-Y. Lin, C.-J. Lin (NTU) Incremental and Decremental Training Aug. 5, 14 18 / 31

Outline 1 Introduction Analysis on Our Setting Primal or Dual Problem Optimization Methods Implementation Issue 3 Experiments 4 Conclusions C.-H. Tsai, C.-Y. Lin, C.-J. Lin (NTU) Incremental and Decremental Training Aug. 5, 14 19 / 31

Optimization Methods with Warm Start If an initial solution is close to the optimum, a high-order optimization method may be advantageous because of the fast final convergence. dist. to optimum time (a) Linear convergence dist. to optimum time (b) Superlinear or quadratic convergence C.-H. Tsai, C.-Y. Lin, C.-J. Lin (NTU) Incremental and Decremental Training Aug. 5, 14 / 31

Optimization Methods in Experiments We consider TRON (Lin et al., 8), PCD (Chang et al., 8) and DCD (Hsieh et al., 8; Yu et al., 11) in our experiments. The details of theses methods are listed below. Name form optimization order TRON primal trust region Newton high-order PCD primal coordinate descent low-order DCD dual coordinate descent low-order C.-H. Tsai, C.-Y. Lin, C.-J. Lin (NTU) Incremental and Decremental Training Aug. 5, 14 1 / 31

Outline 1 Introduction Analysis on Our Setting Primal or Dual Problem Optimization Methods Implementation Issue 3 Experiments 4 Conclusions C.-H. Tsai, C.-Y. Lin, C.-J. Lin (NTU) Incremental and Decremental Training Aug. 5, 14 / 31

Additional Information Requirement We check whether additional information must be maintained in the model after training. Originally the model contains only w. Primal problem: Only need w. No additional information is needed. Dual problem: α is additional information. Maintaining α is complicated because we need to track which coordinate corresponds to which instance. Thus, primal solvers are easier to implement than dual solvers. C.-H. Tsai, C.-Y. Lin, C.-J. Lin (NTU) Incremental and Decremental Training Aug. 5, 14 3 / 31

Outline 1 Introduction Analysis on Our Setting Primal or Dual Problem Optimization Methods Implementation Issue 3 Experiments 4 Conclusions C.-H. Tsai, C.-Y. Lin, C.-J. Lin (NTU) Incremental and Decremental Training Aug. 5, 14 4 / 31

Experiment Setting Incremental learning: we randomly divide each data set to r parts so that r 1 parts as original data, 1 part as new data. Decremental learning follows a similar setting. A smaller r means that data change is larger. C.-H. Tsai, C.-Y. Lin, C.-J. Lin (NTU) Incremental and Decremental Training Aug. 5, 14 5 / 31

Data Sets Data set l: #instances n: #features density ijcnn 49,99 41.37% webspam 35, 54 33.51% news 19,996 1,355,191.3% yahoo-jp 176,3 83,6.% We evaluate relative primal and dual differences to optimal objective value for primal and dual solvers, respectively. f (w) f (w ) f (w ) and f D (α) f D (α ). f D (α ) We show results of logistic regression with C = 1. C.-H. Tsai, C.-Y. Lin, C.-J. Lin (NTU) Incremental and Decremental Training Aug. 5, 14 6 / 31

Relative Difference to Optimal Objective Value for Incremental Learning Data set Form No warm start r = 5 r = 5 ijcnn Primal.4e+ 3.4e.e 5 Dual 1.e+ 1.9e 1 1.8e webspam Primal.1e+ 1.e 8.5e Dual 1.e+ 1.9e 1 1.9e news Primal 1.3e+ 1.7e 1.3e 3 Dual 1.e+ 1.6e 1 1.4e yahoo-jp Primal.4e+ 1.6e 1.3e 3 Dual 1.e+ 1.9e 1 1.9e C.-H. Tsai, C.-Y. Lin, C.-J. Lin (NTU) Incremental and Decremental Training Aug. 5, 14 7 / 31

Incremental Learning ijcnn webspam news yahoo-jp TRON PCD.1..3.4 4 6 8 1 1 3 4 1 3 4 5 6 7 1 3 4 5 6 7 1 3 4 5 5 1 15 5 3 35 1 3 4 5 6 DCD.5.1.15. 1 3 4 5..4.6.8 1 1. 1.4.5 1 1.5.5 3 3.5 C.-H. Tsai, C.-Y. Lin, C.-J. Lin (NTU) Incremental and Decremental Training Aug. 5, 14 8 / 31

Decremental Learning ijcnn webspam news yahoo-jp TRON PCD.1..3.4 4 6 8 1 1 3 4 1 3 4 5 6 7 1 3 4 5 6 7 1 3 4 5 5 1 15 5 3 1 3 4 5 6 DCD.5.1.15. 1 3 4..4.6.8 1.5 1 1.5.5 3 C.-H. Tsai, C.-Y. Lin, C.-J. Lin (NTU) Incremental and Decremental Training Aug. 5, 14 9 / 31

Outline 1 Introduction Analysis on Our Setting Primal or Dual Problem Optimization Methods Implementation Issue 3 Experiments 4 Conclusions C.-H. Tsai, C.-Y. Lin, C.-J. Lin (NTU) Incremental and Decremental Training Aug. 5, 14 3 / 31

Conclusions Warm start for a primal-based high-order optimization method is preferred because: The warm start setting generally gives a better primal initial solution than the dual. For implementation, a primal-based method is more straightforward than a dual-based method. The warm start setting more effectively speeds up a high-order optimization method such as TRON. We implement TRON as an incremental/decremental learning extension of LIBLINEAR (Fan et al., 8) at http://www.csie.ntu.edu.tw/~cjlin/papers/ws. C.-H. Tsai, C.-Y. Lin, C.-J. Lin (NTU) Incremental and Decremental Training Aug. 5, 14 31 / 31