Training Support Vector Machines: Status and Challenges

Similar documents
Support Vector Machines and Kernel Methods

Outline. Basic concepts: SVM and kernels SVM primal/dual problems. Chih-Jen Lin (National Taiwan Univ.) 1 / 22

Support Vector Machines

Working Set Selection Using Second Order Information for Training SVM

A Dual Coordinate Descent Method for Large-scale Linear SVM

Introduction to Support Vector Machines

Introduction to Logistic Regression and Support Vector Machine

Incremental and Decremental Training for Linear Classification

A Dual Coordinate Descent Method for Large-scale Linear SVM

Distributed Box-Constrained Quadratic Optimization for Dual Linear SVM

Introduction to Support Vector Machines

Perceptron Revisited: Linear Separators. Support Vector Machines

Advanced Topics in Machine Learning

LIBSVM: a Library for Support Vector Machines (Version 2.32)

CS6375: Machine Learning Gautam Kunapuli. Support Vector Machines

Large-scale Linear RankSVM

Coordinate Descent Method for Large-scale L2-loss Linear Support Vector Machines

Support Vector Machines

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Support Vector Machine (continued)

Support Vector Machine

A Simple Decomposition Method for Support Vector Machines

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

SVMC An introduction to Support Vector Machines Classification

Support Vector Machines

Coordinate Descent Method for Large-scale L2-loss Linear SVM. Kai-Wei Chang Cho-Jui Hsieh Chih-Jen Lin

Support Vector Machines

L5 Support Vector Classification

Jeff Howbert Introduction to Machine Learning Winter

Sequential Minimal Optimization (SMO)

Dan Roth 461C, 3401 Walnut

LIBSVM: a Library for Support Vector Machines

ECS289: Scalable Machine Learning

Support Vector and Kernel Methods

Direct L2 Support Vector Machine

Max Margin-Classifier

Support Vector Machines Explained

Learning with kernels and SVM

Support Vector Machines.

Linear, threshold units. Linear Discriminant Functions and Support Vector Machines. Biometrics CSE 190 Lecture 11. X i : inputs W i : weights

A short introduction to supervised learning, with applications to cancer pathway analysis Dr. Christina Leslie

A Revisit to Support Vector Data Description (SVDD)

Chapter 9. Support Vector Machine. Yongdai Kim Seoul National University

Machine Learning. Kernels. Fall (Kernels, Kernelized Perceptron and SVM) Professor Liang Huang. (Chap. 12 of CIML)

CS145: INTRODUCTION TO DATA MINING

Machine Learning Techniques

Solving the SVM Optimization Problem

Machine Learning Techniques

ECS289: Scalable Machine Learning

Optimization for Kernel Methods S. Sathiya Keerthi

Infinite Ensemble Learning with Support Vector Machinery

Machine Learning Techniques

LIBLINEAR: A Library for Large Linear Classification

Neural Networks. Prof. Dr. Rudolf Kruse. Computational Intelligence Group Faculty for Computer Science

Support Vector Machines, Kernel SVM

Support Vector Machines II. CAP 5610: Machine Learning Instructor: Guo-Jun QI

Support vector machines Lecture 4

A Study on Trust Region Update Rules in Newton Methods for Large-scale Linear Classification

LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Lecture 10: A brief introduction to Support Vector Machine

Support Vector Machine via Nonlinear Rescaling Method

A Study on Sigmoid Kernels for SVM and the Training of non-psd Kernels by SMO-type Methods

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Support Vector Machine (SVM) and Kernel Methods

Nearest Neighbors Methods for Support Vector Machines

StreamSVM Linear SVMs and Logistic Regression When Data Does Not Fit In Memory

Using interior point methods for optimization in training very large scale Support Vector Machines. Interior Point Methods for QP. Elements of the IPM

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines

Support Vector Machine for Classification and Regression

Review: Support vector machines. Machine learning techniques and image analysis

Preconditioned Conjugate Gradient Methods in Truncated Newton Frameworks for Large-scale Linear Classification

Convex Optimization Algorithms for Machine Learning in 10 Slides

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

UVA CS 4501: Machine Learning

Statistical Machine Learning from Data

Learning Kernel Parameters by using Class Separability Measure

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM

Linear & nonlinear classifiers

Polyhedral Computation. Linear Classifiers & the SVM

SVM May 2007 DOE-PI Dianne P. O Leary c 2007

Multisurface Proximal Support Vector Machine Classification via Generalized Eigenvalues

Large-scale Stochastic Optimization

SMO Algorithms for Support Vector Machines without Bias Term

Machine Learning. Support Vector Machines. Manfred Huber

Lecture 10: Support Vector Machine and Large Margin Classifier

Support Vector Machine (SVM) and Kernel Methods

The Intrinsic Recurrent Support Vector Machine

Linear & nonlinear classifiers

Learning From Data Lecture 25 The Kernel Trick

Pattern Recognition 2018 Support Vector Machines

Support Vector Machines

Kernels and the Kernel Trick. Machine Learning Fall 2017

CSC 411 Lecture 17: Support Vector Machine

Deviations from linear separability. Kernel methods. Basis expansion for quadratic boundaries. Adding new features Systematic deviation

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

Support Vector Machine Regression for Volatile Stock Market Prediction

Lecture 18: Multiclass Support Vector Machines

Support Vector Machines. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Support Vector Machine (SVM) and Kernel Methods

LECTURE 7 Support vector machines

Transcription:

ICML Workshop on Large Scale Learning Challenge July 9, 2008 Chih-Jen Lin (National Taiwan Univ.) 1 / 34 Training Support Vector Machines: Status and Challenges Chih-Jen Lin Department of Computer Science National Taiwan University

Outline SVM is popular But its training isn t easy We check existing techniques Large data sets We show several approaches, and discuss various considerations Will try to partially answer why there are controversial comparisons Chih-Jen Lin (National Taiwan Univ.) 2 / 34

Outline Introduction to SVM Introduction to SVM Solving SVM Quadratic Programming Problem Training large-scale data Linear SVM Discussion and Conclusions Chih-Jen Lin (National Taiwan Univ.) 3 / 34

Introduction to SVM Support Vector Classification Training data (x i, y i ), i = 1,..., l, x i R n, y i = ±1 Maximizing the margin [Boser et al., 1992, Cortes and Vapnik, 1995] min w,b,ξ 1 2 wt w + C l i=1 subject to y i (w T φ(x i ) + b) 1 ξ i, ξ i 0, i = 1,..., l. High dimensional ( maybe infinite ) feature space ξ i φ(x) = (φ 1 (x), φ 2 (x),...). Chih-Jen Lin (National Taiwan Univ.) 4 / 34

Introduction to SVM Support Vector Classification (Cont d) w: maybe infinite variables The dual problem (finite # variables) min α subject to 1 2 αt Qα e T α 0 α i C, i = 1,..., l y T α = 0, where Q ij = y i y j φ(x i ) T φ(x j ) and e = [1,..., 1] T At optimum w = l i=1 α iy i φ(x i ) Kernel: K(x i, x j ) φ(x i ) T φ(x j ) Chih-Jen Lin (National Taiwan Univ.) 5 / 34

Outline Solving SVM Quadratic Programming Problem Introduction to SVM Solving SVM Quadratic Programming Problem Training large-scale data Linear SVM Discussion and Conclusions Chih-Jen Lin (National Taiwan Univ.) 6 / 34

Solving SVM Quadratic Programming Problem Large Dense Quadratic Programming min α subject to 1 2 αt Qα e T α 0 α i C, i = 1,..., l y T α = 0 Q ij 0, Q : an l by l fully dense matrix 50,000 training points: 50,000 variables: (50, 000 2 8/2) bytes = 10GB RAM to store Q Traditional methods: Newton, Quasi Newton cannot be directly applied Chih-Jen Lin (National Taiwan Univ.) 7 / 34

Decomposition Methods Solving SVM Quadratic Programming Problem Working on some variables each time (e.g., [Osuna et al., 1997, Joachims, 1998, Platt, 1998]) Similar to coordinate-wise minimization Working set B, N = {1,..., l}\b fixed Sub-problem at the kth iteration: subject to 1 [ min α T α B 2 B (α ] [ ] [ k Q BB Q BN αb N )T Q NB Q NN α k N [ e T B (e ] [ ] k α B N )T α k N ] 0 α t C, t B, y T B α B = y T Nα k N Chih-Jen Lin (National Taiwan Univ.) 8 / 34

Avoid Memory Problems Solving SVM Quadratic Programming Problem The new objective function 1 2 αt B Q BB α B + ( e B + Q BN α k N) T α B + constant Only B columns of Q needed ( B 2) Calculated when used Trade time for space Popular software such as SVM light, LIBSVM, SVMTorch are of this type Chih-Jen Lin (National Taiwan Univ.) 9 / 34

Solving SVM Quadratic Programming Problem How Decomposition Methods Perform? Convergence not very fast But, no need to have very accurate α Prediction not affected much In some situations, # support vectors # training points Initial α 1 = 0, some instances never used Chih-Jen Lin (National Taiwan Univ.) 10 / 34

Solving SVM Quadratic Programming Problem An example of training 50,000 instances using LIBSVM $svm-train -c 16 -g 4 -m 400 22features Total nsv = 3370 Time 79.524s On a Xeon 2.0G machine Calculating Q may have taken more time #SVs = 3,370 50,000 A good case where some remain at zero all the time Chih-Jen Lin (National Taiwan Univ.) 11 / 34

Solving SVM Quadratic Programming Problem Issues of Decomposition Methods Techniques for faster decomposition methods store recently used kernel elements working set size/selection theoretical issues: convergence and others (details not discussed here) But training large data still difficult Kernel square to the number of data Training millions of data time consuming Will discuss some possible approaches Chih-Jen Lin (National Taiwan Univ.) 12 / 34

Outline Training large-scale data Introduction to SVM Solving SVM Quadratic Programming Problem Training large-scale data Linear SVM Discussion and Conclusions Chih-Jen Lin (National Taiwan Univ.) 13 / 34

Training large-scale data Parallel: Multi-core/Shared Memory Most computation of decomposition methods: kernel evaluations Easily parallelized via openmp One line change of LIBSVM Each core/cpu calculates part of a kernel column Multicore Shared-memory 1 80 1 100 2 48 2 57 4 32 4 36 8 27 8 28 Same 50,000 data (kernel evaluations: 80% time) Using GPU [Catanzaro et al., 2008] Chih-Jen Lin (National Taiwan Univ.) 14 / 34

Training large-scale data Parallel: Distributed Environments What if data data cannot fit into memory? Use distributed environments PSVM: [Chang et al., 2007] http://code.google.com/p/psvm/ π-svm: http://pisvm.sourceforge.net, Parallel GPDT [Zanni et al., 2006] All use MPI They report good speed-up But on certain environments, communication cost a concern Chih-Jen Lin (National Taiwan Univ.) 15 / 34

Approximations Training large-scale data Subsampling Simple and often effective Many more advanced techniques Incremental training: (e.g., [Syed et al., 1999]) Data 10 parts train 1st part SVs, train SVs + 2nd part,... Select and train good points: KNN or heuristics e.g., [Bakır et al., 2005] Chih-Jen Lin (National Taiwan Univ.) 16 / 34

Approximations (Cont d) Training large-scale data Approximate the kernel; e.g., [Fine and Scheinberg, 2001, Williams and Seeger, 2001] Use part of the kernel; e.g., [Lee and Mangasarian, 2001, Keerthi et al., 2006] And many others Some simple but some sophisticated Chih-Jen Lin (National Taiwan Univ.) 17 / 34

Training large-scale data Parallelization or Approximation Difficult to say Parallel: general Approximation: simpler in some cases We can do both For certain problems, approximation doesn t easily work Chih-Jen Lin (National Taiwan Univ.) 18 / 34

Training large-scale data Parallelization or Approximation (Cont d) covtype: 500k training and 80k testing rcv1: 550k training and 14k testing covtype rcv1 Training size Accuracy Training size Accuracy 50k 92.5% 50k 97.2% 100k 95.3% 100k 97.4% 500k 98.2% 550k 97.8% For large sets, selecting a right approach is essential We illustrate this point using linear SVM for document classification Chih-Jen Lin (National Taiwan Univ.) 19 / 34

Outline Linear SVM Introduction to SVM Solving SVM Quadratic Programming Problem Training large-scale data Linear SVM Discussion and Conclusions Chih-Jen Lin (National Taiwan Univ.) 20 / 34

Linear SVM Linear Support Vector Machines Data not mapped to another space In theory, RBF kernel with certain parameters as good geleralization performance as linear [Keerthi and Lin, 2003] But sometimes can easily solve much larger linear SVMs Training of linear/nonlinear SVMs should be separately considered Chih-Jen Lin (National Taiwan Univ.) 21 / 34

Linear SVM Linear Support Vector Machines (Cont d) Linear SVM useful if accuracy similar to nonlinear Will discuss an example of linear SVM for document classification Chih-Jen Lin (National Taiwan Univ.) 22 / 34

Linear SVM Linear SVM for Large Document Sets Document classification Bag of words model (TF-IDF or others) A large # of features Can solve larger problems than kernelized cases Recently an active research topic SVM perf [Joachims, 2006] Pegasos [Shalev-Shwartz et al., 2007] LIBLINEAR [Lin et al., 2007, Hsieh et al., 2008] and others Chih-Jen Lin (National Taiwan Univ.) 23 / 34

Linear SVM Linear SVM Primal without the bias term b Dual min w 1 2 wt w + C l max ( ) 0, 1 y i w T x i i=1 min α subject to f (α) = 1 2 αt Qα e T α 0 α i C, i No linear constraint y T α = 0 Q ij = y i y j x T i x j Chih-Jen Lin (National Taiwan Univ.) 24 / 34

Linear SVM A Comparison: LIBSVM and LIBLINEAR rcv1: # data: > 600k, # features: > 40k TF-IDF Using LIBSVM (linear kernel) > 10 hours Using LIBLINEAR Computation: < 5 seconds; I/O: 60 seconds Same stopping condition Accuracy similar to nonlinear Chih-Jen Lin (National Taiwan Univ.) 25 / 34

Linear SVM Revisit Decomposition Methods The extreme: update one variable at a time Reduced to ( ( α i min max α i ) ) if (α), 0, C Q ii where i f (α) = (Qα) i 1 = O(nl) to calculate ith row of Q n: # features, l: # data l Q ij α j 1 Chih-Jen Lin (National Taiwan Univ.) 26 / 34 j=1

Linear SVM For linear SVM, define w = l y jα j x j, j=1 Much easier: O(n) i f (α) = y i w T x i 1 All we need is to maintain w. If ᾱ i α i then Still O(n) w w + (α i ᾱ i )y i x i Chih-Jen Lin (National Taiwan Univ.) 27 / 34

Linear SVM Testing Accuracy (Time in Seconds) L1-SVM: news20 L2-SVM: news20 L1-SVM: rcv1 L2-SVM: rcv1 Chih-Jen Lin (National Taiwan Univ.) 28 / 34

Analysis Linear SVM Other implementation details in [Hsieh et al., 2008] Decomposition method for linear/nonlinear kernels: O(nl) per iteration New way for linear: O(n) per iteration Faster if # iterations not l times more A few seconds for million data; Any limitation? Less effective if # features small: should solve primal Large penalty parameter C Chih-Jen Lin (National Taiwan Univ.) 29 / 34

Analysis (Cont d) Linear SVM One must be careful on comparisons Now we have two decomposition methods (nonlinear and linear) Similar theoretical convergence rates Very different practical behaviors for certain problems This partially explains controversial comparisons in some recent work Chih-Jen Lin (National Taiwan Univ.) 30 / 34

Analysis (Cont d) Linear SVM A lesson: different SVMs To handle large data may need different training strategies Even just for linear SVM # data # features # data # features # data, # features both large Should use different methods For example, # data # features primal based method; (but why not nonlinear?) Chih-Jen Lin (National Taiwan Univ.) 31 / 34

Outline Discussion and Conclusions Introduction to SVM Solving SVM Quadratic Programming Problem Training large-scale data Linear SVM Discussion and Conclusions Chih-Jen Lin (National Taiwan Univ.) 32 / 34

Discussion and Conclusions Discussion and Conclusions Linear versus nonlinear In this competition, most use linear (wild track) Even accuracy may be worse Recall I mention parallelization & approximation Linear is essentially an approximation of nonlinear For large data, selecting a right approach seems to be essential But finding a suitable one is difficult Chih-Jen Lin (National Taiwan Univ.) 33 / 34

Discussion and Conclusions Discussion and Conclusions (Cont d) This (i.e., too many approaches ) is indeed bad from the viewpoint of designing machine learning software The success of LIBSVM and SVM light Simple and general Developments in both directions (general and specific) will help to advance SVM training Chih-Jen Lin (National Taiwan Univ.) 34 / 34