Working Set Selection Using Second Order Information for Training SVM

Similar documents
Training Support Vector Machines: Status and Challenges

Solving the SVM Optimization Problem

Outline. Basic concepts: SVM and kernels SVM primal/dual problems. Chih-Jen Lin (National Taiwan Univ.) 1 / 22

LIBSVM: a Library for Support Vector Machines (Version 2.32)

A Simple Decomposition Method for Support Vector Machines

Sequential Minimal Optimization (SMO)

LIBSVM: a Library for Support Vector Machines

Introduction to Support Vector Machines

Incremental and Decremental Training for Linear Classification

A Revisit to Support Vector Data Description (SVDD)

A Dual Coordinate Descent Method for Large-scale Linear SVM

Introduction to Support Vector Machines

Introduction to Logistic Regression and Support Vector Machine

Support Vector Machines

Comments on the Core Vector Machines: Fast SVM Training on Very Large Data Sets

UVA CS 4501: Machine Learning

Distributed Box-Constrained Quadratic Optimization for Dual Linear SVM

ECS289: Scalable Machine Learning

A Study on Sigmoid Kernels for SVM and the Training of non-psd Kernels by SMO-type Methods

SVM May 2007 DOE-PI Dianne P. O Leary c 2007

Convex Optimization Algorithms for Machine Learning in 10 Slides

Large-scale Linear RankSVM

Statistical Machine Learning from Data

SMO Algorithms for Support Vector Machines without Bias Term

Support Vector Machines

ECS289: Scalable Machine Learning

A Dual Coordinate Descent Method for Large-scale Linear SVM

Optimization for Kernel Methods S. Sathiya Keerthi

CS6375: Machine Learning Gautam Kunapuli. Support Vector Machines

CS269: Machine Learning Theory Lecture 16: SVMs and Kernels November 17, 2010

Supplement: Distributed Box-constrained Quadratic Optimization for Dual Linear SVM

Machine Learning. Support Vector Machines. Manfred Huber

Machine Learning. Lecture 6: Support Vector Machine. Feng Li.

Introduction to Support Vector Machines

Improvements to Platt s SMO Algorithm for SVM Classifier Design

Training algorithms for fuzzy support vector machines with nois

1 Kernel methods & optimization

SVMC An introduction to Support Vector Machines Classification

Support Vector Machines: Maximum Margin Classifiers

A Study on Trust Region Update Rules in Newton Methods for Large-scale Linear Classification

Support Vector Machines and Kernel Methods

Mathematical Programming for Multiple Kernel Learning

Convex Optimization. Dani Yogatama. School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA. February 12, 2014

Formulation with slack variables

Machine Learning Techniques

CSC 411 Lecture 17: Support Vector Machine

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines

Support Vector Machines and Kernel Methods

Dan Roth 461C, 3401 Walnut

An Ensemble Ranking Solution for the Yahoo! Learning to Rank Challenge

Jeff Howbert Introduction to Machine Learning Winter

The Intrinsic Recurrent Support Vector Machine

Lecture Support Vector Machine (SVM) Classifiers

An SMO algorithm for the Potential Support Vector Machine

Perceptron Revisited: Linear Separators. Support Vector Machines

On the convergence of hybrid decomposition methods for SVM training

Matrix Factorization and Factorization Machines for Recommender Systems

Speed-Up LOO-CV with SVM Classifier

Machine Learning Techniques

Cutting Plane Training of Structural SVM

Algorithms for NLP. Classification II. Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley

ECS289: Scalable Machine Learning

Machine Learning Techniques

SVM-Optimization and Steepest-Descent Line Search

Sparse Support Vector Machines by Kernel Discriminant Analysis

Kernels and the Kernel Trick. Machine Learning Fall 2017

Learning Kernel Parameters by using Class Separability Measure

Estimating Convergence Probability for the Hartree-Fock Algorithm for Calculating Electronic Wavefunctions for Molecules

Nearest Neighbors Methods for Support Vector Machines

QP Algorithms with Guaranteed Accuracy and Run Time for Support Vector Machines

Support Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017

Support Vector Machine for Classification and Regression

SUPPORT VECTOR MACHINE

Lecture 18: Optimization Programming

STA141C: Big Data & High Performance Statistical Computing

Mehryar Mohri Foundations of Machine Learning Courant Institute of Mathematical Sciences Homework assignment 3 April 5, 2013 Due: April 19, 2013

Large-scale Collaborative Ranking in Near-Linear Time

A Magiv CV Theory for Large-Margin Classifiers

Support Vector and Kernel Methods

Gaps in Support Vector Optimization

Discriminant Kernels based Support Vector Machine

SMO vs PDCO for SVM: Sequential Minimal Optimization vs Primal-Dual interior method for Convex Objectives for Support Vector Machines

Introduction to Machine Learning Prof. Sudeshna Sarkar Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

Part 1. The Review of Linear Programming

Convergence of a Generalized Gradient Selection Approach for the Decomposition Method

Lecture 3 January 28

Support Vector Machine Classification with Indefinite Kernels

Notes on Dantzig-Wolfe decomposition and column generation

Large-Scale Training of SVMs with Automata Kernels

Coordinate descent. Geoff Gordon & Ryan Tibshirani Optimization /

Support Vector Machines on General Confidence Functions

StreamSVM Linear SVMs and Logistic Regression When Data Does Not Fit In Memory

Learning with Rigorous Support Vector Machines

A Reformulation of Support Vector Machines for General Confidence Functions

Nonlinear Optimization and Support Vector Machines

Lecture 25: November 27

An Improved GLMNET for L1-regularized Logistic Regression

Hyperplanes. Alex Sverdlov. A hyperplane is a fancy name for a plane in N-dimensions: this is just a line in 2D, and a

Lecture 23: November 21

Machine Learning for NLP

Transcription:

Working Set Selection Using Second Order Information for Training SVM Chih-Jen Lin Department of Computer Science National Taiwan University Joint work with Rong-En Fan and Pai-Hsuen Chen Talk at NIPS 2005 Workshop on Large Scale Kernel Machines. p.1/19

Outline Large dense quadratic programming in SVM Decomposition methods and working set selections A new selection based on second order information Results and analysis This work appears in JMLR 2005. p.2/19

SVM Dual Optimization Problem Large dense quadratic problem min α subject to 1 2 αt Qα e T α 0 α i C,i = 1,...,l y T α = 0, l: # of training data Q: l by l fully dense matrix y i = ±1 e = [1,...,1] T Difficult as Q is fully dense in general. p.3/19

Do we really need to solve the dual? Maybe not. Sometimes data too large to do so Approximating either from primal or dual side However, in certain situations we still hope to solve it This talk: a faster algorithm and implementation. p.4/19

Decomposition Methods Working on a few variable each time Similar to coordinate-wise minimization Working set B, N = {1,...,l}\B fixed Size of B usually <= 100 Sub-problem in each iteration: min α B 1 2 [ ] [ α T Q BB Q BN B (αk N )T [ e T B (ek N )T ] [ α B α k N Q NB ] Q NN ] [ ] α B α k N subject to 0 (α B ) t C,t = 1,...,q, y T B α B = y T N αk N. p.5/19

Sequential Minimal Optimization (SMO) Consider B = {i,j}; that is, B = 2 (Platt, 1998) Extreme of decomposition methods Sub-problem analytically solved; no need to use optimization software 1 min α i,α j 2 [ ] [ ] [ Q ii Q ij α i α j Q ij Q jj s.t. 0 α i,α j C, y i α i + y j α j = y T N αk N, α i α j ] + (Q BN α kn e B) T [ This work focuses on selecting two elements α i α j ]. p.6/19

Existing Selection by Gradient Let d [d B,0 N ]. Minimizing f(α k + d) f(α k ) + f(α k ) T d = f(α k ) + f(α k ) T B d B. Solve min d B f(α k ) T B d B subject to y T B d B = 0, d t 0, if α k t = 0,t B, d t 0, if αt k = C,t B, (1b) 1 d t 1,t B B = 2 (1a). p.7/19

First considered in (Joachims, 1998) 0 α t C leads to (1a) and (1b). 0 αt k + d t d t 0, if αt k = 0, αt k + d t C d t 0, if αt k = C α + d may not be feasible. OK for finding working sets 1 d t 1,t B avoid objective value. p.8/19

Rewritten as checking first order approximation at different sub-problems of B where {i,j} = arg min B: B =2 Sub(B), Sub(B) min d B f(α k ) T B d B subject to y T B d B = 0, d t 0, if α k t = 0,t B, d t 0, if α k t = C,t B, 1 d t 1,t B. Checking all ( l 2) possible B s?. p.9/19

Solution of Using Gradient Information O(l) procedure where i arg max t I up (α k ) y t f(α k ) t, j arg min y t f(α k ) t, t I low (α k ) I up (α) {t α t < C,y t = 1 or α t > 0,y t = 1}, and I low (α) {t α t < C,y t = 1 or α t > 0,y t = 1}. This usually called maximal violating pair. p.10/19

Better Working Set Selection Difficult: # iter ց but cost per iter ր May not imply shorter training time A selection by second order information (Fan et al., 2005) As f is a quadratic, f(α k + d) = f(α k ) + f(α k ) T d + 1 2 dt 2 f(α k )d = f(α k ) + f(α k ) T B d B + 1 2 dt B 2 f(α k ) BB d B. p.11/19

Selection by Second-Order Information Using second order information min Sub(B), B: B =2 Sub(B) min d B 1 2 dt B 2 f(α k ) BB d B + f(α k ) T B d B subject to y T B d B = 0, d t 0, if α k t = 0,t B, d t 0, if α k t = C,t B. 1 d t 1,t B not needed if Q BB PD Too expensive to check ( l 2) sets. p.12/19

A heuristic 1. Select 2. Select 3. Return B = {i,j}. i arg max{ y t f(α k ) t t I up (α k )}. t j arg min t {Sub({i,t}) t I low (α k ), y t f(α k ) t < y i f(α k ) i }. The same i as using the gradient information Check only O(l) B s to decide j. p.13/19

Sub({i, t}) can be easily solved If K ii + K jj 2K ij > 0, Sub({i,t}) = ( y i f(α k ) i + y t f(α k ) t ) 2 2(K ii + K tt 2K it ) Convergence established in (Fan et al., 2005) Details not shown here. p.14/19

mg 1.4 1.2 1 0.8 0.6 0.4 0.2 0 Comparison of Two Selections Iteration and time ratio between using second-order information and maximal violating pair time (40M cache) time (100K cache) total #iter image splice tree a1a australian breast-cancer diabetes fourclass german.numer w1a abalone cadata cpusmall space_ga Data sets. p.15/19 Ratio

A complete comparison is not easy Try enough data sets Consider parameter selection Details not shown here. p.16/19

More about Second-Order Selection What if we check all ( l 2) sets Iteration ratio between checking all and checking O(l) : 1.4 1.3 parameter selection final training 1.2 1.1 1 Ratio 0.9 0.8 0.7 0.6 0.5 0.4 image splice tree a1a australian breast-cancer diabetes fourclass german.numer w1a Data sets Fewer iterations, but ratio (0.7 to 0.8) not enough to justify the higher cost per iteration. p.17/19

Why not Keeping Feasibility? min d B 1 2 dt B 2 f(α k ) BB d B + f(α k ) T B d B Two types of constraints: y T B d B = 0, y T B d B = 0, d t 0, if α k t = 0,t B, 0 α k + d t C,t B d t 0, if αt k = C,t B Related work (Lai et al., 2003a,b) Heuristically select some pairs Check function reduction while keeping feasibility Higher cost in selecting working sets We proved: at final iterations two are indeed the same. p.18/19

Conclusions Finding better working sets for SVM decomposition methods is difficult We proposed one based on second order information Results better than the commonly used selection from first order information Implementation in LIBSVM (after version 2.8) http://www.csie.ntu.edu.tw/~cjlin/libsvm Replacing the maximal violating pair selection. p.19/19