StreamSVM Linear SVMs and Logistic Regression When Data Does Not Fit In Memory

Similar documents
Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization

Distributed Box-Constrained Quadratic Optimization for Dual Linear SVM

Incremental and Decremental Training for Linear Classification

Convex Optimization Algorithms for Machine Learning in 10 Slides

Cutting Plane Training of Structural SVM

ECS289: Scalable Machine Learning

COMP 652: Machine Learning. Lecture 12. COMP Lecture 12 1 / 37

ECS289: Scalable Machine Learning

Coordinate Descent and Ascent Methods

Solving the SVM Optimization Problem

Matrix Factorization and Factorization Machines for Recommender Systems

Parallel and Distributed Stochastic Learning -Towards Scalable Learning for Big Data Intelligence

Training Support Vector Machines: Status and Challenges

Change point method: an exact line search method for SVMs

Proximal Newton Method. Zico Kolter (notes by Ryan Tibshirani) Convex Optimization

ECS171: Machine Learning

Proximal Newton Method. Ryan Tibshirani Convex Optimization /36-725

A Parallel SGD method with Strong Convergence

Classification: Logistic Regression from Data

Classification: Logistic Regression from Data

CIS 520: Machine Learning Oct 09, Kernel Methods

Support Vector Machines and Kernel Methods

Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x))

STA141C: Big Data & High Performance Statistical Computing

Introduction to Machine Learning

A Dual Coordinate Descent Method for Large-scale Linear SVM

Improvements to Platt s SMO Algorithm for SVM Classifier Design

ECS289: Scalable Machine Learning

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26

A Study on Trust Region Update Rules in Newton Methods for Large-scale Linear Classification

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 5

Fast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee

A Dual Coordinate Descent Method for Large-scale Linear SVM

CA-SVM: Communication-Avoiding Support Vector Machines on Distributed System

Machine Learning for NLP

Warm up: risk prediction with logistic regression

This ensures that we walk downhill. For fixed λ not even this may be the case.

Optimization Methods for Machine Learning

Support Vector Machines for Classification and Regression

ECS171: Machine Learning

Warm up. Regrade requests submitted directly in Gradescope, do not instructors.

Fast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee

SMO Algorithms for Support Vector Machines without Bias Term

Supplement: Distributed Box-constrained Quadratic Optimization for Dual Linear SVM

Support Vector Machines for Classification and Regression. 1 Linearly Separable Data: Hard Margin SVMs

Support Vector Machines

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016

Last updated: Oct 22, 2012 LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Faster Machine Learning via Low-Precision Communication & Computation. Dan Alistarh (IST Austria & ETH Zurich), Hantian Zhang (ETH Zurich)

Kernel Logistic Regression and the Import Vector Machine

LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Machine Learning for NLP

Topics we covered. Machine Learning. Statistics. Optimization. Systems! Basics of probability Tail bounds Density Estimation Exponential Families

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Discriminative Models

CS-E4830 Kernel Methods in Machine Learning

Behavioral Data Mining. Lecture 7 Linear and Logistic Regression

Block Coordinate Descent for Regularized Multi-convex Optimization

Mini-Batch Primal and Dual Methods for SVMs

Machine Learning. Support Vector Machines. Manfred Huber

Kernel Machines. Pradeep Ravikumar Co-instructor: Manuela Veloso. Machine Learning

Sequential Minimal Optimization (SMO)

Deep Learning for Computer Vision

Linear Discrimination Functions

SMO vs PDCO for SVM: Sequential Minimal Optimization vs Primal-Dual interior method for Convex Objectives for Support Vector Machines

Neural Networks and Deep Learning

Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron

Support Vector Machines

An Evolving Gradient Resampling Method for Machine Learning. Jorge Nocedal

CS145: INTRODUCTION TO DATA MINING

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods.

Outline. Basic concepts: SVM and kernels SVM primal/dual problems. Chih-Jen Lin (National Taiwan Univ.) 1 / 22

Support Vector Machines for Classification: A Statistical Portrait

A Magiv CV Theory for Large-Margin Classifiers

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique

Linear Models for Regression CS534

Selected Topics in Optimization. Some slides borrowed from

ICS-E4030 Kernel Methods in Machine Learning

Statistical Machine Learning from Data

Motivation Subgradient Method Stochastic Subgradient Method. Convex Optimization. Lecture 15 - Gradient Descent in Machine Learning

Block stochastic gradient update method

CS60021: Scalable Data Mining. Large Scale Machine Learning

Support Vector Machines

Serious limitations of (single-layer) perceptrons: Cannot learn non-linearly separable tasks. Cannot approximate (learn) non-linear functions

Accelerating Stochastic Optimization

Basis Expansion and Nonlinear SVM. Kai Yu

ARock: an algorithmic framework for asynchronous parallel coordinate updates

PASSCoDe: Parallel ASynchronous Stochastic dual Co-ordinate Descent

Linear & nonlinear classifiers

An Improved Conjugate Gradient Scheme to the Solution of Least Squares SVM

A Distributed Solver for Kernelized SVM

Large-scale Collaborative Ranking in Near-Linear Time

Stochastic Subgradient Method

Neural Networks. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

Introduction to Support Vector Machines

Big Data Analytics: Optimization and Randomization

Homework 4. Convex Optimization /36-725

Transcription:

StreamSVM Linear SVMs and Logistic Regression When Data Does Not Fit In Memory S.V. N. (vishy) Vishwanathan Purdue University and Microsoft vishy@purdue.edu October 9, 2012 S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 1 / 41

Linear Support Vector Machines Outline 1 Linear Support Vector Machines 2 StreamSVM 3 Experiments - SVM 4 Logistic Regression 5 Experiments - Logistic Regression 6 Conclusion S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 2 / 41

Linear Support Vector Machines Binary Classification y i = +1 y i = 1 S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 3 / 41

Linear Support Vector Machines Binary Classification y i = +1 y i = 1 {x w, x = 0} S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 3 / 41

Linear Support Vector Machines Linear Support Vector Machines y i = +1 w, x 1 x 2 = 2 w w, x 1 x 2 = 2 w x 2 x 1 y i = 1 {x w, x = 1} {x w, x = 1} {x w, x = 0} S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 4 / 41

Linear Support Vector Machines Linear Support Vector Machines Optimization Problem 1 min w 2 w 2 + C m max (0, 1 y i w, x i ) i=1 Hinge Loss 3 Loss 2 1 Margin 3 2 1 1 2 3 4 S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 5 / 41

Linear Support Vector Machines The Dual Objective min α D(α) := 1 2 s.t. α i α j y i y j xi, x j i,j 0 α i C i α i Convex and quadratic objective Linear (box) constraints w = i α iy i x i Each α i corresponds to an x i S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 6 / 41

Linear Support Vector Machines Coordinate Descent 60 40 20 0 2 0 2 3 2 1 0 1 2 3 S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 7 / 41

Linear Support Vector Machines Coordinate Descent 60 40 20 0 2 x 0 2 3 2 1 0 y 1 2 3 S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 7 / 41

Linear Support Vector Machines Coordinate Descent 60 40 20 0 3 2 1 0 1 2 3 x S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 7 / 41

Linear Support Vector Machines Coordinate Descent 60 40 20 0 3 2 1 0 1 2 3 x S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 7 / 41

Linear Support Vector Machines Coordinate Descent 60 40 20 0 2 x 0 2 3 2 1 0 y 1 2 3 S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 7 / 41

Linear Support Vector Machines Coordinate Descent 8 6 4 2 3 2 1 0 1 2 3 y S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 7 / 41

Linear Support Vector Machines Coordinate Descent 8 6 4 2 3 2 1 0 1 2 3 y S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 7 / 41

Linear Support Vector Machines Coordinate Descent in the Dual One dimensional function ˆD(α t ) = α2 t 2 x t, x t + α t α i y i y t x i, x t α t + const. i t s.t. 0 α t C Solve ˆD(α t ) = α t x t, x t + i t α i y i y t x i, x t 1 = 0 ( α t = median 0, C, 1 i t α ) iy i y t x i, x t x t, x t S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 8 / 41

StreamSVM Outline 1 Linear Support Vector Machines 2 StreamSVM 3 Experiments - SVM 4 Logistic Regression 5 Experiments - Logistic Regression 6 Conclusion S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 9 / 41

StreamSVM We are Collecting Lots of Data... y i = +1 y i = 1 S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 10 / 41

StreamSVM We are Collecting Lots of Data... y i = +1 y i = 1 S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 10 / 41

StreamSVM We are Collecting Lots of Data... y i = +1 y i = 1 S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 10 / 41

StreamSVM We are Collecting Lots of Data... y i = +1 y i = 1 S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 10 / 41

StreamSVM What if Data Does not Fit in Memory? Idea 1: Block Minimization [Yu et al., KDD 2010] Split data into blocks B 1, B 2... such that B j fits in memory Compress and store each block separately Load one block of data at a time and optimize only those α i s Idea 2: Selective Block Minimization [Chang and Roth, KDD 2011] Split data into blocks B 1, B 2... such that B j fits in memory Compress and store each block separately Load one block of data at a time and optimize only those α i s Retain informative samples from each block in main memory S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 11 / 41

StreamSVM What are Informative Samples? S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 12 / 41

StreamSVM Some Observations SBM and BM are wasteful Both split data into blocks and compress the blocks This requires reading the entire data at least once (expensive) Both pause optimization while a block is loaded into memory Hardware 101 Disk I/O is slower than CPU (sometimes by a factor of 100) Random access on HDD is terrible sequential access is reasonably fast (factor of 10) Multi-core processors are becoming commonplace How can we exploit this? S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 13 / 41

StreamSVM Our Architecture Reader Trainer HDD RAM RAM Data Working Set Weight Vec S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 14 / 41

StreamSVM Our Philosophy Iterate over the data in main memory while streaming data from disk. Evict primarily examples from main memory that are uninformative. S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 15 / 41

StreamSVM Reader for k = 1,..., max_iter do for i = 1,..., n do if A = Ω then randomly select i A A = A \ {i } delete y i, Q ii, x i from RAM end if read y i, x i from Disk calculate Q ii = x i, x i store y i, Q ii, x i in RAM A = A {i} end for if stopping criterion is met then exit end if end for S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 16 / 41

StreamSVM Trainer α 1 = 0, w 1 = 0, ε = 9, ε new = 0, β = 0.9 while stopping criterion is not met do for t = 1,..., n do If A > 0.9 Ω then ε = βε randomly select i A and read y i, Q ii, x i from RAM compute i D := y i w t, x i 1 if (αi t = 0 and i D > ε) or (αi t = C and i D < ε) then A = A \ {i} and delete y i, Q ii, x i from RAM continue end if = median α t+1 i ( 0, C, α t i i D ε new = max(ε new, i D ) end for Update stopping criterion ε = ε new end while Q ii ), w t+1 = w t + (α t+1 αi t)y ix i S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 17 / 41 i

StreamSVM Proof of Convergence (Sketch) Definition (Luo and Tseng Problem) minimize α g(eα) + b α subject to L i α i U i α, b R n, E R d n, L i [, ), U i (, ] The above optimization problem is a Luo and Tseng problem if the following conditions hold: 1 E has no zero columns 2 the set of optimal solutions A is non-empty 3 the function g is strictly convex and twice cont. differentiable 4 for all optimal solutions α A, 2 g(eα ) is positive definite. S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 18 / 41

StreamSVM Proof of Convergence (Sketch) Lemma (see Theorem 1 of Hsieh et al., ICML 2008) If no data x i = 0, then the SVM dual is a Luo and Tseng problem. Definition (Almost cyclic rule) There exists an integer B n such that every coordinate is iterated upon at least once every B successive iterations. Theorem (Theorem 2.1 of Luo and Tseng, JOTA 1992) Let {α t } be a sequence of iterates generated by a coordinate descent method using the almost cyclic rule. The α t converges linearly to an element of A. S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 19 / 41

StreamSVM Proof of Convergence (Sketch) Lemma If the trainer accesses the cached training data sequentially, and the following conditions hold: The trainer is at most κ 1 times faster than the reading thread, that is, the trainer performs at most κ coordinate updates in the time that it takes the reader to read one training data from disk. A point is never evicted from the RAM unless the α i corresponding to that point has been updated. Then this confirms to the almost cyclic rule. S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 20 / 41

Experiments - SVM Outline 1 Linear Support Vector Machines 2 StreamSVM 3 Experiments - SVM 4 Logistic Regression 5 Experiments - Logistic Regression 6 Conclusion S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 21 / 41

Experiments - SVM Experiments dataset n d s(%) n + :n Datasize ocr 3.5 M 1156 100 0.96 45.28 GB dna 50 M 800 25 3e 3 63.04 GB webspam-t 0.35 M 16.61 M 0.022 1.54 20.03 GB kddb 20.01 M 29.89 M 1e-4 6.18 4.75 GB S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 22 / 41

Experiments - SVM Does Active Eviction Work? Relative Function Value Difference 10 0 10 1 10 2 10 3 10 4 10 5 10 6 10 7 Linear SVM webspam-t C = 1.0 Random Active 0 0.5 1 1.5 2 2.5 Wall Clock Time (sec) 10 4 S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 23 / 41

Experiments - SVM Comparison with Block Minimization Relative Function Value Difference 10 1 10 3 10 5 10 7 10 9 ocr C = 1.0 StreamSVM SBM BM 0 1 2 3 4 Wall Clock Time (sec) 10 4 S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 24 / 41

Experiments - SVM Comparison with Block Minimization Relative Function Value Difference 10 0 10 1 10 2 10 3 10 4 10 5 10 6 10 7 webspam-t C = 1.0 StreamSVM SBM BM 0 0.5 1 1.5 2 Wall Clock Time (sec) 10 4 S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 24 / 41

Experiments - SVM Comparison with Block Minimization Relative Function Value Difference 10 0 10 1 10 2 10 3 10 4 10 5 10 6 kddb C = 1.0 StreamSVM SBM BM 0 1 2 3 4 Wall Clock Time (sec) 10 4 S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 24 / 41

Experiments - SVM Comparison with Block Minimization Relative Objective Function Value 10 1 10 3 10 5 10 7 10 9 10 11 dna C = 1.0 StreamSVM SBM BM 0 1 2 3 4 Wall Clock Time (sec) 10 4 S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 24 / 41

Experiments - SVM Effect of Varying C Relative Function Value Difference 10 1 10 3 10 5 10 7 10 9 10 11 dna C = 10.0 StreamSVM SBM BM 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 Wall Clock Time (sec) 10 5 S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 25 / 41

Experiments - SVM Effect of Varying C Relative Function Value Difference 10 0 10 1 10 2 dna C = 100.0 StreamSVM SBM BM 0 0.5 1 1.5 Wall Clock Time (sec) 10 5 S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 25 / 41

Experiments - SVM Effect of Varying C Relative Objective Function Value 10 0 10 1 dna C = 1000.0 StreamSVM SBM BM 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 Wall Clock Time (sec) 10 5 S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 25 / 41

Experiments - SVM Varying Cache Size Relative Function Value Difference 10 0 10 1 10 2 10 3 10 4 10 5 10 6 kddb C = 1.0 256 MB 1 GB 4 GB 16 GB 0 0.5 1 1.5 2 Wall Clock Time (sec) 10 4 S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 26 / 41

Experiments - SVM Expanding Features Relative Function Value Difference 10 0 10 2 10 4 10 6 10 8 10 10 dna expanded C = 1.0 16 GB 32GB 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 Wall Clock Time (sec) 10 5 S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 27 / 41

Logistic Regression Outline 1 Linear Support Vector Machines 2 StreamSVM 3 Experiments - SVM 4 Logistic Regression 5 Experiments - Logistic Regression 6 Conclusion S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 28 / 41

Logistic Regression Logistic Regression Optimization Problem 1 min w 2 w 2 + C m log (1 + exp ( y i w, x i )) i=1 Hinge Loss 3 Loss 2 Logistic Loss 1 Margin 4 2 2 4 S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 29 / 41

Logistic Regression The Dual Objective min α D(α) := 1 2 s.t. α i α j y i y j xi, x j + α i log α i + (C α i ) log(c α i ) i,j 0 α i C i w = i α iy i x i Convex objective (but not quadratic) Box constraints S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 30 / 41

Logistic Regression Coordinate Descent One dimensional function ˆD(α t ) := α2 t 2 x t, x t + α t α i y i y t x i, x t i t s.t. + α t log α t + (C α t ) log(c α t ) 0 α t C Solve No closed form solution A modified Newton method does the job! See Yu, Huang, Lin 2011 for details. S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 31 / 41

Logistic Regression Determining Importance: SVM case Shrinking α t i = 0 and i D(α) > ε α t i = C and i D(α) < ε or S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 32 / 41

Logistic Regression Determining Importance: SVM case 1 Loss Gradient 0.5 Margin 4 2 2 4 6 w loss(w, x i, y i ) = α i π i D(α) = 0 w, x i 1 ɛ (KKT condition) Computing ɛ ɛ = max i A π i D(α) S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 33 / 41

Logistic Regression Determining Importance: Logistic Regression 1 Loss Gradient 0.8 0.6 0.4 0.2 Margin 8 6 4 2 2 4 6 8 w loss(w, x i, y i ) α i i D(α) 0 w, x i ɛ (KKT condition) Computing ɛ ɛ = max i A i D(α) S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 34 / 41

Experiments - Logistic Regression Outline 1 Linear Support Vector Machines 2 StreamSVM 3 Experiments - SVM 4 Logistic Regression 5 Experiments - Logistic Regression 6 Conclusion S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 35 / 41

Experiments - Logistic Regression Does Active Eviction Work? Relative Function Value Difference 10 0 10 2 10 4 10 6 10 8 10 10 Logistic Regression webspam-t C = 1.0 Random Active 0 0.5 1 1.5 2 2.5 Wall Clock Time (sec) 10 4 S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 36 / 41

Experiments - Logistic Regression Comparison with Block Minimization Relative Function Value Difference 10 0 10 2 10 4 10 6 10 8 10 10 webspam-t C = 1.0 StreamSVM BM 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 Wall Clock Time (sec) 10 4 S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 37 / 41

Conclusion Outline 1 Linear Support Vector Machines 2 StreamSVM 3 Experiments - SVM 4 Logistic Regression 5 Experiments - Logistic Regression 6 Conclusion S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 38 / 41

Conclusion Things I did not talk about/work in Progress StreamSVM Amortizing the cost of training different models Other storage Heirarchies (e.g. Solid State Drives) Extensions to other problems Multiclass problems Structured Output Prediction Distributed version of the algorithm S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 39 / 41

Conclusion References Shin Matsushima, S. V. N. Vishwanathan, and Alexander J Smola. Linear Support Vector Machines via Dual Cached Loops. KDD 2012. Code for StreamSVM available from http://www.r.dl.itc.u-tokyo.ac.jp/~masin/ streamsvm.html S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 40 / 41

Conclusion Joint work with Reading Thread Read (Sequential Access) Training Thread Load (Random Access) Read (Random Access) Update Disk Dataset RAM Cached Data (Working Set) RAM Weight Vector S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 41 / 41