StreamSVM Linear SVMs and Logistic Regression When Data Does Not Fit In Memory

StreamSVM Linear SVMs and Logistic Regression When Data Does Not Fit In Memory S.V. N. (vishy) Vishwanathan Purdue University and Microsoft vishy@purdue.edu October 9, 2012 S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 1 / 41

Linear Support Vector Machines Outline 1 Linear Support Vector Machines 2 StreamSVM 3 Experiments - SVM 4 Logistic Regression 5 Experiments - Logistic Regression 6 Conclusion S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 2 / 41

Linear Support Vector Machines Binary Classification y i = +1 y i = 1 S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 3 / 41

Linear Support Vector Machines Binary Classification y i = +1 y i = 1 {x w, x = 0} S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 3 / 41

Linear Support Vector Machines Linear Support Vector Machines y i = +1 w, x 1 x 2 = 2 w w, x 1 x 2 = 2 w x 2 x 1 y i = 1 {x w, x = 1} {x w, x = 1} {x w, x = 0} S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 4 / 41

Linear Support Vector Machines Linear Support Vector Machines Optimization Problem 1 min w 2 w 2 + C m max (0, 1 y i w, x i ) i=1 Hinge Loss 3 Loss 2 1 Margin 3 2 1 1 2 3 4 S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 5 / 41

Linear Support Vector Machines The Dual Objective min α D(α) := 1 2 s.t. α i α j y i y j xi, x j i,j 0 α i C i α i Convex and quadratic objective Linear (box) constraints w = i α iy i x i Each α i corresponds to an x i S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 6 / 41

Linear Support Vector Machines Coordinate Descent 60 40 20 0 2 0 2 3 2 1 0 1 2 3 S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 7 / 41

Linear Support Vector Machines Coordinate Descent 60 40 20 0 2 x 0 2 3 2 1 0 y 1 2 3 S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 7 / 41

Linear Support Vector Machines Coordinate Descent 60 40 20 0 3 2 1 0 1 2 3 x S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 7 / 41

Linear Support Vector Machines Coordinate Descent 60 40 20 0 2 x 0 2 3 2 1 0 y 1 2 3 S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 7 / 41

Linear Support Vector Machines Coordinate Descent 8 6 4 2 3 2 1 0 1 2 3 y S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 7 / 41

Linear Support Vector Machines Coordinate Descent in the Dual One dimensional function ˆD(α t ) = α2 t 2 x t, x t + α t α i y i y t x i, x t α t + const. i t s.t. 0 α t C Solve ˆD(α t ) = α t x t, x t + i t α i y i y t x i, x t 1 = 0 ( α t = median 0, C, 1 i t α ) iy i y t x i, x t x t, x t S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 8 / 41

StreamSVM Outline 1 Linear Support Vector Machines 2 StreamSVM 3 Experiments - SVM 4 Logistic Regression 5 Experiments - Logistic Regression 6 Conclusion S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 9 / 41

StreamSVM We are Collecting Lots of Data... y i = +1 y i = 1 S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 10 / 41

StreamSVM What if Data Does not Fit in Memory? Idea 1: Block Minimization [Yu et al., KDD 2010] Split data into blocks B 1, B 2... such that B j fits in memory Compress and store each block separately Load one block of data at a time and optimize only those α i s Idea 2: Selective Block Minimization [Chang and Roth, KDD 2011] Split data into blocks B 1, B 2... such that B j fits in memory Compress and store each block separately Load one block of data at a time and optimize only those α i s Retain informative samples from each block in main memory S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 11 / 41

StreamSVM What are Informative Samples? S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 12 / 41

StreamSVM Some Observations SBM and BM are wasteful Both split data into blocks and compress the blocks This requires reading the entire data at least once (expensive) Both pause optimization while a block is loaded into memory Hardware 101 Disk I/O is slower than CPU (sometimes by a factor of 100) Random access on HDD is terrible sequential access is reasonably fast (factor of 10) Multi-core processors are becoming commonplace How can we exploit this? S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 13 / 41

StreamSVM Our Architecture Reader Trainer HDD RAM RAM Data Working Set Weight Vec S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 14 / 41

StreamSVM Our Philosophy Iterate over the data in main memory while streaming data from disk. Evict primarily examples from main memory that are uninformative. S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 15 / 41

StreamSVM Reader for k = 1,..., max_iter do for i = 1,..., n do if A = Ω then randomly select i A A = A \ {i } delete y i, Q ii, x i from RAM end if read y i, x i from Disk calculate Q ii = x i, x i store y i, Q ii, x i in RAM A = A {i} end for if stopping criterion is met then exit end if end for S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 16 / 41

StreamSVM Trainer α 1 = 0, w 1 = 0, ε = 9, ε new = 0, β = 0.9 while stopping criterion is not met do for t = 1,..., n do If A > 0.9 Ω then ε = βε randomly select i A and read y i, Q ii, x i from RAM compute i D := y i w t, x i 1 if (αi t = 0 and i D > ε) or (αi t = C and i D < ε) then A = A \ {i} and delete y i, Q ii, x i from RAM continue end if = median α t+1 i ( 0, C, α t i i D ε new = max(ε new, i D ) end for Update stopping criterion ε = ε new end while Q ii ), w t+1 = w t + (α t+1 αi t)y ix i S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 17 / 41 i

StreamSVM Proof of Convergence (Sketch) Definition (Luo and Tseng Problem) minimize α g(eα) + b α subject to L i α i U i α, b R n, E R d n, L i [, ), U i (, ] The above optimization problem is a Luo and Tseng problem if the following conditions hold: 1 E has no zero columns 2 the set of optimal solutions A is non-empty 3 the function g is strictly convex and twice cont. differentiable 4 for all optimal solutions α A, 2 g(eα ) is positive definite. S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 18 / 41

StreamSVM Proof of Convergence (Sketch) Lemma (see Theorem 1 of Hsieh et al., ICML 2008) If no data x i = 0, then the SVM dual is a Luo and Tseng problem. Definition (Almost cyclic rule) There exists an integer B n such that every coordinate is iterated upon at least once every B successive iterations. Theorem (Theorem 2.1 of Luo and Tseng, JOTA 1992) Let {α t } be a sequence of iterates generated by a coordinate descent method using the almost cyclic rule. The α t converges linearly to an element of A. S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 19 / 41

StreamSVM Proof of Convergence (Sketch) Lemma If the trainer accesses the cached training data sequentially, and the following conditions hold: The trainer is at most κ 1 times faster than the reading thread, that is, the trainer performs at most κ coordinate updates in the time that it takes the reader to read one training data from disk. A point is never evicted from the RAM unless the α i corresponding to that point has been updated. Then this confirms to the almost cyclic rule. S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 20 / 41

Experiments - SVM Outline 1 Linear Support Vector Machines 2 StreamSVM 3 Experiments - SVM 4 Logistic Regression 5 Experiments - Logistic Regression 6 Conclusion S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 21 / 41

Experiments - SVM Experiments dataset n d s(%) n + :n Datasize ocr 3.5 M 1156 100 0.96 45.28 GB dna 50 M 800 25 3e 3 63.04 GB webspam-t 0.35 M 16.61 M 0.022 1.54 20.03 GB kddb 20.01 M 29.89 M 1e-4 6.18 4.75 GB S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 22 / 41

Experiments - SVM Does Active Eviction Work? Relative Function Value Difference 10 0 10 1 10 2 10 3 10 4 10 5 10 6 10 7 Linear SVM webspam-t C = 1.0 Random Active 0 0.5 1 1.5 2 2.5 Wall Clock Time (sec) 10 4 S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 23 / 41

Experiments - SVM Comparison with Block Minimization Relative Function Value Difference 10 1 10 3 10 5 10 7 10 9 ocr C = 1.0 StreamSVM SBM BM 0 1 2 3 4 Wall Clock Time (sec) 10 4 S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 24 / 41

Experiments - SVM Comparison with Block Minimization Relative Function Value Difference 10 0 10 1 10 2 10 3 10 4 10 5 10 6 10 7 webspam-t C = 1.0 StreamSVM SBM BM 0 0.5 1 1.5 2 Wall Clock Time (sec) 10 4 S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 24 / 41

Experiments - SVM Comparison with Block Minimization Relative Function Value Difference 10 0 10 1 10 2 10 3 10 4 10 5 10 6 kddb C = 1.0 StreamSVM SBM BM 0 1 2 3 4 Wall Clock Time (sec) 10 4 S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 24 / 41

Experiments - SVM Comparison with Block Minimization Relative Objective Function Value 10 1 10 3 10 5 10 7 10 9 10 11 dna C = 1.0 StreamSVM SBM BM 0 1 2 3 4 Wall Clock Time (sec) 10 4 S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 24 / 41

Experiments - SVM Effect of Varying C Relative Function Value Difference 10 1 10 3 10 5 10 7 10 9 10 11 dna C = 10.0 StreamSVM SBM BM 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 Wall Clock Time (sec) 10 5 S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 25 / 41

Experiments - SVM Effect of Varying C Relative Function Value Difference 10 0 10 1 10 2 dna C = 100.0 StreamSVM SBM BM 0 0.5 1 1.5 Wall Clock Time (sec) 10 5 S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 25 / 41

Experiments - SVM Effect of Varying C Relative Objective Function Value 10 0 10 1 dna C = 1000.0 StreamSVM SBM BM 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 Wall Clock Time (sec) 10 5 S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 25 / 41

Experiments - SVM Varying Cache Size Relative Function Value Difference 10 0 10 1 10 2 10 3 10 4 10 5 10 6 kddb C = 1.0 256 MB 1 GB 4 GB 16 GB 0 0.5 1 1.5 2 Wall Clock Time (sec) 10 4 S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 26 / 41

Experiments - SVM Expanding Features Relative Function Value Difference 10 0 10 2 10 4 10 6 10 8 10 10 dna expanded C = 1.0 16 GB 32GB 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 Wall Clock Time (sec) 10 5 S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 27 / 41

Logistic Regression Outline 1 Linear Support Vector Machines 2 StreamSVM 3 Experiments - SVM 4 Logistic Regression 5 Experiments - Logistic Regression 6 Conclusion S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 28 / 41

Logistic Regression Logistic Regression Optimization Problem 1 min w 2 w 2 + C m log (1 + exp ( y i w, x i )) i=1 Hinge Loss 3 Loss 2 Logistic Loss 1 Margin 4 2 2 4 S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 29 / 41

Logistic Regression The Dual Objective min α D(α) := 1 2 s.t. α i α j y i y j xi, x j + α i log α i + (C α i ) log(c α i ) i,j 0 α i C i w = i α iy i x i Convex objective (but not quadratic) Box constraints S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 30 / 41

Logistic Regression Coordinate Descent One dimensional function ˆD(α t ) := α2 t 2 x t, x t + α t α i y i y t x i, x t i t s.t. + α t log α t + (C α t ) log(c α t ) 0 α t C Solve No closed form solution A modified Newton method does the job! See Yu, Huang, Lin 2011 for details. S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 31 / 41

Logistic Regression Determining Importance: SVM case Shrinking α t i = 0 and i D(α) > ε α t i = C and i D(α) < ε or S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 32 / 41

Logistic Regression Determining Importance: SVM case 1 Loss Gradient 0.5 Margin 4 2 2 4 6 w loss(w, x i, y i ) = α i π i D(α) = 0 w, x i 1 ɛ (KKT condition) Computing ɛ ɛ = max i A π i D(α) S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 33 / 41

Logistic Regression Determining Importance: Logistic Regression 1 Loss Gradient 0.8 0.6 0.4 0.2 Margin 8 6 4 2 2 4 6 8 w loss(w, x i, y i ) α i i D(α) 0 w, x i ɛ (KKT condition) Computing ɛ ɛ = max i A i D(α) S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 34 / 41

Experiments - Logistic Regression Outline 1 Linear Support Vector Machines 2 StreamSVM 3 Experiments - SVM 4 Logistic Regression 5 Experiments - Logistic Regression 6 Conclusion S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 35 / 41

Experiments - Logistic Regression Does Active Eviction Work? Relative Function Value Difference 10 0 10 2 10 4 10 6 10 8 10 10 Logistic Regression webspam-t C = 1.0 Random Active 0 0.5 1 1.5 2 2.5 Wall Clock Time (sec) 10 4 S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 36 / 41

Experiments - Logistic Regression Comparison with Block Minimization Relative Function Value Difference 10 0 10 2 10 4 10 6 10 8 10 10 webspam-t C = 1.0 StreamSVM BM 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 Wall Clock Time (sec) 10 4 S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 37 / 41

Conclusion Outline 1 Linear Support Vector Machines 2 StreamSVM 3 Experiments - SVM 4 Logistic Regression 5 Experiments - Logistic Regression 6 Conclusion S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 38 / 41

Conclusion Things I did not talk about/work in Progress StreamSVM Amortizing the cost of training different models Other storage Heirarchies (e.g. Solid State Drives) Extensions to other problems Multiclass problems Structured Output Prediction Distributed version of the algorithm S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 39 / 41

Conclusion References Shin Matsushima, S. V. N. Vishwanathan, and Alexander J Smola. Linear Support Vector Machines via Dual Cached Loops. KDD 2012. Code for StreamSVM available from http://www.r.dl.itc.u-tokyo.ac.jp/~masin/ streamsvm.html S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 40 / 41

Conclusion Joint work with Reading Thread Read (Sequential Access) Training Thread Load (Random Access) Read (Random Access) Update Disk Dataset RAM Cached Data (Working Set) RAM Weight Vector S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 41 / 41