Mini-Batch Primal and Dual Methods for SVMs
|
|
- Elfreda Cobb
- 6 years ago
- Views:
Transcription
1 Mini-Batch Primal and Dual Methods for SVMs Peter Richtárik School of Mathematics The University of Edinburgh Coauthors: M. Takáč (Edinburgh), A. Bijral and N. Srebro (both TTI at Chicago) arxiv: Fête Parisienne in Computation, Inference and Optimization March 20, / 31
2 ShaioShalevjSchwartzBoTongoZhang Stochastic dual coordinate ascent methods for regularized loss minimization 2012BoarXiv: ShaioShalevjSchwartzBoYoramo SingerBoNathanoSrebro Pegasos: Primal estimated subgradient solver for SVM ICMLo2007 PeteroRichtárikBoMartinoTakáčo Parallel coordinate descent methods for big data optimization 2012BoarXiv: MartinoTakáčBoAvleenoBijralBoPeteroRichtárikBoNathanoSrebro Mini-batch primal and dual methods for SVMs 2013BoarXiv:1303: / 31
3 Support Vector Machine <w,x> - b = 1 <w,x> - b = 0 <w,x> - b = -1 3 / 31
4 Family Support Machine 4 / 31
5 PART I: Stochastic Gradient Descent (SGD) 5 / 31
6 SVM: Primal Problem Data: {(x i, y i ) R d {+1, 1} : i S def = {1, 2,..., n}} Examples: x 1,..., x n (assumption: max i x i 2 1) Labels: y i {+1, 1} Optimization formulation of SVM: min w R d { P S (w) def = λ 2 w 2 + ˆL S (w) }, (P) where ˆLA (w) def = 1 A i A l(y i w, x i ) (average hinge loss on examples in A) l(ζ) def = max{0, 1 ζ} (hinge loss) 6 / 31
7 Pegasos (SGD) Algorithm 1. Choose w 1 = 0 R d 2. Iterate for t = 1, 2,..., T 2.1 Choose A t S = {1, 2,..., n}, A t = b, uniformly at random 2.2 Set stepsize η t 1 λt 2.3 Update w t+1 w t η t P At (w t) Theorem For w = 1 T T t=1 w t we have: where c = ( λ + 1) 2. E[P( w)] P(w ) + c log(t ) Shai Shalev-Shwartz, Yoram Singer and Nathan Srebro 1 λt, Pegasos: Primal Estimated sub-gradient SOlver for SVM, ICML / 31
8 Pegasos (SGD) Algorithm 1. Choose w 1 = 0 R d 2. Iterate for t = 1, 2,..., T 2.1 Choose A t S = {1, 2,..., n}, A t = b, uniformly at random 2.2 Set stepsize η t 1 λt 2.3 Update w t+1 (1 η tλ)w t + ηt b i A t : y i w t,x i <1 y ix i Theorem For w = 1 T T t=1 w t we have: where c = ( λ + 1) 2. E[P( w)] P(w ) + c log(t ) Shai Shalev-Shwartz, Yoram Singer and Nathan Srebro 1 λt, Pegasos: Primal Estimated sub-gradient SOlver for SVM, ICML / 31
9 Pegasos (SGD) Algorithm 1. Choose w 1 = 0 R d 2. Iterate for t = 1, 2,..., T 2.1 Choose A t S = {1, 2,..., n}, A t = b, uniformly at random 2.2 Set stepsize η t 1 λt 2.3 Update w t+1 (1 η tλ)w t + ηt b i A t : y i w t,x i <1 y ix i Theorem 1 For w = 2 T T t= T /2 +1 w t we have: E[P( w)] P(w ) + 30β b b 1 λt, where β b = 1 + (b 1)(nσ2 1) n 1, σ 2 def = 1 n Q and Q ij = y i x i, y j x j Martin Takáč, Avleen Bijral, P. R. and Nathan Srebro Mini-batch primal and dual methods for SVMs, / 31
10 Insight into β b b β b b = 1 + (b 1)(nσ n 1 b 2 1) 8 / 31
11 Insight into β b b β b b = 1 + (b 1)(nσ n 1 b 2 1) Letting X = [x 1, x 2,..., x n ], Z = [y 1 x 1,..., y n x n ] and assuming x i = 1 for all i, we have nσ 2 def = Q = ZZ T = Z T Z = λ max (Z T Z) [ tr(zt Z) n, tr(z T Z)] = [ tr(xt X) n, tr(x T X) ] }{{}}{{} =1 =n 8 / 31
12 Insight into β b b β b b = 1 + (b 1)(nσ n 1 b 2 1) Letting X = [x 1, x 2,..., x n ], Z = [y 1 x 1,..., y n x n ] and assuming x i = 1 for all i, we have nσ 2 def = Q = ZZ T = Z T Z = λ max (Z T Z) [ tr(zt Z) n, tr(z T Z)] = [ tr(xt X) n, tr(x T X) ] }{{}}{{} =1 =n nσ 2 = n β b b = 1 (no parallelization speedup; mini-batching does not help) nσ 2 = 1 β b b = 1 b (speedup equal to batch size!) 8 / 31
13 Insight into β b b β b b = 1 + (b 1)(nσ n 1 b 2 1) Letting X = [x 1, x 2,..., x n ], Z = [y 1 x 1,..., y n x n ] and assuming x i = 1 for all i, we have nσ 2 def = Q = ZZ T = Z T Z = λ max (Z T Z) [ tr(zt Z) n, tr(z T Z)] = [ tr(xt X) n, tr(x T X) ] }{{}}{{} =1 =n nσ 2 = n β b b = 1 (no parallelization speedup; mini-batching does not help) nσ 2 = 1 β b b = 1 b (speedup equal to batch size!) Similar expression appears in P. R. and Martin Takáč Parallel coordinate descent methods for big data problems, 2012 with nσ 2 replaced by ω (degree of partial separability of the loss function) 8 / 31
14 Computing β b : SVM Datasets To run SDCA with safe mini-batching, we need to compute β b : Two options: β b = 1 + (b 1)(nσ2 1), where nσ 2 = λ max (Z T Z) n 1 Compute the largest eigenvalue (e.g., power method) Replace nσ 2 by an upper bound: degree of partial separability ω General SDCA methods based on ω described here: P. R. and Martin Takáč Parallel coordinate descent methods for big data optimization, 2012 Dataset # Examples (d) # Features (n) A 0 nσ 2 [1, n] ω a1a 1, , a9a 32, , rcv1 20,242 47,236 1,498, real-sim 72,309 20,958 3,709, ,484 news20 19,996 1,355,191 9,097,958 9, ,423 url 2,396,130 3,231, ,058, webspam 350,000 16,609,143 29,796, kdda2010 8,407,752 20,216, ,613, kddb ,264,097 29,890, ,345, / 31
15 Where does β b Come from? Lemma 1 Consider any symmetric Q R n n, random subset A {1, 2,..., n} with A = b and v R n. Then [ ( E[v[A] T Qv [A]] = b 1 b 1 ) n ] Q ii vi 2 + b 1 n n 1 n 1 vt Qv. i=1 Moreover, if Q ii 1 for all i, then we get the following ESO (Expected Separable Overapproximation): E[v T [A] Qv [A]] b n β b v / 31
16 Where does β b Come from? Lemma 1 Consider any symmetric Q R n n, random subset A {1, 2,..., n} with A = b and v R n. Then [ ( E[v[A] T Qv [A]] = b 1 b 1 ) n ] Q ii vi 2 + b 1 n n 1 n 1 vt Qv. i=1 Moreover, if Q ii 1 for all i, then we get the following ESO (Expected Separable Overapproximation): E[v T [A] Qv [A]] b n β b v 2. Remark: ESO inequalities are systematically developed in P. R. and Martin Takáč Parallel coordinate descent methods for big data problems, / 31
17 Insight into the Analysis Classical Pegasos Analysis Uses the inequality: ˆL At (w) 2 1 which holds for any A t S = {1, 2,..., n} 11 / 31
18 Insight into the Analysis Classical Pegasos Analysis Uses the inequality: ˆL At (w) 2 1 which holds for any A t S = {1, 2,..., n} New Analysis Uses the inequality: E ˆL At (w) 2 β b b which holds for A t, A t = b, chosen uniformly at random (established by previous lemma) 11 / 31
19 PART II: Stochastic Dual Coordinate Ascent (SDCA) 12 / 31
20 Stochastic Dual Coordinate Ascent (SDCA) Problem: Algorithm max α R n, 0 α i 1 { D(α) def = 1 n n i=1 } α i 1 2λn 2 αt Qα (D) 1. Choose α 0 = 0 R n 2. For t = 0, 1, 2,... iterate: 2.1 Choose i {1,..., n}, uniformly at random 2.2 Set δ arg max{d(α t + δe i ) : 0 α i + δ 1} 2.3 α t+1 α t + δ e i 13 / 31
21 Stochastic Dual Coordinate Ascent (SDCA) Problem: max α R n, 0 α i 1 { D(α) def = 1 n n i=1 } α i 1 2λn 2 αt Qα (D) Algorithm 1. Choose α 0 = 0 R n 2. For t = 0, 1, 2,... iterate: 2.1 Choose i {1,..., n}, uniformly at random 2.2 Set δ arg max{d(α t + δe i ) : 0 α i + δ 1} 2.3 α t+1 α t + δ e i First proposed for SVM by C.-J. Hsieh K.-W. Chang, C.-J. Lin, S. S. Keerthi, and S. Sundararajan A dual coordinate descent method for large-scale linear SVM, ICML 2008 General analysis in P. R. and M. Takáč Iteration complexity of randomized block-coordinate descent methods..., MAPR 2012 [INFORMS Computing Society Best Student Paper Prize (runner-up), 2012] 13 / 31
22 Naive Mini-Batching / Parallelization Problem: Algorithm max α R n, 0 α i 1 { D(α) def = 1 n n i=1 } α i 1 2λn 2 αt Qα (D) 1. Choose α 0 = 0 R n 2. For t = 0, 1, 2,... iterate: 2.1 Choose A t {1,..., n}, A t = b, uniformly at random 2.2 Set α t+1 α t 2.3 For i A t do Set δ arg max{d(α + δe i ) : 0 α i + δ 1} α t+1 α t+1 + δ e i Analyzed in Joseph K. Bradley, Aapo Kyrola, Danny Bickson, and Carlos Guestrin Parallel coordinate descent for L1-regularized loss minimization, ICML 2011 Convergence guaranteed only for small b (β b 2) Analysis does not cover SVM dual (D) 14 / 31
23 Example: Failure of Naive Parallelization / 31
24 Example: Details Problem: Q = ( ) 1 1, λ = n = 1 2, b = 2. D(α) = 1 2 et α 1 4 αt Qα The naive approach will produce the sequence: α 0 = (0, 0) T with D(α 0 ) = 0 α 1 = (1, 1) T with D(α 1 ) = 0 α 2 = (0, 0) T with D(α 2 ) = 0 α 3 = (1, 1) T with D(α 3 ) = 0 Optimal solution: D(α ) = D(( 1 2, 1 2 )T ) = / 31
25 Safe Mini-Batching Instead of choosing δ naively via maximizing the original function D(α + δ) := (αt Qα + 2α T Qδ + δ T Qδ) 2λn 2 + work with its Expected Separable Underapproximation: H βb (δ, α) := (αt Qα + 2α T Qδ + β b δ 2 ) 2λn 2 + That is, instead of n i=1 n i=1 α i + δ i, n α i + δ i, n do δ arg max{d(α + δe i, α) : 0 α i + δ 1}, δ arg max{h β (δe i, α) : 0 α i + δ 1}, i A t i A t 17 / 31
26 Safe Mini-Batching: General Theory Developed in P. R. and Martin Takáč Parallel coordinate descent methods for big data optimization, 2012 Based on the idea If you can t guarantee descent/ascent, guarantee it in expectation Definition (Expected Separable Overapproximation) Let f : R n R be convex and smooth. Let Ŝ be a random subset of {1, 2,..., n} s.t. P(i Ŝ) = const for all i (uniform sampling) and for w R n def ++ define x w = ( n i=1 x i 2)1/2. Then we say that f admits a (β, w)-eso w.r.t. Ŝ if for all x, h R n : ( E[ Ŝ ] E[f (x + h [ Ŝ] )] f (x) + f (x), h + β ) n 2 h 2 w 18 / 31
27 ESO for SVM Dual ESO can also be written as ( E[f (x + h [ Ŝ] )] 1 E[ Ŝ ] n ) ( f (x) + E[ Ŝ ] n f (x) + f (x), h + β 2 h 2 w }{{} def =H β (h;x) SVM setting: f (x) = D(α), h = δ, Ŝ = A t, Ŝ = b, w = (1,..., 1)T ) 19 / 31
28 ESO for SVM Dual ESO can also be written as ( E[f (x + h [ Ŝ] )] 1 E[ Ŝ ] n ) ( f (x) + E[ Ŝ ] n f (x) + f (x), h + β 2 h 2 w }{{} def =H β (h;x) SVM setting: f (x) = D(α), h = δ, Ŝ = A t, Ŝ = b, w = (1,..., 1)T Lemma 3 For the SVM dual loss we have for all α, δ R n the following ESO: where E[D(α + δ)] (1 b n )D(α) + b n H β b (δ; α), H βb (δ, α) = (αt Qα + 2α T Qδ + β b δ 2 ) 2λn 2 + β b def = 1 + (b 1)(nσ2 1). n 1 n i=1 α i + δ i, n ) 19 / 31
29 Primal Suboptimality for SDCA with Safe Mini-Batching Theorem 2 Let us run SDCA with safe mini-batching and let α 0 = 0 R n and ɛ > 0. If we let then t 0 max{0, n b T 0 t 0 + β b [ 4 b 2λn log( β b ) }, λɛ 2 n β b, ]+ T T 0 + max{ n b, β b 1 b λɛ }, ᾱ def T 1 1 = T T 0 α t, t=t 0 w(ᾱ) def = 1 λn n ᾱ i y i x i is an ɛ-approximate solution to the PRIMAL problem, i.e., i=1 E[P(w(ᾱ))] P(w ) E[P(w(ᾱ)) D(ᾱ) ] ɛ. }{{} duality gap 20 / 31
30 Primal Suboptimality: Simple Expression β b b 5 λɛ + n b ( 1 + log ( )) 2λn β b 21 / 31
31 PART III: SGD vs SDCA: Theory and Numerics 22 / 31
32 23 / 31
33 SGD vs. SDCA: Theory Stochastic Gradient Descent (SGD) SGD needs T = β b b 30 λɛ Stochastic Dual Coordinate Ascent (SDCA) SDCA (with safe mini-batching) needs T = β b b 5 λɛ + n ( ( )) 2λn 1 + log b β b 24 / 31
34 Numerical Experiments: Datasets Data # train # test # features (n) Sparsity % λ cov 522,911 58, rcv1 20, ,399 47, astro-ph 29,882 32,487 99, news20 15,020 4,976 1,355, / 31
35 Batch Size vs Iterations ɛ = covertype Iterations Pegasos (SGD) Naive SDCA Safe SDCA Aggressive SDCA β b /b (right axis) Batch size 26 / 31
36 Batch Size vs Iterations ɛ = astro ph Iterations Batch size 27 / 31
37 Numerical Experiments 0.25 astro ph, b= astro ph, b=8192 Test Error Primal/Dual Suboptimality Iterations Iterations 28 / 31
38 Test Error and Primal/Dual Suboptimality Test Error news20, b=256 Pegasos (SGD) Naive SDCA Safe SDCA Aggressive SDCA Primal/Dual Suboptimality news20, b= Iterations Iterations 29 / 31
39 Summary 1 Shai Shalev-Shwartz, Yoram Singer and Nathan Srebro Pegasos: Primal Estimated sub-gradient SOlver for SVM, ICML Analysis of SGD for b = 1 - Weak analysis for b > 1 (no speedup) P. R. and Martin Takáč Parallel coordinate descent methods for big data optimization, General analysis of SDCA for b > 1 (even variable b) + ESO: Expected Separable Overapproximation - Dual suboptimality only Shai Shalev-Shwartz and Tong Zhang Stochastic dual coordinate ascent methods for regularized loss minimization, Primal sub-optimality for SDCA with b = 1 - No analysis for b > 1 30 / 31
40 Summary 2 Martin Takáč, Avleen Bijral, P. R. and Nathan Srebro Mini-batch primal and dual methods for SVMs, 2013 First analysis of mini-batched SGD for SVM primal which works New mini-batch SDCA method for SVM dual with safe mini-batching with aggressive mini-batching Both SGD and SDCA have guarantees in terms of primal suboptimality spectral norm of the data controls parallelization speedup have essentially identical iterations 31 / 31
Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization
Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization Shai Shalev-Shwartz and Tong Zhang School of CS and Engineering, The Hebrew University of Jerusalem Optimization for Machine
More informationFAST DISTRIBUTED COORDINATE DESCENT FOR NON-STRONGLY CONVEX LOSSES. Olivier Fercoq Zheng Qu Peter Richtárik Martin Takáč
FAST DISTRIBUTED COORDINATE DESCENT FOR NON-STRONGLY CONVEX LOSSES Olivier Fercoq Zheng Qu Peter Richtárik Martin Takáč School of Mathematics, University of Edinburgh, Edinburgh, EH9 3JZ, United Kingdom
More informationImportance Sampling for Minibatches
Importance Sampling for Minibatches Dominik Csiba School of Mathematics University of Edinburgh 07.09.2016, Birmingham Dominik Csiba (University of Edinburgh) Importance Sampling for Minibatches 07.09.2016,
More informationRandomized Coordinate Descent with Arbitrary Sampling: Algorithms and Complexity
Randomized Coordinate Descent with Arbitrary Sampling: Algorithms and Complexity Zheng Qu University of Hong Kong CAM, 23-26 Aug 2016 Hong Kong based on joint work with Peter Richtarik and Dominique Cisba(University
More informationLecture 3: Minimizing Large Sums. Peter Richtárik
Lecture 3: Minimizing Large Sums Peter Richtárik Graduate School in Systems, Op@miza@on, Control and Networks Belgium 2015 Mo@va@on: Machine Learning & Empirical Risk Minimiza@on Training Linear Predictors
More informationAdaptive Probabilities in Stochastic Optimization Algorithms
Research Collection Master Thesis Adaptive Probabilities in Stochastic Optimization Algorithms Author(s): Zhong, Lei Publication Date: 2015 Permanent Link: https://doi.org/10.3929/ethz-a-010421465 Rights
More informationAdding vs. Averaging in Distributed Primal-Dual Optimization
Chenxin Ma Industrial and Systems Engineering, Lehigh University, USA Virginia Smith University of California, Berkeley, USA Martin Jaggi ETH Zürich, Switzerland Michael I. Jordan University of California,
More informationAdvanced Topics in Machine Learning
Advanced Topics in Machine Learning 1. Learning SVMs / Primal Methods Lars Schmidt-Thieme Information Systems and Machine Learning Lab (ISMLL) University of Hildesheim, Germany 1 / 16 Outline 10. Linearization
More informationAccelerating Stochastic Optimization
Accelerating Stochastic Optimization Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem and Mobileye Master Class at Tel-Aviv, Tel-Aviv University, November 2014 Shalev-Shwartz
More informationarxiv: v2 [cs.lg] 3 Jul 2015
arxiv:502.03508v2 [cs.lg] 3 Jul 205 Chenxin Ma Industrial and Systems Engineering, Lehigh University, USA Virginia Smith University of California, Berkeley, USA Martin Jaggi ETH Zürich, Switzerland Michael
More informationSDNA: Stochastic Dual Newton Ascent for Empirical Risk Minimization
Zheng Qu ZHENGQU@HKU.HK Department of Mathematics, The University of Hong Kong, Hong Kong Peter Richtárik PETER.RICHTARIK@ED.AC.UK School of Mathematics, The University of Edinburgh, UK Martin Takáč TAKAC.MT@GMAIL.COM
More informationDistributed Box-Constrained Quadratic Optimization for Dual Linear SVM
Distributed Box-Constrained Quadratic Optimization for Dual Linear SVM Lee, Ching-pei University of Illinois at Urbana-Champaign Joint work with Dan Roth ICML 2015 Outline Introduction Algorithm Experiments
More informationBig Data Analytics: Optimization and Randomization
Big Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML 2015 Hong Kong Department of Computer Science, The University of Iowa, IA, USA Nov. 20, 2015 Yang Tutorial for ACML 15 Nov.
More informationPASSCoDe: Parallel ASynchronous Stochastic dual Co-ordinate Descent
Cho-Jui Hsieh Hsiang-Fu Yu Inderjit S. Dhillon Department of Computer Science, The University of Texas, Austin, TX 787, USA CJHSIEH@CS.UTEXAS.EDU ROFUYU@CS.UTEXAS.EDU INDERJIT@CS.UTEXAS.EDU Abstract Stochastic
More informationOn the Generalization Ability of Online Strongly Convex Programming Algorithms
On the Generalization Ability of Online Strongly Convex Programming Algorithms Sham M. Kakade I Chicago Chicago, IL 60637 sham@tti-c.org Ambuj ewari I Chicago Chicago, IL 60637 tewari@tti-c.org Abstract
More informationStochastic Dual Coordinate Ascent with Adaptive Probabilities
Dominik Csiba Zheng Qu Peter Richtárik University of Edinburgh CDOMINIK@GMAIL.COM ZHENG.QU@ED.AC.UK PETER.RICHTARIK@ED.AC.UK Abstract This paper introduces AdaSDCA: an adaptive variant of stochastic dual
More informationMachine Learning. Support Vector Machines. Fabio Vandin November 20, 2017
Machine Learning Support Vector Machines Fabio Vandin November 20, 2017 1 Classification and Margin Consider a classification problem with two classes: instance set X = R d label set Y = { 1, 1}. Training
More informationStochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure
Stochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure Alberto Bietti Julien Mairal Inria Grenoble (Thoth) March 21, 2017 Alberto Bietti Stochastic MISO March 21,
More informationSupplement: Distributed Box-constrained Quadratic Optimization for Dual Linear SVM
Supplement: Distributed Box-constrained Quadratic Optimization for Dual Linear SVM Ching-pei Lee LEECHINGPEI@GMAIL.COM Dan Roth DANR@ILLINOIS.EDU University of Illinois at Urbana-Champaign, 201 N. Goodwin
More informationCoordinate Descent and Ascent Methods
Coordinate Descent and Ascent Methods Julie Nutini Machine Learning Reading Group November 3 rd, 2015 1 / 22 Projected-Gradient Methods Motivation Rewrite non-smooth problem as smooth constrained problem:
More informationParallel Coordinate Optimization
1 / 38 Parallel Coordinate Optimization Julie Nutini MLRG - Spring Term March 6 th, 2018 2 / 38 Contours of a function F : IR 2 IR. Goal: Find the minimizer of F. Coordinate Descent in 2D Contours of a
More informationCoordinate Descent Faceoff: Primal or Dual?
JMLR: Workshop and Conference Proceedings 83:1 22, 2018 Algorithmic Learning Theory 2018 Coordinate Descent Faceoff: Primal or Dual? Dominik Csiba peter.richtarik@ed.ac.uk School of Mathematics University
More informationPrimal-dual coordinate descent A Coordinate Descent Primal-Dual Algorithm with Large Step Size and Possibly Non-Separable Functions
Primal-dual coordinate descent A Coordinate Descent Primal-Dual Algorithm with Large Step Size and Possibly Non-Separable Functions Olivier Fercoq and Pascal Bianchi Problem Minimize the convex function
More informationSDNA: Stochastic Dual Newton Ascent for Empirical Risk Minimization
Zheng Qu ZHENGQU@HKU.HK Department of Mathematics, The University of Hong Kong, Hong Kong Peter Richtárik PETER.RICHTARIK@ED.AC.UK School of Mathematics, The University of Edinburgh, UK Martin Takáč TAKAC.MT@GMAIL.COM
More informationarxiv: v2 [stat.ml] 16 Jun 2015
Semi-Stochastic Gradient Descent Methods Jakub Konečný Peter Richtárik arxiv:1312.1666v2 [stat.ml] 16 Jun 2015 School of Mathematics University of Edinburgh United Kingdom June 15, 2015 (first version:
More informationAccelerated Randomized Primal-Dual Coordinate Method for Empirical Risk Minimization
Accelerated Randomized Primal-Dual Coordinate Method for Empirical Risk Minimization Lin Xiao (Microsoft Research) Joint work with Qihang Lin (CMU), Zhaosong Lu (Simon Fraser) Yuchen Zhang (UC Berkeley)
More informationConvex Optimization Lecture 16
Convex Optimization Lecture 16 Today: Projected Gradient Descent Conditional Gradient Descent Stochastic Gradient Descent Random Coordinate Descent Recall: Gradient Descent (Steepest Descent w.r.t Euclidean
More informationNesterov s Acceleration
Nesterov s Acceleration Nesterov Accelerated Gradient min X f(x)+ (X) f -smooth. Set s 1 = 1 and = 1. Set y 0. Iterate by increasing t: g t 2 @f(y t ) s t+1 = 1+p 1+4s 2 t 2 y t = x t + s t 1 s t+1 (x
More informationECS289: Scalable Machine Learning
ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Sept 29, 2016 Outline Convex vs Nonconvex Functions Coordinate Descent Gradient Descent Newton s method Stochastic Gradient Descent Numerical Optimization
More informationMaking Gradient Descent Optimal for Strongly Convex Stochastic Optimization
Making Gradient Descent Optimal for Strongly Convex Stochastic Optimization Alexander Rakhlin University of Pennsylvania Ohad Shamir Microsoft Research New England Karthik Sridharan University of Pennsylvania
More informationTrade-Offs in Distributed Learning and Optimization
Trade-Offs in Distributed Learning and Optimization Ohad Shamir Weizmann Institute of Science Includes joint works with Yossi Arjevani, Nathan Srebro and Tong Zhang IHES Workshop March 2016 Distributed
More informationLarge-scale Stochastic Optimization
Large-scale Stochastic Optimization 11-741/641/441 (Spring 2016) Hanxiao Liu hanxiaol@cs.cmu.edu March 24, 2016 1 / 22 Outline 1. Gradient Descent (GD) 2. Stochastic Gradient Descent (SGD) Formulation
More informationFast Stochastic Optimization Algorithms for ML
Fast Stochastic Optimization Algorithms for ML Aaditya Ramdas April 20, 205 This lecture is about efficient algorithms for minimizing finite sums min w R d n i= f i (w) or min w R d n f i (w) + λ 2 w 2
More informationSTA141C: Big Data & High Performance Statistical Computing
STA141C: Big Data & High Performance Statistical Computing Lecture 8: Optimization Cho-Jui Hsieh UC Davis May 9, 2017 Optimization Numerical Optimization Numerical Optimization: min X f (X ) Can be applied
More informationarxiv: v2 [cs.lg] 10 Oct 2018
Journal of Machine Learning Research 9 (208) -49 Submitted 0/6; Published 7/8 CoCoA: A General Framework for Communication-Efficient Distributed Optimization arxiv:6.0289v2 [cs.lg] 0 Oct 208 Virginia Smith
More informationAccelerating SVRG via second-order information
Accelerating via second-order information Ritesh Kolte Department of Electrical Engineering rkolte@stanford.edu Murat Erdogdu Department of Statistics erdogdu@stanford.edu Ayfer Özgür Department of Electrical
More informationAdaptive Subgradient Methods for Online Learning and Stochastic Optimization John Duchi, Elad Hanzan, Yoram Singer
Adaptive Subgradient Methods for Online Learning and Stochastic Optimization John Duchi, Elad Hanzan, Yoram Singer Vicente L. Malave February 23, 2011 Outline Notation minimize a number of functions φ
More informationStochastic Gradient Descent with Only One Projection
Stochastic Gradient Descent with Only One Projection Mehrdad Mahdavi, ianbao Yang, Rong Jin, Shenghuo Zhu, and Jinfeng Yi Dept. of Computer Science and Engineering, Michigan State University, MI, USA Machine
More informationLinear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)
Support vector machines In a nutshell Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers) Solution only depends on a small subset of training
More informationDistributed Optimization with Arbitrary Local Solvers
Industrial and Systems Engineering Distributed Optimization with Arbitrary Local Solvers Chenxin Ma, Jakub Konečný 2, Martin Jaggi 3, Virginia Smith 4, Michael I. Jordan 4, Peter Richtárik 2, and Martin
More informationMachine Learning Lecture 6 Note
Machine Learning Lecture 6 Note Compiled by Abhi Ashutosh, Daniel Chen, and Yijun Xiao February 16, 2016 1 Pegasos Algorithm The Pegasos Algorithm looks very similar to the Perceptron Algorithm. In fact,
More informationSupport Vector Machine
Andrea Passerini passerini@disi.unitn.it Machine Learning Support vector machines In a nutshell Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)
More informationOptimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade
Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade Machine Learning for Big Data CSE547/STAT548 University of Washington S. M. Kakade (UW) Optimization for
More informationConvex Optimization Algorithms for Machine Learning in 10 Slides
Convex Optimization Algorithms for Machine Learning in 10 Slides Presenter: Jul. 15. 2015 Outline 1 Quadratic Problem Linear System 2 Smooth Problem Newton-CG 3 Composite Problem Proximal-Newton-CD 4 Non-smooth,
More informationRandomized Smoothing for Stochastic Optimization
Randomized Smoothing for Stochastic Optimization John Duchi Peter Bartlett Martin Wainwright University of California, Berkeley NIPS Big Learn Workshop, December 2011 Duchi (UC Berkeley) Smoothing and
More informationStochastic Gradient Descent with Variance Reduction
Stochastic Gradient Descent with Variance Reduction Rie Johnson, Tong Zhang Presenter: Jiawen Yao March 17, 2015 Rie Johnson, Tong Zhang Presenter: JiawenStochastic Yao Gradient Descent with Variance Reduction
More informationParallel and Distributed Stochastic Learning -Towards Scalable Learning for Big Data Intelligence
Parallel and Distributed Stochastic Learning -Towards Scalable Learning for Big Data Intelligence oé LAMDA Group H ŒÆOŽÅ Æ EâX ^ #EâI[ : liwujun@nju.edu.cn Dec 10, 2016 Wu-Jun Li (http://cs.nju.edu.cn/lwj)
More informationSVRG++ with Non-uniform Sampling
SVRG++ with Non-uniform Sampling Tamás Kern András György Department of Electrical and Electronic Engineering Imperial College London, London, UK, SW7 2BT {tamas.kern15,a.gyorgy}@imperial.ac.uk Abstract
More informationCourse Notes for EE227C (Spring 2018): Convex Optimization and Approximation
Course Notes for EE7C (Spring 08): Convex Optimization and Approximation Instructor: Moritz Hardt Email: hardt+ee7c@berkeley.edu Graduate Instructor: Max Simchowitz Email: msimchow+ee7c@berkeley.edu October
More informationLogarithmic Regret Algorithms for Strongly Convex Repeated Games
Logarithmic Regret Algorithms for Strongly Convex Repeated Games Shai Shalev-Shwartz 1 and Yoram Singer 1,2 1 School of Computer Sci & Eng, The Hebrew University, Jerusalem 91904, Israel 2 Google Inc 1600
More informationInverse Time Dependency in Convex Regularized Learning
Inverse Time Dependency in Convex Regularized Learning Zeyuan A. Zhu (Tsinghua University) Weizhu Chen (MSRA) Chenguang Zhu (Tsinghua University) Gang Wang (MSRA) Haixun Wang (MSRA) Zheng Chen (MSRA) December
More informationFast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee
Fast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee Shen-Yi Zhao and Wu-Jun Li National Key Laboratory for Novel Software Technology Department of Computer
More informationarxiv: v1 [stat.ml] 15 Mar 2018
Proximal SCOPE for Distributed Sparse Learning: Better Data Partition Implies Faster Convergence Rate Shen-Yi Zhao 1 Gong-Duo Zhang 1 Ming-Wei Li 1 Wu-Jun Li * 1 arxiv:1803.05621v1 [stat.ml] 15 Mar 2018
More informationOLSO. Online Learning and Stochastic Optimization. Yoram Singer August 10, Google Research
OLSO Online Learning and Stochastic Optimization Yoram Singer August 10, 2016 Google Research References Introduction to Online Convex Optimization, Elad Hazan, Princeton University Online Learning and
More informationRandomized Block Coordinate Non-Monotone Gradient Method for a Class of Nonlinear Programming
Randomized Block Coordinate Non-Monotone Gradient Method for a Class of Nonlinear Programming Zhaosong Lu Lin Xiao June 25, 2013 Abstract In this paper we propose a randomized block coordinate non-monotone
More informationEfficient Learning on Large Data Sets
Department of Computer Science and Engineering Hong Kong University of Science and Technology Hong Kong Joint work with Mu Li, Chonghai Hu, Weike Pan, Bao-liang Lu Chinese Workshop on Machine Learning
More informationCoordinate descent methods
Coordinate descent methods Master Mathematics for data science and big data Olivier Fercoq November 3, 05 Contents Exact coordinate descent Coordinate gradient descent 3 3 Proximal coordinate descent 5
More informationA Dual Coordinate Descent Method for Large-scale Linear SVM
Cho-Jui Hsieh b92085@csie.ntu.edu.tw Kai-Wei Chang b92084@csie.ntu.edu.tw Chih-Jen Lin cjlin@csie.ntu.edu.tw Department of Computer Science, National Taiwan University, Taipei 106, Taiwan S. Sathiya Keerthi
More informationFast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee
Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence (AAAI-16) Fast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee Shen-Yi Zhao and
More informationA Distributed Solver for Kernelized SVM
and A Distributed Solver for Kernelized Stanford ICME haoming@stanford.edu bzhe@stanford.edu June 3, 2015 Overview and 1 and 2 3 4 5 Support Vector Machines and A widely used supervised learning model,
More informationInverse Time Dependency in Convex Regularized Learning
Inverse Time Dependency in Convex Regularized Learning Zeyuan Allen Zhu 2*, Weizhu Chen 2, Chenguang Zhu 23, Gang Wang 2, Haixun Wang 2, Zheng Chen 2 Fundamental Science Class, Department of Physics, Tsinghua
More informationProximal Minimization by Incremental Surrogate Optimization (MISO)
Proximal Minimization by Incremental Surrogate Optimization (MISO) (and a few variants) Julien Mairal Inria, Grenoble ICCOPT, Tokyo, 2016 Julien Mairal, Inria MISO 1/26 Motivation: large-scale machine
More informationCSC 411 Lecture 17: Support Vector Machine
CSC 411 Lecture 17: Support Vector Machine Ethan Fetaya, James Lucas and Emad Andrews University of Toronto CSC411 Lec17 1 / 1 Today Max-margin classification SVM Hard SVM Duality Soft SVM CSC411 Lec17
More informationStingyCD: Safely Avoiding Wasteful Updates in Coordinate Descent
: Safely Avoiding Wasteful Updates in Coordinate Descent Tyler B. Johnson 1 Carlos Guestrin 1 Abstract Coordinate descent (CD) is a scalable and simple algorithm for solving many optimization problems
More informationAdaptive Online Gradient Descent
University of Pennsylvania ScholarlyCommons Statistics Papers Wharton Faculty Research 6-4-2007 Adaptive Online Gradient Descent Peter Bartlett Elad Hazan Alexander Rakhlin University of Pennsylvania Follow
More informationStochastic and online algorithms
Stochastic and online algorithms stochastic gradient method online optimization and dual averaging method minimizing finite average Stochastic and online optimization 6 1 Stochastic optimization problem
More informationConvergence Rates for Deterministic and Stochastic Subgradient Methods Without Lipschitz Continuity
Convergence Rates for Deterministic and Stochastic Subgradient Methods Without Lipschitz Continuity Benjamin Grimmer Abstract We generalize the classic convergence rate theory for subgradient methods to
More informationarxiv: v2 [stat.ml] 3 Jun 2017
: A Novel Method for Machine Learning Problems Using Stochastic Recursive Gradient Lam M. Nguyen Jie Liu Katya Scheinberg Martin Takáč arxiv:703.000v [stat.ml] 3 Jun 07 Abstract In this paper, we propose
More informationMulti-class classification via proximal mirror descent
Multi-class classification via proximal mirror descent Daria Reshetova Stanford EE department resh@stanford.edu Abstract We consider the problem of multi-class classification and a stochastic optimization
More informationDistributed Stochastic Optimization of Regularized Risk via Saddle-point Problem
Distributed Stochastic Optimization of Regularized Risk via Saddle-point Problem Shin Matsushima 1, Hyokun Yun 2, Xinhua Zhang 3, and S.V.N. Vishwanathan 2,4 1 The University of Tokyo, Tokyo, Japan shin
More informationLagrange duality. The Lagrangian. We consider an optimization program of the form
Lagrange duality Another way to arrive at the KKT conditions, and one which gives us some insight on solving constrained optimization problems, is through the Lagrange dual. The dual is a maximization
More informationMachine Learning A Geometric Approach
Machine Learning A Geometric Approach CIML book Chap 7.7 Linear Classification: Support Vector Machines (SVM) Professor Liang Huang some slides from Alex Smola (CMU) Linear Separator Ham Spam From Perceptron
More informationCoordinate Descent Method for Large-scale L2-loss Linear SVM. Kai-Wei Chang Cho-Jui Hsieh Chih-Jen Lin
Coordinate Descent Method for Large-scale L2-loss Linear SVM Kai-Wei Chang Cho-Jui Hsieh Chih-Jen Lin Abstract Linear support vector machines (SVM) are useful for classifying largescale sparse data. Problems
More informationAd Placement Strategies
Case Study : Estimating Click Probabilities Intro Logistic Regression Gradient Descent + SGD AdaGrad Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox January 7 th, 04 Ad
More informationDual Iterative Hard Thresholding: From Non-convex Sparse Minimization to Non-smooth Concave Maximization
Dual Iterative Hard Thresholding: From on-convex Sparse Minimization to on-smooth Concave Maximization Bo Liu Xiao-Tong Yuan 2 Lezi Wang Qingshan Liu 2 Dimitris. Metaxas Abstract Iterative Hard Thresholding
More informationClassification Logistic Regression
Announcements: Classification Logistic Regression Machine Learning CSE546 Sham Kakade University of Washington HW due on Friday. Today: Review: sub-gradients,lasso Logistic Regression October 3, 26 Sham
More informationEfficient Serial and Parallel Coordinate Descent Methods for Huge-Scale Convex Optimization
Efficient Serial and Parallel Coordinate Descent Methods for Huge-Scale Convex Optimization Martin Takáč The University of Edinburgh Based on: P. Richtárik and M. Takáč. Iteration complexity of randomized
More informationAdaptive Gradient Methods AdaGrad / Adam. Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade
Adaptive Gradient Methods AdaGrad / Adam Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade 1 Announcements: HW3 posted Dual coordinate ascent (some review of SGD and random
More informationSimultaneous Model Selection and Optimization through Parameter-free Stochastic Learning
Simultaneous Model Selection and Optimization through Parameter-free Stochastic Learning Francesco Orabona Yahoo! Labs New York, USA francesco@orabona.com Abstract Stochastic gradient descent algorithms
More informationLearning in a Distributed and Heterogeneous Environment
Learning in a Distributed and Heterogeneous Environment Martin Jaggi EPFL Machine Learning and Optimization Laboratory mlo.epfl.ch Inria - EPFL joint Workshop - Paris - Feb 15 th Machine Learning Methods
More informationSADAGRAD: Strongly Adaptive Stochastic Gradient Methods
Zaiyi Chen * Yi Xu * Enhong Chen Tianbao Yang Abstract Although the convergence rates of existing variants of ADAGRAD have a better dependence on the number of iterations under the strong convexity condition,
More informationBeating SGD: Learning SVMs in Sublinear Time
Beating SGD: Learning SVMs in Sublinear Time Elad Hazan Tomer Koren Technion, Israel Institute of Technology Haifa, Israel 32000 {ehazan@ie,tomerk@cs}.technion.ac.il Nathan Srebro Toyota Technological
More informationMachine Learning in the Data Revolution Era
Machine Learning in the Data Revolution Era Shai Shalev-Shwartz School of Computer Science and Engineering The Hebrew University of Jerusalem Machine Learning Seminar Series, Google & University of Waterloo,
More informationarxiv: v1 [math.oc] 18 Mar 2016
Katyusha: Accelerated Variance Reduction for Faster SGD Zeyuan Allen-Zhu zeyuan@csail.mit.edu Princeton University arxiv:1603.05953v1 [math.oc] 18 Mar 016 March 18, 016 Abstract We consider minimizing
More informationStreamSVM Linear SVMs and Logistic Regression When Data Does Not Fit In Memory
StreamSVM Linear SVMs and Logistic Regression When Data Does Not Fit In Memory S.V. N. (vishy) Vishwanathan Purdue University and Microsoft vishy@purdue.edu October 9, 2012 S.V. N. Vishwanathan (Purdue,
More informationCase Study 1: Estimating Click Probabilities. Kakade Announcements: Project Proposals: due this Friday!
Case Study 1: Estimating Click Probabilities Intro Logistic Regression Gradient Descent + SGD Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade April 4, 017 1 Announcements:
More informationA Dual Coordinate Descent Method for Large-scale Linear SVM
Cho-Jui Hsieh b92085@csie.ntu.edu.tw Kai-Wei Chang b92084@csie.ntu.edu.tw Chih-Jen Lin cjlin@csie.ntu.edu.tw Department of Computer Science, National Taiwan University, Taipei 06, Taiwan S. Sathiya Keerthi
More informationDoubly Accelerated Stochastic Variance Reduced Dual Averaging Method for Regularized Empirical Risk Minimization
Doubly Accelerated Stochastic Variance Reduced Dual Averaging Method for Regularized Empirical Risk Minimization Tomoya Murata Taiji Suzuki More recently, several authors have proposed accelerarxiv:70300439v4
More informationPrimal-dual coordinate descent
Primal-dual coordinate descent Olivier Fercoq Joint work with P. Bianchi & W. Hachem 15 July 2015 1/28 Minimize the convex function f, g, h convex f is differentiable Problem min f (x) + g(x) + h(mx) x
More informationRandomized Smoothing Techniques in Optimization
Randomized Smoothing Techniques in Optimization John Duchi Based on joint work with Peter Bartlett, Michael Jordan, Martin Wainwright, Andre Wibisono Stanford University Information Systems Laboratory
More informationDistributed Box-Constrained Quadratic Optimization for Dual Linear SVM
ICML 15 Distributed Box-Constrained Quadratic Optimization for Dual Linear SVM Ching-pei Lee LEECHINGPEI@GMAIL.COM Dan Roth DANR@ILLINOIS.EDU University of Illinois at Urbana-Champaign, 201 N. Goodwin
More informationStochastic Subgradient Method
Stochastic Subgradient Method Lingjie Weng, Yutian Chen Bren School of Information and Computer Science UC Irvine Subgradient Recall basic inequality for convex differentiable f : f y f x + f x T (y x)
More informationA Greedy Framework for First-Order Optimization
A Greedy Framework for First-Order Optimization Jacob Steinhardt Department of Computer Science Stanford University Stanford, CA 94305 jsteinhardt@cs.stanford.edu Jonathan Huggins Department of EECS Massachusetts
More informationOnline Sparse Passive Aggressive Learning with Kernels
Online Sparse Passive Aggressive Learning with Kernels Jing Lu Peilin Zhao Steven C.H. Hoi Abstract Conventional online kernel methods often yield an unbounded large number of support vectors, making them
More informationOptimization Methods for Machine Learning
Optimization Methods for Machine Learning Sathiya Keerthi Microsoft Talks given at UC Santa Cruz February 21-23, 2017 The slides for the talks will be made available at: http://www.keerthis.com/ Introduction
More informationarxiv: v4 [math.oc] 5 Jan 2016
Restarted SGD: Beating SGD without Smoothness and/or Strong Convexity arxiv:151.03107v4 [math.oc] 5 Jan 016 Tianbao Yang, Qihang Lin Department of Computer Science Department of Management Sciences The
More informationKernelized Perceptron Support Vector Machines
Kernelized Perceptron Support Vector Machines Emily Fox University of Washington February 13, 2017 What is the perceptron optimizing? 1 The perceptron algorithm [Rosenblatt 58, 62] Classification setting:
More informationBeyond the regret minimization barrier: an optimal algorithm for stochastic strongly-convex optimization
JMLR: Workshop and Conference Proceedings vol (2010) 1 16 24th Annual Conference on Learning heory Beyond the regret minimization barrier: an optimal algorithm for stochastic strongly-convex optimization
More informationarxiv: v6 [math.oc] 24 Sep 2018
: The First Direct Acceleration of Stochastic Gradient Methods version 6 arxiv:63.5953v6 [math.oc] 4 Sep 8 Zeyuan Allen-Zhu zeyuan@csail.mit.edu Princeton University / Institute for Advanced Study March
More informationThe Stochastic Gradient Descent for the Primal L1-SVM Optimization Revisited
The Stochastic Gradient Descent for the Primal L1-SVM Optimization Revisited Constantinos Panagiotaopoulos and Petroula Tsampoua School of Technology, Aristotle University of Thessalonii, Greece costapan@eng.auth.gr,
More information