Mini-Batch Primal and Dual Methods for SVMs

Size: px
Start display at page:

Download "Mini-Batch Primal and Dual Methods for SVMs"

Transcription

1 Mini-Batch Primal and Dual Methods for SVMs Peter Richtárik School of Mathematics The University of Edinburgh Coauthors: M. Takáč (Edinburgh), A. Bijral and N. Srebro (both TTI at Chicago) arxiv: Fête Parisienne in Computation, Inference and Optimization March 20, / 31

2 ShaioShalevjSchwartzBoTongoZhang Stochastic dual coordinate ascent methods for regularized loss minimization 2012BoarXiv: ShaioShalevjSchwartzBoYoramo SingerBoNathanoSrebro Pegasos: Primal estimated subgradient solver for SVM ICMLo2007 PeteroRichtárikBoMartinoTakáčo Parallel coordinate descent methods for big data optimization 2012BoarXiv: MartinoTakáčBoAvleenoBijralBoPeteroRichtárikBoNathanoSrebro Mini-batch primal and dual methods for SVMs 2013BoarXiv:1303: / 31

3 Support Vector Machine <w,x> - b = 1 <w,x> - b = 0 <w,x> - b = -1 3 / 31

4 Family Support Machine 4 / 31

5 PART I: Stochastic Gradient Descent (SGD) 5 / 31

6 SVM: Primal Problem Data: {(x i, y i ) R d {+1, 1} : i S def = {1, 2,..., n}} Examples: x 1,..., x n (assumption: max i x i 2 1) Labels: y i {+1, 1} Optimization formulation of SVM: min w R d { P S (w) def = λ 2 w 2 + ˆL S (w) }, (P) where ˆLA (w) def = 1 A i A l(y i w, x i ) (average hinge loss on examples in A) l(ζ) def = max{0, 1 ζ} (hinge loss) 6 / 31

7 Pegasos (SGD) Algorithm 1. Choose w 1 = 0 R d 2. Iterate for t = 1, 2,..., T 2.1 Choose A t S = {1, 2,..., n}, A t = b, uniformly at random 2.2 Set stepsize η t 1 λt 2.3 Update w t+1 w t η t P At (w t) Theorem For w = 1 T T t=1 w t we have: where c = ( λ + 1) 2. E[P( w)] P(w ) + c log(t ) Shai Shalev-Shwartz, Yoram Singer and Nathan Srebro 1 λt, Pegasos: Primal Estimated sub-gradient SOlver for SVM, ICML / 31

8 Pegasos (SGD) Algorithm 1. Choose w 1 = 0 R d 2. Iterate for t = 1, 2,..., T 2.1 Choose A t S = {1, 2,..., n}, A t = b, uniformly at random 2.2 Set stepsize η t 1 λt 2.3 Update w t+1 (1 η tλ)w t + ηt b i A t : y i w t,x i <1 y ix i Theorem For w = 1 T T t=1 w t we have: where c = ( λ + 1) 2. E[P( w)] P(w ) + c log(t ) Shai Shalev-Shwartz, Yoram Singer and Nathan Srebro 1 λt, Pegasos: Primal Estimated sub-gradient SOlver for SVM, ICML / 31

9 Pegasos (SGD) Algorithm 1. Choose w 1 = 0 R d 2. Iterate for t = 1, 2,..., T 2.1 Choose A t S = {1, 2,..., n}, A t = b, uniformly at random 2.2 Set stepsize η t 1 λt 2.3 Update w t+1 (1 η tλ)w t + ηt b i A t : y i w t,x i <1 y ix i Theorem 1 For w = 2 T T t= T /2 +1 w t we have: E[P( w)] P(w ) + 30β b b 1 λt, where β b = 1 + (b 1)(nσ2 1) n 1, σ 2 def = 1 n Q and Q ij = y i x i, y j x j Martin Takáč, Avleen Bijral, P. R. and Nathan Srebro Mini-batch primal and dual methods for SVMs, / 31

10 Insight into β b b β b b = 1 + (b 1)(nσ n 1 b 2 1) 8 / 31

11 Insight into β b b β b b = 1 + (b 1)(nσ n 1 b 2 1) Letting X = [x 1, x 2,..., x n ], Z = [y 1 x 1,..., y n x n ] and assuming x i = 1 for all i, we have nσ 2 def = Q = ZZ T = Z T Z = λ max (Z T Z) [ tr(zt Z) n, tr(z T Z)] = [ tr(xt X) n, tr(x T X) ] }{{}}{{} =1 =n 8 / 31

12 Insight into β b b β b b = 1 + (b 1)(nσ n 1 b 2 1) Letting X = [x 1, x 2,..., x n ], Z = [y 1 x 1,..., y n x n ] and assuming x i = 1 for all i, we have nσ 2 def = Q = ZZ T = Z T Z = λ max (Z T Z) [ tr(zt Z) n, tr(z T Z)] = [ tr(xt X) n, tr(x T X) ] }{{}}{{} =1 =n nσ 2 = n β b b = 1 (no parallelization speedup; mini-batching does not help) nσ 2 = 1 β b b = 1 b (speedup equal to batch size!) 8 / 31

13 Insight into β b b β b b = 1 + (b 1)(nσ n 1 b 2 1) Letting X = [x 1, x 2,..., x n ], Z = [y 1 x 1,..., y n x n ] and assuming x i = 1 for all i, we have nσ 2 def = Q = ZZ T = Z T Z = λ max (Z T Z) [ tr(zt Z) n, tr(z T Z)] = [ tr(xt X) n, tr(x T X) ] }{{}}{{} =1 =n nσ 2 = n β b b = 1 (no parallelization speedup; mini-batching does not help) nσ 2 = 1 β b b = 1 b (speedup equal to batch size!) Similar expression appears in P. R. and Martin Takáč Parallel coordinate descent methods for big data problems, 2012 with nσ 2 replaced by ω (degree of partial separability of the loss function) 8 / 31

14 Computing β b : SVM Datasets To run SDCA with safe mini-batching, we need to compute β b : Two options: β b = 1 + (b 1)(nσ2 1), where nσ 2 = λ max (Z T Z) n 1 Compute the largest eigenvalue (e.g., power method) Replace nσ 2 by an upper bound: degree of partial separability ω General SDCA methods based on ω described here: P. R. and Martin Takáč Parallel coordinate descent methods for big data optimization, 2012 Dataset # Examples (d) # Features (n) A 0 nσ 2 [1, n] ω a1a 1, , a9a 32, , rcv1 20,242 47,236 1,498, real-sim 72,309 20,958 3,709, ,484 news20 19,996 1,355,191 9,097,958 9, ,423 url 2,396,130 3,231, ,058, webspam 350,000 16,609,143 29,796, kdda2010 8,407,752 20,216, ,613, kddb ,264,097 29,890, ,345, / 31

15 Where does β b Come from? Lemma 1 Consider any symmetric Q R n n, random subset A {1, 2,..., n} with A = b and v R n. Then [ ( E[v[A] T Qv [A]] = b 1 b 1 ) n ] Q ii vi 2 + b 1 n n 1 n 1 vt Qv. i=1 Moreover, if Q ii 1 for all i, then we get the following ESO (Expected Separable Overapproximation): E[v T [A] Qv [A]] b n β b v / 31

16 Where does β b Come from? Lemma 1 Consider any symmetric Q R n n, random subset A {1, 2,..., n} with A = b and v R n. Then [ ( E[v[A] T Qv [A]] = b 1 b 1 ) n ] Q ii vi 2 + b 1 n n 1 n 1 vt Qv. i=1 Moreover, if Q ii 1 for all i, then we get the following ESO (Expected Separable Overapproximation): E[v T [A] Qv [A]] b n β b v 2. Remark: ESO inequalities are systematically developed in P. R. and Martin Takáč Parallel coordinate descent methods for big data problems, / 31

17 Insight into the Analysis Classical Pegasos Analysis Uses the inequality: ˆL At (w) 2 1 which holds for any A t S = {1, 2,..., n} 11 / 31

18 Insight into the Analysis Classical Pegasos Analysis Uses the inequality: ˆL At (w) 2 1 which holds for any A t S = {1, 2,..., n} New Analysis Uses the inequality: E ˆL At (w) 2 β b b which holds for A t, A t = b, chosen uniformly at random (established by previous lemma) 11 / 31

19 PART II: Stochastic Dual Coordinate Ascent (SDCA) 12 / 31

20 Stochastic Dual Coordinate Ascent (SDCA) Problem: Algorithm max α R n, 0 α i 1 { D(α) def = 1 n n i=1 } α i 1 2λn 2 αt Qα (D) 1. Choose α 0 = 0 R n 2. For t = 0, 1, 2,... iterate: 2.1 Choose i {1,..., n}, uniformly at random 2.2 Set δ arg max{d(α t + δe i ) : 0 α i + δ 1} 2.3 α t+1 α t + δ e i 13 / 31

21 Stochastic Dual Coordinate Ascent (SDCA) Problem: max α R n, 0 α i 1 { D(α) def = 1 n n i=1 } α i 1 2λn 2 αt Qα (D) Algorithm 1. Choose α 0 = 0 R n 2. For t = 0, 1, 2,... iterate: 2.1 Choose i {1,..., n}, uniformly at random 2.2 Set δ arg max{d(α t + δe i ) : 0 α i + δ 1} 2.3 α t+1 α t + δ e i First proposed for SVM by C.-J. Hsieh K.-W. Chang, C.-J. Lin, S. S. Keerthi, and S. Sundararajan A dual coordinate descent method for large-scale linear SVM, ICML 2008 General analysis in P. R. and M. Takáč Iteration complexity of randomized block-coordinate descent methods..., MAPR 2012 [INFORMS Computing Society Best Student Paper Prize (runner-up), 2012] 13 / 31

22 Naive Mini-Batching / Parallelization Problem: Algorithm max α R n, 0 α i 1 { D(α) def = 1 n n i=1 } α i 1 2λn 2 αt Qα (D) 1. Choose α 0 = 0 R n 2. For t = 0, 1, 2,... iterate: 2.1 Choose A t {1,..., n}, A t = b, uniformly at random 2.2 Set α t+1 α t 2.3 For i A t do Set δ arg max{d(α + δe i ) : 0 α i + δ 1} α t+1 α t+1 + δ e i Analyzed in Joseph K. Bradley, Aapo Kyrola, Danny Bickson, and Carlos Guestrin Parallel coordinate descent for L1-regularized loss minimization, ICML 2011 Convergence guaranteed only for small b (β b 2) Analysis does not cover SVM dual (D) 14 / 31

23 Example: Failure of Naive Parallelization / 31

24 Example: Details Problem: Q = ( ) 1 1, λ = n = 1 2, b = 2. D(α) = 1 2 et α 1 4 αt Qα The naive approach will produce the sequence: α 0 = (0, 0) T with D(α 0 ) = 0 α 1 = (1, 1) T with D(α 1 ) = 0 α 2 = (0, 0) T with D(α 2 ) = 0 α 3 = (1, 1) T with D(α 3 ) = 0 Optimal solution: D(α ) = D(( 1 2, 1 2 )T ) = / 31

25 Safe Mini-Batching Instead of choosing δ naively via maximizing the original function D(α + δ) := (αt Qα + 2α T Qδ + δ T Qδ) 2λn 2 + work with its Expected Separable Underapproximation: H βb (δ, α) := (αt Qα + 2α T Qδ + β b δ 2 ) 2λn 2 + That is, instead of n i=1 n i=1 α i + δ i, n α i + δ i, n do δ arg max{d(α + δe i, α) : 0 α i + δ 1}, δ arg max{h β (δe i, α) : 0 α i + δ 1}, i A t i A t 17 / 31

26 Safe Mini-Batching: General Theory Developed in P. R. and Martin Takáč Parallel coordinate descent methods for big data optimization, 2012 Based on the idea If you can t guarantee descent/ascent, guarantee it in expectation Definition (Expected Separable Overapproximation) Let f : R n R be convex and smooth. Let Ŝ be a random subset of {1, 2,..., n} s.t. P(i Ŝ) = const for all i (uniform sampling) and for w R n def ++ define x w = ( n i=1 x i 2)1/2. Then we say that f admits a (β, w)-eso w.r.t. Ŝ if for all x, h R n : ( E[ Ŝ ] E[f (x + h [ Ŝ] )] f (x) + f (x), h + β ) n 2 h 2 w 18 / 31

27 ESO for SVM Dual ESO can also be written as ( E[f (x + h [ Ŝ] )] 1 E[ Ŝ ] n ) ( f (x) + E[ Ŝ ] n f (x) + f (x), h + β 2 h 2 w }{{} def =H β (h;x) SVM setting: f (x) = D(α), h = δ, Ŝ = A t, Ŝ = b, w = (1,..., 1)T ) 19 / 31

28 ESO for SVM Dual ESO can also be written as ( E[f (x + h [ Ŝ] )] 1 E[ Ŝ ] n ) ( f (x) + E[ Ŝ ] n f (x) + f (x), h + β 2 h 2 w }{{} def =H β (h;x) SVM setting: f (x) = D(α), h = δ, Ŝ = A t, Ŝ = b, w = (1,..., 1)T Lemma 3 For the SVM dual loss we have for all α, δ R n the following ESO: where E[D(α + δ)] (1 b n )D(α) + b n H β b (δ; α), H βb (δ, α) = (αt Qα + 2α T Qδ + β b δ 2 ) 2λn 2 + β b def = 1 + (b 1)(nσ2 1). n 1 n i=1 α i + δ i, n ) 19 / 31

29 Primal Suboptimality for SDCA with Safe Mini-Batching Theorem 2 Let us run SDCA with safe mini-batching and let α 0 = 0 R n and ɛ > 0. If we let then t 0 max{0, n b T 0 t 0 + β b [ 4 b 2λn log( β b ) }, λɛ 2 n β b, ]+ T T 0 + max{ n b, β b 1 b λɛ }, ᾱ def T 1 1 = T T 0 α t, t=t 0 w(ᾱ) def = 1 λn n ᾱ i y i x i is an ɛ-approximate solution to the PRIMAL problem, i.e., i=1 E[P(w(ᾱ))] P(w ) E[P(w(ᾱ)) D(ᾱ) ] ɛ. }{{} duality gap 20 / 31

30 Primal Suboptimality: Simple Expression β b b 5 λɛ + n b ( 1 + log ( )) 2λn β b 21 / 31

31 PART III: SGD vs SDCA: Theory and Numerics 22 / 31

32 23 / 31

33 SGD vs. SDCA: Theory Stochastic Gradient Descent (SGD) SGD needs T = β b b 30 λɛ Stochastic Dual Coordinate Ascent (SDCA) SDCA (with safe mini-batching) needs T = β b b 5 λɛ + n ( ( )) 2λn 1 + log b β b 24 / 31

34 Numerical Experiments: Datasets Data # train # test # features (n) Sparsity % λ cov 522,911 58, rcv1 20, ,399 47, astro-ph 29,882 32,487 99, news20 15,020 4,976 1,355, / 31

35 Batch Size vs Iterations ɛ = covertype Iterations Pegasos (SGD) Naive SDCA Safe SDCA Aggressive SDCA β b /b (right axis) Batch size 26 / 31

36 Batch Size vs Iterations ɛ = astro ph Iterations Batch size 27 / 31

37 Numerical Experiments 0.25 astro ph, b= astro ph, b=8192 Test Error Primal/Dual Suboptimality Iterations Iterations 28 / 31

38 Test Error and Primal/Dual Suboptimality Test Error news20, b=256 Pegasos (SGD) Naive SDCA Safe SDCA Aggressive SDCA Primal/Dual Suboptimality news20, b= Iterations Iterations 29 / 31

39 Summary 1 Shai Shalev-Shwartz, Yoram Singer and Nathan Srebro Pegasos: Primal Estimated sub-gradient SOlver for SVM, ICML Analysis of SGD for b = 1 - Weak analysis for b > 1 (no speedup) P. R. and Martin Takáč Parallel coordinate descent methods for big data optimization, General analysis of SDCA for b > 1 (even variable b) + ESO: Expected Separable Overapproximation - Dual suboptimality only Shai Shalev-Shwartz and Tong Zhang Stochastic dual coordinate ascent methods for regularized loss minimization, Primal sub-optimality for SDCA with b = 1 - No analysis for b > 1 30 / 31

40 Summary 2 Martin Takáč, Avleen Bijral, P. R. and Nathan Srebro Mini-batch primal and dual methods for SVMs, 2013 First analysis of mini-batched SGD for SVM primal which works New mini-batch SDCA method for SVM dual with safe mini-batching with aggressive mini-batching Both SGD and SDCA have guarantees in terms of primal suboptimality spectral norm of the data controls parallelization speedup have essentially identical iterations 31 / 31

Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization

Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization Shai Shalev-Shwartz and Tong Zhang School of CS and Engineering, The Hebrew University of Jerusalem Optimization for Machine

More information

FAST DISTRIBUTED COORDINATE DESCENT FOR NON-STRONGLY CONVEX LOSSES. Olivier Fercoq Zheng Qu Peter Richtárik Martin Takáč

FAST DISTRIBUTED COORDINATE DESCENT FOR NON-STRONGLY CONVEX LOSSES. Olivier Fercoq Zheng Qu Peter Richtárik Martin Takáč FAST DISTRIBUTED COORDINATE DESCENT FOR NON-STRONGLY CONVEX LOSSES Olivier Fercoq Zheng Qu Peter Richtárik Martin Takáč School of Mathematics, University of Edinburgh, Edinburgh, EH9 3JZ, United Kingdom

More information

Importance Sampling for Minibatches

Importance Sampling for Minibatches Importance Sampling for Minibatches Dominik Csiba School of Mathematics University of Edinburgh 07.09.2016, Birmingham Dominik Csiba (University of Edinburgh) Importance Sampling for Minibatches 07.09.2016,

More information

Randomized Coordinate Descent with Arbitrary Sampling: Algorithms and Complexity

Randomized Coordinate Descent with Arbitrary Sampling: Algorithms and Complexity Randomized Coordinate Descent with Arbitrary Sampling: Algorithms and Complexity Zheng Qu University of Hong Kong CAM, 23-26 Aug 2016 Hong Kong based on joint work with Peter Richtarik and Dominique Cisba(University

More information

Lecture 3: Minimizing Large Sums. Peter Richtárik

Lecture 3: Minimizing Large Sums. Peter Richtárik Lecture 3: Minimizing Large Sums Peter Richtárik Graduate School in Systems, Op@miza@on, Control and Networks Belgium 2015 Mo@va@on: Machine Learning & Empirical Risk Minimiza@on Training Linear Predictors

More information

Adaptive Probabilities in Stochastic Optimization Algorithms

Adaptive Probabilities in Stochastic Optimization Algorithms Research Collection Master Thesis Adaptive Probabilities in Stochastic Optimization Algorithms Author(s): Zhong, Lei Publication Date: 2015 Permanent Link: https://doi.org/10.3929/ethz-a-010421465 Rights

More information

Adding vs. Averaging in Distributed Primal-Dual Optimization

Adding vs. Averaging in Distributed Primal-Dual Optimization Chenxin Ma Industrial and Systems Engineering, Lehigh University, USA Virginia Smith University of California, Berkeley, USA Martin Jaggi ETH Zürich, Switzerland Michael I. Jordan University of California,

More information

Advanced Topics in Machine Learning

Advanced Topics in Machine Learning Advanced Topics in Machine Learning 1. Learning SVMs / Primal Methods Lars Schmidt-Thieme Information Systems and Machine Learning Lab (ISMLL) University of Hildesheim, Germany 1 / 16 Outline 10. Linearization

More information

Accelerating Stochastic Optimization

Accelerating Stochastic Optimization Accelerating Stochastic Optimization Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem and Mobileye Master Class at Tel-Aviv, Tel-Aviv University, November 2014 Shalev-Shwartz

More information

arxiv: v2 [cs.lg] 3 Jul 2015

arxiv: v2 [cs.lg] 3 Jul 2015 arxiv:502.03508v2 [cs.lg] 3 Jul 205 Chenxin Ma Industrial and Systems Engineering, Lehigh University, USA Virginia Smith University of California, Berkeley, USA Martin Jaggi ETH Zürich, Switzerland Michael

More information

SDNA: Stochastic Dual Newton Ascent for Empirical Risk Minimization

SDNA: Stochastic Dual Newton Ascent for Empirical Risk Minimization Zheng Qu ZHENGQU@HKU.HK Department of Mathematics, The University of Hong Kong, Hong Kong Peter Richtárik PETER.RICHTARIK@ED.AC.UK School of Mathematics, The University of Edinburgh, UK Martin Takáč TAKAC.MT@GMAIL.COM

More information

Distributed Box-Constrained Quadratic Optimization for Dual Linear SVM

Distributed Box-Constrained Quadratic Optimization for Dual Linear SVM Distributed Box-Constrained Quadratic Optimization for Dual Linear SVM Lee, Ching-pei University of Illinois at Urbana-Champaign Joint work with Dan Roth ICML 2015 Outline Introduction Algorithm Experiments

More information

Big Data Analytics: Optimization and Randomization

Big Data Analytics: Optimization and Randomization Big Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML 2015 Hong Kong Department of Computer Science, The University of Iowa, IA, USA Nov. 20, 2015 Yang Tutorial for ACML 15 Nov.

More information

PASSCoDe: Parallel ASynchronous Stochastic dual Co-ordinate Descent

PASSCoDe: Parallel ASynchronous Stochastic dual Co-ordinate Descent Cho-Jui Hsieh Hsiang-Fu Yu Inderjit S. Dhillon Department of Computer Science, The University of Texas, Austin, TX 787, USA CJHSIEH@CS.UTEXAS.EDU ROFUYU@CS.UTEXAS.EDU INDERJIT@CS.UTEXAS.EDU Abstract Stochastic

More information

On the Generalization Ability of Online Strongly Convex Programming Algorithms

On the Generalization Ability of Online Strongly Convex Programming Algorithms On the Generalization Ability of Online Strongly Convex Programming Algorithms Sham M. Kakade I Chicago Chicago, IL 60637 sham@tti-c.org Ambuj ewari I Chicago Chicago, IL 60637 tewari@tti-c.org Abstract

More information

Stochastic Dual Coordinate Ascent with Adaptive Probabilities

Stochastic Dual Coordinate Ascent with Adaptive Probabilities Dominik Csiba Zheng Qu Peter Richtárik University of Edinburgh CDOMINIK@GMAIL.COM ZHENG.QU@ED.AC.UK PETER.RICHTARIK@ED.AC.UK Abstract This paper introduces AdaSDCA: an adaptive variant of stochastic dual

More information

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017 Machine Learning Support Vector Machines Fabio Vandin November 20, 2017 1 Classification and Margin Consider a classification problem with two classes: instance set X = R d label set Y = { 1, 1}. Training

More information

Stochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure

Stochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure Stochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure Alberto Bietti Julien Mairal Inria Grenoble (Thoth) March 21, 2017 Alberto Bietti Stochastic MISO March 21,

More information

Supplement: Distributed Box-constrained Quadratic Optimization for Dual Linear SVM

Supplement: Distributed Box-constrained Quadratic Optimization for Dual Linear SVM Supplement: Distributed Box-constrained Quadratic Optimization for Dual Linear SVM Ching-pei Lee LEECHINGPEI@GMAIL.COM Dan Roth DANR@ILLINOIS.EDU University of Illinois at Urbana-Champaign, 201 N. Goodwin

More information

Coordinate Descent and Ascent Methods

Coordinate Descent and Ascent Methods Coordinate Descent and Ascent Methods Julie Nutini Machine Learning Reading Group November 3 rd, 2015 1 / 22 Projected-Gradient Methods Motivation Rewrite non-smooth problem as smooth constrained problem:

More information

Parallel Coordinate Optimization

Parallel Coordinate Optimization 1 / 38 Parallel Coordinate Optimization Julie Nutini MLRG - Spring Term March 6 th, 2018 2 / 38 Contours of a function F : IR 2 IR. Goal: Find the minimizer of F. Coordinate Descent in 2D Contours of a

More information

Coordinate Descent Faceoff: Primal or Dual?

Coordinate Descent Faceoff: Primal or Dual? JMLR: Workshop and Conference Proceedings 83:1 22, 2018 Algorithmic Learning Theory 2018 Coordinate Descent Faceoff: Primal or Dual? Dominik Csiba peter.richtarik@ed.ac.uk School of Mathematics University

More information

Primal-dual coordinate descent A Coordinate Descent Primal-Dual Algorithm with Large Step Size and Possibly Non-Separable Functions

Primal-dual coordinate descent A Coordinate Descent Primal-Dual Algorithm with Large Step Size and Possibly Non-Separable Functions Primal-dual coordinate descent A Coordinate Descent Primal-Dual Algorithm with Large Step Size and Possibly Non-Separable Functions Olivier Fercoq and Pascal Bianchi Problem Minimize the convex function

More information

SDNA: Stochastic Dual Newton Ascent for Empirical Risk Minimization

SDNA: Stochastic Dual Newton Ascent for Empirical Risk Minimization Zheng Qu ZHENGQU@HKU.HK Department of Mathematics, The University of Hong Kong, Hong Kong Peter Richtárik PETER.RICHTARIK@ED.AC.UK School of Mathematics, The University of Edinburgh, UK Martin Takáč TAKAC.MT@GMAIL.COM

More information

arxiv: v2 [stat.ml] 16 Jun 2015

arxiv: v2 [stat.ml] 16 Jun 2015 Semi-Stochastic Gradient Descent Methods Jakub Konečný Peter Richtárik arxiv:1312.1666v2 [stat.ml] 16 Jun 2015 School of Mathematics University of Edinburgh United Kingdom June 15, 2015 (first version:

More information

Accelerated Randomized Primal-Dual Coordinate Method for Empirical Risk Minimization

Accelerated Randomized Primal-Dual Coordinate Method for Empirical Risk Minimization Accelerated Randomized Primal-Dual Coordinate Method for Empirical Risk Minimization Lin Xiao (Microsoft Research) Joint work with Qihang Lin (CMU), Zhaosong Lu (Simon Fraser) Yuchen Zhang (UC Berkeley)

More information

Convex Optimization Lecture 16

Convex Optimization Lecture 16 Convex Optimization Lecture 16 Today: Projected Gradient Descent Conditional Gradient Descent Stochastic Gradient Descent Random Coordinate Descent Recall: Gradient Descent (Steepest Descent w.r.t Euclidean

More information

Nesterov s Acceleration

Nesterov s Acceleration Nesterov s Acceleration Nesterov Accelerated Gradient min X f(x)+ (X) f -smooth. Set s 1 = 1 and = 1. Set y 0. Iterate by increasing t: g t 2 @f(y t ) s t+1 = 1+p 1+4s 2 t 2 y t = x t + s t 1 s t+1 (x

More information

ECS289: Scalable Machine Learning

ECS289: Scalable Machine Learning ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Sept 29, 2016 Outline Convex vs Nonconvex Functions Coordinate Descent Gradient Descent Newton s method Stochastic Gradient Descent Numerical Optimization

More information

Making Gradient Descent Optimal for Strongly Convex Stochastic Optimization

Making Gradient Descent Optimal for Strongly Convex Stochastic Optimization Making Gradient Descent Optimal for Strongly Convex Stochastic Optimization Alexander Rakhlin University of Pennsylvania Ohad Shamir Microsoft Research New England Karthik Sridharan University of Pennsylvania

More information

Trade-Offs in Distributed Learning and Optimization

Trade-Offs in Distributed Learning and Optimization Trade-Offs in Distributed Learning and Optimization Ohad Shamir Weizmann Institute of Science Includes joint works with Yossi Arjevani, Nathan Srebro and Tong Zhang IHES Workshop March 2016 Distributed

More information

Large-scale Stochastic Optimization

Large-scale Stochastic Optimization Large-scale Stochastic Optimization 11-741/641/441 (Spring 2016) Hanxiao Liu hanxiaol@cs.cmu.edu March 24, 2016 1 / 22 Outline 1. Gradient Descent (GD) 2. Stochastic Gradient Descent (SGD) Formulation

More information

Fast Stochastic Optimization Algorithms for ML

Fast Stochastic Optimization Algorithms for ML Fast Stochastic Optimization Algorithms for ML Aaditya Ramdas April 20, 205 This lecture is about efficient algorithms for minimizing finite sums min w R d n i= f i (w) or min w R d n f i (w) + λ 2 w 2

More information

STA141C: Big Data & High Performance Statistical Computing

STA141C: Big Data & High Performance Statistical Computing STA141C: Big Data & High Performance Statistical Computing Lecture 8: Optimization Cho-Jui Hsieh UC Davis May 9, 2017 Optimization Numerical Optimization Numerical Optimization: min X f (X ) Can be applied

More information

arxiv: v2 [cs.lg] 10 Oct 2018

arxiv: v2 [cs.lg] 10 Oct 2018 Journal of Machine Learning Research 9 (208) -49 Submitted 0/6; Published 7/8 CoCoA: A General Framework for Communication-Efficient Distributed Optimization arxiv:6.0289v2 [cs.lg] 0 Oct 208 Virginia Smith

More information

Accelerating SVRG via second-order information

Accelerating SVRG via second-order information Accelerating via second-order information Ritesh Kolte Department of Electrical Engineering rkolte@stanford.edu Murat Erdogdu Department of Statistics erdogdu@stanford.edu Ayfer Özgür Department of Electrical

More information

Adaptive Subgradient Methods for Online Learning and Stochastic Optimization John Duchi, Elad Hanzan, Yoram Singer

Adaptive Subgradient Methods for Online Learning and Stochastic Optimization John Duchi, Elad Hanzan, Yoram Singer Adaptive Subgradient Methods for Online Learning and Stochastic Optimization John Duchi, Elad Hanzan, Yoram Singer Vicente L. Malave February 23, 2011 Outline Notation minimize a number of functions φ

More information

Stochastic Gradient Descent with Only One Projection

Stochastic Gradient Descent with Only One Projection Stochastic Gradient Descent with Only One Projection Mehrdad Mahdavi, ianbao Yang, Rong Jin, Shenghuo Zhu, and Jinfeng Yi Dept. of Computer Science and Engineering, Michigan State University, MI, USA Machine

More information

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers) Support vector machines In a nutshell Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers) Solution only depends on a small subset of training

More information

Distributed Optimization with Arbitrary Local Solvers

Distributed Optimization with Arbitrary Local Solvers Industrial and Systems Engineering Distributed Optimization with Arbitrary Local Solvers Chenxin Ma, Jakub Konečný 2, Martin Jaggi 3, Virginia Smith 4, Michael I. Jordan 4, Peter Richtárik 2, and Martin

More information

Machine Learning Lecture 6 Note

Machine Learning Lecture 6 Note Machine Learning Lecture 6 Note Compiled by Abhi Ashutosh, Daniel Chen, and Yijun Xiao February 16, 2016 1 Pegasos Algorithm The Pegasos Algorithm looks very similar to the Perceptron Algorithm. In fact,

More information

Support Vector Machine

Support Vector Machine Andrea Passerini passerini@disi.unitn.it Machine Learning Support vector machines In a nutshell Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

More information

Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade

Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade Machine Learning for Big Data CSE547/STAT548 University of Washington S. M. Kakade (UW) Optimization for

More information

Convex Optimization Algorithms for Machine Learning in 10 Slides

Convex Optimization Algorithms for Machine Learning in 10 Slides Convex Optimization Algorithms for Machine Learning in 10 Slides Presenter: Jul. 15. 2015 Outline 1 Quadratic Problem Linear System 2 Smooth Problem Newton-CG 3 Composite Problem Proximal-Newton-CD 4 Non-smooth,

More information

Randomized Smoothing for Stochastic Optimization

Randomized Smoothing for Stochastic Optimization Randomized Smoothing for Stochastic Optimization John Duchi Peter Bartlett Martin Wainwright University of California, Berkeley NIPS Big Learn Workshop, December 2011 Duchi (UC Berkeley) Smoothing and

More information

Stochastic Gradient Descent with Variance Reduction

Stochastic Gradient Descent with Variance Reduction Stochastic Gradient Descent with Variance Reduction Rie Johnson, Tong Zhang Presenter: Jiawen Yao March 17, 2015 Rie Johnson, Tong Zhang Presenter: JiawenStochastic Yao Gradient Descent with Variance Reduction

More information

Parallel and Distributed Stochastic Learning -Towards Scalable Learning for Big Data Intelligence

Parallel and Distributed Stochastic Learning -Towards Scalable Learning for Big Data Intelligence Parallel and Distributed Stochastic Learning -Towards Scalable Learning for Big Data Intelligence oé LAMDA Group H ŒÆOŽÅ Æ EâX ^ #EâI[ : liwujun@nju.edu.cn Dec 10, 2016 Wu-Jun Li (http://cs.nju.edu.cn/lwj)

More information

SVRG++ with Non-uniform Sampling

SVRG++ with Non-uniform Sampling SVRG++ with Non-uniform Sampling Tamás Kern András György Department of Electrical and Electronic Engineering Imperial College London, London, UK, SW7 2BT {tamas.kern15,a.gyorgy}@imperial.ac.uk Abstract

More information

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation Course Notes for EE7C (Spring 08): Convex Optimization and Approximation Instructor: Moritz Hardt Email: hardt+ee7c@berkeley.edu Graduate Instructor: Max Simchowitz Email: msimchow+ee7c@berkeley.edu October

More information

Logarithmic Regret Algorithms for Strongly Convex Repeated Games

Logarithmic Regret Algorithms for Strongly Convex Repeated Games Logarithmic Regret Algorithms for Strongly Convex Repeated Games Shai Shalev-Shwartz 1 and Yoram Singer 1,2 1 School of Computer Sci & Eng, The Hebrew University, Jerusalem 91904, Israel 2 Google Inc 1600

More information

Inverse Time Dependency in Convex Regularized Learning

Inverse Time Dependency in Convex Regularized Learning Inverse Time Dependency in Convex Regularized Learning Zeyuan A. Zhu (Tsinghua University) Weizhu Chen (MSRA) Chenguang Zhu (Tsinghua University) Gang Wang (MSRA) Haixun Wang (MSRA) Zheng Chen (MSRA) December

More information

Fast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee

Fast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee Fast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee Shen-Yi Zhao and Wu-Jun Li National Key Laboratory for Novel Software Technology Department of Computer

More information

arxiv: v1 [stat.ml] 15 Mar 2018

arxiv: v1 [stat.ml] 15 Mar 2018 Proximal SCOPE for Distributed Sparse Learning: Better Data Partition Implies Faster Convergence Rate Shen-Yi Zhao 1 Gong-Duo Zhang 1 Ming-Wei Li 1 Wu-Jun Li * 1 arxiv:1803.05621v1 [stat.ml] 15 Mar 2018

More information

OLSO. Online Learning and Stochastic Optimization. Yoram Singer August 10, Google Research

OLSO. Online Learning and Stochastic Optimization. Yoram Singer August 10, Google Research OLSO Online Learning and Stochastic Optimization Yoram Singer August 10, 2016 Google Research References Introduction to Online Convex Optimization, Elad Hazan, Princeton University Online Learning and

More information

Randomized Block Coordinate Non-Monotone Gradient Method for a Class of Nonlinear Programming

Randomized Block Coordinate Non-Monotone Gradient Method for a Class of Nonlinear Programming Randomized Block Coordinate Non-Monotone Gradient Method for a Class of Nonlinear Programming Zhaosong Lu Lin Xiao June 25, 2013 Abstract In this paper we propose a randomized block coordinate non-monotone

More information

Efficient Learning on Large Data Sets

Efficient Learning on Large Data Sets Department of Computer Science and Engineering Hong Kong University of Science and Technology Hong Kong Joint work with Mu Li, Chonghai Hu, Weike Pan, Bao-liang Lu Chinese Workshop on Machine Learning

More information

Coordinate descent methods

Coordinate descent methods Coordinate descent methods Master Mathematics for data science and big data Olivier Fercoq November 3, 05 Contents Exact coordinate descent Coordinate gradient descent 3 3 Proximal coordinate descent 5

More information

A Dual Coordinate Descent Method for Large-scale Linear SVM

A Dual Coordinate Descent Method for Large-scale Linear SVM Cho-Jui Hsieh b92085@csie.ntu.edu.tw Kai-Wei Chang b92084@csie.ntu.edu.tw Chih-Jen Lin cjlin@csie.ntu.edu.tw Department of Computer Science, National Taiwan University, Taipei 106, Taiwan S. Sathiya Keerthi

More information

Fast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee

Fast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence (AAAI-16) Fast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee Shen-Yi Zhao and

More information

A Distributed Solver for Kernelized SVM

A Distributed Solver for Kernelized SVM and A Distributed Solver for Kernelized Stanford ICME haoming@stanford.edu bzhe@stanford.edu June 3, 2015 Overview and 1 and 2 3 4 5 Support Vector Machines and A widely used supervised learning model,

More information

Inverse Time Dependency in Convex Regularized Learning

Inverse Time Dependency in Convex Regularized Learning Inverse Time Dependency in Convex Regularized Learning Zeyuan Allen Zhu 2*, Weizhu Chen 2, Chenguang Zhu 23, Gang Wang 2, Haixun Wang 2, Zheng Chen 2 Fundamental Science Class, Department of Physics, Tsinghua

More information

Proximal Minimization by Incremental Surrogate Optimization (MISO)

Proximal Minimization by Incremental Surrogate Optimization (MISO) Proximal Minimization by Incremental Surrogate Optimization (MISO) (and a few variants) Julien Mairal Inria, Grenoble ICCOPT, Tokyo, 2016 Julien Mairal, Inria MISO 1/26 Motivation: large-scale machine

More information

CSC 411 Lecture 17: Support Vector Machine

CSC 411 Lecture 17: Support Vector Machine CSC 411 Lecture 17: Support Vector Machine Ethan Fetaya, James Lucas and Emad Andrews University of Toronto CSC411 Lec17 1 / 1 Today Max-margin classification SVM Hard SVM Duality Soft SVM CSC411 Lec17

More information

StingyCD: Safely Avoiding Wasteful Updates in Coordinate Descent

StingyCD: Safely Avoiding Wasteful Updates in Coordinate Descent : Safely Avoiding Wasteful Updates in Coordinate Descent Tyler B. Johnson 1 Carlos Guestrin 1 Abstract Coordinate descent (CD) is a scalable and simple algorithm for solving many optimization problems

More information

Adaptive Online Gradient Descent

Adaptive Online Gradient Descent University of Pennsylvania ScholarlyCommons Statistics Papers Wharton Faculty Research 6-4-2007 Adaptive Online Gradient Descent Peter Bartlett Elad Hazan Alexander Rakhlin University of Pennsylvania Follow

More information

Stochastic and online algorithms

Stochastic and online algorithms Stochastic and online algorithms stochastic gradient method online optimization and dual averaging method minimizing finite average Stochastic and online optimization 6 1 Stochastic optimization problem

More information

Convergence Rates for Deterministic and Stochastic Subgradient Methods Without Lipschitz Continuity

Convergence Rates for Deterministic and Stochastic Subgradient Methods Without Lipschitz Continuity Convergence Rates for Deterministic and Stochastic Subgradient Methods Without Lipschitz Continuity Benjamin Grimmer Abstract We generalize the classic convergence rate theory for subgradient methods to

More information

arxiv: v2 [stat.ml] 3 Jun 2017

arxiv: v2 [stat.ml] 3 Jun 2017 : A Novel Method for Machine Learning Problems Using Stochastic Recursive Gradient Lam M. Nguyen Jie Liu Katya Scheinberg Martin Takáč arxiv:703.000v [stat.ml] 3 Jun 07 Abstract In this paper, we propose

More information

Multi-class classification via proximal mirror descent

Multi-class classification via proximal mirror descent Multi-class classification via proximal mirror descent Daria Reshetova Stanford EE department resh@stanford.edu Abstract We consider the problem of multi-class classification and a stochastic optimization

More information

Distributed Stochastic Optimization of Regularized Risk via Saddle-point Problem

Distributed Stochastic Optimization of Regularized Risk via Saddle-point Problem Distributed Stochastic Optimization of Regularized Risk via Saddle-point Problem Shin Matsushima 1, Hyokun Yun 2, Xinhua Zhang 3, and S.V.N. Vishwanathan 2,4 1 The University of Tokyo, Tokyo, Japan shin

More information

Lagrange duality. The Lagrangian. We consider an optimization program of the form

Lagrange duality. The Lagrangian. We consider an optimization program of the form Lagrange duality Another way to arrive at the KKT conditions, and one which gives us some insight on solving constrained optimization problems, is through the Lagrange dual. The dual is a maximization

More information

Machine Learning A Geometric Approach

Machine Learning A Geometric Approach Machine Learning A Geometric Approach CIML book Chap 7.7 Linear Classification: Support Vector Machines (SVM) Professor Liang Huang some slides from Alex Smola (CMU) Linear Separator Ham Spam From Perceptron

More information

Coordinate Descent Method for Large-scale L2-loss Linear SVM. Kai-Wei Chang Cho-Jui Hsieh Chih-Jen Lin

Coordinate Descent Method for Large-scale L2-loss Linear SVM. Kai-Wei Chang Cho-Jui Hsieh Chih-Jen Lin Coordinate Descent Method for Large-scale L2-loss Linear SVM Kai-Wei Chang Cho-Jui Hsieh Chih-Jen Lin Abstract Linear support vector machines (SVM) are useful for classifying largescale sparse data. Problems

More information

Ad Placement Strategies

Ad Placement Strategies Case Study : Estimating Click Probabilities Intro Logistic Regression Gradient Descent + SGD AdaGrad Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox January 7 th, 04 Ad

More information

Dual Iterative Hard Thresholding: From Non-convex Sparse Minimization to Non-smooth Concave Maximization

Dual Iterative Hard Thresholding: From Non-convex Sparse Minimization to Non-smooth Concave Maximization Dual Iterative Hard Thresholding: From on-convex Sparse Minimization to on-smooth Concave Maximization Bo Liu Xiao-Tong Yuan 2 Lezi Wang Qingshan Liu 2 Dimitris. Metaxas Abstract Iterative Hard Thresholding

More information

Classification Logistic Regression

Classification Logistic Regression Announcements: Classification Logistic Regression Machine Learning CSE546 Sham Kakade University of Washington HW due on Friday. Today: Review: sub-gradients,lasso Logistic Regression October 3, 26 Sham

More information

Efficient Serial and Parallel Coordinate Descent Methods for Huge-Scale Convex Optimization

Efficient Serial and Parallel Coordinate Descent Methods for Huge-Scale Convex Optimization Efficient Serial and Parallel Coordinate Descent Methods for Huge-Scale Convex Optimization Martin Takáč The University of Edinburgh Based on: P. Richtárik and M. Takáč. Iteration complexity of randomized

More information

Adaptive Gradient Methods AdaGrad / Adam. Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade

Adaptive Gradient Methods AdaGrad / Adam. Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade Adaptive Gradient Methods AdaGrad / Adam Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade 1 Announcements: HW3 posted Dual coordinate ascent (some review of SGD and random

More information

Simultaneous Model Selection and Optimization through Parameter-free Stochastic Learning

Simultaneous Model Selection and Optimization through Parameter-free Stochastic Learning Simultaneous Model Selection and Optimization through Parameter-free Stochastic Learning Francesco Orabona Yahoo! Labs New York, USA francesco@orabona.com Abstract Stochastic gradient descent algorithms

More information

Learning in a Distributed and Heterogeneous Environment

Learning in a Distributed and Heterogeneous Environment Learning in a Distributed and Heterogeneous Environment Martin Jaggi EPFL Machine Learning and Optimization Laboratory mlo.epfl.ch Inria - EPFL joint Workshop - Paris - Feb 15 th Machine Learning Methods

More information

SADAGRAD: Strongly Adaptive Stochastic Gradient Methods

SADAGRAD: Strongly Adaptive Stochastic Gradient Methods Zaiyi Chen * Yi Xu * Enhong Chen Tianbao Yang Abstract Although the convergence rates of existing variants of ADAGRAD have a better dependence on the number of iterations under the strong convexity condition,

More information

Beating SGD: Learning SVMs in Sublinear Time

Beating SGD: Learning SVMs in Sublinear Time Beating SGD: Learning SVMs in Sublinear Time Elad Hazan Tomer Koren Technion, Israel Institute of Technology Haifa, Israel 32000 {ehazan@ie,tomerk@cs}.technion.ac.il Nathan Srebro Toyota Technological

More information

Machine Learning in the Data Revolution Era

Machine Learning in the Data Revolution Era Machine Learning in the Data Revolution Era Shai Shalev-Shwartz School of Computer Science and Engineering The Hebrew University of Jerusalem Machine Learning Seminar Series, Google & University of Waterloo,

More information

arxiv: v1 [math.oc] 18 Mar 2016

arxiv: v1 [math.oc] 18 Mar 2016 Katyusha: Accelerated Variance Reduction for Faster SGD Zeyuan Allen-Zhu zeyuan@csail.mit.edu Princeton University arxiv:1603.05953v1 [math.oc] 18 Mar 016 March 18, 016 Abstract We consider minimizing

More information

StreamSVM Linear SVMs and Logistic Regression When Data Does Not Fit In Memory

StreamSVM Linear SVMs and Logistic Regression When Data Does Not Fit In Memory StreamSVM Linear SVMs and Logistic Regression When Data Does Not Fit In Memory S.V. N. (vishy) Vishwanathan Purdue University and Microsoft vishy@purdue.edu October 9, 2012 S.V. N. Vishwanathan (Purdue,

More information

Case Study 1: Estimating Click Probabilities. Kakade Announcements: Project Proposals: due this Friday!

Case Study 1: Estimating Click Probabilities. Kakade Announcements: Project Proposals: due this Friday! Case Study 1: Estimating Click Probabilities Intro Logistic Regression Gradient Descent + SGD Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade April 4, 017 1 Announcements:

More information

A Dual Coordinate Descent Method for Large-scale Linear SVM

A Dual Coordinate Descent Method for Large-scale Linear SVM Cho-Jui Hsieh b92085@csie.ntu.edu.tw Kai-Wei Chang b92084@csie.ntu.edu.tw Chih-Jen Lin cjlin@csie.ntu.edu.tw Department of Computer Science, National Taiwan University, Taipei 06, Taiwan S. Sathiya Keerthi

More information

Doubly Accelerated Stochastic Variance Reduced Dual Averaging Method for Regularized Empirical Risk Minimization

Doubly Accelerated Stochastic Variance Reduced Dual Averaging Method for Regularized Empirical Risk Minimization Doubly Accelerated Stochastic Variance Reduced Dual Averaging Method for Regularized Empirical Risk Minimization Tomoya Murata Taiji Suzuki More recently, several authors have proposed accelerarxiv:70300439v4

More information

Primal-dual coordinate descent

Primal-dual coordinate descent Primal-dual coordinate descent Olivier Fercoq Joint work with P. Bianchi & W. Hachem 15 July 2015 1/28 Minimize the convex function f, g, h convex f is differentiable Problem min f (x) + g(x) + h(mx) x

More information

Randomized Smoothing Techniques in Optimization

Randomized Smoothing Techniques in Optimization Randomized Smoothing Techniques in Optimization John Duchi Based on joint work with Peter Bartlett, Michael Jordan, Martin Wainwright, Andre Wibisono Stanford University Information Systems Laboratory

More information

Distributed Box-Constrained Quadratic Optimization for Dual Linear SVM

Distributed Box-Constrained Quadratic Optimization for Dual Linear SVM ICML 15 Distributed Box-Constrained Quadratic Optimization for Dual Linear SVM Ching-pei Lee LEECHINGPEI@GMAIL.COM Dan Roth DANR@ILLINOIS.EDU University of Illinois at Urbana-Champaign, 201 N. Goodwin

More information

Stochastic Subgradient Method

Stochastic Subgradient Method Stochastic Subgradient Method Lingjie Weng, Yutian Chen Bren School of Information and Computer Science UC Irvine Subgradient Recall basic inequality for convex differentiable f : f y f x + f x T (y x)

More information

A Greedy Framework for First-Order Optimization

A Greedy Framework for First-Order Optimization A Greedy Framework for First-Order Optimization Jacob Steinhardt Department of Computer Science Stanford University Stanford, CA 94305 jsteinhardt@cs.stanford.edu Jonathan Huggins Department of EECS Massachusetts

More information

Online Sparse Passive Aggressive Learning with Kernels

Online Sparse Passive Aggressive Learning with Kernels Online Sparse Passive Aggressive Learning with Kernels Jing Lu Peilin Zhao Steven C.H. Hoi Abstract Conventional online kernel methods often yield an unbounded large number of support vectors, making them

More information

Optimization Methods for Machine Learning

Optimization Methods for Machine Learning Optimization Methods for Machine Learning Sathiya Keerthi Microsoft Talks given at UC Santa Cruz February 21-23, 2017 The slides for the talks will be made available at: http://www.keerthis.com/ Introduction

More information

arxiv: v4 [math.oc] 5 Jan 2016

arxiv: v4 [math.oc] 5 Jan 2016 Restarted SGD: Beating SGD without Smoothness and/or Strong Convexity arxiv:151.03107v4 [math.oc] 5 Jan 016 Tianbao Yang, Qihang Lin Department of Computer Science Department of Management Sciences The

More information

Kernelized Perceptron Support Vector Machines

Kernelized Perceptron Support Vector Machines Kernelized Perceptron Support Vector Machines Emily Fox University of Washington February 13, 2017 What is the perceptron optimizing? 1 The perceptron algorithm [Rosenblatt 58, 62] Classification setting:

More information

Beyond the regret minimization barrier: an optimal algorithm for stochastic strongly-convex optimization

Beyond the regret minimization barrier: an optimal algorithm for stochastic strongly-convex optimization JMLR: Workshop and Conference Proceedings vol (2010) 1 16 24th Annual Conference on Learning heory Beyond the regret minimization barrier: an optimal algorithm for stochastic strongly-convex optimization

More information

arxiv: v6 [math.oc] 24 Sep 2018

arxiv: v6 [math.oc] 24 Sep 2018 : The First Direct Acceleration of Stochastic Gradient Methods version 6 arxiv:63.5953v6 [math.oc] 4 Sep 8 Zeyuan Allen-Zhu zeyuan@csail.mit.edu Princeton University / Institute for Advanced Study March

More information

The Stochastic Gradient Descent for the Primal L1-SVM Optimization Revisited

The Stochastic Gradient Descent for the Primal L1-SVM Optimization Revisited The Stochastic Gradient Descent for the Primal L1-SVM Optimization Revisited Constantinos Panagiotaopoulos and Petroula Tsampoua School of Technology, Aristotle University of Thessalonii, Greece costapan@eng.auth.gr,

More information