Trade-Offs in Distributed Learning and Optimization

Size: px
Start display at page:

Download "Trade-Offs in Distributed Learning and Optimization"

Transcription

1 Trade-Offs in Distributed Learning and Optimization Ohad Shamir Weizmann Institute of Science Includes joint works with Yossi Arjevani, Nathan Srebro and Tong Zhang IHES Workshop March 2016

2 Distributed Learning and Optimization Why? Big data Too large to fit into a single machine Distributed learning as a constraint Computational Speed-ups Ideally, k machines = 1/k training time Distributed learning as an opportunity

3 Challenges Challenge 1: Communication Very slow compared to local processing 1 : L1 Cache Reference Main memory Reference Round-trip within same datacenter Packet California Netherlands & back 0.5 ns 100 ns 500, 000 ns 150, 000, 000 ns 1

4 Challenges Challenge 2: Parallelization How to parallelize algorithms of the following form: example update example update example

5 Challenges Challenge 3: Accuracy Given the same data, output quality should resemble that of a non-distributed algorithm

6 Setting Convex distributed empirical risk minimization: min w W R d F (w) = 1 m F i (w) = 1 n m F i (w) i=1 n f i,j (w) j=1 f i,j ( ): Loss on j-th example in i-th machine. Strongly-convex/Smooth problems: Each F i is strongly convex/smooth Machines can broadcast simultaneously in communication rounds Õ(d) bits per machine/round

7 Main Question How to optimally trade-off 1 Accuracy: F (w) inf w W F (w) ɛ 2 Communication 3 Runtime Notes: In this talk, accuracy in terms of optimization error, not risk

8 3 Data Partition Scenarios min w W R d F (w) = 1 m m F i (w) i=1 F i (w) = 1 n n f i,j (w) j=1

9 3 Data Partition Scenarios min w W R d F (w) = 1 m Arbitrary partition m F i (w) i=1 No relationship between F i s F i (w) = 1 n n f i,j (w) j=1

10 3 Data Partition Scenarios min w W R d F (w) = 1 m Arbitrary partition m F i (w) i=1 No relationship between F i s Random partition F i (w) = 1 n n f i,j (w) j=1 mn individual losses f i,j assigned to machines at random Strong concentration of measure effects (e.g. F i ( ) s have similar values/gradients/hessians)

11 3 Data Partition Scenarios min w W R d F (w) = 1 m Arbitrary partition m F i (w) i=1 No relationship between F i s Random partition F i (w) = 1 n n f i,j (w) j=1 mn individual losses f i,j assigned to machines at random Strong concentration of measure effects (e.g. F i ( ) s have similar values/gradients/hessians) δ-related setting Generalization of previous scenarios Informally: Values/gradients/Hessians at any w are δ-distant across machines

12 Talk Outline Algorithms + worst-case upper and lower bounds* for 1 Arbitrary partition 2 δ-related 3 Random partition Relies on new results for without-replacement sampling in stochastic gradient methods * In terms of # communication rounds

13 Arbitrary Partition min w W R d F (w) = 1 m m F i (w) i=1 F i (w) = 1 n n f i,j (w) j=1 Baseline: Reduce to non-distributed first-order optimization Example (Gradient Descent) Initialize w 0 ; For t = 1, 2,... Machine i computes F i (w t ) Communication round: F (w t ) = 1 m m i=1 F i(w t ) Update w t+1 = w t η F (w t )

14 Arbitrary Partition min w W R d F (w) = 1 m m F i (w) i=1 F i (w) = 1 n n f i,j (w) j=1 Baseline: Reduce to non-distributed first-order optimization Example (Gradient Descent) Initialize w 0 ; For t = 1, 2,... Machine i computes F i (w t ) Communication round: F (w t ) = 1 m m i=1 F i(w t ) Update w t+1 = w t η F (w t ) Can also accelerate; use proximal smoothing etc. # communication rounds = iteration complexity ( 1/λ ( λ-strongly convex + 1-smooth: O log 1 ɛ λ-strongly convex: O( 1/λɛ); Convex: O(1/ɛ) (with smoothing) Almost full parallelization; but many comm. rounds ) )

15 Arbitrary Partition Can we improve on this simple baseline?

16 Arbitrary Partition Can we improve on this simple baseline? For very large family of algorithms, baseline is (worse-case) optimal!

17 Arbitrary Partition Can we improve on this simple baseline? For very large family of algorithms, baseline is (worse-case) optimal! Assumed Algorithmic Template For each machine j, let W j = {0} Between communication rounds: Each machine j sequentially computes and adds to W j an arbitrary number of vectors w, s.t. γw + ν F j (w) (for some γ, ν 0, γ + ν > 0) is in { span w, F j (w ), ( 2 F j (w ) + D)w, ( 2 F j (w ) + D) 1 w } w, w W j, D diagonal During communication round: Broadcast some vectors in W j

18 Proof Sketch (λ-strongly convex; 1-smooth; 2 machines) ( 1 λ F 1 (w) = w A 1 + λ ) 4 2 I w 1 λ e 1 w 2 ( 1 λ F 2 (w) = w A 2 + λ ) 4 2 I w

19 Proof Sketch (λ-strongly convex; 1-smooth; 2 machines) ( 1 λ F 1 (w) = w A 1 + λ ) 4 2 I w 1 λ e 1 w 2 ( 1 λ F 2 (w) = w A 2 + λ ) 4 2 I w A 1 = A 2 = exp( Ω(T / 1/λ)) error after T iterations 1 2 F 1( ) F 2( ) is essentially hard function for non-distributed optimization [Nemirovski and Yudin 1983]

20 Proof Sketch (λ-strongly convex); 2 machines Ω(1/(λT 2 )) error after T iterations (matched by AGD with proximal smoothing) F 1 (w) = 1 2 b w (T + 2) ( w 2 w 3 + w 4 w w T w T +1 ) + λ 2 w 2 F 2 (w) = 1 2(T + 2) ( w 1 w 2 + w 3 w w T +1 w T +2 ) + λ 2 w 2

21 δ-related Setting min w W R d F (w) = 1 m m F i (w) i=1 F i (w) = 1 n n f i,j (w) j=1 Assuming F i are smooth, values/gradients/hessians at any w W are δ-close

22 δ-related Setting min w W R d F (w) = 1 m m F i (w) i=1 F i (w) = 1 n n f i,j (w) j=1 Assuming F i are smooth, values/gradients/hessians at any w W are δ-close Can show similar lower bounds, but weakened by δ factors δ e.g. λ log ( ) 1 ɛ for λ-strongly convex and smooth Can we build algorithms which utilize δ-relatedness? More data less communication??

23 DANE Algorithm Distributed Approximate NEwton-type method Parameters: Learning rate η > 0; regularizer µ > 0 Initialize w 1 = 0 For t = 1, 2,... Compute F (w t ) = 1 m i F i(w t ) For each machine i, solve local optimization problem: w i t+1 = arg min w F i (w) ( F i (w t ) η F (w t )) w Compute average w t+1 = 1 m + µ 2 w w t 2 2 i wi t+1 Structure similar to ADMM

24 Crucial Property Newton step Equivalent to approximate Newton step w t+1 = w t ( 2 F (w t ) ) 1 F (wt ) Extremely fast convergence, but expensive

25 Crucial Property Equivalent to approximate Newton step Newton step ( 1 1 w t+1 = w t 2 F i (w t )) F (w t ) m Extremely fast convergence, but expensive i

26 Crucial Property Equivalent to approximate Newton step Newton step ( 1 1 w t+1 = w t 2 F i (w t )) F (w t ) m Extremely fast convergence, but expensive DANE (Quadratic Functions) Equivalent to ( 1 ( w t+1 = w t η 2 F i (w t ) ) ) 1 + µi F (w t ) m even though Hessians 2 F i never computed explicitly! i i

27 Theoretical Guarantees: Quadratic Functions Theorem w t+1 w opt I η H 1 H t w 1 w opt where H = 1 2 F i, m H 1 = 1 ( 2 F i + µi ) 1 m i i In a δ-related setting, 2 F i H δ for all i Setting η, µ appropriately, { } I η H 1 1 H max 2, (δ/λ) 2 Conclusion Need O((δ/λ) 2 ) log(1/ɛ) communication rounds

28 Some Experiments Regularized least squares in R 500 log 10 (optimization error) vs. # iterations DANE m=4 N=6*10 3 N=10*10 3 N=14* m= ADMM

29 Extensions Can provide some guarantees for non-quadratic losses as well Using a more sophisticated algorithm, can improve guarantees to O( δ/λ log(1/ɛ)) rounds, for quadratics / self-concordant losses [Zhang and Xiao 2015] Disadvantage: Each iteration requires solving a local optimization problem Not necessarily cheap in terms of runtime

30 Random Partition min w W R d F (w) = 1 m m F i (w) i=1 F i (w) = 1 n n f i,j (w) j=1 Special case of δ-related setting, δ = O(1/ n) in general

31 Random Partition min w W R d F (w) = 1 m m F i (w) i=1 F i (w) = 1 n n f i,j (w) j=1 Special case of δ-related setting, δ = O(1/ n) in general Previous algorithms: O(log(1/ɛ)) communication rounds as long as λ Ω(1/ n)

32 Random Partition min w W R d F (w) = 1 m m F i (w) i=1 F i (w) = 1 n n f i,j (w) j=1 Special case of δ-related setting, δ = O(1/ n) in general Previous algorithms: O(log(1/ɛ)) communication rounds as long as λ Ω(1/ n) Next: Much simpler approach for randomly partitioned data Can get O(log(1/ɛ)) communication rounds as long as λ Ω(1/n) Detour : without-replacement sampling for stochastic gradient methods...

33 Stochastic Gradient Methods min w W R d F (w) = 1 N N f i (w) i=1 Stochastic Gradient Descent / Subgradient Method w 1 = 0 For t = 1, 2,... Sample i t Let g t f it (w t ) w t+1 = Π W (w t η t g t )

34 Stochastic Gradient Methods min w W R d F (w) = 1 N N f i (w) i=1 Stochastic Gradient Descent / Subgradient Method w 1 = 0 For t = 1, 2,... Sample i t Let g t f it (w t ) w t+1 = Π W (w t η t g t ) Standard analysis: each i t sampled uniformly from {1,..., N} Works because E[g t ] F (w t ) But: one often samples i t without replacement Equivalently, go over a random shuffle of the data

35 Why Without Replacement Sampling? Often works better (all data equally processed) Requires sequential rather than random data access However, much harder to analyze

36 Why Without Replacement Sampling? Often works better (all data equally processed) Requires sequential rather than random data access However, much harder to analyze Incremental gradient bounds (e.g. Nedić and Bertsekas 2001): Hold for any order, but can be exponentially slower

37 Why Without Replacement Sampling? Often works better (all data equally processed) Requires sequential rather than random data access However, much harder to analyze Incremental gradient bounds (e.g. Nedić and Bertsekas 2001): Hold for any order, but can be exponentially slower (Gürbüzbalaban, Ozdaglar, Parillo 2015): For strongly convex and smooth problems, k passes over N data points: Without-replacement: O((N/k) 2 ) error With-replacement: O(1/Nk)

38 New Results Without-replacement sampling is not worse than with-replacement sampling (in worst-case sense) under various scenarios, e.g. O(1/ T ) for convex linear prediction O(1/λT ) for λ-strongly-convex and smooth linear prediction ( ( )) exp Ω T N+1/λ for least squares on N data points, using no-replacement SVRG algorithm Analysis uses ideas from adversarial online learning and transductive learning theory

39 Derivation: Convex Linear Prediction Algorithm: min w W R d F (w) = 1 N N f i (w) Sequentially processes f σ(1), f σ(2),..., f σ(t ) where σ random permutation on {1,..., N} i=1 Produces iterates w 1,..., w T Goal: Prove [ 1 E T ] T F (w t ) F (w ) t=1 ( ) 1 O T where w = arg min w W F (w)

40 Derivation: Convex Linear Prediction Assumption: Algorithm has bounded regret in adversarial online setting: For any sequence of convex Lipschitz f 1, f 2,... and w W 1 T T f i (w t ) 1 T i=1 T i=1 Satisfied for e.g. stochastic gradient descent ( ) 1 f i (w) O T

41 Derivation: Convex Linear Prediction (1/2) [ 1 E σ T ] T F (w t ) F (w ) t=1

42 Derivation: Convex Linear Prediction (1/2) [ 1 E σ T [ 1 = E σ T ] T F (w t ) F (w ) t=1 ] T F (w t ) f σ(t) (w ) t=1

43 Derivation: Convex Linear Prediction (1/2) [ 1 = E σ T [ 1 E σ T [ 1 = E σ T ] T F (w t ) F (w ) t=1 ] T F (w t ) f σ(t) (w ) t=1 T ( fσ(t) (w t ) f σ(t) (w ) )] [ 1 +E σ T t=1 T ( F (wt ) f σ(t) (w t ) )] t=1

44 Derivation: Convex Linear Prediction (1/2) [ 1 = E σ T [ 1 E σ T [ 1 = E σ T ] T F (w t ) F (w ) t=1 ] T F (w t ) f σ(t) (w ) t=1 T ( fσ(t) (w t ) f σ(t) (w ) )] [ 1 +E σ T t=1 ( ) 1 = O T [ 1 + E σ T T ( F (wt ) f σ(t) (w t ) )] t=1 ( T N i=1 f σ(i)(w t ) N i=t f )] σ(i)(w t ) N N t + 1 t=1

45 Derivation: Convex Linear Prediction (2/2) Following algebraic manipulations, ( ) 1 = O + 1 T T T t=1 where F a:b is average over f σ(a),..., f σ(b) t 1 N E [F 1:t 1(w t ) F t:n (w t )]

46 Derivation: Convex Linear Prediction (2/2) Following algebraic manipulations, ( ) 1 = O + 1 T T T t=1 t 1 N E [F 1:t 1(w t ) F t:n (w t )] where F a:b is average over f σ(a),..., f σ(b) ( ) 1 O + 1 T [ ] t 1 T T N E sup (F 1:t 1 (w) F t:n (w)) w W t=2

47 Derivation: Convex Linear Prediction (2/2) Following algebraic manipulations, ( ) 1 = O + 1 T T T t=1 t 1 N E [F 1:t 1(w t ) F t:n (w t )] where F a:b is average over f σ(a),..., f σ(b) ( ) 1 O + 1 T [ ] t 1 T T N E sup (F 1:t 1 (w) F t:n (w)) w W ( ) 1 O + 1 T T t=2 T t=2 t 1 N R t 1:N t+1(w) + O ( ) 1, N where R t 1:N t+1 (W) is transductive Rademacher complexity of W w.r.t. f 1, f 2,..., f N

48 Derivation: Convex Linear Prediction (2/2) Following algebraic manipulations, ( ) 1 = O + 1 T T T t=1 t 1 N E [F 1:t 1(w t ) F t:n (w t )] where F a:b is average over f σ(a),..., f σ(b) ( ) 1 O + 1 T [ ] t 1 T T N E sup (F 1:t 1 (w) F t:n (w)) w W ( ) 1 O + 1 T T t=2 T t=2 t 1 N R t 1:N t+1(w) + O ( ) 1, N where R t 1:N t+1 (W) is transductive Rademacher complexity of W w.r.t. f 1, f 2,..., f N Specializing to linear predictors and convex Lipschitz losses, get familiar O(1/ T ) rate

49 Derivation: Convex Linear Prediction (2/2) Following algebraic manipulations, ( ) 1 = O + 1 T T T t=1 t 1 N E [F 1:t 1(w t ) F t:n (w t )] where F a:b is average over f σ(a),..., f σ(b) ( ) 1 O + 1 T [ ] t 1 T T N E sup (F 1:t 1 (w) F t:n (w)) w W ( ) 1 O + 1 T T t=2 T t=2 t 1 N R t 1:N t+1(w) + O ( ) 1, N where R t 1:N t+1 (W) is transductive Rademacher complexity of W w.r.t. f 1, f 2,..., f N Specializing to linear predictors and convex Lipschitz losses, get familiar O(1/ T ) rate With more effort, get O(1/λT ) with λ-strong convexity

50 SVRG min w W R d F (w) = 1 N N f i (w) In recent years, fast stochastic algorithms to solve finite-sum problems i=1 SAG [Le Roux et al. 2012], SDCA (as analyzed in [Shalev-Shwartz et al. 2013, 2016]), SVRG [Johnson and Zhang 2013], SAGA [Defazio et al. 2014], Finito [Defazio et al. 2014], S2GD [Konecný and Richtárik 2013]... Cheap stochastic iterations + linear convergence rate (assuming strong convexity and smoothness) All existing analyses based on with-replacement sampling We consider without-replacement SVRG

51 With-Replacement SVRG Initialize w 1 = 0 for s = 1, 2,..., S do Compute F ( w s ) = 1 N N i=1 f i( w s ) w 1 := w s for t = 1, 2,..., T do Draw i t uniformly at random from {1,..., N} w t+1 := w t η ( f it (w t ) f it ( w s ) + F ( w s )) end for Pick w s+1 at random from w 1,..., w T end for

52 With-Replacement SVRG Initialize w 1 = 0 for s = 1, 2,..., S do Compute F ( w s ) = 1 N N i=1 f i( w s ) w 1 := w s for t = 1, 2,..., T do Draw i t uniformly at random from {1,..., N} w t+1 := w t η ( f it (w t ) f it ( w s ) + F ( w s )) end for Pick w s+1 at random from w 1,..., w T end for Analysis for 1-smooth, λ-strongly convex functions: S = O(log(1/ɛ)), T 1/λ

53 With-Replacement SVRG Initialize w 1 = 0 for s = 1, 2,..., S do Compute F ( w s ) = 1 N N i=1 f i( w s ) w 1 := w s for t = 1, 2,..., T do Draw i t uniformly at random from {1,..., N} w t+1 := w t η ( f it (w t ) f it ( w s ) + F ( w s )) end for Pick w s+1 at random from w 1,..., w T end for Analysis for 1-smooth, λ-strongly convex functions: S = O(log(1/ɛ)), T 1/λ Observation [Lee, Lin, Ma 2015]: Can simulate it in distributed setting with randomly partitioned data, as long as T O(1/ N)

54 With-Replacement SVRG Initialize w 1 = 0 for s = 1, 2,..., S do Compute F ( w s ) = 1 N N i=1 f i( w s ) w 1 := w s for t = 1, 2,..., T do Draw i t uniformly at random from {1,..., N} w t+1 := w t η ( f it (w t ) f it ( w s ) + F ( w s )) end for Pick w s+1 at random from w 1,..., w T end for Analysis for 1-smooth, λ-strongly convex functions: S = O(log(1/ɛ)), T 1/λ Observation [Lee, Lin, Ma 2015]: Can simulate it in distributed setting with randomly partitioned data, as long as T O(1/ N) By analysis, only applicable when λ Ω(1/ N)

55 Without-Replacement SVRG Given: Permutation σ on {1,..., N} Initialize w 1 = 0 for s = 1, 2,..., S do Compute F ( w s ) = 1 N N i=1 f i( w s ) w 1 := w s for t = 1, 2,..., T do w t+1 := w t η ( f σ(t) (w t ) f σ(t) ( w s ) + F ( w s ) ) end for Pick w s+1 at random from w 1,..., w T end for Observation: Can simulate it in distributed setting with randomly partitioned data, all the way up to T = Ω(N)

56 Bound for Regularized Least Squares f i (w) = 1 2 ( w, x i y i ) 2 + ˆλ 2 w 2 ; F ( ) λ-strongly convex Theorem If perform S = O(log(1/ɛ)) epochs with T = Θ(1/λ) without-replacement stochastic iterations, E[F ( w S+1 ) F (w )] ɛ

57 Bound for Regularized Least Squares f i (w) = 1 2 ( w, x i y i ) 2 + ˆλ 2 w 2 ; F ( ) λ-strongly convex Theorem If perform S = O(log(1/ɛ)) epochs with T = Θ(1/λ) without-replacement stochastic iterations, Implications: E[F ( w S+1 ) F (w )] ɛ

58 Bound for Regularized Least Squares f i (w) = 1 2 ( w, x i y i ) 2 + ˆλ 2 w 2 ; F ( ) λ-strongly convex Theorem If perform S = O(log(1/ɛ)) epochs with T = Θ(1/λ) without-replacement stochastic iterations, Implications: E[F ( w S+1 ) F (w )] ɛ Non-distributed optimization: λ Ω(1/N) ɛ-optimal solution without data reshuffling

59 Bound for Regularized Least Squares f i (w) = 1 2 ( w, x i y i ) 2 + ˆλ 2 w 2 ; F ( ) λ-strongly convex Theorem If perform S = O(log(1/ɛ)) epochs with T = Θ(1/λ) without-replacement stochastic iterations, Implications: E[F ( w S+1 ) F (w )] ɛ Non-distributed optimization: λ Ω(1/N) ɛ-optimal solution without data reshuffling Distributed optimization: λ Ω(1/n) ɛ-optimal solution with randomly-partitioned data O(log(1/ɛ)) communication rounds Runtime dominated by fully-parallelizable gradient computation

60 Summary and Open Questions min w W R d F (w) = 1 n n F i (w) i=1 F i (w) = 1 n n f i,j (w) j=1 Arbitrary partition Baseline is also (worst-case) optimal... δ-related For quadratics More generally X Self-concordant losses Algorithms communication-efficient, not necessarily runtime-efficient Random partition Least squares More generally X λ-strongly-convex and smooth, but only when λ Ω(1/ data size)

Big Data Analytics: Optimization and Randomization

Big Data Analytics: Optimization and Randomization Big Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML 2015 Hong Kong Department of Computer Science, The University of Iowa, IA, USA Nov. 20, 2015 Yang Tutorial for ACML 15 Nov.

More information

Fast Stochastic Optimization Algorithms for ML

Fast Stochastic Optimization Algorithms for ML Fast Stochastic Optimization Algorithms for ML Aaditya Ramdas April 20, 205 This lecture is about efficient algorithms for minimizing finite sums min w R d n i= f i (w) or min w R d n f i (w) + λ 2 w 2

More information

Accelerating Stochastic Optimization

Accelerating Stochastic Optimization Accelerating Stochastic Optimization Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem and Mobileye Master Class at Tel-Aviv, Tel-Aviv University, November 2014 Shalev-Shwartz

More information

A Stochastic PCA Algorithm with an Exponential Convergence Rate. Ohad Shamir

A Stochastic PCA Algorithm with an Exponential Convergence Rate. Ohad Shamir A Stochastic PCA Algorithm with an Exponential Convergence Rate Ohad Shamir Weizmann Institute of Science NIPS Optimization Workshop December 2014 Ohad Shamir Stochastic PCA with Exponential Convergence

More information

Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization

Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization Shai Shalev-Shwartz and Tong Zhang School of CS and Engineering, The Hebrew University of Jerusalem Optimization for Machine

More information

Stochastic Gradient Descent with Variance Reduction

Stochastic Gradient Descent with Variance Reduction Stochastic Gradient Descent with Variance Reduction Rie Johnson, Tong Zhang Presenter: Jiawen Yao March 17, 2015 Rie Johnson, Tong Zhang Presenter: JiawenStochastic Yao Gradient Descent with Variance Reduction

More information

Large-scale Stochastic Optimization

Large-scale Stochastic Optimization Large-scale Stochastic Optimization 11-741/641/441 (Spring 2016) Hanxiao Liu hanxiaol@cs.cmu.edu March 24, 2016 1 / 22 Outline 1. Gradient Descent (GD) 2. Stochastic Gradient Descent (SGD) Formulation

More information

SVRG++ with Non-uniform Sampling

SVRG++ with Non-uniform Sampling SVRG++ with Non-uniform Sampling Tamás Kern András György Department of Electrical and Electronic Engineering Imperial College London, London, UK, SW7 2BT {tamas.kern15,a.gyorgy}@imperial.ac.uk Abstract

More information

ECS289: Scalable Machine Learning

ECS289: Scalable Machine Learning ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Sept 29, 2016 Outline Convex vs Nonconvex Functions Coordinate Descent Gradient Descent Newton s method Stochastic Gradient Descent Numerical Optimization

More information

STA141C: Big Data & High Performance Statistical Computing

STA141C: Big Data & High Performance Statistical Computing STA141C: Big Data & High Performance Statistical Computing Lecture 8: Optimization Cho-Jui Hsieh UC Davis May 9, 2017 Optimization Numerical Optimization Numerical Optimization: min X f (X ) Can be applied

More information

Nesterov s Acceleration

Nesterov s Acceleration Nesterov s Acceleration Nesterov Accelerated Gradient min X f(x)+ (X) f -smooth. Set s 1 = 1 and = 1. Set y 0. Iterate by increasing t: g t 2 @f(y t ) s t+1 = 1+p 1+4s 2 t 2 y t = x t + s t 1 s t+1 (x

More information

Stochastic optimization: Beyond stochastic gradients and convexity Part I

Stochastic optimization: Beyond stochastic gradients and convexity Part I Stochastic optimization: Beyond stochastic gradients and convexity Part I Francis Bach INRIA - Ecole Normale Supérieure, Paris, France ÉCOLE NORMALE SUPÉRIEURE Joint tutorial with Suvrit Sra, MIT - NIPS

More information

Distributed Stochastic Variance Reduced Gradient Methods and A Lower Bound for Communication Complexity

Distributed Stochastic Variance Reduced Gradient Methods and A Lower Bound for Communication Complexity Noname manuscript No. (will be inserted by the editor) Distributed Stochastic Variance Reduced Gradient Methods and A Lower Bound for Communication Complexity Jason D. Lee Qihang Lin Tengyu Ma Tianbao

More information

Journal Club. A Universal Catalyst for First-Order Optimization (H. Lin, J. Mairal and Z. Harchaoui) March 8th, CMAP, Ecole Polytechnique 1/19

Journal Club. A Universal Catalyst for First-Order Optimization (H. Lin, J. Mairal and Z. Harchaoui) March 8th, CMAP, Ecole Polytechnique 1/19 Journal Club A Universal Catalyst for First-Order Optimization (H. Lin, J. Mairal and Z. Harchaoui) CMAP, Ecole Polytechnique March 8th, 2018 1/19 Plan 1 Motivations 2 Existing Acceleration Methods 3 Universal

More information

Stochastic Variance Reduction for Nonconvex Optimization. Barnabás Póczos

Stochastic Variance Reduction for Nonconvex Optimization. Barnabás Póczos 1 Stochastic Variance Reduction for Nonconvex Optimization Barnabás Póczos Contents 2 Stochastic Variance Reduction for Nonconvex Optimization Joint work with Sashank Reddi, Ahmed Hefny, Suvrit Sra, and

More information

Stochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure

Stochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure Stochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure Alberto Bietti Julien Mairal Inria Grenoble (Thoth) March 21, 2017 Alberto Bietti Stochastic MISO March 21,

More information

Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method

Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method Davood Hajinezhad Iowa State University Davood Hajinezhad Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method 1 / 35 Co-Authors

More information

Lecture 3: Minimizing Large Sums. Peter Richtárik

Lecture 3: Minimizing Large Sums. Peter Richtárik Lecture 3: Minimizing Large Sums Peter Richtárik Graduate School in Systems, Op@miza@on, Control and Networks Belgium 2015 Mo@va@on: Machine Learning & Empirical Risk Minimiza@on Training Linear Predictors

More information

Distributed Inexact Newton-type Pursuit for Non-convex Sparse Learning

Distributed Inexact Newton-type Pursuit for Non-convex Sparse Learning Distributed Inexact Newton-type Pursuit for Non-convex Sparse Learning Bo Liu Department of Computer Science, Rutgers Univeristy Xiao-Tong Yuan BDAT Lab, Nanjing University of Information Science and Technology

More information

Incremental Gradient, Subgradient, and Proximal Methods for Convex Optimization

Incremental Gradient, Subgradient, and Proximal Methods for Convex Optimization Incremental Gradient, Subgradient, and Proximal Methods for Convex Optimization Dimitri P. Bertsekas Laboratory for Information and Decision Systems Massachusetts Institute of Technology February 2014

More information

Minimizing Finite Sums with the Stochastic Average Gradient Algorithm

Minimizing Finite Sums with the Stochastic Average Gradient Algorithm Minimizing Finite Sums with the Stochastic Average Gradient Algorithm Joint work with Nicolas Le Roux and Francis Bach University of British Columbia Context: Machine Learning for Big Data Large-scale

More information

Improved Optimization of Finite Sums with Minibatch Stochastic Variance Reduced Proximal Iterations

Improved Optimization of Finite Sums with Minibatch Stochastic Variance Reduced Proximal Iterations Improved Optimization of Finite Sums with Miniatch Stochastic Variance Reduced Proximal Iterations Jialei Wang University of Chicago Tong Zhang Tencent AI La Astract jialei@uchicago.edu tongzhang@tongzhang-ml.org

More information

Stochastic and online algorithms

Stochastic and online algorithms Stochastic and online algorithms stochastic gradient method online optimization and dual averaging method minimizing finite average Stochastic and online optimization 6 1 Stochastic optimization problem

More information

Proximal Minimization by Incremental Surrogate Optimization (MISO)

Proximal Minimization by Incremental Surrogate Optimization (MISO) Proximal Minimization by Incremental Surrogate Optimization (MISO) (and a few variants) Julien Mairal Inria, Grenoble ICCOPT, Tokyo, 2016 Julien Mairal, Inria MISO 1/26 Motivation: large-scale machine

More information

Computational and Statistical Learning Theory

Computational and Statistical Learning Theory Computational and Statistical Learning Theory TTIC 31120 Prof. Nati Srebro Lecture 17: Stochastic Optimization Part II: Realizable vs Agnostic Rates Part III: Nearest Neighbor Classification Stochastic

More information

Optimization for Machine Learning

Optimization for Machine Learning Optimization for Machine Learning (Lecture 3-A - Convex) SUVRIT SRA Massachusetts Institute of Technology Special thanks: Francis Bach (INRIA, ENS) (for sharing this material, and permitting its use) MPI-IS

More information

Oracle Complexity of Second-Order Methods for Smooth Convex Optimization

Oracle Complexity of Second-Order Methods for Smooth Convex Optimization racle Complexity of Second-rder Methods for Smooth Convex ptimization Yossi Arjevani had Shamir Ron Shiff Weizmann Institute of Science Rehovot 7610001 Israel Abstract yossi.arjevani@weizmann.ac.il ohad.shamir@weizmann.ac.il

More information

Modern Stochastic Methods. Ryan Tibshirani (notes by Sashank Reddi and Ryan Tibshirani) Convex Optimization

Modern Stochastic Methods. Ryan Tibshirani (notes by Sashank Reddi and Ryan Tibshirani) Convex Optimization Modern Stochastic Methods Ryan Tibshirani (notes by Sashank Reddi and Ryan Tibshirani) Convex Optimization 10-725 Last time: conditional gradient method For the problem min x f(x) subject to x C where

More information

Randomized Coordinate Descent with Arbitrary Sampling: Algorithms and Complexity

Randomized Coordinate Descent with Arbitrary Sampling: Algorithms and Complexity Randomized Coordinate Descent with Arbitrary Sampling: Algorithms and Complexity Zheng Qu University of Hong Kong CAM, 23-26 Aug 2016 Hong Kong based on joint work with Peter Richtarik and Dominique Cisba(University

More information

Convex Optimization Lecture 16

Convex Optimization Lecture 16 Convex Optimization Lecture 16 Today: Projected Gradient Descent Conditional Gradient Descent Stochastic Gradient Descent Random Coordinate Descent Recall: Gradient Descent (Steepest Descent w.r.t Euclidean

More information

Accelerated Randomized Primal-Dual Coordinate Method for Empirical Risk Minimization

Accelerated Randomized Primal-Dual Coordinate Method for Empirical Risk Minimization Accelerated Randomized Primal-Dual Coordinate Method for Empirical Risk Minimization Lin Xiao (Microsoft Research) Joint work with Qihang Lin (CMU), Zhaosong Lu (Simon Fraser) Yuchen Zhang (UC Berkeley)

More information

Coordinate Descent and Ascent Methods

Coordinate Descent and Ascent Methods Coordinate Descent and Ascent Methods Julie Nutini Machine Learning Reading Group November 3 rd, 2015 1 / 22 Projected-Gradient Methods Motivation Rewrite non-smooth problem as smooth constrained problem:

More information

Accelerating SVRG via second-order information

Accelerating SVRG via second-order information Accelerating via second-order information Ritesh Kolte Department of Electrical Engineering rkolte@stanford.edu Murat Erdogdu Department of Statistics erdogdu@stanford.edu Ayfer Özgür Department of Electrical

More information

On the Iteration Complexity of Oblivious First-Order Optimization Algorithms

On the Iteration Complexity of Oblivious First-Order Optimization Algorithms On the Iteration Complexity of Oblivious First-Order Optimization Algorithms Yossi Arjevani Weizmann Institute of Science, Rehovot 7610001, Israel Ohad Shamir Weizmann Institute of Science, Rehovot 7610001,

More information

Introduction to Machine Learning (67577) Lecture 7

Introduction to Machine Learning (67577) Lecture 7 Introduction to Machine Learning (67577) Lecture 7 Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem Solving Convex Problems using SGD and RLM Shai Shalev-Shwartz (Hebrew

More information

Stochastic Gradient Descent. Ryan Tibshirani Convex Optimization

Stochastic Gradient Descent. Ryan Tibshirani Convex Optimization Stochastic Gradient Descent Ryan Tibshirani Convex Optimization 10-725 Last time: proximal gradient descent Consider the problem min x g(x) + h(x) with g, h convex, g differentiable, and h simple in so

More information

Mini-Batch Primal and Dual Methods for SVMs

Mini-Batch Primal and Dual Methods for SVMs Mini-Batch Primal and Dual Methods for SVMs Peter Richtárik School of Mathematics The University of Edinburgh Coauthors: M. Takáč (Edinburgh), A. Bijral and N. Srebro (both TTI at Chicago) arxiv:1303.2314

More information

A Two-Stage Approach for Learning a Sparse Model with Sharp

A Two-Stage Approach for Learning a Sparse Model with Sharp A Two-Stage Approach for Learning a Sparse Model with Sharp Excess Risk Analysis The University of Iowa, Nanjing University, Alibaba Group February 3, 2017 1 Problem and Chanllenges 2 The Two-stage Approach

More information

Finite-sum Composition Optimization via Variance Reduced Gradient Descent

Finite-sum Composition Optimization via Variance Reduced Gradient Descent Finite-sum Composition Optimization via Variance Reduced Gradient Descent Xiangru Lian Mengdi Wang Ji Liu University of Rochester Princeton University University of Rochester xiangru@yandex.com mengdiw@princeton.edu

More information

Support Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017

Support Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017 Support Vector Machines: Training with Stochastic Gradient Descent Machine Learning Fall 2017 1 Support vector machines Training by maximizing margin The SVM objective Solving the SVM optimization problem

More information

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017 Machine Learning Support Vector Machines Fabio Vandin November 20, 2017 1 Classification and Margin Consider a classification problem with two classes: instance set X = R d label set Y = { 1, 1}. Training

More information

Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade

Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade Machine Learning for Big Data CSE547/STAT548 University of Washington S. M. Kakade (UW) Optimization for

More information

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016 ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016 LECTURERS: NAMAN AGARWAL AND BRIAN BULLINS SCRIBE: KIRAN VODRAHALLI Contents 1 Introduction 1 1.1 History of Optimization.....................................

More information

Stochastic Optimization Algorithms Beyond SG

Stochastic Optimization Algorithms Beyond SG Stochastic Optimization Algorithms Beyond SG Frank E. Curtis 1, Lehigh University involving joint work with Léon Bottou, Facebook AI Research Jorge Nocedal, Northwestern University Optimization Methods

More information

OLSO. Online Learning and Stochastic Optimization. Yoram Singer August 10, Google Research

OLSO. Online Learning and Stochastic Optimization. Yoram Singer August 10, Google Research OLSO Online Learning and Stochastic Optimization Yoram Singer August 10, 2016 Google Research References Introduction to Online Convex Optimization, Elad Hazan, Princeton University Online Learning and

More information

Stochastic Quasi-Newton Methods

Stochastic Quasi-Newton Methods Stochastic Quasi-Newton Methods Donald Goldfarb Department of IEOR Columbia University UCLA Distinguished Lecture Series May 17-19, 2016 1 / 35 Outline Stochastic Approximation Stochastic Gradient Descent

More information

Limitations on Variance-Reduction and Acceleration Schemes for Finite Sum Optimization

Limitations on Variance-Reduction and Acceleration Schemes for Finite Sum Optimization Limitations on Variance-Reduction and Acceleration Schemes for Finite Sum Optimization Yossi Arjevani Department of Computer Science and Applied Mathematics Weizmann Institute of Science Rehovot 7610001,

More information

Making Gradient Descent Optimal for Strongly Convex Stochastic Optimization

Making Gradient Descent Optimal for Strongly Convex Stochastic Optimization Making Gradient Descent Optimal for Strongly Convex Stochastic Optimization Alexander Rakhlin University of Pennsylvania Ohad Shamir Microsoft Research New England Karthik Sridharan University of Pennsylvania

More information

Convex Optimization. Ofer Meshi. Lecture 6: Lower Bounds Constrained Optimization

Convex Optimization. Ofer Meshi. Lecture 6: Lower Bounds Constrained Optimization Convex Optimization Ofer Meshi Lecture 6: Lower Bounds Constrained Optimization Lower Bounds Some upper bounds: #iter μ 2 M #iter 2 M #iter L L μ 2 Oracle/ops GD κ log 1/ε M x # ε L # x # L # ε # με f

More information

Full-information Online Learning

Full-information Online Learning Introduction Expert Advice OCO LM A DA NANJING UNIVERSITY Full-information Lijun Zhang Nanjing University, China June 2, 2017 Outline Introduction Expert Advice OCO 1 Introduction Definitions Regret 2

More information

Optimistic Rates Nati Srebro

Optimistic Rates Nati Srebro Optimistic Rates Nati Srebro Based on work with Karthik Sridharan and Ambuj Tewari Examples based on work with Andy Cotter, Elad Hazan, Tomer Koren, Percy Liang, Shai Shalev-Shwartz, Ohad Shamir, Karthik

More information

Parallel and Distributed Stochastic Learning -Towards Scalable Learning for Big Data Intelligence

Parallel and Distributed Stochastic Learning -Towards Scalable Learning for Big Data Intelligence Parallel and Distributed Stochastic Learning -Towards Scalable Learning for Big Data Intelligence oé LAMDA Group H ŒÆOŽÅ Æ EâX ^ #EâI[ : liwujun@nju.edu.cn Dec 10, 2016 Wu-Jun Li (http://cs.nju.edu.cn/lwj)

More information

Comparison of Modern Stochastic Optimization Algorithms

Comparison of Modern Stochastic Optimization Algorithms Comparison of Modern Stochastic Optimization Algorithms George Papamakarios December 214 Abstract Gradient-based optimization methods are popular in machine learning applications. In large-scale problems,

More information

Optimization. Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison

Optimization. Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison Optimization Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison optimization () cost constraints might be too much to cover in 3 hours optimization (for big

More information

Importance Sampling for Minibatches

Importance Sampling for Minibatches Importance Sampling for Minibatches Dominik Csiba School of Mathematics University of Edinburgh 07.09.2016, Birmingham Dominik Csiba (University of Edinburgh) Importance Sampling for Minibatches 07.09.2016,

More information

Incremental Quasi-Newton methods with local superlinear convergence rate

Incremental Quasi-Newton methods with local superlinear convergence rate Incremental Quasi-Newton methods wh local superlinear convergence rate Aryan Mokhtari, Mark Eisen, and Alejandro Ribeiro Department of Electrical and Systems Engineering Universy of Pennsylvania Int. Conference

More information

Notes on AdaGrad. Joseph Perla 2014

Notes on AdaGrad. Joseph Perla 2014 Notes on AdaGrad Joseph Perla 2014 1 Introduction Stochastic Gradient Descent (SGD) is a common online learning algorithm for optimizing convex (and often non-convex) functions in machine learning today.

More information

Coordinate Descent Faceoff: Primal or Dual?

Coordinate Descent Faceoff: Primal or Dual? JMLR: Workshop and Conference Proceedings 83:1 22, 2018 Algorithmic Learning Theory 2018 Coordinate Descent Faceoff: Primal or Dual? Dominik Csiba peter.richtarik@ed.ac.uk School of Mathematics University

More information

An Evolving Gradient Resampling Method for Machine Learning. Jorge Nocedal

An Evolving Gradient Resampling Method for Machine Learning. Jorge Nocedal An Evolving Gradient Resampling Method for Machine Learning Jorge Nocedal Northwestern University NIPS, Montreal 2015 1 Collaborators Figen Oztoprak Stefan Solntsev Richard Byrd 2 Outline 1. How to improve

More information

Barzilai-Borwein Step Size for Stochastic Gradient Descent

Barzilai-Borwein Step Size for Stochastic Gradient Descent Barzilai-Borwein Step Size for Stochastic Gradient Descent Conghui Tan Shiqian Ma Yu-Hong Dai Yuqiu Qian May 16, 2016 Abstract One of the major issues in stochastic gradient descent (SGD) methods is how

More information

Stochastic Optimization

Stochastic Optimization Introduction Related Work SGD Epoch-GD LM A DA NANJING UNIVERSITY Lijun Zhang Nanjing University, China May 26, 2017 Introduction Related Work SGD Epoch-GD Outline 1 Introduction 2 Related Work 3 Stochastic

More information

Online (and Distributed) Learning with Information Constraints. Ohad Shamir

Online (and Distributed) Learning with Information Constraints. Ohad Shamir Online (and Distributed) Learning with Information Constraints Ohad Shamir Weizmann Institute of Science Online Algorithms and Learning Workshop Leiden, November 2014 Ohad Shamir Learning with Information

More information

The FTRL Algorithm with Strongly Convex Regularizers

The FTRL Algorithm with Strongly Convex Regularizers CSE599s, Spring 202, Online Learning Lecture 8-04/9/202 The FTRL Algorithm with Strongly Convex Regularizers Lecturer: Brandan McMahan Scribe: Tamara Bonaci Introduction In the last lecture, we talked

More information

Accelerated Stochastic Block Coordinate Gradient Descent for Sparsity Constrained Nonconvex Optimization

Accelerated Stochastic Block Coordinate Gradient Descent for Sparsity Constrained Nonconvex Optimization Accelerated Stochastic Block Coordinate Gradient Descent for Sparsity Constrained Nonconvex Optimization Jinghui Chen Department of Systems and Information Engineering University of Virginia Quanquan Gu

More information

Convex Optimization Algorithms for Machine Learning in 10 Slides

Convex Optimization Algorithms for Machine Learning in 10 Slides Convex Optimization Algorithms for Machine Learning in 10 Slides Presenter: Jul. 15. 2015 Outline 1 Quadratic Problem Linear System 2 Smooth Problem Newton-CG 3 Composite Problem Proximal-Newton-CD 4 Non-smooth,

More information

Barzilai-Borwein Step Size for Stochastic Gradient Descent

Barzilai-Borwein Step Size for Stochastic Gradient Descent Barzilai-Borwein Step Size for Stochastic Gradient Descent Conghui Tan The Chinese University of Hong Kong chtan@se.cuhk.edu.hk Shiqian Ma The Chinese University of Hong Kong sqma@se.cuhk.edu.hk Yu-Hong

More information

Big Data Analytics. Lucas Rego Drumond

Big Data Analytics. Lucas Rego Drumond Big Data Analytics Lucas Rego Drumond Information Systems and Machine Learning Lab (ISMLL) Institute of Computer Science University of Hildesheim, Germany Predictive Models Predictive Models 1 / 34 Outline

More information

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 5

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 5 Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 5 Slides adapted from Jordan Boyd-Graber, Tom Mitchell, Ziv Bar-Joseph Machine Learning: Chenhao Tan Boulder 1 of 27 Quiz question For

More information

Learnability, Stability, Regularization and Strong Convexity

Learnability, Stability, Regularization and Strong Convexity Learnability, Stability, Regularization and Strong Convexity Nati Srebro Shai Shalev-Shwartz HUJI Ohad Shamir Weizmann Karthik Sridharan Cornell Ambuj Tewari Michigan Toyota Technological Institute Chicago

More information

Parallel Coordinate Optimization

Parallel Coordinate Optimization 1 / 38 Parallel Coordinate Optimization Julie Nutini MLRG - Spring Term March 6 th, 2018 2 / 38 Contours of a function F : IR 2 IR. Goal: Find the minimizer of F. Coordinate Descent in 2D Contours of a

More information

Linearly-Convergent Stochastic-Gradient Methods

Linearly-Convergent Stochastic-Gradient Methods Linearly-Convergent Stochastic-Gradient Methods Joint work with Francis Bach, Michael Friedlander, Nicolas Le Roux INRIA - SIERRA Project - Team Laboratoire d Informatique de l École Normale Supérieure

More information

Logistic Regression. Stochastic Gradient Descent

Logistic Regression. Stochastic Gradient Descent Tutorial 8 CPSC 340 Logistic Regression Stochastic Gradient Descent Logistic Regression Model A discriminative probabilistic model for classification e.g. spam filtering Let x R d be input and y { 1, 1}

More information

Empirical Risk Minimization

Empirical Risk Minimization Empirical Risk Minimization Fabrice Rossi SAMM Université Paris 1 Panthéon Sorbonne 2018 Outline Introduction PAC learning ERM in practice 2 General setting Data X the input space and Y the output space

More information

Distributed Optimization with Arbitrary Local Solvers

Distributed Optimization with Arbitrary Local Solvers Industrial and Systems Engineering Distributed Optimization with Arbitrary Local Solvers Chenxin Ma, Jakub Konečný 2, Martin Jaggi 3, Virginia Smith 4, Michael I. Jordan 4, Peter Richtárik 2, and Martin

More information

Approximate Second Order Algorithms. Seo Taek Kong, Nithin Tangellamudi, Zhikai Guo

Approximate Second Order Algorithms. Seo Taek Kong, Nithin Tangellamudi, Zhikai Guo Approximate Second Order Algorithms Seo Taek Kong, Nithin Tangellamudi, Zhikai Guo Why Second Order Algorithms? Invariant under affine transformations e.g. stretching a function preserves the convergence

More information

Online Convex Optimization. Gautam Goel, Milan Cvitkovic, and Ellen Feldman CS 159 4/5/2016

Online Convex Optimization. Gautam Goel, Milan Cvitkovic, and Ellen Feldman CS 159 4/5/2016 Online Convex Optimization Gautam Goel, Milan Cvitkovic, and Ellen Feldman CS 159 4/5/2016 The General Setting The General Setting (Cover) Given only the above, learning isn't always possible Some Natural

More information

Accelerate Subgradient Methods

Accelerate Subgradient Methods Accelerate Subgradient Methods Tianbao Yang Department of Computer Science The University of Iowa Contributors: students Yi Xu, Yan Yan and colleague Qihang Lin Yang (CS@Uiowa) Accelerate Subgradient Methods

More information

Distributed Box-Constrained Quadratic Optimization for Dual Linear SVM

Distributed Box-Constrained Quadratic Optimization for Dual Linear SVM Distributed Box-Constrained Quadratic Optimization for Dual Linear SVM Lee, Ching-pei University of Illinois at Urbana-Champaign Joint work with Dan Roth ICML 2015 Outline Introduction Algorithm Experiments

More information

Stochastic Gradient Descent. CS 584: Big Data Analytics

Stochastic Gradient Descent. CS 584: Big Data Analytics Stochastic Gradient Descent CS 584: Big Data Analytics Gradient Descent Recap Simplest and extremely popular Main Idea: take a step proportional to the negative of the gradient Easy to implement Each iteration

More information

A Distributed Solver for Kernelized SVM

A Distributed Solver for Kernelized SVM and A Distributed Solver for Kernelized Stanford ICME haoming@stanford.edu bzhe@stanford.edu June 3, 2015 Overview and 1 and 2 3 4 5 Support Vector Machines and A widely used supervised learning model,

More information

Distributed online optimization over jointly connected digraphs

Distributed online optimization over jointly connected digraphs Distributed online optimization over jointly connected digraphs David Mateos-Núñez Jorge Cortés University of California, San Diego {dmateosn,cortes}@ucsd.edu Mathematical Theory of Networks and Systems

More information

Optimization Methods for Machine Learning

Optimization Methods for Machine Learning Optimization Methods for Machine Learning Sathiya Keerthi Microsoft Talks given at UC Santa Cruz February 21-23, 2017 The slides for the talks will be made available at: http://www.keerthis.com/ Introduction

More information

Stochastic gradient methods for machine learning

Stochastic gradient methods for machine learning Stochastic gradient methods for machine learning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France Joint work with Eric Moulines, Nicolas Le Roux and Mark Schmidt - September 2012 Context Machine

More information

Simple Optimization, Bigger Models, and Faster Learning. Niao He

Simple Optimization, Bigger Models, and Faster Learning. Niao He Simple Optimization, Bigger Models, and Faster Learning Niao He Big Data Symposium, UIUC, 2016 Big Data, Big Picture Niao He (UIUC) 2/26 Big Data, Big Picture Niao He (UIUC) 3/26 Big Data, Big Picture

More information

On the Generalization Ability of Online Strongly Convex Programming Algorithms

On the Generalization Ability of Online Strongly Convex Programming Algorithms On the Generalization Ability of Online Strongly Convex Programming Algorithms Sham M. Kakade I Chicago Chicago, IL 60637 sham@tti-c.org Ambuj ewari I Chicago Chicago, IL 60637 tewari@tti-c.org Abstract

More information

Machine Learning in the Data Revolution Era

Machine Learning in the Data Revolution Era Machine Learning in the Data Revolution Era Shai Shalev-Shwartz School of Computer Science and Engineering The Hebrew University of Jerusalem Machine Learning Seminar Series, Google & University of Waterloo,

More information

CPSC 540: Machine Learning

CPSC 540: Machine Learning CPSC 540: Machine Learning First-Order Methods, L1-Regularization, Coordinate Descent Winter 2016 Some images from this lecture are taken from Google Image Search. Admin Room: We ll count final numbers

More information

Lecture 4: Types of errors. Bayesian regression models. Logistic regression

Lecture 4: Types of errors. Bayesian regression models. Logistic regression Lecture 4: Types of errors. Bayesian regression models. Logistic regression A Bayesian interpretation of regularization Bayesian vs maximum likelihood fitting more generally COMP-652 and ECSE-68, Lecture

More information

Distributed Block-diagonal Approximation Methods for Regularized Empirical Risk Minimization

Distributed Block-diagonal Approximation Methods for Regularized Empirical Risk Minimization Distributed Block-diagonal Approximation Methods for Regularized Empirical Risk Minimization Ching-pei Lee Department of Computer Sciences University of Wisconsin-Madison Madison, WI 53706-1613, USA Kai-Wei

More information

The Randomized Newton Method for Convex Optimization

The Randomized Newton Method for Convex Optimization The Randomized Newton Method for Convex Optimization Vaden Masrani UBC MLRG April 3rd, 2018 Introduction We have some unconstrained, twice-differentiable convex function f : R d R that we want to minimize:

More information

Learning with stochastic proximal gradient

Learning with stochastic proximal gradient Learning with stochastic proximal gradient Lorenzo Rosasco DIBRIS, Università di Genova Via Dodecaneso, 35 16146 Genova, Italy lrosasco@mit.edu Silvia Villa, Băng Công Vũ Laboratory for Computational and

More information

Adaptive Probabilities in Stochastic Optimization Algorithms

Adaptive Probabilities in Stochastic Optimization Algorithms Research Collection Master Thesis Adaptive Probabilities in Stochastic Optimization Algorithms Author(s): Zhong, Lei Publication Date: 2015 Permanent Link: https://doi.org/10.3929/ethz-a-010421465 Rights

More information

Stochastic Composition Optimization

Stochastic Composition Optimization Stochastic Composition Optimization Algorithms and Sample Complexities Mengdi Wang Joint works with Ethan X. Fang, Han Liu, and Ji Liu ORFE@Princeton ICCOPT, Tokyo, August 8-11, 2016 1 / 24 Collaborators

More information

Selected Topics in Optimization. Some slides borrowed from

Selected Topics in Optimization. Some slides borrowed from Selected Topics in Optimization Some slides borrowed from http://www.stat.cmu.edu/~ryantibs/convexopt/ Overview Optimization problems are almost everywhere in statistics and machine learning. Input Model

More information

High Order Methods for Empirical Risk Minimization

High Order Methods for Empirical Risk Minimization High Order Methods for Empirical Risk Minimization Alejandro Ribeiro Department of Electrical and Systems Engineering University of Pennsylvania aribeiro@seas.upenn.edu IPAM Workshop of Emerging Wireless

More information

Sub-Sampled Newton Methods

Sub-Sampled Newton Methods Sub-Sampled Newton Methods F. Roosta-Khorasani and M. W. Mahoney ICSI and Dept of Statistics, UC Berkeley February 2016 F. Roosta-Khorasani and M. W. Mahoney (UCB) Sub-Sampled Newton Methods Feb 2016 1

More information

Alireza Shafaei. Machine Learning Reading Group The University of British Columbia Summer 2017

Alireza Shafaei. Machine Learning Reading Group The University of British Columbia Summer 2017 s s Machine Learning Reading Group The University of British Columbia Summer 2017 (OCO) Convex 1/29 Outline (OCO) Convex Stochastic Bernoulli s (OCO) Convex 2/29 At each iteration t, the player chooses

More information

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers) Support vector machines In a nutshell Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers) Solution only depends on a small subset of training

More information

CSCI 1951-G Optimization Methods in Finance Part 12: Variants of Gradient Descent

CSCI 1951-G Optimization Methods in Finance Part 12: Variants of Gradient Descent CSCI 1951-G Optimization Methods in Finance Part 12: Variants of Gradient Descent April 27, 2018 1 / 32 Outline 1) Moment and Nesterov s accelerated gradient descent 2) AdaGrad and RMSProp 4) Adam 5) Stochastic

More information

Lecture 5: Gradient Descent. 5.1 Unconstrained minimization problems and Gradient descent

Lecture 5: Gradient Descent. 5.1 Unconstrained minimization problems and Gradient descent 10-725/36-725: Convex Optimization Spring 2015 Lecturer: Ryan Tibshirani Lecture 5: Gradient Descent Scribes: Loc Do,2,3 Disclaimer: These notes have not been subjected to the usual scrutiny reserved for

More information