Trade-Offs in Distributed Learning and Optimization

Size: px

Start display at page:

Download "Trade-Offs in Distributed Learning and Optimization"

Nickolas Lawson
5 years ago
Views:

1 Trade-Offs in Distributed Learning and Optimization Ohad Shamir Weizmann Institute of Science Includes joint works with Yossi Arjevani, Nathan Srebro and Tong Zhang IHES Workshop March 2016

Distributed learning as a constraint Computational

2 Distributed Learning and Optimization Why? Big data Too large to fit into a single machine Distributed learning as a constraint Computational Speed-ups Ideally, k machines = 1/k training time Distributed learning as an opportunity

3 Challenges Challenge 1: Communication Very slow compared to local processing 1 : L1 Cache Reference Main memory Reference Round-trip within same datacenter Packet California Netherlands & back 0.5 ns 100 ns 500, 000 ns 150, 000, 000 ns 1

4 Challenges Challenge 2: Parallelization How to parallelize algorithms of the following form: example update example update example

5 Challenges Challenge 3: Accuracy Given the same data, output quality should resemble that of a non-distributed algorithm

6 Setting Convex distributed empirical risk minimization: min w W R d F (w) = 1 m F i (w) = 1 n m F i (w) i=1 n f i,j (w) j=1 f i,j ( ): Loss on j-th example in i-th machine. Strongly-convex/Smooth problems: Each F i is strongly convex/smooth Machines can broadcast simultaneously in communication rounds Õ(d) bits per machine/round

7 Main Question How to optimally trade-off 1 Accuracy: F (w) inf w W F (w) ɛ 2 Communication 3 Runtime Notes: In this talk, accuracy in terms of optimization error, not risk

8 3 Data Partition Scenarios min w W R d F (w) = 1 m m F i (w) i=1 F i (w) = 1 n n f i,j (w) j=1

9 3 Data Partition Scenarios min w W R d F (w) = 1 m Arbitrary partition m F i (w) i=1 No relationship between F i s F i (w) = 1 n n f i,j (w) j=1

10 3 Data Partition Scenarios min w W R d F (w) = 1 m Arbitrary partition m F i (w) i=1 No relationship between F i s Random partition F i (w) = 1 n n f i,j (w) j=1 mn individual losses f i,j assigned to machines at random Strong concentration of measure effects (e.g. F i ( ) s have similar values/gradients/hessians)

11 3 Data Partition Scenarios min w W R d F (w) = 1 m Arbitrary partition m F i (w) i=1 No relationship between F i s Random partition F i (w) = 1 n n f i,j (w) j=1 mn individual losses f i,j assigned to machines at random Strong concentration of measure effects (e.g. F i ( ) s have similar values/gradients/hessians) δ-related setting Generalization of previous scenarios Informally: Values/gradients/Hessians at any w are δ-distant across machines

12 Talk Outline Algorithms + worst-case upper and lower bounds* for 1 Arbitrary partition 2 δ-related 3 Random partition Relies on new results for without-replacement sampling in stochastic gradient methods * In terms of # communication rounds

13 Arbitrary Partition min w W R d F (w) = 1 m m F i (w) i=1 F i (w) = 1 n n f i,j (w) j=1 Baseline: Reduce to non-distributed first-order optimization Example (Gradient Descent) Initialize w 0 ; For t = 1, 2,... Machine i computes F i (w t ) Communication round: F (w t ) = 1 m m i=1 F i(w t ) Update w t+1 = w t η F (w t )

14 Arbitrary Partition min w W R d F (w) = 1 m m F i (w) i=1 F i (w) = 1 n n f i,j (w) j=1 Baseline: Reduce to non-distributed first-order optimization Example (Gradient Descent) Initialize w 0 ; For t = 1, 2,... Machine i computes F i (w t ) Communication round: F (w t ) = 1 m m i=1 F i(w t ) Update w t+1 = w t η F (w t ) Can also accelerate; use proximal smoothing etc. # communication rounds = iteration complexity ( 1/λ ( λ-strongly convex + 1-smooth: O log 1 ɛ λ-strongly convex: O( 1/λɛ); Convex: O(1/ɛ) (with smoothing) Almost full parallelization; but many comm. rounds ) )

15 Arbitrary Partition Can we improve on this simple baseline?

16 Arbitrary Partition Can we improve on this simple baseline? For very large family of algorithms, baseline is (worse-case) optimal!

17 Arbitrary Partition Can we improve on this simple baseline? For very large family of algorithms, baseline is (worse-case) optimal! Assumed Algorithmic Template For each machine j, let W j = {0} Between communication rounds: Each machine j sequentially computes and adds to W j an arbitrary number of vectors w, s.t. γw + ν F j (w) (for some γ, ν 0, γ + ν > 0) is in { span w, F j (w ), ( 2 F j (w ) + D)w, ( 2 F j (w ) + D) 1 w } w, w W j, D diagonal During communication round: Broadcast some vectors in W j

18 Proof Sketch (λ-strongly convex; 1-smooth; 2 machines) ( 1 λ F 1 (w) = w A 1 + λ ) 4 2 I w 1 λ e 1 w 2 ( 1 λ F 2 (w) = w A 2 + λ ) 4 2 I w

19 Proof Sketch (λ-strongly convex; 1-smooth; 2 machines) ( 1 λ F 1 (w) = w A 1 + λ ) 4 2 I w 1 λ e 1 w 2 ( 1 λ F 2 (w) = w A 2 + λ ) 4 2 I w A 1 = A 2 = exp( Ω(T / 1/λ)) error after T iterations 1 2 F 1( ) F 2( ) is essentially hard function for non-distributed optimization [Nemirovski and Yudin 1983]

20 Proof Sketch (λ-strongly convex); 2 machines Ω(1/(λT 2 )) error after T iterations (matched by AGD with proximal smoothing) F 1 (w) = 1 2 b w (T + 2) ( w 2 w 3 + w 4 w w T w T +1 ) + λ 2 w 2 F 2 (w) = 1 2(T + 2) ( w 1 w 2 + w 3 w w T +1 w T +2 ) + λ 2 w 2

21 δ-related Setting min w W R d F (w) = 1 m m F i (w) i=1 F i (w) = 1 n n f i,j (w) j=1 Assuming F i are smooth, values/gradients/hessians at any w W are δ-close

22 δ-related Setting min w W R d F (w) = 1 m m F i (w) i=1 F i (w) = 1 n n f i,j (w) j=1 Assuming F i are smooth, values/gradients/hessians at any w W are δ-close Can show similar lower bounds, but weakened by δ factors δ e.g. λ log ( ) 1 ɛ for λ-strongly convex and smooth Can we build algorithms which utilize δ-relatedness? More data less communication??

23 DANE Algorithm Distributed Approximate NEwton-type method Parameters: Learning rate η > 0; regularizer µ > 0 Initialize w 1 = 0 For t = 1, 2,... Compute F (w t ) = 1 m i F i(w t ) For each machine i, solve local optimization problem: w i t+1 = arg min w F i (w) ( F i (w t ) η F (w t )) w Compute average w t+1 = 1 m + µ 2 w w t 2 2 i wi t+1 Structure similar to ADMM

24 Crucial Property Newton step Equivalent to approximate Newton step w t+1 = w t ( 2 F (w t ) ) 1 F (wt ) Extremely fast convergence, but expensive

25 Crucial Property Equivalent to approximate Newton step Newton step ( 1 1 w t+1 = w t 2 F i (w t )) F (w t ) m Extremely fast convergence, but expensive i

26 Crucial Property Equivalent to approximate Newton step Newton step ( 1 1 w t+1 = w t 2 F i (w t )) F (w t ) m Extremely fast convergence, but expensive DANE (Quadratic Functions) Equivalent to ( 1 ( w t+1 = w t η 2 F i (w t ) ) ) 1 + µi F (w t ) m even though Hessians 2 F i never computed explicitly! i i

27 Theoretical Guarantees: Quadratic Functions Theorem w t+1 w opt I η H 1 H t w 1 w opt where H = 1 2 F i, m H 1 = 1 ( 2 F i + µi ) 1 m i i In a δ-related setting, 2 F i H δ for all i Setting η, µ appropriately, { } I η H 1 1 H max 2, (δ/λ) 2 Conclusion Need O((δ/λ) 2 ) log(1/ɛ) communication rounds

28 Some Experiments Regularized least squares in R 500 log 10 (optimization error) vs. # iterations DANE m=4 N=6*10 3 N=10*10 3 N=14* m= ADMM

29 Extensions Can provide some guarantees for non-quadratic losses as well Using a more sophisticated algorithm, can improve guarantees to O( δ/λ log(1/ɛ)) rounds, for quadratics / self-concordant losses [Zhang and Xiao 2015] Disadvantage: Each iteration requires solving a local optimization problem Not necessarily cheap in terms of runtime

30 Random Partition min w W R d F (w) = 1 m m F i (w) i=1 F i (w) = 1 n n f i,j (w) j=1 Special case of δ-related setting, δ = O(1/ n) in general

31 Random Partition min w W R d F (w) = 1 m m F i (w) i=1 F i (w) = 1 n n f i,j (w) j=1 Special case of δ-related setting, δ = O(1/ n) in general Previous algorithms: O(log(1/ɛ)) communication rounds as long as λ Ω(1/ n)

32 Random Partition min w W R d F (w) = 1 m m F i (w) i=1 F i (w) = 1 n n f i,j (w) j=1 Special case of δ-related setting, δ = O(1/ n) in general Previous algorithms: O(log(1/ɛ)) communication rounds as long as λ Ω(1/ n) Next: Much simpler approach for randomly partitioned data Can get O(log(1/ɛ)) communication rounds as long as λ Ω(1/n) Detour : without-replacement sampling for stochastic gradient methods...

33 Stochastic Gradient Methods min w W R d F (w) = 1 N N f i (w) i=1 Stochastic Gradient Descent / Subgradient Method w 1 = 0 For t = 1, 2,... Sample i t Let g t f it (w t ) w t+1 = Π W (w t η t g t )

34 Stochastic Gradient Methods min w W R d F (w) = 1 N N f i (w) i=1 Stochastic Gradient Descent / Subgradient Method w 1 = 0 For t = 1, 2,... Sample i t Let g t f it (w t ) w t+1 = Π W (w t η t g t ) Standard analysis: each i t sampled uniformly from {1,..., N} Works because E[g t ] F (w t ) But: one often samples i t without replacement Equivalently, go over a random shuffle of the data

35 Why Without Replacement Sampling? Often works better (all data equally processed) Requires sequential rather than random data access However, much harder to analyze

36 Why Without Replacement Sampling? Often works better (all data equally processed) Requires sequential rather than random data access However, much harder to analyze Incremental gradient bounds (e.g. Nedić and Bertsekas 2001): Hold for any order, but can be exponentially slower

37 Why Without Replacement Sampling? Often works better (all data equally processed) Requires sequential rather than random data access However, much harder to analyze Incremental gradient bounds (e.g. Nedić and Bertsekas 2001): Hold for any order, but can be exponentially slower (Gürbüzbalaban, Ozdaglar, Parillo 2015): For strongly convex and smooth problems, k passes over N data points: Without-replacement: O((N/k) 2 ) error With-replacement: O(1/Nk)

38 New Results Without-replacement sampling is not worse than with-replacement sampling (in worst-case sense) under various scenarios, e.g. O(1/ T ) for convex linear prediction O(1/λT ) for λ-strongly-convex and smooth linear prediction ( ( )) exp Ω T N+1/λ for least squares on N data points, using no-replacement SVRG algorithm Analysis uses ideas from adversarial online learning and transductive learning theory

39 Derivation: Convex Linear Prediction Algorithm: min w W R d F (w) = 1 N N f i (w) Sequentially processes f σ(1), f σ(2),..., f σ(t ) where σ random permutation on {1,..., N} i=1 Produces iterates w 1,..., w T Goal: Prove [ 1 E T ] T F (w t ) F (w ) t=1 ( ) 1 O T where w = arg min w W F (w)

40 Derivation: Convex Linear Prediction Assumption: Algorithm has bounded regret in adversarial online setting: For any sequence of convex Lipschitz f 1, f 2,... and w W 1 T T f i (w t ) 1 T i=1 T i=1 Satisfied for e.g. stochastic gradient descent ( ) 1 f i (w) O T

41 Derivation: Convex Linear Prediction (1/2) [ 1 E σ T ] T F (w t ) F (w ) t=1

42 Derivation: Convex Linear Prediction (1/2) [ 1 E σ T [ 1 = E σ T ] T F (w t ) F (w ) t=1 ] T F (w t ) f σ(t) (w ) t=1

43 Derivation: Convex Linear Prediction (1/2) [ 1 = E σ T [ 1 E σ T [ 1 = E σ T ] T F (w t ) F (w ) t=1 ] T F (w t ) f σ(t) (w ) t=1 T ( fσ(t) (w t ) f σ(t) (w ) )] [ 1 +E σ T t=1 T ( F (wt ) f σ(t) (w t ) )] t=1

44 Derivation: Convex Linear Prediction (1/2) [ 1 = E σ T [ 1 E σ T [ 1 = E σ T ] T F (w t ) F (w ) t=1 ] T F (w t ) f σ(t) (w ) t=1 T ( fσ(t) (w t ) f σ(t) (w ) )] [ 1 +E σ T t=1 ( ) 1 = O T [ 1 + E σ T T ( F (wt ) f σ(t) (w t ) )] t=1 ( T N i=1 f σ(i)(w t ) N i=t f )] σ(i)(w t ) N N t + 1 t=1

45 Derivation: Convex Linear Prediction (2/2) Following algebraic manipulations, ( ) 1 = O + 1 T T T t=1 where F a:b is average over f σ(a),..., f σ(b) t 1 N E [F 1:t 1(w t ) F t:n (w t )]

46 Derivation: Convex Linear Prediction (2/2) Following algebraic manipulations, ( ) 1 = O + 1 T T T t=1 t 1 N E [F 1:t 1(w t ) F t:n (w t )] where F a:b is average over f σ(a),..., f σ(b) ( ) 1 O + 1 T [ ] t 1 T T N E sup (F 1:t 1 (w) F t:n (w)) w W t=2

47 Derivation: Convex Linear Prediction (2/2) Following algebraic manipulations, ( ) 1 = O + 1 T T T t=1 t 1 N E [F 1:t 1(w t ) F t:n (w t )] where F a:b is average over f σ(a),..., f σ(b) ( ) 1 O + 1 T [ ] t 1 T T N E sup (F 1:t 1 (w) F t:n (w)) w W ( ) 1 O + 1 T T t=2 T t=2 t 1 N R t 1:N t+1(w) + O ( ) 1, N where R t 1:N t+1 (W) is transductive Rademacher complexity of W w.r.t. f 1, f 2,..., f N

48 Derivation: Convex Linear Prediction (2/2) Following algebraic manipulations, ( ) 1 = O + 1 T T T t=1 t 1 N E [F 1:t 1(w t ) F t:n (w t )] where F a:b is average over f σ(a),..., f σ(b) ( ) 1 O + 1 T [ ] t 1 T T N E sup (F 1:t 1 (w) F t:n (w)) w W ( ) 1 O + 1 T T t=2 T t=2 t 1 N R t 1:N t+1(w) + O ( ) 1, N where R t 1:N t+1 (W) is transductive Rademacher complexity of W w.r.t. f 1, f 2,..., f N Specializing to linear predictors and convex Lipschitz losses, get familiar O(1/ T ) rate

49 Derivation: Convex Linear Prediction (2/2) Following algebraic manipulations, ( ) 1 = O + 1 T T T t=1 t 1 N E [F 1:t 1(w t ) F t:n (w t )] where F a:b is average over f σ(a),..., f σ(b) ( ) 1 O + 1 T [ ] t 1 T T N E sup (F 1:t 1 (w) F t:n (w)) w W ( ) 1 O + 1 T T t=2 T t=2 t 1 N R t 1:N t+1(w) + O ( ) 1, N where R t 1:N t+1 (W) is transductive Rademacher complexity of W w.r.t. f 1, f 2,..., f N Specializing to linear predictors and convex Lipschitz losses, get familiar O(1/ T ) rate With more effort, get O(1/λT ) with λ-strong convexity

50 SVRG min w W R d F (w) = 1 N N f i (w) In recent years, fast stochastic algorithms to solve finite-sum problems i=1 SAG [Le Roux et al. 2012], SDCA (as analyzed in [Shalev-Shwartz et al. 2013, 2016]), SVRG [Johnson and Zhang 2013], SAGA [Defazio et al. 2014], Finito [Defazio et al. 2014], S2GD [Konecný and Richtárik 2013]... Cheap stochastic iterations + linear convergence rate (assuming strong convexity and smoothness) All existing analyses based on with-replacement sampling We consider without-replacement SVRG

51 With-Replacement SVRG Initialize w 1 = 0 for s = 1, 2,..., S do Compute F ( w s ) = 1 N N i=1 f i( w s ) w 1 := w s for t = 1, 2,..., T do Draw i t uniformly at random from {1,..., N} w t+1 := w t η ( f it (w t ) f it ( w s ) + F ( w s )) end for Pick w s+1 at random from w 1,..., w T end for

52 With-Replacement SVRG Initialize w 1 = 0 for s = 1, 2,..., S do Compute F ( w s ) = 1 N N i=1 f i( w s ) w 1 := w s for t = 1, 2,..., T do Draw i t uniformly at random from {1,..., N} w t+1 := w t η ( f it (w t ) f it ( w s ) + F ( w s )) end for Pick w s+1 at random from w 1,..., w T end for Analysis for 1-smooth, λ-strongly convex functions: S = O(log(1/ɛ)), T 1/λ

53 With-Replacement SVRG Initialize w 1 = 0 for s = 1, 2,..., S do Compute F ( w s ) = 1 N N i=1 f i( w s ) w 1 := w s for t = 1, 2,..., T do Draw i t uniformly at random from {1,..., N} w t+1 := w t η ( f it (w t ) f it ( w s ) + F ( w s )) end for Pick w s+1 at random from w 1,..., w T end for Analysis for 1-smooth, λ-strongly convex functions: S = O(log(1/ɛ)), T 1/λ Observation [Lee, Lin, Ma 2015]: Can simulate it in distributed setting with randomly partitioned data, as long as T O(1/ N)

54 With-Replacement SVRG Initialize w 1 = 0 for s = 1, 2,..., S do Compute F ( w s ) = 1 N N i=1 f i( w s ) w 1 := w s for t = 1, 2,..., T do Draw i t uniformly at random from {1,..., N} w t+1 := w t η ( f it (w t ) f it ( w s ) + F ( w s )) end for Pick w s+1 at random from w 1,..., w T end for Analysis for 1-smooth, λ-strongly convex functions: S = O(log(1/ɛ)), T 1/λ Observation [Lee, Lin, Ma 2015]: Can simulate it in distributed setting with randomly partitioned data, as long as T O(1/ N) By analysis, only applicable when λ Ω(1/ N)

55 Without-Replacement SVRG Given: Permutation σ on {1,..., N} Initialize w 1 = 0 for s = 1, 2,..., S do Compute F ( w s ) = 1 N N i=1 f i( w s ) w 1 := w s for t = 1, 2,..., T do w t+1 := w t η ( f σ(t) (w t ) f σ(t) ( w s ) + F ( w s ) ) end for Pick w s+1 at random from w 1,..., w T end for Observation: Can simulate it in distributed setting with randomly partitioned data, all the way up to T = Ω(N)

56 Bound for Regularized Least Squares f i (w) = 1 2 ( w, x i y i ) 2 + ˆλ 2 w 2 ; F ( ) λ-strongly convex Theorem If perform S = O(log(1/ɛ)) epochs with T = Θ(1/λ) without-replacement stochastic iterations, E[F ( w S+1 ) F (w )] ɛ

57 Bound for Regularized Least Squares f i (w) = 1 2 ( w, x i y i ) 2 + ˆλ 2 w 2 ; F ( ) λ-strongly convex Theorem If perform S = O(log(1/ɛ)) epochs with T = Θ(1/λ) without-replacement stochastic iterations, Implications: E[F ( w S+1 ) F (w )] ɛ

58 Bound for Regularized Least Squares f i (w) = 1 2 ( w, x i y i ) 2 + ˆλ 2 w 2 ; F ( ) λ-strongly convex Theorem If perform S = O(log(1/ɛ)) epochs with T = Θ(1/λ) without-replacement stochastic iterations, Implications: E[F ( w S+1 ) F (w )] ɛ Non-distributed optimization: λ Ω(1/N) ɛ-optimal solution without data reshuffling

59 Bound for Regularized Least Squares f i (w) = 1 2 ( w, x i y i ) 2 + ˆλ 2 w 2 ; F ( ) λ-strongly convex Theorem If perform S = O(log(1/ɛ)) epochs with T = Θ(1/λ) without-replacement stochastic iterations, Implications: E[F ( w S+1 ) F (w )] ɛ Non-distributed optimization: λ Ω(1/N) ɛ-optimal solution without data reshuffling Distributed optimization: λ Ω(1/n) ɛ-optimal solution with randomly-partitioned data O(log(1/ɛ)) communication rounds Runtime dominated by fully-parallelizable gradient computation

60 Summary and Open Questions min w W R d F (w) = 1 n n F i (w) i=1 F i (w) = 1 n n f i,j (w) j=1 Arbitrary partition Baseline is also (worst-case) optimal... δ-related For quadratics More generally X Self-concordant losses Algorithms communication-efficient, not necessarily runtime-efficient Random partition Least squares More generally X λ-strongly-convex and smooth, but only when λ Ω(1/ data size)

Big Data Analytics: Optimization and Randomization

Big Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML 2015 Hong Kong Department of Computer Science, The University of Iowa, IA, USA Nov. 20, 2015 Yang Tutorial for ACML 15 Nov.