Importance Sampling for Minibatches

Size: px

Start display at page:

Download "Importance Sampling for Minibatches"

August Bennett
5 years ago
Views:

1 Importance Sampling for Minibatches Dominik Csiba School of Mathematics University of Edinburgh , Birmingham Dominik Csiba (University of Edinburgh) Importance Sampling for Minibatches , Birmingham 1 / 15

2 Acknowledgements This talk is based on [Csiba and Richtárik, 2016]. Co-author: Peter Richtárik University of Edinburgh Dominik Csiba (University of Edinburgh) Importance Sampling for Minibatches , Birmingham 2 / 15

3 Introduction Many supervised learning tasks can be written as optimization problems. Dominik Csiba (University of Edinburgh) Importance Sampling for Minibatches , Birmingham 3 / 15

4 Introduction Many supervised learning tasks can be written as optimization problems. Regularized Loss Minimization (RLM) Let f i : R d R be a loss function associated with the i-th example and R : R d R be a regularizer. Then the RLM problem is defined as { } w = arg min F (w) 1 n f i (w) + R(w) w R d n i=1 Dominik Csiba (University of Edinburgh) Importance Sampling for Minibatches , Birmingham 3 / 15

5 Introduction Many supervised learning tasks can be written as optimization problems. Regularized Loss Minimization (RLM) Let f i : R d R be a loss function associated with the i-th example and R : R d R be a regularizer. Then the RLM problem is defined as { } w = arg min F (w) 1 n f i (w) + R(w) w R d n Ridge Regression: training set {(x i, y i )} n i=1, linear regression with square loss f i (w) = 1 2 ( xi, w y i ) 2, squared l 2 regularizer R(w) = λ 2 w 2 2 i=1 Dominik Csiba (University of Edinburgh) Importance Sampling for Minibatches , Birmingham 3 / 15

6 Introduction Many supervised learning tasks can be written as optimization problems. Regularized Loss Minimization (RLM) Let f i : R d R be a loss function associated with the i-th example and R : R d R be a regularizer. Then the RLM problem is defined as { } w = arg min F (w) 1 n f i (w) + R(w) w R d n Ridge Regression: training set {(x i, y i )} n i=1, linear regression with square loss f i (w) = 1 2 ( xi, w y i ) 2, squared l 2 regularizer R(w) = λ 2 w 2 2 i=1 Direct solution: usually not available iterative methods Dominik Csiba (University of Edinburgh) Importance Sampling for Minibatches , Birmingham 3 / 15

7 Introduction Many supervised learning tasks can be written as optimization problems. Regularized Loss Minimization (RLM) Let f i : R d R be a loss function associated with the i-th example and R : R d R be a regularizer. Then the RLM problem is defined as { } w = arg min F (w) 1 n f i (w) + R(w) w R d n Ridge Regression: training set {(x i, y i )} n i=1, linear regression with square loss f i (w) = 1 2 ( xi, w y i ) 2, squared l 2 regularizer R(w) = λ 2 w 2 2 i=1 Direct solution: usually not available iterative methods Batch methods: very expensive iterations stochastic methods Dominik Csiba (University of Edinburgh) Importance Sampling for Minibatches , Birmingham 3 / 15

8 Introduction Many supervised learning tasks can be written as optimization problems. Regularized Loss Minimization (RLM) Let f i : R d R be a loss function associated with the i-th example and R : R d R be a regularizer. Then the RLM problem is defined as { } w = arg min F (w) 1 n f i (w) + R(w) w R d n Ridge Regression: training set {(x i, y i )} n i=1, linear regression with square loss f i (w) = 1 2 ( xi, w y i ) 2, squared l 2 regularizer R(w) = λ 2 w 2 2 i=1 Direct solution: usually not available iterative methods Batch methods: very expensive iterations stochastic methods Most popular methods stochastic iterative methods Dominik Csiba (University of Edinburgh) Importance Sampling for Minibatches , Birmingham 3 / 15

9 Randomized Coordinate Descent NSync [Richtárik and Takáč, 2015] Sampling Ŝ - random set-valued mapping with values subsets of {1,..., d}, such that P(j Ŝ) > 0 for every j {1,..., d} Stepsize parameters v 1,..., v d > 0 computable from (F, Ŝ) Algorithm: on each iteration t > 0 do 1 sample a random set S t using Ŝ 2 for each j S t, update w t j w t 1 j 1 v j w j F (w t ) Dominik Csiba (University of Edinburgh) Importance Sampling for Minibatches , Birmingham 4 / 15

10 Randomized Coordinate Descent NSync [Richtárik and Takáč, 2015] Sampling Ŝ - random set-valued mapping with values subsets of {1,..., d}, such that P(j Ŝ) > 0 for every j {1,..., d} Stepsize parameters v 1,..., v d > 0 computable from (F, Ŝ) Algorithm: on each iteration t > 0 do 1 sample a random set S t using Ŝ 2 for each j S t, update w t j Serial sampling: P( Ŝ = 1) = 1 w t 1 j 1 v j w j F (w t ) Dominik Csiba (University of Edinburgh) Importance Sampling for Minibatches , Birmingham 4 / 15

11 Randomized Coordinate Descent NSync [Richtárik and Takáč, 2015] Sampling Ŝ - random set-valued mapping with values subsets of {1,..., d}, such that P(j Ŝ) > 0 for every j {1,..., d} Stepsize parameters v 1,..., v d > 0 computable from (F, Ŝ) Algorithm: on each iteration t > 0 do 1 sample a random set S t using Ŝ 2 for each j S t, update w t j Serial sampling: P( Ŝ = 1) = 1 w t 1 j 1 v j w j F (w t ) Ridge Regression using serial sampling: v j = 1 n n i=1 (x i j )2 + λ Dominik Csiba (University of Edinburgh) Importance Sampling for Minibatches , Birmingham 4 / 15

12 Expected Separable Overapproximation Assumption 1: ESO [Qu and Richtárik, 2014] Let the objective F and the sampling Ŝ be given and let p j P(j Ŝ). Let h [Ŝ] be a vector defined by the entries { ( ) h [ Ŝ] = h j if j Ŝ j 0 if j / Ŝ The parameters v 1,..., v d satisfy the ESO if for all w, h R d it holds that d E[F (w + h [ Ŝ] )] F (w) + d p j h j F (w) + p j v j hj 2 w j j=1 and we say that v 1,..., v d ESO(F, Ŝ). j=1 Dominik Csiba (University of Edinburgh) Importance Sampling for Minibatches , Birmingham 5 / 15

13 Expected Separable Overapproximation Assumption 1: ESO [Qu and Richtárik, 2014] Let the objective F and the sampling Ŝ be given and let p j P(j Ŝ). Let h [Ŝ] be a vector defined by the entries { ( ) h [ Ŝ] = h j if j Ŝ j 0 if j / Ŝ The parameters v 1,..., v d satisfy the ESO if for all w, h R d it holds that d E[F (w + h [ Ŝ] )] F (w) + d p j h j F (w) + p j v j hj 2 w j j=1 and we say that v 1,..., v d ESO(F, Ŝ). Intepretation: The expectation of F at the update behaves smoothly Dominik Csiba (University of Edinburgh) Importance Sampling for Minibatches , Birmingham 5 / 15 j=1

14 Strong Convexity Assumption 2: Strong Convexity Function F is λ-strongly convex if for all w, h R d it holds that F (w + h) F (w) + F (w), h + λ 2 w 2 2 Dominik Csiba (University of Edinburgh) Importance Sampling for Minibatches , Birmingham 6 / 15

15 Strong Convexity Assumption 2: Strong Convexity Function F is λ-strongly convex if for all w, h R d it holds that F (w + h) F (w) + F (w), h + λ 2 w 2 2 Example: R(w) = λ 2 w 2 2 is λ-strongly convex Dominik Csiba (University of Edinburgh) Importance Sampling for Minibatches , Birmingham 6 / 15

16 Strong Convexity Assumption 2: Strong Convexity Function F is λ-strongly convex if for all w, h R d it holds that F (w + h) F (w) + F (w), h + λ 2 w 2 2 Example: R(w) = λ 2 w 2 2 is λ-strongly convex Lemma: convex + strongly convex strongly convex Dominik Csiba (University of Edinburgh) Importance Sampling for Minibatches , Birmingham 6 / 15

17 Strong Convexity Assumption 2: Strong Convexity Function F is λ-strongly convex if for all w, h R d it holds that F (w + h) F (w) + F (w), h + λ 2 w 2 2 Example: R(w) = λ 2 w 2 2 is λ-strongly convex Lemma: convex + strongly convex strongly convex Ridge logistic regression is λ-strongly convex n F (w) = 1 n i=1 log(1 + exp( y i w, x i )) + λ 2 w 2 2 Dominik Csiba (University of Edinburgh) Importance Sampling for Minibatches , Birmingham 6 / 15

18 Strong Convexity Assumption 2: Strong Convexity Function F is λ-strongly convex if for all w, h R d it holds that F (w + h) F (w) + F (w), h + λ 2 w 2 2 Example: R(w) = λ 2 w 2 2 is λ-strongly convex Lemma: convex + strongly convex strongly convex Ridge logistic regression is λ-strongly convex n F (w) = 1 n i=1 log(1 + exp( y i w, x i )) + λ 2 w 2 2 Ridge regression is λ-strongly convex Dominik Csiba (University of Edinburgh) Importance Sampling for Minibatches , Birmingham 6 / 15

19 Convergence Rate Theorem [Richtárik and Takáč, 2015] Let the objective F and the sampling Ŝ be such that F is λ-strongly convex and v 1,... v d ESO(F, Ŝ) and let v [v 1,..., v d ]. Let p j = P(j Ŝ) and let p [p 1,... p d ]. Let C(p, v) max j {1,...,d} ( ) vj, p j λ and let {w t } t=1 be a sequence generated by NSync. Then ( ) C T C(p, v) log E[F (w T ) F (w )] ɛ, ɛ where C is an absolute constant depending on w 0 and w. Dominik Csiba (University of Edinburgh) Importance Sampling for Minibatches , Birmingham 7 / 15

20 Convergence Rate Theorem [Richtárik and Takáč, 2015] Let the objective F and the sampling Ŝ be such that F is λ-strongly convex and v 1,... v d ESO(F, Ŝ) and let v [v 1,..., v d ]. Let p j = P(j Ŝ) and let p [p 1,... p d ]. Let C(p, v) max j {1,...,d} ( ) vj, p j λ and let {w t } t=1 be a sequence generated by NSync. Then ( ) C T C(p, v) log E[F (w T ) F (w )] ɛ, ɛ where C is an absolute constant depending on w 0 and w. Key quantity: C(p, v) Dominik Csiba (University of Edinburgh) Importance Sampling for Minibatches , Birmingham 7 / 15

21 Serial Importance Sampling Let Ŝ be a serial sampling defined by P({j} = Ŝ) = p j, where j p j = 1. Dominik Csiba (University of Edinburgh) Importance Sampling for Minibatches , Birmingham 8 / 15

22 Serial Importance Sampling Let Ŝ be a serial sampling defined by P({j} = Ŝ) = p j, where j p j = 1. Ridge regression using serial sampling: v j = 1 n n i=1 (x i j )2 + λ Dominik Csiba (University of Edinburgh) Importance Sampling for Minibatches , Birmingham 8 / 15

23 Serial Importance Sampling Let Ŝ be a serial sampling defined by P({j} = Ŝ) = p j, where j p j = 1. Ridge regression using serial sampling: v j = 1 n n i=1 (x i j )2 + λ Question: How to choose p, so that the complexity ( ) vj C(p, v) max j {1,...,d} p j λ is minimized? Dominik Csiba (University of Edinburgh) Importance Sampling for Minibatches , Birmingham 8 / 15

24 Serial Importance Sampling Let Ŝ be a serial sampling defined by P({j} = Ŝ) = p j, where j p j = 1. Ridge regression using serial sampling: v j = 1 n n i=1 (x i j )2 + λ Question: How to choose p, so that the complexity ( ) vj C(p, v) max j {1,...,d} p j λ is minimized? sampling p j C(p, v) uniform 1/d (d max j {1,...,d} v j )/λ Dominik Csiba (University of Edinburgh) Importance Sampling for Minibatches , Birmingham 8 / 15

25 Serial Importance Sampling Let Ŝ be a serial sampling defined by P({j} = Ŝ) = p j, where j p j = 1. Ridge regression using serial sampling: v j = 1 n n i=1 (x i j )2 + λ Question: How to choose p, so that the complexity ( ) vj C(p, v) max j {1,...,d} p j λ is minimized? sampling p j C(p, v) uniform 1/d (d max j {1,...,d} v j )/λ importance v j /( k v k) ( k v k)/λ Dominik Csiba (University of Edinburgh) Importance Sampling for Minibatches , Birmingham 8 / 15

26 Serial Importance Sampling Let Ŝ be a serial sampling defined by P({j} = Ŝ) = p j, where j p j = 1. Ridge regression using serial sampling: v j = 1 n n i=1 (x i j )2 + λ Question: How to choose p, so that the complexity ( ) vj C(p, v) max j {1,...,d} p j λ is minimized? sampling p j C(p, v) uniform 1/d (d max j {1,...,d} v j )/λ importance v j /( k v k) ( k v k)/λ The speedup is proportional to mean(v) vs. maximum(v) Dominik Csiba (University of Edinburgh) Importance Sampling for Minibatches , Birmingham 8 / 15

27 Serial Importance Sampling Let Ŝ be a serial sampling defined by P({j} = Ŝ) = p j, where j p j = 1. Ridge regression using serial sampling: v j = 1 n n i=1 (x i j )2 + λ Question: How to choose p, so that the complexity ( ) vj C(p, v) max j {1,...,d} p j λ is minimized? sampling p j C(p, v) uniform 1/d (d max j {1,...,d} v j )/λ importance v j /( k v k) ( k v k)/λ The speedup is proportional to mean(v) vs. maximum(v) Note: importance sampling is the optimal fixed serial sampling Dominik Csiba (University of Edinburgh) Importance Sampling for Minibatches , Birmingham 8 / 15

28 Minibatch Coordinate Descent Assume we have τ cores and we want to make use of all of them. Dominik Csiba (University of Edinburgh) Importance Sampling for Minibatches , Birmingham 9 / 15

29 Minibatch Coordinate Descent Assume we have τ cores and we want to make use of all of them. Parallel sampling: P( Ŝ = τ) = 1, each core processes one coordinate Dominik Csiba (University of Edinburgh) Importance Sampling for Minibatches , Birmingham 9 / 15

30 Minibatch Coordinate Descent Assume we have τ cores and we want to make use of all of them. Parallel sampling: P( Ŝ = τ) = 1, each core processes one coordinate Question: How to design importance sampling for minibatches? Dominik Csiba (University of Edinburgh) Importance Sampling for Minibatches , Birmingham 9 / 15

31 Minibatch Coordinate Descent Assume we have τ cores and we want to make use of all of them. Parallel sampling: P( Ŝ = τ) = 1, each core processes one coordinate Question: How to design importance sampling for minibatches? Complications: Dominik Csiba (University of Edinburgh) Importance Sampling for Minibatches , Birmingham 9 / 15

32 Minibatch Coordinate Descent Assume we have τ cores and we want to make use of all of them. Parallel sampling: P( Ŝ = τ) = 1, each core processes one coordinate Question: How to design importance sampling for minibatches? Complications: for serial sampling we had P({j} = Ŝ) = P(j Ŝ), which is not the }{{} parameter case for parallel Dominik Csiba (University of Edinburgh) Importance Sampling for Minibatches , Birmingham 9 / 15

33 Minibatch Coordinate Descent Assume we have τ cores and we want to make use of all of them. Parallel sampling: P( Ŝ = τ) = 1, each core processes one coordinate Question: How to design importance sampling for minibatches? Complications: for serial sampling we had P({j} = Ŝ) = P(j Ŝ), which is not the }{{} parameter case for parallel We need to choose ( ( d τ) set-probabilities, instead of just d 1) = d Dominik Csiba (University of Edinburgh) Importance Sampling for Minibatches , Birmingham 9 / 15

34 Minibatch Coordinate Descent Assume we have τ cores and we want to make use of all of them. Parallel sampling: P( Ŝ = τ) = 1, each core processes one coordinate Question: How to design importance sampling for minibatches? Complications: for serial sampling we had P({j} = Ŝ) = P(j Ŝ), which is not the }{{} parameter case for parallel We need to choose ( ( d τ) set-probabilities, instead of just d 1) = d All of them influence the values P(j Ŝ) Dominik Csiba (University of Edinburgh) Importance Sampling for Minibatches , Birmingham 9 / 15

35 Minibatch Coordinate Descent Assume we have τ cores and we want to make use of all of them. Parallel sampling: P( Ŝ = τ) = 1, each core processes one coordinate Question: How to design importance sampling for minibatches? Complications: for serial sampling we had P({j} = Ŝ) = P(j Ŝ), which is not the }{{} parameter case for parallel We need to choose ( ( d τ) set-probabilities, instead of just d 1) = d All of them influence the values P(j Ŝ) Goal: Find a parallel sampling with direct control over p j = P(j Ŝ) Dominik Csiba (University of Edinburgh) Importance Sampling for Minibatches , Birmingham 9 / 15

36 Minibatch Coordinate Descent Assume we have τ cores and we want to make use of all of them. Parallel sampling: P( Ŝ = τ) = 1, each core processes one coordinate Question: How to design importance sampling for minibatches? Complications: for serial sampling we had P({j} = Ŝ) = P(j Ŝ), which is not the }{{} parameter case for parallel We need to choose ( ( d τ) set-probabilities, instead of just d 1) = d All of them influence the values P(j Ŝ) Goal: Find a parallel sampling with direct control over p j = P(j Ŝ) Solution: We introduce the bucket sampling Dominik Csiba (University of Edinburgh) Importance Sampling for Minibatches , Birmingham 9 / 15

37 Bucket sampling Idea: Divide the coordinates {1,..., d} into τ buckets and choose one coordinate non-uniformly from each bucket Dominik Csiba (University of Edinburgh) Importance Sampling for Minibatches , Birmingham 10 / 15

38 Bucket sampling Idea: Divide the coordinates {1,..., d} into τ buckets and choose one coordinate non-uniformly from each bucket Bucket sampling Let C 1,..., C τ be an arbitrary partition of {1,..., d}. Choose the probabilities p j, such that j C k p j = 1 for each k {1,..., τ}. The bucket sampling is a procedure, which outputs a single coordinate j from each C k with probabilities p j. Dominik Csiba (University of Edinburgh) Importance Sampling for Minibatches , Birmingham 10 / 15

39 Bucket sampling Idea: Divide the coordinates {1,..., d} into τ buckets and choose one coordinate non-uniformly from each bucket Bucket sampling Let C 1,..., C τ be an arbitrary partition of {1,..., d}. Choose the probabilities p j, such that j C k p j = 1 for each k {1,..., τ}. The bucket sampling is a procedure, which outputs a single coordinate j from each C k with probabilities p j. Example: Let τ = 4 and d = 12, [ ] p = 0.1, 0.4, 0.2, 0.3, }{{}}{{} 1.0, 0.1, 0.1, 0.1, 0.1, 0.6, 0.5, 0.5 }{{}}{{} C 1 C 2 C 3 C 4 Dominik Csiba (University of Edinburgh) Importance Sampling for Minibatches , Birmingham 10 / 15

40 Bucket sampling Idea: Divide the coordinates {1,..., d} into τ buckets and choose one coordinate non-uniformly from each bucket Bucket sampling Let C 1,..., C τ be an arbitrary partition of {1,..., d}. Choose the probabilities p j, such that j C k p j = 1 for each k {1,..., τ}. The bucket sampling is a procedure, which outputs a single coordinate j from each C k with probabilities p j. Example: Let τ = 4 and d = 12, [ ] p = 0.1, 0.4, 0.2, 0.3, }{{}}{{} 1.0, 0.1, 0.1, 0.1, 0.1, 0.6, 0.5, 0.5 }{{}}{{} C 1 C 2 C 3 Observe: It holds, that P(j Ŝ) = p j. C 4 Dominik Csiba (University of Edinburgh) Importance Sampling for Minibatches , Birmingham 10 / 15

41 Bucket sampling Idea: Divide the coordinates {1,..., d} into τ buckets and choose one coordinate non-uniformly from each bucket Bucket sampling Let C 1,..., C τ be an arbitrary partition of {1,..., d}. Choose the probabilities p j, such that j C k p j = 1 for each k {1,..., τ}. The bucket sampling is a procedure, which outputs a single coordinate j from each C k with probabilities p j. Example: Let τ = 4 and d = 12, [ ] p = 0.1, 0.4, 0.2, 0.3, }{{}}{{} 1.0, 0.1, 0.1, 0.1, 0.1, 0.6, 0.5, 0.5 }{{}}{{} C 1 C 2 C 3 Observe: It holds, that P(j Ŝ) = p j. Note: Both p 1,..., p d and C 1,..., C τ are parameters of the sampling Dominik Csiba (University of Edinburgh) Importance Sampling for Minibatches , Birmingham 10 / 15 C 4

42 Importance Sampling for Minibatches Goal: We aim to minimize C(p bucket, v bucket ) = ( vj bucket ) max j {1,...,d} λp bucket j Dominik Csiba (University of Edinburgh) Importance Sampling for Minibatches , Birmingham 11 / 15

43 Importance Sampling for Minibatches Goal: We aim to minimize C(p bucket, v bucket ) = Issue: The values v bucket depend on p bucket, i.e., v j bucket = (... p bucket...) ( vj bucket ) max j {1,...,d} λp bucket j Dominik Csiba (University of Edinburgh) Importance Sampling for Minibatches , Birmingham 11 / 15

44 Importance Sampling for Minibatches Goal: We aim to minimize C(p bucket, v bucket ) = Issue: The values v bucket depend on p bucket, i.e., v j bucket = (... p bucket...) ( vj bucket ) max j {1,...,d} λp bucket j Optimal probabilities: joint optimization difficult Dominik Csiba (University of Edinburgh) Importance Sampling for Minibatches , Birmingham 11 / 15

45 Importance Sampling for Minibatches Goal: We aim to minimize C(p bucket, v bucket ) = Issue: The values v bucket depend on p bucket, i.e., v j bucket = (... p bucket...) ( vj bucket ) max j {1,...,d} λp bucket j Optimal probabilities: joint optimization difficult Strategy: Choose a reasonable v bucket pick p j bucket v j bucket Dominik Csiba (University of Edinburgh) Importance Sampling for Minibatches , Birmingham 11 / 15

46 Importance Sampling for Minibatches Goal: We aim to minimize C(p bucket, v bucket ) = Issue: The values v bucket depend on p bucket, i.e., v j bucket = (... p bucket...) ( vj bucket ) max j {1,...,d} λp bucket j Optimal probabilities: joint optimization difficult Strategy: Choose a reasonable v bucket pick p j bucket v j bucket In theory: We cannot guarantee a speedup, just convergence Dominik Csiba (University of Edinburgh) Importance Sampling for Minibatches , Birmingham 11 / 15

47 Importance Sampling for Minibatches Goal: We aim to minimize C(p bucket, v bucket ) = Issue: The values v bucket depend on p bucket, i.e., v j bucket = (... p bucket...) ( vj bucket ) max j {1,...,d} λp bucket j Optimal probabilities: joint optimization difficult Strategy: Choose a reasonable v bucket pick p j bucket v j bucket In theory: We cannot guarantee a speedup, just convergence In practice: We will see. Dominik Csiba (University of Edinburgh) Importance Sampling for Minibatches , Birmingham 11 / 15

48 Experiments: Convergence Ridge Logistic Regression using Quartz [Qu et al., 2015] Dominik Csiba (University of Edinburgh) Importance Sampling for Minibatches , Birmingham 12 / 15

49 Experiments: Convergence Ridge Logistic Regression using Quartz [Qu et al., 2015] Randomly chosen equally sized buckets C 1,..., C τ Dominik Csiba (University of Edinburgh) Importance Sampling for Minibatches , Birmingham 12 / 15

50 Experiments: Convergence Ridge Logistic Regression using Quartz [Qu et al., 2015] Randomly chosen equally sized buckets C 1,..., C τ P(w t ) P(w ) nice 2-nice 4-nice 8-nice 16-nice 32-nice Number of Effective Passes / τ P(w t ) P(w ) imp 2-imp 4-imp 8-imp 16-imp 32-imp Number of Effective Passes / τ url dataset: d = 2, 396, 130, n = 3, 231, 962, sparsity = 0.04% Dominik Csiba (University of Edinburgh) Importance Sampling for Minibatches , Birmingham 12 / 15

51 Experiments: Theory vs. Practice Ridge Logistic Regression using Quartz [Qu et al., 2015] Randomly chosen equally sized buckets C 1,..., C τ Data τ = 1 τ = 2 τ = 4 τ = 8 τ = 16 τ = 32 ijcnn1 1.2 : : : : : : 1.8 protein 1.3 : : : : : : 1.5 w8a 2.8 : : : : : : 1.8 url 3.0 : : : : : : 1.7 aloi 13 : : : : : : 6.7 Table: The theoretical : empirical ratios θ (τ-imp) /θ (τ-nice). Dominik Csiba (University of Edinburgh) Importance Sampling for Minibatches , Birmingham 13 / 15

52 Conclusion Dominik Csiba (University of Edinburgh) Importance Sampling for Minibatches , Birmingham 14 / 15

53 Conclusion we introduced a flexible parallel sampling - the bucket sampling Dominik Csiba (University of Edinburgh) Importance Sampling for Minibatches , Birmingham 14 / 15

54 Conclusion we introduced a flexible parallel sampling - the bucket sampling we used it to get an importance sampling for minibatches Dominik Csiba (University of Edinburgh) Importance Sampling for Minibatches , Birmingham 14 / 15

55 Conclusion we introduced a flexible parallel sampling - the bucket sampling we used it to get an importance sampling for minibatches can be used for various stochastic iterative methods Dominik Csiba (University of Edinburgh) Importance Sampling for Minibatches , Birmingham 14 / 15

56 Conclusion we introduced a flexible parallel sampling - the bucket sampling we used it to get an importance sampling for minibatches can be used for various stochastic iterative methods we empirically match the improvement gained by serial importance sampling Dominik Csiba (University of Edinburgh) Importance Sampling for Minibatches , Birmingham 14 / 15

57 Conclusion we introduced a flexible parallel sampling - the bucket sampling we used it to get an importance sampling for minibatches can be used for various stochastic iterative methods we empirically match the improvement gained by serial importance sampling in theory, we cannot in general guarantee that importance sampling for minibatches will be better than the uniform parallel sampling Dominik Csiba (University of Edinburgh) Importance Sampling for Minibatches , Birmingham 14 / 15

58 References Csiba, Dominik and Richtárik, Peter Importance Sampling for Minibatches arxiv e-prints (2016): Richtárik, Peter and Takáč, Martin On optimal probabilities in stochastic coordinate descent methods Optimization Letters (2015): Qu, Zheng and Richtárik, Peter Coordinate descent with arbitrary sampling II: Expected separable overapproximation arxiv e-prints (2014): Qu, Zheng and Richtárik, Peter and Zhang, Tong Quartz: Randomized dual coordinate ascent with arbitrary sampling Advances in Neural Information Processing Systems, 2015 Dominik Csiba (University of Edinburgh) Importance Sampling for Minibatches , Birmingham 15 / 15

Randomized Coordinate Descent with Arbitrary Sampling: Algorithms and Complexity

Randomized Coordinate Descent with Arbitrary Sampling: Algorithms and Complexity Zheng Qu University of Hong Kong CAM, 23-26 Aug 2016 Hong Kong based on joint work with Peter Richtarik and Dominique Cisba(University