Importance Sampling for Minibatches

Size: px
Start display at page:

Download "Importance Sampling for Minibatches"

Transcription

1 Importance Sampling for Minibatches Dominik Csiba School of Mathematics University of Edinburgh , Birmingham Dominik Csiba (University of Edinburgh) Importance Sampling for Minibatches , Birmingham 1 / 15

2 Acknowledgements This talk is based on [Csiba and Richtárik, 2016]. Co-author: Peter Richtárik University of Edinburgh Dominik Csiba (University of Edinburgh) Importance Sampling for Minibatches , Birmingham 2 / 15

3 Introduction Many supervised learning tasks can be written as optimization problems. Dominik Csiba (University of Edinburgh) Importance Sampling for Minibatches , Birmingham 3 / 15

4 Introduction Many supervised learning tasks can be written as optimization problems. Regularized Loss Minimization (RLM) Let f i : R d R be a loss function associated with the i-th example and R : R d R be a regularizer. Then the RLM problem is defined as { } w = arg min F (w) 1 n f i (w) + R(w) w R d n i=1 Dominik Csiba (University of Edinburgh) Importance Sampling for Minibatches , Birmingham 3 / 15

5 Introduction Many supervised learning tasks can be written as optimization problems. Regularized Loss Minimization (RLM) Let f i : R d R be a loss function associated with the i-th example and R : R d R be a regularizer. Then the RLM problem is defined as { } w = arg min F (w) 1 n f i (w) + R(w) w R d n Ridge Regression: training set {(x i, y i )} n i=1, linear regression with square loss f i (w) = 1 2 ( xi, w y i ) 2, squared l 2 regularizer R(w) = λ 2 w 2 2 i=1 Dominik Csiba (University of Edinburgh) Importance Sampling for Minibatches , Birmingham 3 / 15

6 Introduction Many supervised learning tasks can be written as optimization problems. Regularized Loss Minimization (RLM) Let f i : R d R be a loss function associated with the i-th example and R : R d R be a regularizer. Then the RLM problem is defined as { } w = arg min F (w) 1 n f i (w) + R(w) w R d n Ridge Regression: training set {(x i, y i )} n i=1, linear regression with square loss f i (w) = 1 2 ( xi, w y i ) 2, squared l 2 regularizer R(w) = λ 2 w 2 2 i=1 Direct solution: usually not available iterative methods Dominik Csiba (University of Edinburgh) Importance Sampling for Minibatches , Birmingham 3 / 15

7 Introduction Many supervised learning tasks can be written as optimization problems. Regularized Loss Minimization (RLM) Let f i : R d R be a loss function associated with the i-th example and R : R d R be a regularizer. Then the RLM problem is defined as { } w = arg min F (w) 1 n f i (w) + R(w) w R d n Ridge Regression: training set {(x i, y i )} n i=1, linear regression with square loss f i (w) = 1 2 ( xi, w y i ) 2, squared l 2 regularizer R(w) = λ 2 w 2 2 i=1 Direct solution: usually not available iterative methods Batch methods: very expensive iterations stochastic methods Dominik Csiba (University of Edinburgh) Importance Sampling for Minibatches , Birmingham 3 / 15

8 Introduction Many supervised learning tasks can be written as optimization problems. Regularized Loss Minimization (RLM) Let f i : R d R be a loss function associated with the i-th example and R : R d R be a regularizer. Then the RLM problem is defined as { } w = arg min F (w) 1 n f i (w) + R(w) w R d n Ridge Regression: training set {(x i, y i )} n i=1, linear regression with square loss f i (w) = 1 2 ( xi, w y i ) 2, squared l 2 regularizer R(w) = λ 2 w 2 2 i=1 Direct solution: usually not available iterative methods Batch methods: very expensive iterations stochastic methods Most popular methods stochastic iterative methods Dominik Csiba (University of Edinburgh) Importance Sampling for Minibatches , Birmingham 3 / 15

9 Randomized Coordinate Descent NSync [Richtárik and Takáč, 2015] Sampling Ŝ - random set-valued mapping with values subsets of {1,..., d}, such that P(j Ŝ) > 0 for every j {1,..., d} Stepsize parameters v 1,..., v d > 0 computable from (F, Ŝ) Algorithm: on each iteration t > 0 do 1 sample a random set S t using Ŝ 2 for each j S t, update w t j w t 1 j 1 v j w j F (w t ) Dominik Csiba (University of Edinburgh) Importance Sampling for Minibatches , Birmingham 4 / 15

10 Randomized Coordinate Descent NSync [Richtárik and Takáč, 2015] Sampling Ŝ - random set-valued mapping with values subsets of {1,..., d}, such that P(j Ŝ) > 0 for every j {1,..., d} Stepsize parameters v 1,..., v d > 0 computable from (F, Ŝ) Algorithm: on each iteration t > 0 do 1 sample a random set S t using Ŝ 2 for each j S t, update w t j Serial sampling: P( Ŝ = 1) = 1 w t 1 j 1 v j w j F (w t ) Dominik Csiba (University of Edinburgh) Importance Sampling for Minibatches , Birmingham 4 / 15

11 Randomized Coordinate Descent NSync [Richtárik and Takáč, 2015] Sampling Ŝ - random set-valued mapping with values subsets of {1,..., d}, such that P(j Ŝ) > 0 for every j {1,..., d} Stepsize parameters v 1,..., v d > 0 computable from (F, Ŝ) Algorithm: on each iteration t > 0 do 1 sample a random set S t using Ŝ 2 for each j S t, update w t j Serial sampling: P( Ŝ = 1) = 1 w t 1 j 1 v j w j F (w t ) Ridge Regression using serial sampling: v j = 1 n n i=1 (x i j )2 + λ Dominik Csiba (University of Edinburgh) Importance Sampling for Minibatches , Birmingham 4 / 15

12 Expected Separable Overapproximation Assumption 1: ESO [Qu and Richtárik, 2014] Let the objective F and the sampling Ŝ be given and let p j P(j Ŝ). Let h [Ŝ] be a vector defined by the entries { ( ) h [ Ŝ] = h j if j Ŝ j 0 if j / Ŝ The parameters v 1,..., v d satisfy the ESO if for all w, h R d it holds that d E[F (w + h [ Ŝ] )] F (w) + d p j h j F (w) + p j v j hj 2 w j j=1 and we say that v 1,..., v d ESO(F, Ŝ). j=1 Dominik Csiba (University of Edinburgh) Importance Sampling for Minibatches , Birmingham 5 / 15

13 Expected Separable Overapproximation Assumption 1: ESO [Qu and Richtárik, 2014] Let the objective F and the sampling Ŝ be given and let p j P(j Ŝ). Let h [Ŝ] be a vector defined by the entries { ( ) h [ Ŝ] = h j if j Ŝ j 0 if j / Ŝ The parameters v 1,..., v d satisfy the ESO if for all w, h R d it holds that d E[F (w + h [ Ŝ] )] F (w) + d p j h j F (w) + p j v j hj 2 w j j=1 and we say that v 1,..., v d ESO(F, Ŝ). Intepretation: The expectation of F at the update behaves smoothly Dominik Csiba (University of Edinburgh) Importance Sampling for Minibatches , Birmingham 5 / 15 j=1

14 Strong Convexity Assumption 2: Strong Convexity Function F is λ-strongly convex if for all w, h R d it holds that F (w + h) F (w) + F (w), h + λ 2 w 2 2 Dominik Csiba (University of Edinburgh) Importance Sampling for Minibatches , Birmingham 6 / 15

15 Strong Convexity Assumption 2: Strong Convexity Function F is λ-strongly convex if for all w, h R d it holds that F (w + h) F (w) + F (w), h + λ 2 w 2 2 Example: R(w) = λ 2 w 2 2 is λ-strongly convex Dominik Csiba (University of Edinburgh) Importance Sampling for Minibatches , Birmingham 6 / 15

16 Strong Convexity Assumption 2: Strong Convexity Function F is λ-strongly convex if for all w, h R d it holds that F (w + h) F (w) + F (w), h + λ 2 w 2 2 Example: R(w) = λ 2 w 2 2 is λ-strongly convex Lemma: convex + strongly convex strongly convex Dominik Csiba (University of Edinburgh) Importance Sampling for Minibatches , Birmingham 6 / 15

17 Strong Convexity Assumption 2: Strong Convexity Function F is λ-strongly convex if for all w, h R d it holds that F (w + h) F (w) + F (w), h + λ 2 w 2 2 Example: R(w) = λ 2 w 2 2 is λ-strongly convex Lemma: convex + strongly convex strongly convex Ridge logistic regression is λ-strongly convex n F (w) = 1 n i=1 log(1 + exp( y i w, x i )) + λ 2 w 2 2 Dominik Csiba (University of Edinburgh) Importance Sampling for Minibatches , Birmingham 6 / 15

18 Strong Convexity Assumption 2: Strong Convexity Function F is λ-strongly convex if for all w, h R d it holds that F (w + h) F (w) + F (w), h + λ 2 w 2 2 Example: R(w) = λ 2 w 2 2 is λ-strongly convex Lemma: convex + strongly convex strongly convex Ridge logistic regression is λ-strongly convex n F (w) = 1 n i=1 log(1 + exp( y i w, x i )) + λ 2 w 2 2 Ridge regression is λ-strongly convex Dominik Csiba (University of Edinburgh) Importance Sampling for Minibatches , Birmingham 6 / 15

19 Convergence Rate Theorem [Richtárik and Takáč, 2015] Let the objective F and the sampling Ŝ be such that F is λ-strongly convex and v 1,... v d ESO(F, Ŝ) and let v [v 1,..., v d ]. Let p j = P(j Ŝ) and let p [p 1,... p d ]. Let C(p, v) max j {1,...,d} ( ) vj, p j λ and let {w t } t=1 be a sequence generated by NSync. Then ( ) C T C(p, v) log E[F (w T ) F (w )] ɛ, ɛ where C is an absolute constant depending on w 0 and w. Dominik Csiba (University of Edinburgh) Importance Sampling for Minibatches , Birmingham 7 / 15

20 Convergence Rate Theorem [Richtárik and Takáč, 2015] Let the objective F and the sampling Ŝ be such that F is λ-strongly convex and v 1,... v d ESO(F, Ŝ) and let v [v 1,..., v d ]. Let p j = P(j Ŝ) and let p [p 1,... p d ]. Let C(p, v) max j {1,...,d} ( ) vj, p j λ and let {w t } t=1 be a sequence generated by NSync. Then ( ) C T C(p, v) log E[F (w T ) F (w )] ɛ, ɛ where C is an absolute constant depending on w 0 and w. Key quantity: C(p, v) Dominik Csiba (University of Edinburgh) Importance Sampling for Minibatches , Birmingham 7 / 15

21 Serial Importance Sampling Let Ŝ be a serial sampling defined by P({j} = Ŝ) = p j, where j p j = 1. Dominik Csiba (University of Edinburgh) Importance Sampling for Minibatches , Birmingham 8 / 15

22 Serial Importance Sampling Let Ŝ be a serial sampling defined by P({j} = Ŝ) = p j, where j p j = 1. Ridge regression using serial sampling: v j = 1 n n i=1 (x i j )2 + λ Dominik Csiba (University of Edinburgh) Importance Sampling for Minibatches , Birmingham 8 / 15

23 Serial Importance Sampling Let Ŝ be a serial sampling defined by P({j} = Ŝ) = p j, where j p j = 1. Ridge regression using serial sampling: v j = 1 n n i=1 (x i j )2 + λ Question: How to choose p, so that the complexity ( ) vj C(p, v) max j {1,...,d} p j λ is minimized? Dominik Csiba (University of Edinburgh) Importance Sampling for Minibatches , Birmingham 8 / 15

24 Serial Importance Sampling Let Ŝ be a serial sampling defined by P({j} = Ŝ) = p j, where j p j = 1. Ridge regression using serial sampling: v j = 1 n n i=1 (x i j )2 + λ Question: How to choose p, so that the complexity ( ) vj C(p, v) max j {1,...,d} p j λ is minimized? sampling p j C(p, v) uniform 1/d (d max j {1,...,d} v j )/λ Dominik Csiba (University of Edinburgh) Importance Sampling for Minibatches , Birmingham 8 / 15

25 Serial Importance Sampling Let Ŝ be a serial sampling defined by P({j} = Ŝ) = p j, where j p j = 1. Ridge regression using serial sampling: v j = 1 n n i=1 (x i j )2 + λ Question: How to choose p, so that the complexity ( ) vj C(p, v) max j {1,...,d} p j λ is minimized? sampling p j C(p, v) uniform 1/d (d max j {1,...,d} v j )/λ importance v j /( k v k) ( k v k)/λ Dominik Csiba (University of Edinburgh) Importance Sampling for Minibatches , Birmingham 8 / 15

26 Serial Importance Sampling Let Ŝ be a serial sampling defined by P({j} = Ŝ) = p j, where j p j = 1. Ridge regression using serial sampling: v j = 1 n n i=1 (x i j )2 + λ Question: How to choose p, so that the complexity ( ) vj C(p, v) max j {1,...,d} p j λ is minimized? sampling p j C(p, v) uniform 1/d (d max j {1,...,d} v j )/λ importance v j /( k v k) ( k v k)/λ The speedup is proportional to mean(v) vs. maximum(v) Dominik Csiba (University of Edinburgh) Importance Sampling for Minibatches , Birmingham 8 / 15

27 Serial Importance Sampling Let Ŝ be a serial sampling defined by P({j} = Ŝ) = p j, where j p j = 1. Ridge regression using serial sampling: v j = 1 n n i=1 (x i j )2 + λ Question: How to choose p, so that the complexity ( ) vj C(p, v) max j {1,...,d} p j λ is minimized? sampling p j C(p, v) uniform 1/d (d max j {1,...,d} v j )/λ importance v j /( k v k) ( k v k)/λ The speedup is proportional to mean(v) vs. maximum(v) Note: importance sampling is the optimal fixed serial sampling Dominik Csiba (University of Edinburgh) Importance Sampling for Minibatches , Birmingham 8 / 15

28 Minibatch Coordinate Descent Assume we have τ cores and we want to make use of all of them. Dominik Csiba (University of Edinburgh) Importance Sampling for Minibatches , Birmingham 9 / 15

29 Minibatch Coordinate Descent Assume we have τ cores and we want to make use of all of them. Parallel sampling: P( Ŝ = τ) = 1, each core processes one coordinate Dominik Csiba (University of Edinburgh) Importance Sampling for Minibatches , Birmingham 9 / 15

30 Minibatch Coordinate Descent Assume we have τ cores and we want to make use of all of them. Parallel sampling: P( Ŝ = τ) = 1, each core processes one coordinate Question: How to design importance sampling for minibatches? Dominik Csiba (University of Edinburgh) Importance Sampling for Minibatches , Birmingham 9 / 15

31 Minibatch Coordinate Descent Assume we have τ cores and we want to make use of all of them. Parallel sampling: P( Ŝ = τ) = 1, each core processes one coordinate Question: How to design importance sampling for minibatches? Complications: Dominik Csiba (University of Edinburgh) Importance Sampling for Minibatches , Birmingham 9 / 15

32 Minibatch Coordinate Descent Assume we have τ cores and we want to make use of all of them. Parallel sampling: P( Ŝ = τ) = 1, each core processes one coordinate Question: How to design importance sampling for minibatches? Complications: for serial sampling we had P({j} = Ŝ) = P(j Ŝ), which is not the }{{} parameter case for parallel Dominik Csiba (University of Edinburgh) Importance Sampling for Minibatches , Birmingham 9 / 15

33 Minibatch Coordinate Descent Assume we have τ cores and we want to make use of all of them. Parallel sampling: P( Ŝ = τ) = 1, each core processes one coordinate Question: How to design importance sampling for minibatches? Complications: for serial sampling we had P({j} = Ŝ) = P(j Ŝ), which is not the }{{} parameter case for parallel We need to choose ( ( d τ) set-probabilities, instead of just d 1) = d Dominik Csiba (University of Edinburgh) Importance Sampling for Minibatches , Birmingham 9 / 15

34 Minibatch Coordinate Descent Assume we have τ cores and we want to make use of all of them. Parallel sampling: P( Ŝ = τ) = 1, each core processes one coordinate Question: How to design importance sampling for minibatches? Complications: for serial sampling we had P({j} = Ŝ) = P(j Ŝ), which is not the }{{} parameter case for parallel We need to choose ( ( d τ) set-probabilities, instead of just d 1) = d All of them influence the values P(j Ŝ) Dominik Csiba (University of Edinburgh) Importance Sampling for Minibatches , Birmingham 9 / 15

35 Minibatch Coordinate Descent Assume we have τ cores and we want to make use of all of them. Parallel sampling: P( Ŝ = τ) = 1, each core processes one coordinate Question: How to design importance sampling for minibatches? Complications: for serial sampling we had P({j} = Ŝ) = P(j Ŝ), which is not the }{{} parameter case for parallel We need to choose ( ( d τ) set-probabilities, instead of just d 1) = d All of them influence the values P(j Ŝ) Goal: Find a parallel sampling with direct control over p j = P(j Ŝ) Dominik Csiba (University of Edinburgh) Importance Sampling for Minibatches , Birmingham 9 / 15

36 Minibatch Coordinate Descent Assume we have τ cores and we want to make use of all of them. Parallel sampling: P( Ŝ = τ) = 1, each core processes one coordinate Question: How to design importance sampling for minibatches? Complications: for serial sampling we had P({j} = Ŝ) = P(j Ŝ), which is not the }{{} parameter case for parallel We need to choose ( ( d τ) set-probabilities, instead of just d 1) = d All of them influence the values P(j Ŝ) Goal: Find a parallel sampling with direct control over p j = P(j Ŝ) Solution: We introduce the bucket sampling Dominik Csiba (University of Edinburgh) Importance Sampling for Minibatches , Birmingham 9 / 15

37 Bucket sampling Idea: Divide the coordinates {1,..., d} into τ buckets and choose one coordinate non-uniformly from each bucket Dominik Csiba (University of Edinburgh) Importance Sampling for Minibatches , Birmingham 10 / 15

38 Bucket sampling Idea: Divide the coordinates {1,..., d} into τ buckets and choose one coordinate non-uniformly from each bucket Bucket sampling Let C 1,..., C τ be an arbitrary partition of {1,..., d}. Choose the probabilities p j, such that j C k p j = 1 for each k {1,..., τ}. The bucket sampling is a procedure, which outputs a single coordinate j from each C k with probabilities p j. Dominik Csiba (University of Edinburgh) Importance Sampling for Minibatches , Birmingham 10 / 15

39 Bucket sampling Idea: Divide the coordinates {1,..., d} into τ buckets and choose one coordinate non-uniformly from each bucket Bucket sampling Let C 1,..., C τ be an arbitrary partition of {1,..., d}. Choose the probabilities p j, such that j C k p j = 1 for each k {1,..., τ}. The bucket sampling is a procedure, which outputs a single coordinate j from each C k with probabilities p j. Example: Let τ = 4 and d = 12, [ ] p = 0.1, 0.4, 0.2, 0.3, }{{}}{{} 1.0, 0.1, 0.1, 0.1, 0.1, 0.6, 0.5, 0.5 }{{}}{{} C 1 C 2 C 3 C 4 Dominik Csiba (University of Edinburgh) Importance Sampling for Minibatches , Birmingham 10 / 15

40 Bucket sampling Idea: Divide the coordinates {1,..., d} into τ buckets and choose one coordinate non-uniformly from each bucket Bucket sampling Let C 1,..., C τ be an arbitrary partition of {1,..., d}. Choose the probabilities p j, such that j C k p j = 1 for each k {1,..., τ}. The bucket sampling is a procedure, which outputs a single coordinate j from each C k with probabilities p j. Example: Let τ = 4 and d = 12, [ ] p = 0.1, 0.4, 0.2, 0.3, }{{}}{{} 1.0, 0.1, 0.1, 0.1, 0.1, 0.6, 0.5, 0.5 }{{}}{{} C 1 C 2 C 3 Observe: It holds, that P(j Ŝ) = p j. C 4 Dominik Csiba (University of Edinburgh) Importance Sampling for Minibatches , Birmingham 10 / 15

41 Bucket sampling Idea: Divide the coordinates {1,..., d} into τ buckets and choose one coordinate non-uniformly from each bucket Bucket sampling Let C 1,..., C τ be an arbitrary partition of {1,..., d}. Choose the probabilities p j, such that j C k p j = 1 for each k {1,..., τ}. The bucket sampling is a procedure, which outputs a single coordinate j from each C k with probabilities p j. Example: Let τ = 4 and d = 12, [ ] p = 0.1, 0.4, 0.2, 0.3, }{{}}{{} 1.0, 0.1, 0.1, 0.1, 0.1, 0.6, 0.5, 0.5 }{{}}{{} C 1 C 2 C 3 Observe: It holds, that P(j Ŝ) = p j. Note: Both p 1,..., p d and C 1,..., C τ are parameters of the sampling Dominik Csiba (University of Edinburgh) Importance Sampling for Minibatches , Birmingham 10 / 15 C 4

42 Importance Sampling for Minibatches Goal: We aim to minimize C(p bucket, v bucket ) = ( vj bucket ) max j {1,...,d} λp bucket j Dominik Csiba (University of Edinburgh) Importance Sampling for Minibatches , Birmingham 11 / 15

43 Importance Sampling for Minibatches Goal: We aim to minimize C(p bucket, v bucket ) = Issue: The values v bucket depend on p bucket, i.e., v j bucket = (... p bucket...) ( vj bucket ) max j {1,...,d} λp bucket j Dominik Csiba (University of Edinburgh) Importance Sampling for Minibatches , Birmingham 11 / 15

44 Importance Sampling for Minibatches Goal: We aim to minimize C(p bucket, v bucket ) = Issue: The values v bucket depend on p bucket, i.e., v j bucket = (... p bucket...) ( vj bucket ) max j {1,...,d} λp bucket j Optimal probabilities: joint optimization difficult Dominik Csiba (University of Edinburgh) Importance Sampling for Minibatches , Birmingham 11 / 15

45 Importance Sampling for Minibatches Goal: We aim to minimize C(p bucket, v bucket ) = Issue: The values v bucket depend on p bucket, i.e., v j bucket = (... p bucket...) ( vj bucket ) max j {1,...,d} λp bucket j Optimal probabilities: joint optimization difficult Strategy: Choose a reasonable v bucket pick p j bucket v j bucket Dominik Csiba (University of Edinburgh) Importance Sampling for Minibatches , Birmingham 11 / 15

46 Importance Sampling for Minibatches Goal: We aim to minimize C(p bucket, v bucket ) = Issue: The values v bucket depend on p bucket, i.e., v j bucket = (... p bucket...) ( vj bucket ) max j {1,...,d} λp bucket j Optimal probabilities: joint optimization difficult Strategy: Choose a reasonable v bucket pick p j bucket v j bucket In theory: We cannot guarantee a speedup, just convergence Dominik Csiba (University of Edinburgh) Importance Sampling for Minibatches , Birmingham 11 / 15

47 Importance Sampling for Minibatches Goal: We aim to minimize C(p bucket, v bucket ) = Issue: The values v bucket depend on p bucket, i.e., v j bucket = (... p bucket...) ( vj bucket ) max j {1,...,d} λp bucket j Optimal probabilities: joint optimization difficult Strategy: Choose a reasonable v bucket pick p j bucket v j bucket In theory: We cannot guarantee a speedup, just convergence In practice: We will see. Dominik Csiba (University of Edinburgh) Importance Sampling for Minibatches , Birmingham 11 / 15

48 Experiments: Convergence Ridge Logistic Regression using Quartz [Qu et al., 2015] Dominik Csiba (University of Edinburgh) Importance Sampling for Minibatches , Birmingham 12 / 15

49 Experiments: Convergence Ridge Logistic Regression using Quartz [Qu et al., 2015] Randomly chosen equally sized buckets C 1,..., C τ Dominik Csiba (University of Edinburgh) Importance Sampling for Minibatches , Birmingham 12 / 15

50 Experiments: Convergence Ridge Logistic Regression using Quartz [Qu et al., 2015] Randomly chosen equally sized buckets C 1,..., C τ P(w t ) P(w ) nice 2-nice 4-nice 8-nice 16-nice 32-nice Number of Effective Passes / τ P(w t ) P(w ) imp 2-imp 4-imp 8-imp 16-imp 32-imp Number of Effective Passes / τ url dataset: d = 2, 396, 130, n = 3, 231, 962, sparsity = 0.04% Dominik Csiba (University of Edinburgh) Importance Sampling for Minibatches , Birmingham 12 / 15

51 Experiments: Theory vs. Practice Ridge Logistic Regression using Quartz [Qu et al., 2015] Randomly chosen equally sized buckets C 1,..., C τ Data τ = 1 τ = 2 τ = 4 τ = 8 τ = 16 τ = 32 ijcnn1 1.2 : : : : : : 1.8 protein 1.3 : : : : : : 1.5 w8a 2.8 : : : : : : 1.8 url 3.0 : : : : : : 1.7 aloi 13 : : : : : : 6.7 Table: The theoretical : empirical ratios θ (τ-imp) /θ (τ-nice). Dominik Csiba (University of Edinburgh) Importance Sampling for Minibatches , Birmingham 13 / 15

52 Conclusion Dominik Csiba (University of Edinburgh) Importance Sampling for Minibatches , Birmingham 14 / 15

53 Conclusion we introduced a flexible parallel sampling - the bucket sampling Dominik Csiba (University of Edinburgh) Importance Sampling for Minibatches , Birmingham 14 / 15

54 Conclusion we introduced a flexible parallel sampling - the bucket sampling we used it to get an importance sampling for minibatches Dominik Csiba (University of Edinburgh) Importance Sampling for Minibatches , Birmingham 14 / 15

55 Conclusion we introduced a flexible parallel sampling - the bucket sampling we used it to get an importance sampling for minibatches can be used for various stochastic iterative methods Dominik Csiba (University of Edinburgh) Importance Sampling for Minibatches , Birmingham 14 / 15

56 Conclusion we introduced a flexible parallel sampling - the bucket sampling we used it to get an importance sampling for minibatches can be used for various stochastic iterative methods we empirically match the improvement gained by serial importance sampling Dominik Csiba (University of Edinburgh) Importance Sampling for Minibatches , Birmingham 14 / 15

57 Conclusion we introduced a flexible parallel sampling - the bucket sampling we used it to get an importance sampling for minibatches can be used for various stochastic iterative methods we empirically match the improvement gained by serial importance sampling in theory, we cannot in general guarantee that importance sampling for minibatches will be better than the uniform parallel sampling Dominik Csiba (University of Edinburgh) Importance Sampling for Minibatches , Birmingham 14 / 15

58 References Csiba, Dominik and Richtárik, Peter Importance Sampling for Minibatches arxiv e-prints (2016): Richtárik, Peter and Takáč, Martin On optimal probabilities in stochastic coordinate descent methods Optimization Letters (2015): Qu, Zheng and Richtárik, Peter Coordinate descent with arbitrary sampling II: Expected separable overapproximation arxiv e-prints (2014): Qu, Zheng and Richtárik, Peter and Zhang, Tong Quartz: Randomized dual coordinate ascent with arbitrary sampling Advances in Neural Information Processing Systems, 2015 Dominik Csiba (University of Edinburgh) Importance Sampling for Minibatches , Birmingham 15 / 15

Randomized Coordinate Descent with Arbitrary Sampling: Algorithms and Complexity

Randomized Coordinate Descent with Arbitrary Sampling: Algorithms and Complexity Randomized Coordinate Descent with Arbitrary Sampling: Algorithms and Complexity Zheng Qu University of Hong Kong CAM, 23-26 Aug 2016 Hong Kong based on joint work with Peter Richtarik and Dominique Cisba(University

More information

Mini-Batch Primal and Dual Methods for SVMs

Mini-Batch Primal and Dual Methods for SVMs Mini-Batch Primal and Dual Methods for SVMs Peter Richtárik School of Mathematics The University of Edinburgh Coauthors: M. Takáč (Edinburgh), A. Bijral and N. Srebro (both TTI at Chicago) arxiv:1303.2314

More information

Lecture 3: Minimizing Large Sums. Peter Richtárik

Lecture 3: Minimizing Large Sums. Peter Richtárik Lecture 3: Minimizing Large Sums Peter Richtárik Graduate School in Systems, Op@miza@on, Control and Networks Belgium 2015 Mo@va@on: Machine Learning & Empirical Risk Minimiza@on Training Linear Predictors

More information

FAST DISTRIBUTED COORDINATE DESCENT FOR NON-STRONGLY CONVEX LOSSES. Olivier Fercoq Zheng Qu Peter Richtárik Martin Takáč

FAST DISTRIBUTED COORDINATE DESCENT FOR NON-STRONGLY CONVEX LOSSES. Olivier Fercoq Zheng Qu Peter Richtárik Martin Takáč FAST DISTRIBUTED COORDINATE DESCENT FOR NON-STRONGLY CONVEX LOSSES Olivier Fercoq Zheng Qu Peter Richtárik Martin Takáč School of Mathematics, University of Edinburgh, Edinburgh, EH9 3JZ, United Kingdom

More information

Coordinate Descent Faceoff: Primal or Dual?

Coordinate Descent Faceoff: Primal or Dual? JMLR: Workshop and Conference Proceedings 83:1 22, 2018 Algorithmic Learning Theory 2018 Coordinate Descent Faceoff: Primal or Dual? Dominik Csiba peter.richtarik@ed.ac.uk School of Mathematics University

More information

Parallel Coordinate Optimization

Parallel Coordinate Optimization 1 / 38 Parallel Coordinate Optimization Julie Nutini MLRG - Spring Term March 6 th, 2018 2 / 38 Contours of a function F : IR 2 IR. Goal: Find the minimizer of F. Coordinate Descent in 2D Contours of a

More information

Comparison of Modern Stochastic Optimization Algorithms

Comparison of Modern Stochastic Optimization Algorithms Comparison of Modern Stochastic Optimization Algorithms George Papamakarios December 214 Abstract Gradient-based optimization methods are popular in machine learning applications. In large-scale problems,

More information

SVRG++ with Non-uniform Sampling

SVRG++ with Non-uniform Sampling SVRG++ with Non-uniform Sampling Tamás Kern András György Department of Electrical and Electronic Engineering Imperial College London, London, UK, SW7 2BT {tamas.kern15,a.gyorgy}@imperial.ac.uk Abstract

More information

STA141C: Big Data & High Performance Statistical Computing

STA141C: Big Data & High Performance Statistical Computing STA141C: Big Data & High Performance Statistical Computing Lecture 8: Optimization Cho-Jui Hsieh UC Davis May 9, 2017 Optimization Numerical Optimization Numerical Optimization: min X f (X ) Can be applied

More information

ECS289: Scalable Machine Learning

ECS289: Scalable Machine Learning ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Sept 29, 2016 Outline Convex vs Nonconvex Functions Coordinate Descent Gradient Descent Newton s method Stochastic Gradient Descent Numerical Optimization

More information

Divide-and-combine Strategies in Statistical Modeling for Massive Data

Divide-and-combine Strategies in Statistical Modeling for Massive Data Divide-and-combine Strategies in Statistical Modeling for Massive Data Liqun Yu Washington University in St. Louis March 30, 2017 Liqun Yu (WUSTL) D&C Statistical Modeling for Massive Data March 30, 2017

More information

Distributed Inexact Newton-type Pursuit for Non-convex Sparse Learning

Distributed Inexact Newton-type Pursuit for Non-convex Sparse Learning Distributed Inexact Newton-type Pursuit for Non-convex Sparse Learning Bo Liu Department of Computer Science, Rutgers Univeristy Xiao-Tong Yuan BDAT Lab, Nanjing University of Information Science and Technology

More information

Supplement: Distributed Box-constrained Quadratic Optimization for Dual Linear SVM

Supplement: Distributed Box-constrained Quadratic Optimization for Dual Linear SVM Supplement: Distributed Box-constrained Quadratic Optimization for Dual Linear SVM Ching-pei Lee LEECHINGPEI@GMAIL.COM Dan Roth DANR@ILLINOIS.EDU University of Illinois at Urbana-Champaign, 201 N. Goodwin

More information

Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization

Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization Shai Shalev-Shwartz and Tong Zhang School of CS and Engineering, The Hebrew University of Jerusalem Optimization for Machine

More information

J. Sadeghi E. Patelli M. de Angelis

J. Sadeghi E. Patelli M. de Angelis J. Sadeghi E. Patelli Institute for Risk and, Department of Engineering, University of Liverpool, United Kingdom 8th International Workshop on Reliable Computing, Computing with Confidence University of

More information

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017 Machine Learning Support Vector Machines Fabio Vandin November 20, 2017 1 Classification and Margin Consider a classification problem with two classes: instance set X = R d label set Y = { 1, 1}. Training

More information

SDNA: Stochastic Dual Newton Ascent for Empirical Risk Minimization

SDNA: Stochastic Dual Newton Ascent for Empirical Risk Minimization Zheng Qu ZHENGQU@HKU.HK Department of Mathematics, The University of Hong Kong, Hong Kong Peter Richtárik PETER.RICHTARIK@ED.AC.UK School of Mathematics, The University of Edinburgh, UK Martin Takáč TAKAC.MT@GMAIL.COM

More information

Machine Learning. Lecture 04: Logistic and Softmax Regression. Nevin L. Zhang

Machine Learning. Lecture 04: Logistic and Softmax Regression. Nevin L. Zhang Machine Learning Lecture 04: Logistic and Softmax Regression Nevin L. Zhang lzhang@cse.ust.hk Department of Computer Science and Engineering The Hong Kong University of Science and Technology This set

More information

ECS171: Machine Learning

ECS171: Machine Learning ECS171: Machine Learning Lecture 4: Optimization (LFD 3.3, SGD) Cho-Jui Hsieh UC Davis Jan 22, 2018 Gradient descent Optimization Goal: find the minimizer of a function min f (w) w For now we assume f

More information

Trade-Offs in Distributed Learning and Optimization

Trade-Offs in Distributed Learning and Optimization Trade-Offs in Distributed Learning and Optimization Ohad Shamir Weizmann Institute of Science Includes joint works with Yossi Arjevani, Nathan Srebro and Tong Zhang IHES Workshop March 2016 Distributed

More information

Stochastic Gradient Descent. CS 584: Big Data Analytics

Stochastic Gradient Descent. CS 584: Big Data Analytics Stochastic Gradient Descent CS 584: Big Data Analytics Gradient Descent Recap Simplest and extremely popular Main Idea: take a step proportional to the negative of the gradient Easy to implement Each iteration

More information

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016 ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016 LECTURERS: NAMAN AGARWAL AND BRIAN BULLINS SCRIBE: KIRAN VODRAHALLI Contents 1 Introduction 1 1.1 History of Optimization.....................................

More information

Fast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee

Fast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee Fast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee Shen-Yi Zhao and Wu-Jun Li National Key Laboratory for Novel Software Technology Department of Computer

More information

Least Mean Squares Regression. Machine Learning Fall 2018

Least Mean Squares Regression. Machine Learning Fall 2018 Least Mean Squares Regression Machine Learning Fall 2018 1 Where are we? Least Squares Method for regression Examples The LMS objective Gradient descent Incremental/stochastic gradient descent Exercises

More information

Least Mean Squares Regression

Least Mean Squares Regression Least Mean Squares Regression Machine Learning Spring 2018 The slides are mainly from Vivek Srikumar 1 Lecture Overview Linear classifiers What functions do linear classifiers express? Least Squares Method

More information

Fast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee

Fast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence (AAAI-16) Fast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee Shen-Yi Zhao and

More information

CSC321 Lecture 8: Optimization

CSC321 Lecture 8: Optimization CSC321 Lecture 8: Optimization Roger Grosse Roger Grosse CSC321 Lecture 8: Optimization 1 / 26 Overview We ve talked a lot about how to compute gradients. What do we actually do with them? Today s lecture:

More information

CS60021: Scalable Data Mining. Large Scale Machine Learning

CS60021: Scalable Data Mining. Large Scale Machine Learning J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 1 CS60021: Scalable Data Mining Large Scale Machine Learning Sourangshu Bhattacharya Example: Spam filtering Instance

More information

Linear Models in Machine Learning

Linear Models in Machine Learning CS540 Intro to AI Linear Models in Machine Learning Lecturer: Xiaojin Zhu jerryzhu@cs.wisc.edu We briefly go over two linear models frequently used in machine learning: linear regression for, well, regression,

More information

Machine Learning. Lecture 2: Linear regression. Feng Li. https://funglee.github.io

Machine Learning. Lecture 2: Linear regression. Feng Li. https://funglee.github.io Machine Learning Lecture 2: Linear regression Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 2017 Supervised Learning Regression: Predict

More information

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu Logistic Regression Review 10-601 Fall 2012 Recitation September 25, 2012 TA: Selen Uguroglu!1 Outline Decision Theory Logistic regression Goal Loss function Inference Gradient Descent!2 Training Data

More information

Big Data Analytics. Lucas Rego Drumond

Big Data Analytics. Lucas Rego Drumond Big Data Analytics Lucas Rego Drumond Information Systems and Machine Learning Lab (ISMLL) Institute of Computer Science University of Hildesheim, Germany Predictive Models Predictive Models 1 / 34 Outline

More information

Stein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm

Stein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm Stein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm Qiang Liu and Dilin Wang NIPS 2016 Discussion by Yunchen Pu March 17, 2017 March 17, 2017 1 / 8 Introduction Let x R d

More information

arxiv: v2 [cs.lg] 10 Oct 2018

arxiv: v2 [cs.lg] 10 Oct 2018 Journal of Machine Learning Research 9 (208) -49 Submitted 0/6; Published 7/8 CoCoA: A General Framework for Communication-Efficient Distributed Optimization arxiv:6.0289v2 [cs.lg] 0 Oct 208 Virginia Smith

More information

Accelerating Stochastic Optimization

Accelerating Stochastic Optimization Accelerating Stochastic Optimization Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem and Mobileye Master Class at Tel-Aviv, Tel-Aviv University, November 2014 Shalev-Shwartz

More information

Adaptive Probabilities in Stochastic Optimization Algorithms

Adaptive Probabilities in Stochastic Optimization Algorithms Research Collection Master Thesis Adaptive Probabilities in Stochastic Optimization Algorithms Author(s): Zhong, Lei Publication Date: 2015 Permanent Link: https://doi.org/10.3929/ethz-a-010421465 Rights

More information

Accelerating SVRG via second-order information

Accelerating SVRG via second-order information Accelerating via second-order information Ritesh Kolte Department of Electrical Engineering rkolte@stanford.edu Murat Erdogdu Department of Statistics erdogdu@stanford.edu Ayfer Özgür Department of Electrical

More information

Stochastic Dual Coordinate Ascent with Adaptive Probabilities

Stochastic Dual Coordinate Ascent with Adaptive Probabilities Dominik Csiba Zheng Qu Peter Richtárik University of Edinburgh CDOMINIK@GMAIL.COM ZHENG.QU@ED.AC.UK PETER.RICHTARIK@ED.AC.UK Abstract This paper introduces AdaSDCA: an adaptive variant of stochastic dual

More information

Fast Stochastic Optimization Algorithms for ML

Fast Stochastic Optimization Algorithms for ML Fast Stochastic Optimization Algorithms for ML Aaditya Ramdas April 20, 205 This lecture is about efficient algorithms for minimizing finite sums min w R d n i= f i (w) or min w R d n f i (w) + λ 2 w 2

More information

Distributed Box-Constrained Quadratic Optimization for Dual Linear SVM

Distributed Box-Constrained Quadratic Optimization for Dual Linear SVM Distributed Box-Constrained Quadratic Optimization for Dual Linear SVM Lee, Ching-pei University of Illinois at Urbana-Champaign Joint work with Dan Roth ICML 2015 Outline Introduction Algorithm Experiments

More information

Parallel and Distributed Stochastic Learning -Towards Scalable Learning for Big Data Intelligence

Parallel and Distributed Stochastic Learning -Towards Scalable Learning for Big Data Intelligence Parallel and Distributed Stochastic Learning -Towards Scalable Learning for Big Data Intelligence oé LAMDA Group H ŒÆOŽÅ Æ EâX ^ #EâI[ : liwujun@nju.edu.cn Dec 10, 2016 Wu-Jun Li (http://cs.nju.edu.cn/lwj)

More information

Statistical Data Mining and Machine Learning Hilary Term 2016

Statistical Data Mining and Machine Learning Hilary Term 2016 Statistical Data Mining and Machine Learning Hilary Term 2016 Dino Sejdinovic Department of Statistics Oxford Slides and other materials available at: http://www.stats.ox.ac.uk/~sejdinov/sdmml Naïve Bayes

More information

Neural Networks and Deep Learning

Neural Networks and Deep Learning Neural Networks and Deep Learning Professor Ameet Talwalkar November 12, 2015 Professor Ameet Talwalkar Neural Networks and Deep Learning November 12, 2015 1 / 16 Outline 1 Review of last lecture AdaBoost

More information

Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron

Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron CS446: Machine Learning, Fall 2017 Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron Lecturer: Sanmi Koyejo Scribe: Ke Wang, Oct. 24th, 2017 Agenda Recap: SVM and Hinge loss, Representer

More information

CSC321 Lecture 7: Optimization

CSC321 Lecture 7: Optimization CSC321 Lecture 7: Optimization Roger Grosse Roger Grosse CSC321 Lecture 7: Optimization 1 / 25 Overview We ve talked a lot about how to compute gradients. What do we actually do with them? Today s lecture:

More information

Block stochastic gradient update method

Block stochastic gradient update method Block stochastic gradient update method Yangyang Xu and Wotao Yin IMA, University of Minnesota Department of Mathematics, UCLA November 1, 2015 This work was done while in Rice University 1 / 26 Stochastic

More information

Faster Machine Learning via Low-Precision Communication & Computation. Dan Alistarh (IST Austria & ETH Zurich), Hantian Zhang (ETH Zurich)

Faster Machine Learning via Low-Precision Communication & Computation. Dan Alistarh (IST Austria & ETH Zurich), Hantian Zhang (ETH Zurich) Faster Machine Learning via Low-Precision Communication & Computation Dan Alistarh (IST Austria & ETH Zurich), Hantian Zhang (ETH Zurich) 2 How many bits do you need to represent a single number in machine

More information

Big Data Analytics: Optimization and Randomization

Big Data Analytics: Optimization and Randomization Big Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML 2015 Hong Kong Department of Computer Science, The University of Iowa, IA, USA Nov. 20, 2015 Yang Tutorial for ACML 15 Nov.

More information

Multilayer Perceptron

Multilayer Perceptron Outline Hong Chang Institute of Computing Technology, Chinese Academy of Sciences Machine Learning Methods (Fall 2012) Outline Outline I 1 Introduction 2 Single Perceptron 3 Boolean Function Learning 4

More information

CS229 Supplemental Lecture notes

CS229 Supplemental Lecture notes CS229 Supplemental Lecture notes John Duchi Binary classification In binary classification problems, the target y can take on at only two values. In this set of notes, we show how to model this problem

More information

Neural Networks: Backpropagation

Neural Networks: Backpropagation Neural Networks: Backpropagation Seung-Hoon Na 1 1 Department of Computer Science Chonbuk National University 2018.10.25 eung-hoon Na (Chonbuk National University) Neural Networks: Backpropagation 2018.10.25

More information

Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015

Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015 Sequence Modelling with Features: Linear-Chain Conditional Random Fields COMP-599 Oct 6, 2015 Announcement A2 is out. Due Oct 20 at 1pm. 2 Outline Hidden Markov models: shortcomings Generative vs. discriminative

More information

Logistic Regression. COMP 527 Danushka Bollegala

Logistic Regression. COMP 527 Danushka Bollegala Logistic Regression COMP 527 Danushka Bollegala Binary Classification Given an instance x we must classify it to either positive (1) or negative (0) class We can use {1,-1} instead of {1,0} but we will

More information

ARock: an algorithmic framework for asynchronous parallel coordinate updates

ARock: an algorithmic framework for asynchronous parallel coordinate updates ARock: an algorithmic framework for asynchronous parallel coordinate updates Zhimin Peng, Yangyang Xu, Ming Yan, Wotao Yin ( UCLA Math, U.Waterloo DCO) UCLA CAM Report 15-37 ShanghaiTech SSDS 15 June 25,

More information

Lecture 26: Neural Nets

Lecture 26: Neural Nets Lecture 26: Neural Nets ECE 417: Multimedia Signal Processing Mark Hasegawa-Johnson University of Illinois 11/30/2017 1 Intro 2 Knowledge-Based Design 3 Error Metric 4 Gradient Descent 5 Simulated Annealing

More information

Logistic Regression. Stochastic Gradient Descent

Logistic Regression. Stochastic Gradient Descent Tutorial 8 CPSC 340 Logistic Regression Stochastic Gradient Descent Logistic Regression Model A discriminative probabilistic model for classification e.g. spam filtering Let x R d be input and y { 1, 1}

More information

Sparse Principal Component Analysis via Alternating Maximization and Efficient Parallel Implementations

Sparse Principal Component Analysis via Alternating Maximization and Efficient Parallel Implementations Sparse Principal Component Analysis via Alternating Maximization and Efficient Parallel Implementations Martin Takáč The University of Edinburgh Joint work with Peter Richtárik (Edinburgh University) Selin

More information

Machine Learning: Logistic Regression. Lecture 04

Machine Learning: Logistic Regression. Lecture 04 Machine Learning: Logistic Regression Razvan C. Bunescu School of Electrical Engineering and Computer Science bunescu@ohio.edu Supervised Learning Task = learn an (unkon function t : X T that maps input

More information

Stochastic Optimization Algorithms Beyond SG

Stochastic Optimization Algorithms Beyond SG Stochastic Optimization Algorithms Beyond SG Frank E. Curtis 1, Lehigh University involving joint work with Léon Bottou, Facebook AI Research Jorge Nocedal, Northwestern University Optimization Methods

More information

Machine Learning. Linear Models. Fabio Vandin October 10, 2017

Machine Learning. Linear Models. Fabio Vandin October 10, 2017 Machine Learning Linear Models Fabio Vandin October 10, 2017 1 Linear Predictors and Affine Functions Consider X = R d Affine functions: L d = {h w,b : w R d, b R} where ( d ) h w,b (x) = w, x + b = w

More information

Nesterov s Acceleration

Nesterov s Acceleration Nesterov s Acceleration Nesterov Accelerated Gradient min X f(x)+ (X) f -smooth. Set s 1 = 1 and = 1. Set y 0. Iterate by increasing t: g t 2 @f(y t ) s t+1 = 1+p 1+4s 2 t 2 y t = x t + s t 1 s t+1 (x

More information

Discriminative Models

Discriminative Models No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models

More information

Is the test error unbiased for these programs? 2017 Kevin Jamieson

Is the test error unbiased for these programs? 2017 Kevin Jamieson Is the test error unbiased for these programs? 2017 Kevin Jamieson 1 Is the test error unbiased for this program? 2017 Kevin Jamieson 2 Simple Variable Selection LASSO: Sparse Regression Machine Learning

More information

Lecture 10. Neural networks and optimization. Machine Learning and Data Mining November Nando de Freitas UBC. Nonlinear Supervised Learning

Lecture 10. Neural networks and optimization. Machine Learning and Data Mining November Nando de Freitas UBC. Nonlinear Supervised Learning Lecture 0 Neural networks and optimization Machine Learning and Data Mining November 2009 UBC Gradient Searching for a good solution can be interpreted as looking for a minimum of some error (loss) function

More information

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 5

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 5 Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 5 Slides adapted from Jordan Boyd-Graber, Tom Mitchell, Ziv Bar-Joseph Machine Learning: Chenhao Tan Boulder 1 of 27 Quiz question For

More information

SDNA: Stochastic Dual Newton Ascent for Empirical Risk Minimization

SDNA: Stochastic Dual Newton Ascent for Empirical Risk Minimization Zheng Qu ZHENGQU@HKU.HK Department of Mathematics, The University of Hong Kong, Hong Kong Peter Richtárik PETER.RICHTARIK@ED.AC.UK School of Mathematics, The University of Edinburgh, UK Martin Takáč TAKAC.MT@GMAIL.COM

More information

Machine Learning for NLP

Machine Learning for NLP Machine Learning for NLP Linear Models Joakim Nivre Uppsala University Department of Linguistics and Philology Slides adapted from Ryan McDonald, Google Research Machine Learning for NLP 1(26) Outline

More information

Introduction to Logistic Regression and Support Vector Machine

Introduction to Logistic Regression and Support Vector Machine Introduction to Logistic Regression and Support Vector Machine guest lecturer: Ming-Wei Chang CS 446 Fall, 2009 () / 25 Fall, 2009 / 25 Before we start () 2 / 25 Fall, 2009 2 / 25 Before we start Feel

More information

Randomized Smoothing Techniques in Optimization

Randomized Smoothing Techniques in Optimization Randomized Smoothing Techniques in Optimization John Duchi Based on joint work with Peter Bartlett, Michael Jordan, Martin Wainwright, Andre Wibisono Stanford University Information Systems Laboratory

More information

Convex Optimization Lecture 16

Convex Optimization Lecture 16 Convex Optimization Lecture 16 Today: Projected Gradient Descent Conditional Gradient Descent Stochastic Gradient Descent Random Coordinate Descent Recall: Gradient Descent (Steepest Descent w.r.t Euclidean

More information

Estimators based on non-convex programs: Statistical and computational guarantees

Estimators based on non-convex programs: Statistical and computational guarantees Estimators based on non-convex programs: Statistical and computational guarantees Martin Wainwright UC Berkeley Statistics and EECS Based on joint work with: Po-Ling Loh (UC Berkeley) Martin Wainwright

More information

Probabilistic Graphical Models & Applications

Probabilistic Graphical Models & Applications Probabilistic Graphical Models & Applications Learning of Graphical Models Bjoern Andres and Bernt Schiele Max Planck Institute for Informatics The slides of today s lecture are authored by and shown with

More information

Logistic Regression. William Cohen

Logistic Regression. William Cohen Logistic Regression William Cohen 1 Outline Quick review classi5ication, naïve Bayes, perceptrons new result for naïve Bayes Learning as optimization Logistic regression via gradient ascent Over5itting

More information

Lecture 1: Supervised Learning

Lecture 1: Supervised Learning Lecture 1: Supervised Learning Tuo Zhao Schools of ISYE and CSE, Georgia Tech ISYE6740/CSE6740/CS7641: Computational Data Analysis/Machine from Portland, Learning Oregon: pervised learning (Supervised)

More information

Neural networks and optimization

Neural networks and optimization Neural networks and optimization Nicolas Le Roux INRIA 8 Nov 2011 Nicolas Le Roux (INRIA) Neural networks and optimization 8 Nov 2011 1 / 80 1 Introduction 2 Linear classifier 3 Convolutional neural networks

More information

Randomized Block Coordinate Non-Monotone Gradient Method for a Class of Nonlinear Programming

Randomized Block Coordinate Non-Monotone Gradient Method for a Class of Nonlinear Programming Randomized Block Coordinate Non-Monotone Gradient Method for a Class of Nonlinear Programming Zhaosong Lu Lin Xiao June 25, 2013 Abstract In this paper we propose a randomized block coordinate non-monotone

More information

Sparse regression. Optimization-Based Data Analysis. Carlos Fernandez-Granda

Sparse regression. Optimization-Based Data Analysis.   Carlos Fernandez-Granda Sparse regression Optimization-Based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_spring16 Carlos Fernandez-Granda 3/28/2016 Regression Least-squares regression Example: Global warming Logistic

More information

Modern Optimization Techniques

Modern Optimization Techniques Modern Optimization Techniques 2. Unconstrained Optimization / 2.2. Stochastic Gradient Descent Lars Schmidt-Thieme Information Systems and Machine Learning Lab (ISMLL) Institute of Computer Science University

More information

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels Need for Deep Networks Perceptron Can only model linear functions Kernel Machines Non-linearity provided by kernels Need to design appropriate kernels (possibly selecting from a set, i.e. kernel learning)

More information

Summary and discussion of: Dropout Training as Adaptive Regularization

Summary and discussion of: Dropout Training as Adaptive Regularization Summary and discussion of: Dropout Training as Adaptive Regularization Statistics Journal Club, 36-825 Kirstin Early and Calvin Murdock November 21, 2014 1 Introduction Multi-layered (i.e. deep) artificial

More information

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers Engineering Part IIB: Module 4F0 Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers Phil Woodland: pcw@eng.cam.ac.uk Michaelmas 202 Engineering Part IIB:

More information

Empirical Risk Minimization

Empirical Risk Minimization Empirical Risk Minimization Fabrice Rossi SAMM Université Paris 1 Panthéon Sorbonne 2018 Outline Introduction PAC learning ERM in practice 2 General setting Data X the input space and Y the output space

More information

A Parallel SGD method with Strong Convergence

A Parallel SGD method with Strong Convergence A Parallel SGD method with Strong Convergence Dhruv Mahajan Microsoft Research India dhrumaha@microsoft.com S. Sundararajan Microsoft Research India ssrajan@microsoft.com S. Sathiya Keerthi Microsoft Corporation,

More information

The Randomized Newton Method for Convex Optimization

The Randomized Newton Method for Convex Optimization The Randomized Newton Method for Convex Optimization Vaden Masrani UBC MLRG April 3rd, 2018 Introduction We have some unconstrained, twice-differentiable convex function f : R d R that we want to minimize:

More information

Machine Learning. Regularization and Feature Selection. Fabio Vandin November 14, 2017

Machine Learning. Regularization and Feature Selection. Fabio Vandin November 14, 2017 Machine Learning Regularization and Feature Selection Fabio Vandin November 14, 2017 1 Regularized Loss Minimization Assume h is defined by a vector w = (w 1,..., w d ) T R d (e.g., linear models) Regularization

More information

On Nesterov s Random Coordinate Descent Algorithms

On Nesterov s Random Coordinate Descent Algorithms On Nesterov s Random Coordinate Descent Algorithms Zheng Xu University of Texas At Arlington February 19, 2015 1 Introduction Full-Gradient Descent Coordinate Descent 2 Random Coordinate Descent Algorithm

More information

1 Regression with High Dimensional Data

1 Regression with High Dimensional Data 6.883 Learning with Combinatorial Structure ote for Lecture 11 Instructor: Prof. Stefanie Jegelka Scribe: Xuhong Zhang 1 Regression with High Dimensional Data Consider the following regression problem:

More information

Gradient Boosting (Continued)

Gradient Boosting (Continued) Gradient Boosting (Continued) David Rosenberg New York University April 4, 2016 David Rosenberg (New York University) DS-GA 1003 April 4, 2016 1 / 31 Boosting Fits an Additive Model Boosting Fits an Additive

More information

Stochastic Quasi-Newton Methods

Stochastic Quasi-Newton Methods Stochastic Quasi-Newton Methods Donald Goldfarb Department of IEOR Columbia University UCLA Distinguished Lecture Series May 17-19, 2016 1 / 35 Outline Stochastic Approximation Stochastic Gradient Descent

More information

Introduction to Logistic Regression

Introduction to Logistic Regression Introduction to Logistic Regression Guy Lebanon Binary Classification Binary classification is the most basic task in machine learning, and yet the most frequent. Binary classifiers often serve as the

More information

Discriminative Models

Discriminative Models No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models

More information

Ad Placement Strategies

Ad Placement Strategies Case Study : Estimating Click Probabilities Intro Logistic Regression Gradient Descent + SGD AdaGrad Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox January 7 th, 04 Ad

More information

Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade

Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade Machine Learning for Big Data CSE547/STAT548 University of Washington S. M. Kakade (UW) Optimization for

More information

arxiv: v2 [stat.ml] 16 Jun 2015

arxiv: v2 [stat.ml] 16 Jun 2015 Semi-Stochastic Gradient Descent Methods Jakub Konečný Peter Richtárik arxiv:1312.1666v2 [stat.ml] 16 Jun 2015 School of Mathematics University of Edinburgh United Kingdom June 15, 2015 (first version:

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Function approximation Mario Martin CS-UPC May 18, 2018 Mario Martin (CS-UPC) Reinforcement Learning May 18, 2018 / 65 Recap Algorithms: MonteCarlo methods for Policy Evaluation

More information

ECE521 week 3: 23/26 January 2017

ECE521 week 3: 23/26 January 2017 ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear

More information

Linear Models for Regression CS534

Linear Models for Regression CS534 Linear Models for Regression CS534 Example Regression Problems Predict housing price based on House size, lot size, Location, # of rooms Predict stock price based on Price history of the past month Predict

More information

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels Need for Deep Networks Perceptron Can only model linear functions Kernel Machines Non-linearity provided by kernels Need to design appropriate kernels (possibly selecting from a set, i.e. kernel learning)

More information

Accelerated Stochastic Block Coordinate Gradient Descent for Sparsity Constrained Nonconvex Optimization

Accelerated Stochastic Block Coordinate Gradient Descent for Sparsity Constrained Nonconvex Optimization Accelerated Stochastic Block Coordinate Gradient Descent for Sparsity Constrained Nonconvex Optimization Jinghui Chen Department of Systems and Information Engineering University of Virginia Quanquan Gu

More information

Regression with Numerical Optimization. Logistic

Regression with Numerical Optimization. Logistic CSG220 Machine Learning Fall 2008 Regression with Numerical Optimization. Logistic regression Regression with Numerical Optimization. Logistic regression based on a document by Andrew Ng October 3, 204

More information