Importance Sampling for Minibatches
|
|
- August Bennett
- 5 years ago
- Views:
Transcription
1 Importance Sampling for Minibatches Dominik Csiba School of Mathematics University of Edinburgh , Birmingham Dominik Csiba (University of Edinburgh) Importance Sampling for Minibatches , Birmingham 1 / 15
2 Acknowledgements This talk is based on [Csiba and Richtárik, 2016]. Co-author: Peter Richtárik University of Edinburgh Dominik Csiba (University of Edinburgh) Importance Sampling for Minibatches , Birmingham 2 / 15
3 Introduction Many supervised learning tasks can be written as optimization problems. Dominik Csiba (University of Edinburgh) Importance Sampling for Minibatches , Birmingham 3 / 15
4 Introduction Many supervised learning tasks can be written as optimization problems. Regularized Loss Minimization (RLM) Let f i : R d R be a loss function associated with the i-th example and R : R d R be a regularizer. Then the RLM problem is defined as { } w = arg min F (w) 1 n f i (w) + R(w) w R d n i=1 Dominik Csiba (University of Edinburgh) Importance Sampling for Minibatches , Birmingham 3 / 15
5 Introduction Many supervised learning tasks can be written as optimization problems. Regularized Loss Minimization (RLM) Let f i : R d R be a loss function associated with the i-th example and R : R d R be a regularizer. Then the RLM problem is defined as { } w = arg min F (w) 1 n f i (w) + R(w) w R d n Ridge Regression: training set {(x i, y i )} n i=1, linear regression with square loss f i (w) = 1 2 ( xi, w y i ) 2, squared l 2 regularizer R(w) = λ 2 w 2 2 i=1 Dominik Csiba (University of Edinburgh) Importance Sampling for Minibatches , Birmingham 3 / 15
6 Introduction Many supervised learning tasks can be written as optimization problems. Regularized Loss Minimization (RLM) Let f i : R d R be a loss function associated with the i-th example and R : R d R be a regularizer. Then the RLM problem is defined as { } w = arg min F (w) 1 n f i (w) + R(w) w R d n Ridge Regression: training set {(x i, y i )} n i=1, linear regression with square loss f i (w) = 1 2 ( xi, w y i ) 2, squared l 2 regularizer R(w) = λ 2 w 2 2 i=1 Direct solution: usually not available iterative methods Dominik Csiba (University of Edinburgh) Importance Sampling for Minibatches , Birmingham 3 / 15
7 Introduction Many supervised learning tasks can be written as optimization problems. Regularized Loss Minimization (RLM) Let f i : R d R be a loss function associated with the i-th example and R : R d R be a regularizer. Then the RLM problem is defined as { } w = arg min F (w) 1 n f i (w) + R(w) w R d n Ridge Regression: training set {(x i, y i )} n i=1, linear regression with square loss f i (w) = 1 2 ( xi, w y i ) 2, squared l 2 regularizer R(w) = λ 2 w 2 2 i=1 Direct solution: usually not available iterative methods Batch methods: very expensive iterations stochastic methods Dominik Csiba (University of Edinburgh) Importance Sampling for Minibatches , Birmingham 3 / 15
8 Introduction Many supervised learning tasks can be written as optimization problems. Regularized Loss Minimization (RLM) Let f i : R d R be a loss function associated with the i-th example and R : R d R be a regularizer. Then the RLM problem is defined as { } w = arg min F (w) 1 n f i (w) + R(w) w R d n Ridge Regression: training set {(x i, y i )} n i=1, linear regression with square loss f i (w) = 1 2 ( xi, w y i ) 2, squared l 2 regularizer R(w) = λ 2 w 2 2 i=1 Direct solution: usually not available iterative methods Batch methods: very expensive iterations stochastic methods Most popular methods stochastic iterative methods Dominik Csiba (University of Edinburgh) Importance Sampling for Minibatches , Birmingham 3 / 15
9 Randomized Coordinate Descent NSync [Richtárik and Takáč, 2015] Sampling Ŝ - random set-valued mapping with values subsets of {1,..., d}, such that P(j Ŝ) > 0 for every j {1,..., d} Stepsize parameters v 1,..., v d > 0 computable from (F, Ŝ) Algorithm: on each iteration t > 0 do 1 sample a random set S t using Ŝ 2 for each j S t, update w t j w t 1 j 1 v j w j F (w t ) Dominik Csiba (University of Edinburgh) Importance Sampling for Minibatches , Birmingham 4 / 15
10 Randomized Coordinate Descent NSync [Richtárik and Takáč, 2015] Sampling Ŝ - random set-valued mapping with values subsets of {1,..., d}, such that P(j Ŝ) > 0 for every j {1,..., d} Stepsize parameters v 1,..., v d > 0 computable from (F, Ŝ) Algorithm: on each iteration t > 0 do 1 sample a random set S t using Ŝ 2 for each j S t, update w t j Serial sampling: P( Ŝ = 1) = 1 w t 1 j 1 v j w j F (w t ) Dominik Csiba (University of Edinburgh) Importance Sampling for Minibatches , Birmingham 4 / 15
11 Randomized Coordinate Descent NSync [Richtárik and Takáč, 2015] Sampling Ŝ - random set-valued mapping with values subsets of {1,..., d}, such that P(j Ŝ) > 0 for every j {1,..., d} Stepsize parameters v 1,..., v d > 0 computable from (F, Ŝ) Algorithm: on each iteration t > 0 do 1 sample a random set S t using Ŝ 2 for each j S t, update w t j Serial sampling: P( Ŝ = 1) = 1 w t 1 j 1 v j w j F (w t ) Ridge Regression using serial sampling: v j = 1 n n i=1 (x i j )2 + λ Dominik Csiba (University of Edinburgh) Importance Sampling for Minibatches , Birmingham 4 / 15
12 Expected Separable Overapproximation Assumption 1: ESO [Qu and Richtárik, 2014] Let the objective F and the sampling Ŝ be given and let p j P(j Ŝ). Let h [Ŝ] be a vector defined by the entries { ( ) h [ Ŝ] = h j if j Ŝ j 0 if j / Ŝ The parameters v 1,..., v d satisfy the ESO if for all w, h R d it holds that d E[F (w + h [ Ŝ] )] F (w) + d p j h j F (w) + p j v j hj 2 w j j=1 and we say that v 1,..., v d ESO(F, Ŝ). j=1 Dominik Csiba (University of Edinburgh) Importance Sampling for Minibatches , Birmingham 5 / 15
13 Expected Separable Overapproximation Assumption 1: ESO [Qu and Richtárik, 2014] Let the objective F and the sampling Ŝ be given and let p j P(j Ŝ). Let h [Ŝ] be a vector defined by the entries { ( ) h [ Ŝ] = h j if j Ŝ j 0 if j / Ŝ The parameters v 1,..., v d satisfy the ESO if for all w, h R d it holds that d E[F (w + h [ Ŝ] )] F (w) + d p j h j F (w) + p j v j hj 2 w j j=1 and we say that v 1,..., v d ESO(F, Ŝ). Intepretation: The expectation of F at the update behaves smoothly Dominik Csiba (University of Edinburgh) Importance Sampling for Minibatches , Birmingham 5 / 15 j=1
14 Strong Convexity Assumption 2: Strong Convexity Function F is λ-strongly convex if for all w, h R d it holds that F (w + h) F (w) + F (w), h + λ 2 w 2 2 Dominik Csiba (University of Edinburgh) Importance Sampling for Minibatches , Birmingham 6 / 15
15 Strong Convexity Assumption 2: Strong Convexity Function F is λ-strongly convex if for all w, h R d it holds that F (w + h) F (w) + F (w), h + λ 2 w 2 2 Example: R(w) = λ 2 w 2 2 is λ-strongly convex Dominik Csiba (University of Edinburgh) Importance Sampling for Minibatches , Birmingham 6 / 15
16 Strong Convexity Assumption 2: Strong Convexity Function F is λ-strongly convex if for all w, h R d it holds that F (w + h) F (w) + F (w), h + λ 2 w 2 2 Example: R(w) = λ 2 w 2 2 is λ-strongly convex Lemma: convex + strongly convex strongly convex Dominik Csiba (University of Edinburgh) Importance Sampling for Minibatches , Birmingham 6 / 15
17 Strong Convexity Assumption 2: Strong Convexity Function F is λ-strongly convex if for all w, h R d it holds that F (w + h) F (w) + F (w), h + λ 2 w 2 2 Example: R(w) = λ 2 w 2 2 is λ-strongly convex Lemma: convex + strongly convex strongly convex Ridge logistic regression is λ-strongly convex n F (w) = 1 n i=1 log(1 + exp( y i w, x i )) + λ 2 w 2 2 Dominik Csiba (University of Edinburgh) Importance Sampling for Minibatches , Birmingham 6 / 15
18 Strong Convexity Assumption 2: Strong Convexity Function F is λ-strongly convex if for all w, h R d it holds that F (w + h) F (w) + F (w), h + λ 2 w 2 2 Example: R(w) = λ 2 w 2 2 is λ-strongly convex Lemma: convex + strongly convex strongly convex Ridge logistic regression is λ-strongly convex n F (w) = 1 n i=1 log(1 + exp( y i w, x i )) + λ 2 w 2 2 Ridge regression is λ-strongly convex Dominik Csiba (University of Edinburgh) Importance Sampling for Minibatches , Birmingham 6 / 15
19 Convergence Rate Theorem [Richtárik and Takáč, 2015] Let the objective F and the sampling Ŝ be such that F is λ-strongly convex and v 1,... v d ESO(F, Ŝ) and let v [v 1,..., v d ]. Let p j = P(j Ŝ) and let p [p 1,... p d ]. Let C(p, v) max j {1,...,d} ( ) vj, p j λ and let {w t } t=1 be a sequence generated by NSync. Then ( ) C T C(p, v) log E[F (w T ) F (w )] ɛ, ɛ where C is an absolute constant depending on w 0 and w. Dominik Csiba (University of Edinburgh) Importance Sampling for Minibatches , Birmingham 7 / 15
20 Convergence Rate Theorem [Richtárik and Takáč, 2015] Let the objective F and the sampling Ŝ be such that F is λ-strongly convex and v 1,... v d ESO(F, Ŝ) and let v [v 1,..., v d ]. Let p j = P(j Ŝ) and let p [p 1,... p d ]. Let C(p, v) max j {1,...,d} ( ) vj, p j λ and let {w t } t=1 be a sequence generated by NSync. Then ( ) C T C(p, v) log E[F (w T ) F (w )] ɛ, ɛ where C is an absolute constant depending on w 0 and w. Key quantity: C(p, v) Dominik Csiba (University of Edinburgh) Importance Sampling for Minibatches , Birmingham 7 / 15
21 Serial Importance Sampling Let Ŝ be a serial sampling defined by P({j} = Ŝ) = p j, where j p j = 1. Dominik Csiba (University of Edinburgh) Importance Sampling for Minibatches , Birmingham 8 / 15
22 Serial Importance Sampling Let Ŝ be a serial sampling defined by P({j} = Ŝ) = p j, where j p j = 1. Ridge regression using serial sampling: v j = 1 n n i=1 (x i j )2 + λ Dominik Csiba (University of Edinburgh) Importance Sampling for Minibatches , Birmingham 8 / 15
23 Serial Importance Sampling Let Ŝ be a serial sampling defined by P({j} = Ŝ) = p j, where j p j = 1. Ridge regression using serial sampling: v j = 1 n n i=1 (x i j )2 + λ Question: How to choose p, so that the complexity ( ) vj C(p, v) max j {1,...,d} p j λ is minimized? Dominik Csiba (University of Edinburgh) Importance Sampling for Minibatches , Birmingham 8 / 15
24 Serial Importance Sampling Let Ŝ be a serial sampling defined by P({j} = Ŝ) = p j, where j p j = 1. Ridge regression using serial sampling: v j = 1 n n i=1 (x i j )2 + λ Question: How to choose p, so that the complexity ( ) vj C(p, v) max j {1,...,d} p j λ is minimized? sampling p j C(p, v) uniform 1/d (d max j {1,...,d} v j )/λ Dominik Csiba (University of Edinburgh) Importance Sampling for Minibatches , Birmingham 8 / 15
25 Serial Importance Sampling Let Ŝ be a serial sampling defined by P({j} = Ŝ) = p j, where j p j = 1. Ridge regression using serial sampling: v j = 1 n n i=1 (x i j )2 + λ Question: How to choose p, so that the complexity ( ) vj C(p, v) max j {1,...,d} p j λ is minimized? sampling p j C(p, v) uniform 1/d (d max j {1,...,d} v j )/λ importance v j /( k v k) ( k v k)/λ Dominik Csiba (University of Edinburgh) Importance Sampling for Minibatches , Birmingham 8 / 15
26 Serial Importance Sampling Let Ŝ be a serial sampling defined by P({j} = Ŝ) = p j, where j p j = 1. Ridge regression using serial sampling: v j = 1 n n i=1 (x i j )2 + λ Question: How to choose p, so that the complexity ( ) vj C(p, v) max j {1,...,d} p j λ is minimized? sampling p j C(p, v) uniform 1/d (d max j {1,...,d} v j )/λ importance v j /( k v k) ( k v k)/λ The speedup is proportional to mean(v) vs. maximum(v) Dominik Csiba (University of Edinburgh) Importance Sampling for Minibatches , Birmingham 8 / 15
27 Serial Importance Sampling Let Ŝ be a serial sampling defined by P({j} = Ŝ) = p j, where j p j = 1. Ridge regression using serial sampling: v j = 1 n n i=1 (x i j )2 + λ Question: How to choose p, so that the complexity ( ) vj C(p, v) max j {1,...,d} p j λ is minimized? sampling p j C(p, v) uniform 1/d (d max j {1,...,d} v j )/λ importance v j /( k v k) ( k v k)/λ The speedup is proportional to mean(v) vs. maximum(v) Note: importance sampling is the optimal fixed serial sampling Dominik Csiba (University of Edinburgh) Importance Sampling for Minibatches , Birmingham 8 / 15
28 Minibatch Coordinate Descent Assume we have τ cores and we want to make use of all of them. Dominik Csiba (University of Edinburgh) Importance Sampling for Minibatches , Birmingham 9 / 15
29 Minibatch Coordinate Descent Assume we have τ cores and we want to make use of all of them. Parallel sampling: P( Ŝ = τ) = 1, each core processes one coordinate Dominik Csiba (University of Edinburgh) Importance Sampling for Minibatches , Birmingham 9 / 15
30 Minibatch Coordinate Descent Assume we have τ cores and we want to make use of all of them. Parallel sampling: P( Ŝ = τ) = 1, each core processes one coordinate Question: How to design importance sampling for minibatches? Dominik Csiba (University of Edinburgh) Importance Sampling for Minibatches , Birmingham 9 / 15
31 Minibatch Coordinate Descent Assume we have τ cores and we want to make use of all of them. Parallel sampling: P( Ŝ = τ) = 1, each core processes one coordinate Question: How to design importance sampling for minibatches? Complications: Dominik Csiba (University of Edinburgh) Importance Sampling for Minibatches , Birmingham 9 / 15
32 Minibatch Coordinate Descent Assume we have τ cores and we want to make use of all of them. Parallel sampling: P( Ŝ = τ) = 1, each core processes one coordinate Question: How to design importance sampling for minibatches? Complications: for serial sampling we had P({j} = Ŝ) = P(j Ŝ), which is not the }{{} parameter case for parallel Dominik Csiba (University of Edinburgh) Importance Sampling for Minibatches , Birmingham 9 / 15
33 Minibatch Coordinate Descent Assume we have τ cores and we want to make use of all of them. Parallel sampling: P( Ŝ = τ) = 1, each core processes one coordinate Question: How to design importance sampling for minibatches? Complications: for serial sampling we had P({j} = Ŝ) = P(j Ŝ), which is not the }{{} parameter case for parallel We need to choose ( ( d τ) set-probabilities, instead of just d 1) = d Dominik Csiba (University of Edinburgh) Importance Sampling for Minibatches , Birmingham 9 / 15
34 Minibatch Coordinate Descent Assume we have τ cores and we want to make use of all of them. Parallel sampling: P( Ŝ = τ) = 1, each core processes one coordinate Question: How to design importance sampling for minibatches? Complications: for serial sampling we had P({j} = Ŝ) = P(j Ŝ), which is not the }{{} parameter case for parallel We need to choose ( ( d τ) set-probabilities, instead of just d 1) = d All of them influence the values P(j Ŝ) Dominik Csiba (University of Edinburgh) Importance Sampling for Minibatches , Birmingham 9 / 15
35 Minibatch Coordinate Descent Assume we have τ cores and we want to make use of all of them. Parallel sampling: P( Ŝ = τ) = 1, each core processes one coordinate Question: How to design importance sampling for minibatches? Complications: for serial sampling we had P({j} = Ŝ) = P(j Ŝ), which is not the }{{} parameter case for parallel We need to choose ( ( d τ) set-probabilities, instead of just d 1) = d All of them influence the values P(j Ŝ) Goal: Find a parallel sampling with direct control over p j = P(j Ŝ) Dominik Csiba (University of Edinburgh) Importance Sampling for Minibatches , Birmingham 9 / 15
36 Minibatch Coordinate Descent Assume we have τ cores and we want to make use of all of them. Parallel sampling: P( Ŝ = τ) = 1, each core processes one coordinate Question: How to design importance sampling for minibatches? Complications: for serial sampling we had P({j} = Ŝ) = P(j Ŝ), which is not the }{{} parameter case for parallel We need to choose ( ( d τ) set-probabilities, instead of just d 1) = d All of them influence the values P(j Ŝ) Goal: Find a parallel sampling with direct control over p j = P(j Ŝ) Solution: We introduce the bucket sampling Dominik Csiba (University of Edinburgh) Importance Sampling for Minibatches , Birmingham 9 / 15
37 Bucket sampling Idea: Divide the coordinates {1,..., d} into τ buckets and choose one coordinate non-uniformly from each bucket Dominik Csiba (University of Edinburgh) Importance Sampling for Minibatches , Birmingham 10 / 15
38 Bucket sampling Idea: Divide the coordinates {1,..., d} into τ buckets and choose one coordinate non-uniformly from each bucket Bucket sampling Let C 1,..., C τ be an arbitrary partition of {1,..., d}. Choose the probabilities p j, such that j C k p j = 1 for each k {1,..., τ}. The bucket sampling is a procedure, which outputs a single coordinate j from each C k with probabilities p j. Dominik Csiba (University of Edinburgh) Importance Sampling for Minibatches , Birmingham 10 / 15
39 Bucket sampling Idea: Divide the coordinates {1,..., d} into τ buckets and choose one coordinate non-uniformly from each bucket Bucket sampling Let C 1,..., C τ be an arbitrary partition of {1,..., d}. Choose the probabilities p j, such that j C k p j = 1 for each k {1,..., τ}. The bucket sampling is a procedure, which outputs a single coordinate j from each C k with probabilities p j. Example: Let τ = 4 and d = 12, [ ] p = 0.1, 0.4, 0.2, 0.3, }{{}}{{} 1.0, 0.1, 0.1, 0.1, 0.1, 0.6, 0.5, 0.5 }{{}}{{} C 1 C 2 C 3 C 4 Dominik Csiba (University of Edinburgh) Importance Sampling for Minibatches , Birmingham 10 / 15
40 Bucket sampling Idea: Divide the coordinates {1,..., d} into τ buckets and choose one coordinate non-uniformly from each bucket Bucket sampling Let C 1,..., C τ be an arbitrary partition of {1,..., d}. Choose the probabilities p j, such that j C k p j = 1 for each k {1,..., τ}. The bucket sampling is a procedure, which outputs a single coordinate j from each C k with probabilities p j. Example: Let τ = 4 and d = 12, [ ] p = 0.1, 0.4, 0.2, 0.3, }{{}}{{} 1.0, 0.1, 0.1, 0.1, 0.1, 0.6, 0.5, 0.5 }{{}}{{} C 1 C 2 C 3 Observe: It holds, that P(j Ŝ) = p j. C 4 Dominik Csiba (University of Edinburgh) Importance Sampling for Minibatches , Birmingham 10 / 15
41 Bucket sampling Idea: Divide the coordinates {1,..., d} into τ buckets and choose one coordinate non-uniformly from each bucket Bucket sampling Let C 1,..., C τ be an arbitrary partition of {1,..., d}. Choose the probabilities p j, such that j C k p j = 1 for each k {1,..., τ}. The bucket sampling is a procedure, which outputs a single coordinate j from each C k with probabilities p j. Example: Let τ = 4 and d = 12, [ ] p = 0.1, 0.4, 0.2, 0.3, }{{}}{{} 1.0, 0.1, 0.1, 0.1, 0.1, 0.6, 0.5, 0.5 }{{}}{{} C 1 C 2 C 3 Observe: It holds, that P(j Ŝ) = p j. Note: Both p 1,..., p d and C 1,..., C τ are parameters of the sampling Dominik Csiba (University of Edinburgh) Importance Sampling for Minibatches , Birmingham 10 / 15 C 4
42 Importance Sampling for Minibatches Goal: We aim to minimize C(p bucket, v bucket ) = ( vj bucket ) max j {1,...,d} λp bucket j Dominik Csiba (University of Edinburgh) Importance Sampling for Minibatches , Birmingham 11 / 15
43 Importance Sampling for Minibatches Goal: We aim to minimize C(p bucket, v bucket ) = Issue: The values v bucket depend on p bucket, i.e., v j bucket = (... p bucket...) ( vj bucket ) max j {1,...,d} λp bucket j Dominik Csiba (University of Edinburgh) Importance Sampling for Minibatches , Birmingham 11 / 15
44 Importance Sampling for Minibatches Goal: We aim to minimize C(p bucket, v bucket ) = Issue: The values v bucket depend on p bucket, i.e., v j bucket = (... p bucket...) ( vj bucket ) max j {1,...,d} λp bucket j Optimal probabilities: joint optimization difficult Dominik Csiba (University of Edinburgh) Importance Sampling for Minibatches , Birmingham 11 / 15
45 Importance Sampling for Minibatches Goal: We aim to minimize C(p bucket, v bucket ) = Issue: The values v bucket depend on p bucket, i.e., v j bucket = (... p bucket...) ( vj bucket ) max j {1,...,d} λp bucket j Optimal probabilities: joint optimization difficult Strategy: Choose a reasonable v bucket pick p j bucket v j bucket Dominik Csiba (University of Edinburgh) Importance Sampling for Minibatches , Birmingham 11 / 15
46 Importance Sampling for Minibatches Goal: We aim to minimize C(p bucket, v bucket ) = Issue: The values v bucket depend on p bucket, i.e., v j bucket = (... p bucket...) ( vj bucket ) max j {1,...,d} λp bucket j Optimal probabilities: joint optimization difficult Strategy: Choose a reasonable v bucket pick p j bucket v j bucket In theory: We cannot guarantee a speedup, just convergence Dominik Csiba (University of Edinburgh) Importance Sampling for Minibatches , Birmingham 11 / 15
47 Importance Sampling for Minibatches Goal: We aim to minimize C(p bucket, v bucket ) = Issue: The values v bucket depend on p bucket, i.e., v j bucket = (... p bucket...) ( vj bucket ) max j {1,...,d} λp bucket j Optimal probabilities: joint optimization difficult Strategy: Choose a reasonable v bucket pick p j bucket v j bucket In theory: We cannot guarantee a speedup, just convergence In practice: We will see. Dominik Csiba (University of Edinburgh) Importance Sampling for Minibatches , Birmingham 11 / 15
48 Experiments: Convergence Ridge Logistic Regression using Quartz [Qu et al., 2015] Dominik Csiba (University of Edinburgh) Importance Sampling for Minibatches , Birmingham 12 / 15
49 Experiments: Convergence Ridge Logistic Regression using Quartz [Qu et al., 2015] Randomly chosen equally sized buckets C 1,..., C τ Dominik Csiba (University of Edinburgh) Importance Sampling for Minibatches , Birmingham 12 / 15
50 Experiments: Convergence Ridge Logistic Regression using Quartz [Qu et al., 2015] Randomly chosen equally sized buckets C 1,..., C τ P(w t ) P(w ) nice 2-nice 4-nice 8-nice 16-nice 32-nice Number of Effective Passes / τ P(w t ) P(w ) imp 2-imp 4-imp 8-imp 16-imp 32-imp Number of Effective Passes / τ url dataset: d = 2, 396, 130, n = 3, 231, 962, sparsity = 0.04% Dominik Csiba (University of Edinburgh) Importance Sampling for Minibatches , Birmingham 12 / 15
51 Experiments: Theory vs. Practice Ridge Logistic Regression using Quartz [Qu et al., 2015] Randomly chosen equally sized buckets C 1,..., C τ Data τ = 1 τ = 2 τ = 4 τ = 8 τ = 16 τ = 32 ijcnn1 1.2 : : : : : : 1.8 protein 1.3 : : : : : : 1.5 w8a 2.8 : : : : : : 1.8 url 3.0 : : : : : : 1.7 aloi 13 : : : : : : 6.7 Table: The theoretical : empirical ratios θ (τ-imp) /θ (τ-nice). Dominik Csiba (University of Edinburgh) Importance Sampling for Minibatches , Birmingham 13 / 15
52 Conclusion Dominik Csiba (University of Edinburgh) Importance Sampling for Minibatches , Birmingham 14 / 15
53 Conclusion we introduced a flexible parallel sampling - the bucket sampling Dominik Csiba (University of Edinburgh) Importance Sampling for Minibatches , Birmingham 14 / 15
54 Conclusion we introduced a flexible parallel sampling - the bucket sampling we used it to get an importance sampling for minibatches Dominik Csiba (University of Edinburgh) Importance Sampling for Minibatches , Birmingham 14 / 15
55 Conclusion we introduced a flexible parallel sampling - the bucket sampling we used it to get an importance sampling for minibatches can be used for various stochastic iterative methods Dominik Csiba (University of Edinburgh) Importance Sampling for Minibatches , Birmingham 14 / 15
56 Conclusion we introduced a flexible parallel sampling - the bucket sampling we used it to get an importance sampling for minibatches can be used for various stochastic iterative methods we empirically match the improvement gained by serial importance sampling Dominik Csiba (University of Edinburgh) Importance Sampling for Minibatches , Birmingham 14 / 15
57 Conclusion we introduced a flexible parallel sampling - the bucket sampling we used it to get an importance sampling for minibatches can be used for various stochastic iterative methods we empirically match the improvement gained by serial importance sampling in theory, we cannot in general guarantee that importance sampling for minibatches will be better than the uniform parallel sampling Dominik Csiba (University of Edinburgh) Importance Sampling for Minibatches , Birmingham 14 / 15
58 References Csiba, Dominik and Richtárik, Peter Importance Sampling for Minibatches arxiv e-prints (2016): Richtárik, Peter and Takáč, Martin On optimal probabilities in stochastic coordinate descent methods Optimization Letters (2015): Qu, Zheng and Richtárik, Peter Coordinate descent with arbitrary sampling II: Expected separable overapproximation arxiv e-prints (2014): Qu, Zheng and Richtárik, Peter and Zhang, Tong Quartz: Randomized dual coordinate ascent with arbitrary sampling Advances in Neural Information Processing Systems, 2015 Dominik Csiba (University of Edinburgh) Importance Sampling for Minibatches , Birmingham 15 / 15
Randomized Coordinate Descent with Arbitrary Sampling: Algorithms and Complexity
Randomized Coordinate Descent with Arbitrary Sampling: Algorithms and Complexity Zheng Qu University of Hong Kong CAM, 23-26 Aug 2016 Hong Kong based on joint work with Peter Richtarik and Dominique Cisba(University
More informationMini-Batch Primal and Dual Methods for SVMs
Mini-Batch Primal and Dual Methods for SVMs Peter Richtárik School of Mathematics The University of Edinburgh Coauthors: M. Takáč (Edinburgh), A. Bijral and N. Srebro (both TTI at Chicago) arxiv:1303.2314
More informationLecture 3: Minimizing Large Sums. Peter Richtárik
Lecture 3: Minimizing Large Sums Peter Richtárik Graduate School in Systems, Op@miza@on, Control and Networks Belgium 2015 Mo@va@on: Machine Learning & Empirical Risk Minimiza@on Training Linear Predictors
More informationFAST DISTRIBUTED COORDINATE DESCENT FOR NON-STRONGLY CONVEX LOSSES. Olivier Fercoq Zheng Qu Peter Richtárik Martin Takáč
FAST DISTRIBUTED COORDINATE DESCENT FOR NON-STRONGLY CONVEX LOSSES Olivier Fercoq Zheng Qu Peter Richtárik Martin Takáč School of Mathematics, University of Edinburgh, Edinburgh, EH9 3JZ, United Kingdom
More informationCoordinate Descent Faceoff: Primal or Dual?
JMLR: Workshop and Conference Proceedings 83:1 22, 2018 Algorithmic Learning Theory 2018 Coordinate Descent Faceoff: Primal or Dual? Dominik Csiba peter.richtarik@ed.ac.uk School of Mathematics University
More informationParallel Coordinate Optimization
1 / 38 Parallel Coordinate Optimization Julie Nutini MLRG - Spring Term March 6 th, 2018 2 / 38 Contours of a function F : IR 2 IR. Goal: Find the minimizer of F. Coordinate Descent in 2D Contours of a
More informationComparison of Modern Stochastic Optimization Algorithms
Comparison of Modern Stochastic Optimization Algorithms George Papamakarios December 214 Abstract Gradient-based optimization methods are popular in machine learning applications. In large-scale problems,
More informationSVRG++ with Non-uniform Sampling
SVRG++ with Non-uniform Sampling Tamás Kern András György Department of Electrical and Electronic Engineering Imperial College London, London, UK, SW7 2BT {tamas.kern15,a.gyorgy}@imperial.ac.uk Abstract
More informationSTA141C: Big Data & High Performance Statistical Computing
STA141C: Big Data & High Performance Statistical Computing Lecture 8: Optimization Cho-Jui Hsieh UC Davis May 9, 2017 Optimization Numerical Optimization Numerical Optimization: min X f (X ) Can be applied
More informationECS289: Scalable Machine Learning
ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Sept 29, 2016 Outline Convex vs Nonconvex Functions Coordinate Descent Gradient Descent Newton s method Stochastic Gradient Descent Numerical Optimization
More informationDivide-and-combine Strategies in Statistical Modeling for Massive Data
Divide-and-combine Strategies in Statistical Modeling for Massive Data Liqun Yu Washington University in St. Louis March 30, 2017 Liqun Yu (WUSTL) D&C Statistical Modeling for Massive Data March 30, 2017
More informationDistributed Inexact Newton-type Pursuit for Non-convex Sparse Learning
Distributed Inexact Newton-type Pursuit for Non-convex Sparse Learning Bo Liu Department of Computer Science, Rutgers Univeristy Xiao-Tong Yuan BDAT Lab, Nanjing University of Information Science and Technology
More informationSupplement: Distributed Box-constrained Quadratic Optimization for Dual Linear SVM
Supplement: Distributed Box-constrained Quadratic Optimization for Dual Linear SVM Ching-pei Lee LEECHINGPEI@GMAIL.COM Dan Roth DANR@ILLINOIS.EDU University of Illinois at Urbana-Champaign, 201 N. Goodwin
More informationStochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization
Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization Shai Shalev-Shwartz and Tong Zhang School of CS and Engineering, The Hebrew University of Jerusalem Optimization for Machine
More informationJ. Sadeghi E. Patelli M. de Angelis
J. Sadeghi E. Patelli Institute for Risk and, Department of Engineering, University of Liverpool, United Kingdom 8th International Workshop on Reliable Computing, Computing with Confidence University of
More informationMachine Learning. Support Vector Machines. Fabio Vandin November 20, 2017
Machine Learning Support Vector Machines Fabio Vandin November 20, 2017 1 Classification and Margin Consider a classification problem with two classes: instance set X = R d label set Y = { 1, 1}. Training
More informationSDNA: Stochastic Dual Newton Ascent for Empirical Risk Minimization
Zheng Qu ZHENGQU@HKU.HK Department of Mathematics, The University of Hong Kong, Hong Kong Peter Richtárik PETER.RICHTARIK@ED.AC.UK School of Mathematics, The University of Edinburgh, UK Martin Takáč TAKAC.MT@GMAIL.COM
More informationMachine Learning. Lecture 04: Logistic and Softmax Regression. Nevin L. Zhang
Machine Learning Lecture 04: Logistic and Softmax Regression Nevin L. Zhang lzhang@cse.ust.hk Department of Computer Science and Engineering The Hong Kong University of Science and Technology This set
More informationECS171: Machine Learning
ECS171: Machine Learning Lecture 4: Optimization (LFD 3.3, SGD) Cho-Jui Hsieh UC Davis Jan 22, 2018 Gradient descent Optimization Goal: find the minimizer of a function min f (w) w For now we assume f
More informationTrade-Offs in Distributed Learning and Optimization
Trade-Offs in Distributed Learning and Optimization Ohad Shamir Weizmann Institute of Science Includes joint works with Yossi Arjevani, Nathan Srebro and Tong Zhang IHES Workshop March 2016 Distributed
More informationStochastic Gradient Descent. CS 584: Big Data Analytics
Stochastic Gradient Descent CS 584: Big Data Analytics Gradient Descent Recap Simplest and extremely popular Main Idea: take a step proportional to the negative of the gradient Easy to implement Each iteration
More informationContents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016
ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016 LECTURERS: NAMAN AGARWAL AND BRIAN BULLINS SCRIBE: KIRAN VODRAHALLI Contents 1 Introduction 1 1.1 History of Optimization.....................................
More informationFast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee
Fast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee Shen-Yi Zhao and Wu-Jun Li National Key Laboratory for Novel Software Technology Department of Computer
More informationLeast Mean Squares Regression. Machine Learning Fall 2018
Least Mean Squares Regression Machine Learning Fall 2018 1 Where are we? Least Squares Method for regression Examples The LMS objective Gradient descent Incremental/stochastic gradient descent Exercises
More informationLeast Mean Squares Regression
Least Mean Squares Regression Machine Learning Spring 2018 The slides are mainly from Vivek Srikumar 1 Lecture Overview Linear classifiers What functions do linear classifiers express? Least Squares Method
More informationFast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee
Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence (AAAI-16) Fast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee Shen-Yi Zhao and
More informationCSC321 Lecture 8: Optimization
CSC321 Lecture 8: Optimization Roger Grosse Roger Grosse CSC321 Lecture 8: Optimization 1 / 26 Overview We ve talked a lot about how to compute gradients. What do we actually do with them? Today s lecture:
More informationCS60021: Scalable Data Mining. Large Scale Machine Learning
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 1 CS60021: Scalable Data Mining Large Scale Machine Learning Sourangshu Bhattacharya Example: Spam filtering Instance
More informationLinear Models in Machine Learning
CS540 Intro to AI Linear Models in Machine Learning Lecturer: Xiaojin Zhu jerryzhu@cs.wisc.edu We briefly go over two linear models frequently used in machine learning: linear regression for, well, regression,
More informationMachine Learning. Lecture 2: Linear regression. Feng Li. https://funglee.github.io
Machine Learning Lecture 2: Linear regression Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 2017 Supervised Learning Regression: Predict
More informationLogistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu
Logistic Regression Review 10-601 Fall 2012 Recitation September 25, 2012 TA: Selen Uguroglu!1 Outline Decision Theory Logistic regression Goal Loss function Inference Gradient Descent!2 Training Data
More informationBig Data Analytics. Lucas Rego Drumond
Big Data Analytics Lucas Rego Drumond Information Systems and Machine Learning Lab (ISMLL) Institute of Computer Science University of Hildesheim, Germany Predictive Models Predictive Models 1 / 34 Outline
More informationStein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm
Stein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm Qiang Liu and Dilin Wang NIPS 2016 Discussion by Yunchen Pu March 17, 2017 March 17, 2017 1 / 8 Introduction Let x R d
More informationarxiv: v2 [cs.lg] 10 Oct 2018
Journal of Machine Learning Research 9 (208) -49 Submitted 0/6; Published 7/8 CoCoA: A General Framework for Communication-Efficient Distributed Optimization arxiv:6.0289v2 [cs.lg] 0 Oct 208 Virginia Smith
More informationAccelerating Stochastic Optimization
Accelerating Stochastic Optimization Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem and Mobileye Master Class at Tel-Aviv, Tel-Aviv University, November 2014 Shalev-Shwartz
More informationAdaptive Probabilities in Stochastic Optimization Algorithms
Research Collection Master Thesis Adaptive Probabilities in Stochastic Optimization Algorithms Author(s): Zhong, Lei Publication Date: 2015 Permanent Link: https://doi.org/10.3929/ethz-a-010421465 Rights
More informationAccelerating SVRG via second-order information
Accelerating via second-order information Ritesh Kolte Department of Electrical Engineering rkolte@stanford.edu Murat Erdogdu Department of Statistics erdogdu@stanford.edu Ayfer Özgür Department of Electrical
More informationStochastic Dual Coordinate Ascent with Adaptive Probabilities
Dominik Csiba Zheng Qu Peter Richtárik University of Edinburgh CDOMINIK@GMAIL.COM ZHENG.QU@ED.AC.UK PETER.RICHTARIK@ED.AC.UK Abstract This paper introduces AdaSDCA: an adaptive variant of stochastic dual
More informationFast Stochastic Optimization Algorithms for ML
Fast Stochastic Optimization Algorithms for ML Aaditya Ramdas April 20, 205 This lecture is about efficient algorithms for minimizing finite sums min w R d n i= f i (w) or min w R d n f i (w) + λ 2 w 2
More informationDistributed Box-Constrained Quadratic Optimization for Dual Linear SVM
Distributed Box-Constrained Quadratic Optimization for Dual Linear SVM Lee, Ching-pei University of Illinois at Urbana-Champaign Joint work with Dan Roth ICML 2015 Outline Introduction Algorithm Experiments
More informationParallel and Distributed Stochastic Learning -Towards Scalable Learning for Big Data Intelligence
Parallel and Distributed Stochastic Learning -Towards Scalable Learning for Big Data Intelligence oé LAMDA Group H ŒÆOŽÅ Æ EâX ^ #EâI[ : liwujun@nju.edu.cn Dec 10, 2016 Wu-Jun Li (http://cs.nju.edu.cn/lwj)
More informationStatistical Data Mining and Machine Learning Hilary Term 2016
Statistical Data Mining and Machine Learning Hilary Term 2016 Dino Sejdinovic Department of Statistics Oxford Slides and other materials available at: http://www.stats.ox.ac.uk/~sejdinov/sdmml Naïve Bayes
More informationNeural Networks and Deep Learning
Neural Networks and Deep Learning Professor Ameet Talwalkar November 12, 2015 Professor Ameet Talwalkar Neural Networks and Deep Learning November 12, 2015 1 / 16 Outline 1 Review of last lecture AdaBoost
More informationLecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron
CS446: Machine Learning, Fall 2017 Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron Lecturer: Sanmi Koyejo Scribe: Ke Wang, Oct. 24th, 2017 Agenda Recap: SVM and Hinge loss, Representer
More informationCSC321 Lecture 7: Optimization
CSC321 Lecture 7: Optimization Roger Grosse Roger Grosse CSC321 Lecture 7: Optimization 1 / 25 Overview We ve talked a lot about how to compute gradients. What do we actually do with them? Today s lecture:
More informationBlock stochastic gradient update method
Block stochastic gradient update method Yangyang Xu and Wotao Yin IMA, University of Minnesota Department of Mathematics, UCLA November 1, 2015 This work was done while in Rice University 1 / 26 Stochastic
More informationFaster Machine Learning via Low-Precision Communication & Computation. Dan Alistarh (IST Austria & ETH Zurich), Hantian Zhang (ETH Zurich)
Faster Machine Learning via Low-Precision Communication & Computation Dan Alistarh (IST Austria & ETH Zurich), Hantian Zhang (ETH Zurich) 2 How many bits do you need to represent a single number in machine
More informationBig Data Analytics: Optimization and Randomization
Big Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML 2015 Hong Kong Department of Computer Science, The University of Iowa, IA, USA Nov. 20, 2015 Yang Tutorial for ACML 15 Nov.
More informationMultilayer Perceptron
Outline Hong Chang Institute of Computing Technology, Chinese Academy of Sciences Machine Learning Methods (Fall 2012) Outline Outline I 1 Introduction 2 Single Perceptron 3 Boolean Function Learning 4
More informationCS229 Supplemental Lecture notes
CS229 Supplemental Lecture notes John Duchi Binary classification In binary classification problems, the target y can take on at only two values. In this set of notes, we show how to model this problem
More informationNeural Networks: Backpropagation
Neural Networks: Backpropagation Seung-Hoon Na 1 1 Department of Computer Science Chonbuk National University 2018.10.25 eung-hoon Na (Chonbuk National University) Neural Networks: Backpropagation 2018.10.25
More informationSequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015
Sequence Modelling with Features: Linear-Chain Conditional Random Fields COMP-599 Oct 6, 2015 Announcement A2 is out. Due Oct 20 at 1pm. 2 Outline Hidden Markov models: shortcomings Generative vs. discriminative
More informationLogistic Regression. COMP 527 Danushka Bollegala
Logistic Regression COMP 527 Danushka Bollegala Binary Classification Given an instance x we must classify it to either positive (1) or negative (0) class We can use {1,-1} instead of {1,0} but we will
More informationARock: an algorithmic framework for asynchronous parallel coordinate updates
ARock: an algorithmic framework for asynchronous parallel coordinate updates Zhimin Peng, Yangyang Xu, Ming Yan, Wotao Yin ( UCLA Math, U.Waterloo DCO) UCLA CAM Report 15-37 ShanghaiTech SSDS 15 June 25,
More informationLecture 26: Neural Nets
Lecture 26: Neural Nets ECE 417: Multimedia Signal Processing Mark Hasegawa-Johnson University of Illinois 11/30/2017 1 Intro 2 Knowledge-Based Design 3 Error Metric 4 Gradient Descent 5 Simulated Annealing
More informationLogistic Regression. Stochastic Gradient Descent
Tutorial 8 CPSC 340 Logistic Regression Stochastic Gradient Descent Logistic Regression Model A discriminative probabilistic model for classification e.g. spam filtering Let x R d be input and y { 1, 1}
More informationSparse Principal Component Analysis via Alternating Maximization and Efficient Parallel Implementations
Sparse Principal Component Analysis via Alternating Maximization and Efficient Parallel Implementations Martin Takáč The University of Edinburgh Joint work with Peter Richtárik (Edinburgh University) Selin
More informationMachine Learning: Logistic Regression. Lecture 04
Machine Learning: Logistic Regression Razvan C. Bunescu School of Electrical Engineering and Computer Science bunescu@ohio.edu Supervised Learning Task = learn an (unkon function t : X T that maps input
More informationStochastic Optimization Algorithms Beyond SG
Stochastic Optimization Algorithms Beyond SG Frank E. Curtis 1, Lehigh University involving joint work with Léon Bottou, Facebook AI Research Jorge Nocedal, Northwestern University Optimization Methods
More informationMachine Learning. Linear Models. Fabio Vandin October 10, 2017
Machine Learning Linear Models Fabio Vandin October 10, 2017 1 Linear Predictors and Affine Functions Consider X = R d Affine functions: L d = {h w,b : w R d, b R} where ( d ) h w,b (x) = w, x + b = w
More informationNesterov s Acceleration
Nesterov s Acceleration Nesterov Accelerated Gradient min X f(x)+ (X) f -smooth. Set s 1 = 1 and = 1. Set y 0. Iterate by increasing t: g t 2 @f(y t ) s t+1 = 1+p 1+4s 2 t 2 y t = x t + s t 1 s t+1 (x
More informationDiscriminative Models
No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models
More informationIs the test error unbiased for these programs? 2017 Kevin Jamieson
Is the test error unbiased for these programs? 2017 Kevin Jamieson 1 Is the test error unbiased for this program? 2017 Kevin Jamieson 2 Simple Variable Selection LASSO: Sparse Regression Machine Learning
More informationLecture 10. Neural networks and optimization. Machine Learning and Data Mining November Nando de Freitas UBC. Nonlinear Supervised Learning
Lecture 0 Neural networks and optimization Machine Learning and Data Mining November 2009 UBC Gradient Searching for a good solution can be interpreted as looking for a minimum of some error (loss) function
More informationMachine Learning: Chenhao Tan University of Colorado Boulder LECTURE 5
Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 5 Slides adapted from Jordan Boyd-Graber, Tom Mitchell, Ziv Bar-Joseph Machine Learning: Chenhao Tan Boulder 1 of 27 Quiz question For
More informationSDNA: Stochastic Dual Newton Ascent for Empirical Risk Minimization
Zheng Qu ZHENGQU@HKU.HK Department of Mathematics, The University of Hong Kong, Hong Kong Peter Richtárik PETER.RICHTARIK@ED.AC.UK School of Mathematics, The University of Edinburgh, UK Martin Takáč TAKAC.MT@GMAIL.COM
More informationMachine Learning for NLP
Machine Learning for NLP Linear Models Joakim Nivre Uppsala University Department of Linguistics and Philology Slides adapted from Ryan McDonald, Google Research Machine Learning for NLP 1(26) Outline
More informationIntroduction to Logistic Regression and Support Vector Machine
Introduction to Logistic Regression and Support Vector Machine guest lecturer: Ming-Wei Chang CS 446 Fall, 2009 () / 25 Fall, 2009 / 25 Before we start () 2 / 25 Fall, 2009 2 / 25 Before we start Feel
More informationRandomized Smoothing Techniques in Optimization
Randomized Smoothing Techniques in Optimization John Duchi Based on joint work with Peter Bartlett, Michael Jordan, Martin Wainwright, Andre Wibisono Stanford University Information Systems Laboratory
More informationConvex Optimization Lecture 16
Convex Optimization Lecture 16 Today: Projected Gradient Descent Conditional Gradient Descent Stochastic Gradient Descent Random Coordinate Descent Recall: Gradient Descent (Steepest Descent w.r.t Euclidean
More informationEstimators based on non-convex programs: Statistical and computational guarantees
Estimators based on non-convex programs: Statistical and computational guarantees Martin Wainwright UC Berkeley Statistics and EECS Based on joint work with: Po-Ling Loh (UC Berkeley) Martin Wainwright
More informationProbabilistic Graphical Models & Applications
Probabilistic Graphical Models & Applications Learning of Graphical Models Bjoern Andres and Bernt Schiele Max Planck Institute for Informatics The slides of today s lecture are authored by and shown with
More informationLogistic Regression. William Cohen
Logistic Regression William Cohen 1 Outline Quick review classi5ication, naïve Bayes, perceptrons new result for naïve Bayes Learning as optimization Logistic regression via gradient ascent Over5itting
More informationLecture 1: Supervised Learning
Lecture 1: Supervised Learning Tuo Zhao Schools of ISYE and CSE, Georgia Tech ISYE6740/CSE6740/CS7641: Computational Data Analysis/Machine from Portland, Learning Oregon: pervised learning (Supervised)
More informationNeural networks and optimization
Neural networks and optimization Nicolas Le Roux INRIA 8 Nov 2011 Nicolas Le Roux (INRIA) Neural networks and optimization 8 Nov 2011 1 / 80 1 Introduction 2 Linear classifier 3 Convolutional neural networks
More informationRandomized Block Coordinate Non-Monotone Gradient Method for a Class of Nonlinear Programming
Randomized Block Coordinate Non-Monotone Gradient Method for a Class of Nonlinear Programming Zhaosong Lu Lin Xiao June 25, 2013 Abstract In this paper we propose a randomized block coordinate non-monotone
More informationSparse regression. Optimization-Based Data Analysis. Carlos Fernandez-Granda
Sparse regression Optimization-Based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_spring16 Carlos Fernandez-Granda 3/28/2016 Regression Least-squares regression Example: Global warming Logistic
More informationModern Optimization Techniques
Modern Optimization Techniques 2. Unconstrained Optimization / 2.2. Stochastic Gradient Descent Lars Schmidt-Thieme Information Systems and Machine Learning Lab (ISMLL) Institute of Computer Science University
More informationNeed for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels
Need for Deep Networks Perceptron Can only model linear functions Kernel Machines Non-linearity provided by kernels Need to design appropriate kernels (possibly selecting from a set, i.e. kernel learning)
More informationSummary and discussion of: Dropout Training as Adaptive Regularization
Summary and discussion of: Dropout Training as Adaptive Regularization Statistics Journal Club, 36-825 Kirstin Early and Calvin Murdock November 21, 2014 1 Introduction Multi-layered (i.e. deep) artificial
More informationEngineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers
Engineering Part IIB: Module 4F0 Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers Phil Woodland: pcw@eng.cam.ac.uk Michaelmas 202 Engineering Part IIB:
More informationEmpirical Risk Minimization
Empirical Risk Minimization Fabrice Rossi SAMM Université Paris 1 Panthéon Sorbonne 2018 Outline Introduction PAC learning ERM in practice 2 General setting Data X the input space and Y the output space
More informationA Parallel SGD method with Strong Convergence
A Parallel SGD method with Strong Convergence Dhruv Mahajan Microsoft Research India dhrumaha@microsoft.com S. Sundararajan Microsoft Research India ssrajan@microsoft.com S. Sathiya Keerthi Microsoft Corporation,
More informationThe Randomized Newton Method for Convex Optimization
The Randomized Newton Method for Convex Optimization Vaden Masrani UBC MLRG April 3rd, 2018 Introduction We have some unconstrained, twice-differentiable convex function f : R d R that we want to minimize:
More informationMachine Learning. Regularization and Feature Selection. Fabio Vandin November 14, 2017
Machine Learning Regularization and Feature Selection Fabio Vandin November 14, 2017 1 Regularized Loss Minimization Assume h is defined by a vector w = (w 1,..., w d ) T R d (e.g., linear models) Regularization
More informationOn Nesterov s Random Coordinate Descent Algorithms
On Nesterov s Random Coordinate Descent Algorithms Zheng Xu University of Texas At Arlington February 19, 2015 1 Introduction Full-Gradient Descent Coordinate Descent 2 Random Coordinate Descent Algorithm
More information1 Regression with High Dimensional Data
6.883 Learning with Combinatorial Structure ote for Lecture 11 Instructor: Prof. Stefanie Jegelka Scribe: Xuhong Zhang 1 Regression with High Dimensional Data Consider the following regression problem:
More informationGradient Boosting (Continued)
Gradient Boosting (Continued) David Rosenberg New York University April 4, 2016 David Rosenberg (New York University) DS-GA 1003 April 4, 2016 1 / 31 Boosting Fits an Additive Model Boosting Fits an Additive
More informationStochastic Quasi-Newton Methods
Stochastic Quasi-Newton Methods Donald Goldfarb Department of IEOR Columbia University UCLA Distinguished Lecture Series May 17-19, 2016 1 / 35 Outline Stochastic Approximation Stochastic Gradient Descent
More informationIntroduction to Logistic Regression
Introduction to Logistic Regression Guy Lebanon Binary Classification Binary classification is the most basic task in machine learning, and yet the most frequent. Binary classifiers often serve as the
More informationDiscriminative Models
No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models
More informationAd Placement Strategies
Case Study : Estimating Click Probabilities Intro Logistic Regression Gradient Descent + SGD AdaGrad Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox January 7 th, 04 Ad
More informationOptimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade
Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade Machine Learning for Big Data CSE547/STAT548 University of Washington S. M. Kakade (UW) Optimization for
More informationarxiv: v2 [stat.ml] 16 Jun 2015
Semi-Stochastic Gradient Descent Methods Jakub Konečný Peter Richtárik arxiv:1312.1666v2 [stat.ml] 16 Jun 2015 School of Mathematics University of Edinburgh United Kingdom June 15, 2015 (first version:
More informationReinforcement Learning
Reinforcement Learning Function approximation Mario Martin CS-UPC May 18, 2018 Mario Martin (CS-UPC) Reinforcement Learning May 18, 2018 / 65 Recap Algorithms: MonteCarlo methods for Policy Evaluation
More informationECE521 week 3: 23/26 January 2017
ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear
More informationLinear Models for Regression CS534
Linear Models for Regression CS534 Example Regression Problems Predict housing price based on House size, lot size, Location, # of rooms Predict stock price based on Price history of the past month Predict
More informationNeed for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels
Need for Deep Networks Perceptron Can only model linear functions Kernel Machines Non-linearity provided by kernels Need to design appropriate kernels (possibly selecting from a set, i.e. kernel learning)
More informationAccelerated Stochastic Block Coordinate Gradient Descent for Sparsity Constrained Nonconvex Optimization
Accelerated Stochastic Block Coordinate Gradient Descent for Sparsity Constrained Nonconvex Optimization Jinghui Chen Department of Systems and Information Engineering University of Virginia Quanquan Gu
More informationRegression with Numerical Optimization. Logistic
CSG220 Machine Learning Fall 2008 Regression with Numerical Optimization. Logistic regression Regression with Numerical Optimization. Logistic regression based on a document by Andrew Ng October 3, 204
More information