Stochastic Optimization
|
|
- Gillian York
- 5 years ago
- Views:
Transcription
1 Introduction Related Work SGD Epoch-GD LM A DA NANJING UNIVERSITY Lijun Zhang Nanjing University, China May 26, 2017
2 Introduction Related Work SGD Epoch-GD Outline 1 Introduction 2 Related Work 3 Stochastic Gradient Descent Expectation Bound High-probability Bound 4 Epoch Gradient Descent Expectation Bound High-probability Bound
3 Introduction Related Work SGD Epoch-GD Outline 1 Introduction 2 Related Work 3 Stochastic Gradient Descent Expectation Bound High-probability Bound 4 Epoch Gradient Descent Expectation Bound High-probability Bound
4 Introduction Related Work SGD Epoch-GD Definitions [Nemirovski et al., 2009] min F(w) = E ξ[f(w,ξ)] = f(w, ξ)dp(ξ) w W Ξ ξ is a random variable The challenge: evaluation of the expectation/integration
5 Introduction Related Work SGD Epoch-GD Definitions [Nemirovski et al., 2009] min F(w) = E ξ[f(w,ξ)] = f(w, ξ)dp(ξ) w W Ξ ξ is a random variable The challenge: evaluation of the expectation/integration Supervised Learning [Vapnik, 1998] min F(h) = E (x,y)[l(h(x), y)] h H (x, y) is a random instance-label pair H = {h : X R} is a hypothesis class l(, ) is certain loss, e.g., hinge loss l(u, v) = max(0, 1 uv)
6 Introduction Related Work SGD Epoch-GD Optimization Methods Sample Average Approximation (SAA) Empirical Risk Minimization (RRM) min F(w) = 1 n f(w,ξ i ) w W n where ξ 1,...,ξ n are i.i.d. min w W F(w) = 1 n i=1 n l(h(x i ), y i )) i=1 where (x 1, y 1 ),...,(x n, y n ) are i.i.d.
7 Introduction Related Work SGD Epoch-GD Optimization Methods Sample Average Approximation (SAA) Empirical Risk Minimization (RRM) min F(w) = 1 n f(w,ξ i ) w W n where ξ 1,...,ξ n are i.i.d. min w W F(w) = 1 n i=1 n l(h(x i ), y i )) i=1 where (x 1, y 1 ),...,(x n, y n ) are i.i.d. Stochastic Approximation (SA) Optimization via noisy observations of F( ) Zero-order SA [Nesterov, 2011] First-order SA, e.g., SGD [Kushner and Yin, 2003]
8 Introduction Related Work SGD Epoch-GD Performance Optimization Error / Excess Risk O F( w) min F(w) = w W O ( ) 1 n ( ) 1 T,O ( ) 1,O n ( ) 1 T ( 1, O n is the number of samples in empirical risk minimization T is the number of stochastic gradients in first-order stochastic approximation n 2 )
9 Introduction Related Work SGD Epoch-GD Outline 1 Introduction 2 Related Work 3 Stochastic Gradient Descent Expectation Bound High-probability Bound 4 Epoch Gradient Descent Expectation Bound High-probability Bound
10 Introduction Related Work SGD Epoch-GD Related Work Empirical Risk Minimization Stochastic Approximation [Shalev-Shwartz et al., 2009] [Koren and Levy, 2015] [Feldman, 2016] [Zhang et al., 2017] [Zinkevich, 2003] [Hazan et al., 2007] [Hazan and Kale, 2011] [Rakhlin et al., 2012] [Lan, 2012] [Agarwal et al., 2012] [Shamir and Zhang, 2013] [Mahdavi et al., 2015]... Supervised Learning [Vapnik, 1998] [Lee et al., 1996] [Panchenko, 2002] [Bartlett and Mendelson, 2002] [Tsybakov, 2004] [Bartlett et al., 2005] [Sridharan et al., 2009] [Srebro et al., 2010]... [Bach and Moulines, 2013]
11 Introduction Related Work SGD Epoch-GD Risk Bounds of Empirical Risk Minimization Supervised Learning
12 Introduction Related Work SGD Epoch-GD Risk Bounds of Empirical Risk Minimization ( ) 1 Lipschitz: O n [Bartlett and Mendelson, 2002] Supervised Learning
13 Introduction Related Work SGD Epoch-GD Risk Bounds of Empirical Risk Minimization Smooth: O( 1 n [Srebro et al., 2010] ( ) 1 Lipschitz: O n [Bartlett and Mendelson, 2002] ) Supervised Learning
14 Introduction Related Work SGD Epoch-GD Risk Bounds of Empirical Risk Minimization Smooth: O( 1 n [Srebro et al., 2010] ( ) 1 Lipschitz: O n [Bartlett and Mendelson, 2002] ( ) Strongly Convex: O 1 λn [Sridharan et al., 2009] ) Supervised Learning
15 Introduction Related Work SGD Epoch-GD Risk Bounds of Empirical Risk Minimization Smooth: O( 1 n [Srebro et al., 2010] ( ) 1 Lipschitz: O n [Bartlett and Mendelson, 2002] ( ) Strongly Convex & Smooth: O 1 λn 2 [Zhang et al., 2017] ( ) Strongly Convex: O 1 λn [Sridharan et al., 2009] ) Supervised Learning
16 Introduction Related Work SGD Epoch-GD Risk Bounds of Empirical Risk Minimization ( ) d Lipschitz: O n [Shalev-Shwartz et al., 2009] Smooth: O( 1 n [Srebro et al., 2010] ( ) 1 Lipschitz: O n [Bartlett and Mendelson, 2002] ( ) Strongly Convex & Smooth: O 1 λn 2 [Zhang et al., 2017] ( ) Strongly Convex: O 1 λn [Sridharan et al., 2009] ) Supervised Learning
17 Introduction Related Work SGD Epoch-GD Risk Bounds of Empirical Risk Minimization ( ) d Lipschitz: O n Smooth: O( d n [Shalev-Shwartz et al., 2009] ) [Zhang et al., 2017] Smooth: O( 1 n [Srebro et al., 2010] ( ) 1 Lipschitz: O n [Bartlett and Mendelson, 2002] ( ) Strongly Convex & Smooth: O 1 λn 2 [Zhang et al., 2017] ( ) Strongly Convex: O 1 λn [Sridharan et al., 2009] ) Supervised Learning
18 Introduction Related Work SGD Epoch-GD Risk Bounds of Empirical Risk Minimization ( ) d Lipschitz: O n [Shalev-Shwartz et al., 2009] ( ) Strongly Convex: O 1 in Exp. λn [Shalev-Shwartz et al., 2009] Smooth: O( d n ) [Zhang et al., 2017] Smooth: O( 1 n [Srebro et al., 2010] ( ) 1 Lipschitz: O n [Bartlett and Mendelson, 2002] ( ) O d n + 1 λn [Zhang et al., 2017] ( ) Strongly Convex: O 1 λn [Sridharan et al., 2009] ) [Zhang et al., 2017] Strongly Convex & Smooth: O Supervised Learning ( 1 λn 2 )
19 Introduction Related Work SGD Epoch-GD Risk Bounds of Empirical Risk Minimization ( ) d Lipschitz: O n [Shalev-Shwartz et al., 2009] ( ) Strongly Convex: O 1 in Exp. λn [Shalev-Shwartz et al., 2009] Smooth: O( d n ) [Zhang et al., 2017] Smooth: O( 1 n [Srebro et al., 2010] ( ) 1 Lipschitz: O n [Bartlett and Mendelson, 2002] Strongly Convex & Smooth: O [Zhang et al., 2017] ( ) O d n + 1 λn [Zhang et al., 2017] ( ) Strongly Convex: O 1 λn [Sridharan et al., 2009] ) [Zhang et al., 2017] ( 1 λn 2 ) Strongly Convex & Smooth: O Supervised Learning ( 1 λn 2 )
20 Introduction Related Work SGD Epoch-GD Risk Bounds of Stochastic Approximation Supervised Learning
21 Introduction Related Work SGD Epoch-GD Risk Bounds of Stochastic Approximation ( ) Lipschitz: O 1 n [Zinkevich, 2003] Supervised Learning
22 Introduction Related Work SGD Epoch-GD Risk Bounds of Stochastic Approximation ( ) Lipschitz: O 1 n [Zinkevich, 2003] Smooth: O( 1 n ) [Srebro et al., 2010] Supervised Learning
23 Introduction Related Work SGD Epoch-GD Risk Bounds of Stochastic Approximation ( ) Lipschitz: O 1 n [Zinkevich, 2003] Smooth: O( 1 n ) [Srebro et al., 2010] ( ) Strongly Convex: O 1 λn [Hazan and Kale, 2011] Supervised Learning
24 Introduction Related Work SGD Epoch-GD Risk Bounds of Stochastic Approximation ( ) Lipschitz: O 1 n [Zinkevich, 2003] Smooth: O( 1 n ) [Srebro et al., 2010] ( ) Strongly Convex: O 1 λn [Hazan and Kale, 2011] Supervised Learning
25 Introduction Related Work SGD Epoch-GD Risk Bounds of Stochastic Approximation ( ) Strongly Convex: O 1 λn [Hazan and Kale, 2011] Smooth: O( 1 n ) [Srebro et al., 2010] Square or Logistic: O( 1 n ) [Bach and Moulines, 2013] ( ) Lipschitz: O 1 n [Zinkevich, 2003] Supervised Learning
26 Outline 1 Introduction 2 Related Work 3 Stochastic Gradient Descent Expectation Bound High-probability Bound 4 Epoch Gradient Descent Expectation Bound High-probability Bound
27 Stochastic Gradient Descent The Algorithm 1: for t = 1,...,T do 2: w t+1 = w t η t g t 3: w t+1 = Π W (w t+1 ) 4: end for 5: return w T = 1 T T w t Requirement E t [g t ] = F(w t ) where E t [ ] is the conditional expectation
28 Stochastic Gradient Descent The Algorithm 1: for t = 1,...,T do 2: w t+1 = w t η t g t 3: w t+1 = Π W (w t+1 ) 4: end for 5: return w T = 1 T T w t min F(w) = E ξ[f(w,ξ)] w W Sample ξ t from the underlying distribution g t = f(w t,ξ t )
29 Stochastic Gradient Descent The Algorithm 1: for t = 1,...,T do 2: w t+1 = w t η t g t 3: w t+1 = Π W (w t+1 ) 4: end for 5: return w T = 1 T T w t Supervised Learning min F(h) = E (x,y)[l(h(x), y)] h H Sample (x t, y t ) from the underlying distribution g t = l(h t (x t ), y t )
30 Outline 1 Introduction 2 Related Work 3 Stochastic Gradient Descent Expectation Bound High-probability Bound 4 Epoch Gradient Descent Expectation Bound High-probability Bound
31 Analysis I For any w W, we have F(w t ) F(w) F(w t ), w t w = g t, w t w + F(w t ) g t, w t w }{{} δ t = 1 η t w t w t+1, w t w +δ t = 1 ( wt w 2 2 w t+1 w w t w 2η t+1 2 ) 2 +δt t = 1 2η t ( wt w 2 2 w t+1 w η t ( wt w 2 2 w t+1 w 2 2 ) η t + 2 g t 2 2 +δ t ) η t + 2 g t 2 2 +δ t
32 Analysis II To simplify the above inequality, we assume Then, we have F(w t ) F(w) 1 2η η t = η, and g t 2 G ( wt w 2 2 w t+1 w 2 2) + η 2 G2 +δ t By adding the inequalities of all iterations, we have T F(w t ) TF(w) 1 2η w 1 w ηt 2 G2 + T δ t
33 Analysis III Then, we have 1 T F( w T ) F(w) = F ( 1 T T w t ) F(w) ( T F(w t ) F(w) 1 T = 1 2ηT w 1 w η 2 G2 + 1 T 1 2η w 1 w ηt 2 G2 + T δ t ) T δ t In summary, we have F( w T ) F(w) 1 2ηT w 1 w η 2 G2 + 1 T T δ t
34 Expectation Bound Under the condition E t [g t ] = F(w t ), it is easy to verify E[δ t ] = E[ F(w t ) g t, w t w ] =E 1,...,t 1 [E t [ F(w t ) g t, w t w ]] = 0 Thus, by taking expectation over both sides, we have E[F( w T )] F(w) 1 2ηT w 1 w η 2 G2 ( ) η= c T 1 1 = 2 T c w 1 w cg 2
35 Outline 1 Introduction 2 Related Work 3 Stochastic Gradient Descent Expectation Bound High-probability Bound 4 Epoch Gradient Descent Expectation Bound High-probability Bound
36 Martingale A sequence of random variables Y 1, Y 2, Y 3,... that satisfies for any time t E[ Y t ] E[Y t+1 Y 1,...,Y t ] = Y t An Example Y t is a gambler s fortune after t tosses of a fair coin. Let δ t a random variable such that δ t = 1 if the coin comes up heads andδ t = 1 if the coin comes up tails. Then Y t = t i=1 δ t. E[Y t+1 Y 1,...,Y t ] = E[δ t+1 + Y t Y 1,...,Y t ] =Y t + E[δ t+1 Y 1,...,Y t ] = Y t
37 Martingale A sequence of random variables Y 1, Y 2, Y 3,... that satisfies for any time t E[ Y t ] E[Y t+1 Y 1,...,Y t ] = Y t An Example Y t is a gambler s fortune after t tosses of a fair coin. Martingale Difference Suppose Y 1, Y 2, Y 3,... is a martingale, then X t = Y t Y t 1 is a martingale difference sequence. E[X t+1 X 1,...,X t ] = E[Y t+1 Y t X 1,...,X t ] = 0
38 Azuma s inequality for Martingales Suppose X 1, X 2,... is a martingale difference sequence and X i c i almost surely. Then, we have ( n ) ( t 2 Pr X i t exp 2 n i=1 i=1 c2 i ( n ) ( Corollary 1 Pr i=1 X i t exp t 2 2 n i=1 c2 i Assume c i c. With a probability at least 1 2δ, we have n X i 2nclog 1 δ i=1 ) )
39 High-probability Bound I Assume g t 2 G, and x y 2 D, x, y W Under the condition E t [g t ] = F(w t ), it is easy to verify δ t = F(w t ) g t, w t w is a martingale difference sequence with where we use Jensen s inequality δ t F(w t ) g t 2 w t w 2 D( F(w t ) 2 + g t 2 ) 2GD F(w t ) 2 = E[g t ] 2 E g t 2 G
40 High-probability Bound II From Azuma s inequality, with a probability at least 1 δ T δ t 2 TGDlog 1 δ Thus, with a probability at least 1 δ F( w T ) F(w) 1 2ηT w 1 w η 2 G2 + 1 T δ t T 1 2ηT w 1 w η 1 2 G2 + 2 T GDlog 1 δ ( η= c T 1 1 = T 2c w 1 w c ) 2 G2 + 2 GDlog 1 δ ( ) 1 =O T
41 Outline 1 Introduction 2 Related Work 3 Stochastic Gradient Descent Expectation Bound High-probability Bound 4 Epoch Gradient Descent Expectation Bound High-probability Bound
42 Strong Convexity A General Definition F is λ-strongly convex if for all α [0, 1], w, w W, F(αw+(1 α)w ) αf(w)+(1 α)f(w ) λ 2 α(1 α) w w 2 2 A More Popular Definition F is λ-strongly convex if for all w, w W, F(w)+ F(w), w w + λ 2 w w 2 2 F(w )
43 Strong Convexity A General Definition F is λ-strongly convex if for all α [0, 1], w, w W, F(αw+(1 α)w ) αf(w)+(1 α)f(w ) λ 2 α(1 α) w w 2 2 A More Popular Definition F is λ-strongly convex if for all w, w W, F(w)+ F(w), w w + λ 2 w w 2 2 F(w ) An Important Property If F is λ-strongly convex, then λ 2 w w 2 F(w) F(w ) where w = argmin w W F(w).
44 Epoch Gradient Descent The Algorithm [Hazan and Kale, 2011] 1: T 1 = 4, η 1 = 1/λ 2: while k i=1 T k T do 3: for t = 1,...,T k do 4: 5: end for 6: w k+1 1 = 1 Tk T k wk t 7: η k+1 = η k /2, T k+1 = 2T k 8: k = k + 1 9: end while w k t+1 = Π W(w k t η k g k t )
45 Epoch Gradient Descent The Algorithm [Hazan and Kale, 2011] 1: T 1 = 4, η 1 = 1/λ 2: while k i=1 T k T do 3: for t = 1,...,T k do 4: 5: end for 6: w k+1 1 = 1 Tk T k wk t 7: η k+1 = η k /2, T k+1 = 2T k 8: k = k + 1 9: end while w k t+1 = Π W(w k t η k g k t ) Our Goal Establish an O(1/T) excess risk bound
46 Outline 1 Introduction 2 Related Work 3 Stochastic Gradient Descent Expectation Bound High-probability Bound 4 Epoch Gradient Descent Expectation Bound High-probability Bound
47 Analysis I If we have [ ] ( ) 1 E F(w k 1) F(w ) = O, k λt k Then, the last solution w k 1 satisfies [ ] ( ) ( ) 1 1 E F(w k 1 ) F(w ) = O = O λt k λt Suppose [ E F(w k 1) ] F(w ) = c G2 λt k From the analysis of SGD, we have [ ] E F(w k+1 1 ) F(w ) 1 [ ] E w k 1 w η k 2η k T k 2 G2
48 Analysis II Recall that Then, we have w k 1 w λ ( ) F(w k 1) F(w ) [ ] E F(w k+1 1 ) F(w ) 1 [ ] E F(w k λη k T 1) F(w ) + η k k λη k T k =4 = 1 [ ] 4 E c G2 + 4 G 2 λt k 2λT k T k+1 =2T k = c G 2 λt k+1 ( c ) c=8 = c G2 λt k+1 2 G2
49 Outline 1 Introduction 2 Related Work 3 Stochastic Gradient Descent Expectation Bound High-probability Bound 4 Epoch Gradient Descent Expectation Bound High-probability Bound
50 What do We Need? Key Inequality in the Expectation Bound [ ] E F(w k+1 1 ) F(w ) 1 E 2η k T k [ w k 1 w 2 2 ] + η k 2 G2
51 What do We Need? Key Inequality in the Expectation Bound [ ] E F(w k+1 1 ) F(w ) 1 E 2η k T k [ w k 1 w 2 2 ] + η k 2 G2 A High-probability Inequality based on Azuma s Inequality F(w k+1 1 ) F(w ) 1 w k 1 2η k T w η k GD k 2 G2 + 2 log 1 T k δ
52 What do We Need? Key Inequality in the Expectation Bound [ ] E F(w k+1 1 ) F(w ) 1 E 2η k T k [ w k 1 w 2 2 ] + η k 2 G2 A High-probability Inequality based on Azuma s Inequality F(w k+1 1 ) F(w ) 1 w k 1 2η k T w η k GD k 2 G2 + 2 log 1 T k δ The Inequality that We Need F(w k+1 1 ) F(w ) 1 2η k T k w k 1 w η k 2 G2 + O ( 1 λt k )
53 Concentration Inequality I Bernstein s Inequality for Martingales Let X 1,...,X n be a bounded martingale difference sequence with respect to the filtration F = (F i ) 1 i n and with X i K. Let i S i = j=1 be the associated martingale. Denote the sum of the conditional variances by n [ ] Σ 2 n = E Xt 2 F t 1. Then for all constants t, ν > 0, ( ) Pr max S i > t and Σ 2 n ν i=1,...,n X j ( exp t 2 ) 2(ν + Kt/3)
54 Concentration Inequality II Corollary 2 Suppose Σ 2 n ν. Then, with a probability at least 1 δ S n 2νlog 1 δ + 2K 3 log 1 δ Note 2νlog 1 δ 1 2α log 1 +αν, α > 0 δ
55 The Basic Idea For any w W, we have F(w t ) F(w) F(w t ), w t w λ 2 w t w 2 2 λ = g t, w t w + F(w t ) g t, w t w }{{} 2 w t w 2 2 δ t Then, following the same analysis, we arrive at ( T F( w T ) F(w) 1 2ηT w 1 w 2 2+ η 2 G2 + 1 δ t λ T 2 We are going to prove T δ t = λ 2 T ) T w t w 2 2 ( ) loglog T w t w O λ
56 Analysis I δ t = F(w t ) g t, w t w is a martingale difference sequence with δ t F(w t ) g t 2 w t w 2 D( F(w t ) 2 + g t 2 ) 2GD where D could be a prior parameter or 2G/λ when w = w The sum of the conditional variances is upper bounded by T Σ 2 T = [ ] E t δ 2 t T 4G 2 [ E t F(wt ) g t 2 2 w t w 2 ] 2 T w t w 2 2
57 Analysis II A straightforward way Since Σ 2 T 4G2 T w t w 2 2 Following Bernstein s inequality for martingales, with a probability at least 1 δ, we have T ( ) T δ t 2 4G 2 w t w 2 2 log 1 δ GDlog 1 δ λ 2 T w t w G2 λ log 1 δ GDlog 1 δ
58 Analysis III We need Peeling Divide T w t w 2 2 [0, TD2 ] in to a sequence of intervals The initial: T w t w 2 2 [0, D2 /T] Since δ t F(w t ) g t 2 w t w 2 2G w t w 2 we have T T δ t 2G w t w 2 2G T T w t w 2 2 2GD The rest: T w t w 2 2 (2i 1 D 2 /T, 2 i D 2 /T] where i = 1,..., 2log 2 T
59 Analysis IV Define T A T = w t w 2 2 TD 2 We are going to bound [ T Pr δ t 2 ] 4G 2 A T τ GDτ + 2GD
60 Analysis V T Pr δ t 2 4G 2 A T τ + 2 2GDτ + 2GD 3 T =Pr δ t 2 4G 2 A T τ GDτ + 2GD, A T D 2 /T T +Pr δ t 2 4G 2 A T τ GDτ + 2GD, D2 /T < A T TD 2 T =Pr δ t 2 4G 2 A T τ GDτ + 2GD,Σ2 T 4G2 A T, D 2 /T < A T TD 2 m T Pr δ t 2 4G 2 A T τ + 43 GDτ + 2GD,Σ2T 4G2 A T, D2 T 2i 1 < A T D2 T 2i i=1 m T Pr δ t 2 4G2 D 2 2 i τ + 4 T 3 GDτ,Σ2 T 4G2 D 2 2 i me τ T i=1 where m = 2log 2 T
61 Analysis VI By setting τ = log m δ, with a probability at least 1 δ, T δ t 2 4G 2 A T τ + 4 GDτ + 2GD 3 ( T ) =2 4G 2 w t w 2 2 λ 2 = λ 2 T T τ + 4 GDτ + 2GD 3 w t w G2 λ τ + 4 GDτ + 2GD 3 ( ) loglog T w t w O λ where τ = log m δ = log 2log 2 T δ = O(loglog T)
62 Reference I Agarwal, A., Bartlett, P. L., Ravikumar, P., and Wainwright, M. J. (2012). Information-theoretic lower bounds on the oracle complexity of stochastic convex optimization. IEEE Transactions on Information Theory, 58(5): Bach, F. and Moulines, E. (2013). Non-strongly-convex smooth stochastic approximation with convergence rate O(1/n). In Advances in Neural Information Processing Systems 26, pages Bartlett, P. L., Bousquet, O., and Mendelson, S. (2005). Local rademacher complexities. The Annals of Statistics, 33(4): Bartlett, P. L. and Mendelson, S. (2002). Rademacher and gaussian complexities: risk bounds and structural results. Journal of Machine Learning Research, 3: Feldman, V. (2016). Generalization of erm in stochastic convex optimization: The dimension strikes back. ArXiv e-prints, arxiv:
63 Reference II Hazan, E., Agarwal, A., and Kale, S. (2007). Logarithmic regret algorithms for online convex optimization. Machine Learning, 69(2-3): Hazan, E. and Kale, S. (2011). Beyond the regret minimization barrier: an optimal algorithm for stochastic strongly-convex optimization. In Proceedings of the 24th Annual Conference on Learning Theory, pages Koren, T. and Levy, K. (2015). Fast rates for exp-concave empirical risk minimization. In Advances in Neural Information Processing Systems 28, pages Kushner, H. J. and Yin, G. G. (2003). Stochastic Approximation and Recursive Algorithms and Applications. Springer, second edition. Lan, G. (2012). An optimal method for stochastic composite optimization. Mathematical Programming, 133:
64 Reference III Lee, W. S., Bartlett, P. L., and Williamson, R. C. (1996). The importance of convexity in learning with squared loss. In Proceedings of the 9th Annual Conference on Computational Learning Theory, pages Mahdavi, M., Zhang, L., and Jin, R. (2015). Lower and upper bounds on the generalization of stochastic exponentially concave optimization. In Proceedings of the 28th Conference on Learning Theory. Nemirovski, A., Juditsky, A., Lan, G., and Shapiro, A. (2009). Robust stochastic approximation approach to stochastic programming. SIAM Journal on Optimization, 19(4): Nesterov, Y. (2011). Random gradient-free minimization of convex functions. Core discussion papers. Panchenko, D. (2002). Some extensions of an inequality of vapnik and chervonenkis. Electronic Communications in Probability, 7:55 65.
65 Reference IV Rakhlin, A., Shamir, O., and Sridharan, K. (2012). Making gradient descent optimal for strongly convex stochastic optimization. In Proceedings of the 29th International Conference on Machine Learning, pages Shalev-Shwartz, S., Shamir, O., Srebro, N., and Sridharan, K. (2009). Stochastic convex optimization. In Proceedings of the 22nd Annual Conference on Learning Theory. Shamir, O. and Zhang, T. (2013). Stochastic gradient descent for non-smooth optimization: Convergence results and optimal averaging schemes. In Proceedings of the 30th International Conference on Machine Learning, pages Srebro, N., Sridharan, K., and Tewari, A. (2010). Optimistic rates for learning with a smooth loss. ArXiv e-prints, arxiv: Sridharan, K., Shalev-shwartz, S., and Srebro, N. (2009). Fast rates for regularized objectives. In Advances in Neural Information Processing Systems 21, pages
66 Reference V Tsybakov, A. B. (2004). Optimal aggregation of classifiers in statistical learning. The Annals of Statistics, 32: Vapnik, V. N. (1998). Statistical Learning Theory. Wiley-Interscience. Zhang, L., Yang, T., and Jin, R. (2017). Empirical risk minimization for stochastic convex optimization: O(1/n)- and O(1/n 2 )-type of risk bounds. In Proceedings of the 30th Annual Conference on Learning Theory. Zinkevich, M. (2003). Online convex programming and generalized infinitesimal gradient ascent. In Proceedings of the 20th International Conference on Machine Learning, pages
Full-information Online Learning
Introduction Expert Advice OCO LM A DA NANJING UNIVERSITY Full-information Lijun Zhang Nanjing University, China June 2, 2017 Outline Introduction Expert Advice OCO 1 Introduction Definitions Regret 2
More informationMaking Gradient Descent Optimal for Strongly Convex Stochastic Optimization
Making Gradient Descent Optimal for Strongly Convex Stochastic Optimization Alexander Rakhlin University of Pennsylvania Ohad Shamir Microsoft Research New England Karthik Sridharan University of Pennsylvania
More informationLower and Upper Bounds on the Generalization of Stochastic Exponentially Concave Optimization
JMLR: Workshop and Conference Proceedings vol 40:1 16, 2015 28th Annual Conference on Learning Theory Lower and Upper Bounds on the Generalization of Stochastic Exponentially Concave Optimization Mehrdad
More informationAdaptive Online Learning in Dynamic Environments
Adaptive Online Learning in Dynamic Environments Lijun Zhang, Shiyin Lu, Zhi-Hua Zhou National Key Laboratory for Novel Software Technology Nanjing University, Nanjing 210023, China {zhanglj, lusy, zhouzh}@lamda.nju.edu.cn
More informationStochastic Gradient Descent with Only One Projection
Stochastic Gradient Descent with Only One Projection Mehrdad Mahdavi, ianbao Yang, Rong Jin, Shenghuo Zhu, and Jinfeng Yi Dept. of Computer Science and Engineering, Michigan State University, MI, USA Machine
More informationBeyond stochastic gradient descent for large-scale machine learning
Beyond stochastic gradient descent for large-scale machine learning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France Joint work with Eric Moulines, Nicolas Le Roux and Mark Schmidt - CAP, July
More informationStochastic gradient methods for machine learning
Stochastic gradient methods for machine learning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France Joint work with Eric Moulines, Nicolas Le Roux and Mark Schmidt - September 2012 Context Machine
More informationStochastic gradient methods for machine learning
Stochastic gradient methods for machine learning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France Joint work with Eric Moulines, Nicolas Le Roux and Mark Schmidt - January 2013 Context Machine
More informationOptimistic Rates Nati Srebro
Optimistic Rates Nati Srebro Based on work with Karthik Sridharan and Ambuj Tewari Examples based on work with Andy Cotter, Elad Hazan, Tomer Koren, Percy Liang, Shai Shalev-Shwartz, Ohad Shamir, Karthik
More informationStochastic gradient descent and robustness to ill-conditioning
Stochastic gradient descent and robustness to ill-conditioning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France ÉCOLE NORMALE SUPÉRIEURE Joint work with Aymeric Dieuleveut, Nicolas Flammarion,
More informationA Two-Stage Approach for Learning a Sparse Model with Sharp Excess Risk Analysis
Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence (AAAI-7) A Two-Stage Approach for Learning a Sparse Model with Sharp Excess Risk Analysis Zhe Li, Tianbao Yang, Lijun Zhang, Rong
More informationBeyond stochastic gradient descent for large-scale machine learning
Beyond stochastic gradient descent for large-scale machine learning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France Joint work with Eric Moulines - October 2014 Big data revolution? A new
More informationOn the Generalization Ability of Online Strongly Convex Programming Algorithms
On the Generalization Ability of Online Strongly Convex Programming Algorithms Sham M. Kakade I Chicago Chicago, IL 60637 sham@tti-c.org Ambuj ewari I Chicago Chicago, IL 60637 tewari@tti-c.org Abstract
More informationSVRG++ with Non-uniform Sampling
SVRG++ with Non-uniform Sampling Tamás Kern András György Department of Electrical and Electronic Engineering Imperial College London, London, UK, SW7 2BT {tamas.kern15,a.gyorgy}@imperial.ac.uk Abstract
More informationarxiv: v4 [math.oc] 5 Jan 2016
Restarted SGD: Beating SGD without Smoothness and/or Strong Convexity arxiv:151.03107v4 [math.oc] 5 Jan 016 Tianbao Yang, Qihang Lin Department of Computer Science Department of Management Sciences The
More informationLecture: Adaptive Filtering
ECE 830 Spring 2013 Statistical Signal Processing instructors: K. Jamieson and R. Nowak Lecture: Adaptive Filtering Adaptive filters are commonly used for online filtering of signals. The goal is to estimate
More informationA Two-Stage Approach for Learning a Sparse Model with Sharp
A Two-Stage Approach for Learning a Sparse Model with Sharp Excess Risk Analysis The University of Iowa, Nanjing University, Alibaba Group February 3, 2017 1 Problem and Chanllenges 2 The Two-stage Approach
More informationBeyond the regret minimization barrier: an optimal algorithm for stochastic strongly-convex optimization
JMLR: Workshop and Conference Proceedings vol (2010) 1 16 24th Annual Conference on Learning heory Beyond the regret minimization barrier: an optimal algorithm for stochastic strongly-convex optimization
More informationComputational and Statistical Learning Theory
Computational and Statistical Learning Theory TTIC 31120 Prof. Nati Srebro Lecture 17: Stochastic Optimization Part II: Realizable vs Agnostic Rates Part III: Nearest Neighbor Classification Stochastic
More informationOptimal Regularized Dual Averaging Methods for Stochastic Optimization
Optimal Regularized Dual Averaging Methods for Stochastic Optimization Xi Chen Machine Learning Department Carnegie Mellon University xichen@cs.cmu.edu Qihang Lin Javier Peña Tepper School of Business
More informationBeyond stochastic gradient descent for large-scale machine learning
Beyond stochastic gradient descent for large-scale machine learning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France ÉCOLE NORMALE SUPÉRIEURE Joint work with Aymeric Dieuleveut, Nicolas Flammarion,
More informationLearning with stochastic proximal gradient
Learning with stochastic proximal gradient Lorenzo Rosasco DIBRIS, Università di Genova Via Dodecaneso, 35 16146 Genova, Italy lrosasco@mit.edu Silvia Villa, Băng Công Vũ Laboratory for Computational and
More informationStochastic Composition Optimization
Stochastic Composition Optimization Algorithms and Sample Complexities Mengdi Wang Joint works with Ethan X. Fang, Han Liu, and Ji Liu ORFE@Princeton ICCOPT, Tokyo, August 8-11, 2016 1 / 24 Collaborators
More informationSADAGRAD: Strongly Adaptive Stochastic Gradient Methods
Zaiyi Chen * Yi Xu * Enhong Chen Tianbao Yang Abstract Although the convergence rates of existing variants of ADAGRAD have a better dependence on the number of iterations under the strong convexity condition,
More informationLearnability, Stability, Regularization and Strong Convexity
Learnability, Stability, Regularization and Strong Convexity Nati Srebro Shai Shalev-Shwartz HUJI Ohad Shamir Weizmann Karthik Sridharan Cornell Ambuj Tewari Michigan Toyota Technological Institute Chicago
More informationBandit Convex Optimization
March 7, 2017 Table of Contents 1 (BCO) 2 Projection Methods 3 Barrier Methods 4 Variance reduction 5 Other methods 6 Conclusion Learning scenario Compact convex action set K R d. For t = 1 to T : Predict
More informationUniform concentration inequalities, martingales, Rademacher complexity and symmetrization
Uniform concentration inequalities, martingales, Rademacher complexity and symmetrization John Duchi Outline I Motivation 1 Uniform laws of large numbers 2 Loss minimization and data dependence II Uniform
More informationStochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization
Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization Shai Shalev-Shwartz and Tong Zhang School of CS and Engineering, The Hebrew University of Jerusalem Optimization for Machine
More informationStochastic and online algorithms
Stochastic and online algorithms stochastic gradient method online optimization and dual averaging method minimizing finite average Stochastic and online optimization 6 1 Stochastic optimization problem
More informationLarge-scale Stochastic Optimization
Large-scale Stochastic Optimization 11-741/641/441 (Spring 2016) Hanxiao Liu hanxiaol@cs.cmu.edu March 24, 2016 1 / 22 Outline 1. Gradient Descent (GD) 2. Stochastic Gradient Descent (SGD) Formulation
More informationA Stochastic PCA Algorithm with an Exponential Convergence Rate. Ohad Shamir
A Stochastic PCA Algorithm with an Exponential Convergence Rate Ohad Shamir Weizmann Institute of Science NIPS Optimization Workshop December 2014 Ohad Shamir Stochastic PCA with Exponential Convergence
More informationDynamic Regret of Strongly Adaptive Methods
Lijun Zhang 1 ianbao Yang 2 Rong Jin 3 Zhi-Hua Zhou 1 Abstract o cope with changing environments, recent developments in online learning have introduced the concepts of adaptive regret and dynamic regret
More informationAdaptive Subgradient Methods for Online Learning and Stochastic Optimization John Duchi, Elad Hanzan, Yoram Singer
Adaptive Subgradient Methods for Online Learning and Stochastic Optimization John Duchi, Elad Hanzan, Yoram Singer Vicente L. Malave February 23, 2011 Outline Notation minimize a number of functions φ
More informationDistributed Inexact Newton-type Pursuit for Non-convex Sparse Learning
Distributed Inexact Newton-type Pursuit for Non-convex Sparse Learning Bo Liu Department of Computer Science, Rutgers Univeristy Xiao-Tong Yuan BDAT Lab, Nanjing University of Information Science and Technology
More informationSimultaneous Model Selection and Optimization through Parameter-free Stochastic Learning
Simultaneous Model Selection and Optimization through Parameter-free Stochastic Learning Francesco Orabona Yahoo! Labs New York, USA francesco@orabona.com Abstract Stochastic gradient descent algorithms
More informationStochastic optimization in Hilbert spaces
Stochastic optimization in Hilbert spaces Aymeric Dieuleveut Aymeric Dieuleveut Stochastic optimization Hilbert spaces 1 / 48 Outline Learning vs Statistics Aymeric Dieuleveut Stochastic optimization Hilbert
More informationSupplementary Material of Dynamic Regret of Strongly Adaptive Methods
Supplementary Material of A Proof of Lemma 1 Lijun Zhang 1 ianbao Yang Rong Jin 3 Zhi-Hua Zhou 1 We first prove the first part of Lemma 1 Let k = log K t hen, integer t can be represented in the base-k
More informationAccelerating Stochastic Optimization
Accelerating Stochastic Optimization Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem and Mobileye Master Class at Tel-Aviv, Tel-Aviv University, November 2014 Shalev-Shwartz
More informationThe Interplay Between Stability and Regret in Online Learning
The Interplay Between Stability and Regret in Online Learning Ankan Saha Department of Computer Science University of Chicago ankans@cs.uchicago.edu Prateek Jain Microsoft Research India prajain@microsoft.com
More informationStochastic and Adversarial Online Learning without Hyperparameters
Stochastic and Adversarial Online Learning without Hyperparameters Ashok Cutkosky Department of Computer Science Stanford University ashokc@cs.stanford.edu Kwabena Boahen Department of Bioengineering Stanford
More informationIntroduction to Machine Learning (67577) Lecture 7
Introduction to Machine Learning (67577) Lecture 7 Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem Solving Convex Problems using SGD and RLM Shai Shalev-Shwartz (Hebrew
More informationConstrained Stochastic Gradient Descent for Large-scale Least Squares Problem
Constrained Stochastic Gradient Descent for Large-scale Least Squares Problem ABSRAC Yang Mu University of Massachusetts Boston 100 Morrissey Boulevard Boston, MA, US 015 yangmu@cs.umb.edu ianyi Zhou University
More informationOn Projected Stochastic Gradient Descent Algorithm with Weighted Averaging for Least Squares Regression
On Projected Stochastic Gradient Descent Algorithm with Weighted Averaging for Least Squares Regression arxiv:606.03000v [cs.it] 9 Jun 206 Kobi Cohen, Angelia Nedić and R. Srikant Abstract The problem
More informationOnline Learning and Online Convex Optimization
Online Learning and Online Convex Optimization Nicolò Cesa-Bianchi Università degli Studi di Milano N. Cesa-Bianchi (UNIMI) Online Learning 1 / 49 Summary 1 My beautiful regret 2 A supposedly fun game
More informationStochastic Gradient Descent with Variance Reduction
Stochastic Gradient Descent with Variance Reduction Rie Johnson, Tong Zhang Presenter: Jiawen Yao March 17, 2015 Rie Johnson, Tong Zhang Presenter: JiawenStochastic Yao Gradient Descent with Variance Reduction
More informationStochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure
Stochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure Alberto Bietti Julien Mairal Inria Grenoble (Thoth) March 21, 2017 Alberto Bietti Stochastic MISO March 21,
More informationAdaptive Online Gradient Descent
University of Pennsylvania ScholarlyCommons Statistics Papers Wharton Faculty Research 6-4-2007 Adaptive Online Gradient Descent Peter Bartlett Elad Hazan Alexander Rakhlin University of Pennsylvania Follow
More informationThe Online Approach to Machine Learning
The Online Approach to Machine Learning Nicolò Cesa-Bianchi Università degli Studi di Milano N. Cesa-Bianchi (UNIMI) Online Approach to ML 1 / 53 Summary 1 My beautiful regret 2 A supposedly fun game I
More informationMachine Learning. Support Vector Machines. Fabio Vandin November 20, 2017
Machine Learning Support Vector Machines Fabio Vandin November 20, 2017 1 Classification and Margin Consider a classification problem with two classes: instance set X = R d label set Y = { 1, 1}. Training
More informationTrade-Offs in Distributed Learning and Optimization
Trade-Offs in Distributed Learning and Optimization Ohad Shamir Weizmann Institute of Science Includes joint works with Yossi Arjevani, Nathan Srebro and Tong Zhang IHES Workshop March 2016 Distributed
More informationRisk Bounds for Lévy Processes in the PAC-Learning Framework
Risk Bounds for Lévy Processes in the PAC-Learning Framework Chao Zhang School of Computer Engineering anyang Technological University Dacheng Tao School of Computer Engineering anyang Technological University
More informationOptimization for Machine Learning
Optimization for Machine Learning (Lecture 3-A - Convex) SUVRIT SRA Massachusetts Institute of Technology Special thanks: Francis Bach (INRIA, ENS) (for sharing this material, and permitting its use) MPI-IS
More informationMarch 1, Florida State University. Concentration Inequalities: Martingale. Approach and Entropy Method. Lizhe Sun and Boning Yang.
Florida State University March 1, 2018 Framework 1. (Lizhe) Basic inequalities Chernoff bounding Review for STA 6448 2. (Lizhe) Discrete-time martingales inequalities via martingale approach 3. (Boning)
More informationBarzilai-Borwein Step Size for Stochastic Gradient Descent
Barzilai-Borwein Step Size for Stochastic Gradient Descent Conghui Tan The Chinese University of Hong Kong chtan@se.cuhk.edu.hk Shiqian Ma The Chinese University of Hong Kong sqma@se.cuhk.edu.hk Yu-Hong
More informationAdvanced Machine Learning
Advanced Machine Learning Online Convex Optimization MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Outline Online projected sub-gradient descent. Exponentiated Gradient (EG). Mirror descent.
More informationConvergence Rates for Deterministic and Stochastic Subgradient Methods Without Lipschitz Continuity
Convergence Rates for Deterministic and Stochastic Subgradient Methods Without Lipschitz Continuity Benjamin Grimmer Abstract We generalize the classic convergence rate theory for subgradient methods to
More informationEfficient Private ERM for Smooth Objectives
Efficient Private ERM for Smooth Objectives Jiaqi Zhang, Kai Zheng, Wenlong Mou, Liwei Wang Key Laboratory of Machine Perception, MOE, School of Electronics Engineering and Computer Science, Peking University,
More informationAccelerating Online Convex Optimization via Adaptive Prediction
Mehryar Mohri Courant Institute and Google Research New York, NY 10012 mohri@cims.nyu.edu Scott Yang Courant Institute New York, NY 10012 yangs@cims.nyu.edu Abstract We present a powerful general framework
More informationTutorial: PART 2. Optimization for Machine Learning. Elad Hazan Princeton University. + help from Sanjeev Arora & Yoram Singer
Tutorial: PART 2 Optimization for Machine Learning Elad Hazan Princeton University + help from Sanjeev Arora & Yoram Singer Agenda 1. Learning as mathematical optimization Stochastic optimization, ERM,
More informationStatistical Machine Learning II Spring 2017, Learning Theory, Lecture 4
Statistical Machine Learning II Spring 07, Learning Theory, Lecture 4 Jean Honorio jhonorio@purdue.edu Deterministic Optimization For brevity, everywhere differentiable functions will be called smooth.
More informationA Low Complexity Algorithm with O( T ) Regret and Finite Constraint Violations for Online Convex Optimization with Long Term Constraints
A Low Complexity Algorithm with O( T ) Regret and Finite Constraint Violations for Online Convex Optimization with Long Term Constraints Hao Yu and Michael J. Neely Department of Electrical Engineering
More informationarxiv: v1 [math.oc] 18 Mar 2016
Katyusha: Accelerated Variance Reduction for Faster SGD Zeyuan Allen-Zhu zeyuan@csail.mit.edu Princeton University arxiv:1603.05953v1 [math.oc] 18 Mar 016 March 18, 016 Abstract We consider minimizing
More informationInverse Time Dependency in Convex Regularized Learning
Inverse Time Dependency in Convex Regularized Learning Zeyuan Allen Zhu 2*, Weizhu Chen 2, Chenguang Zhu 23, Gang Wang 2, Haixun Wang 2, Zheng Chen 2 Fundamental Science Class, Department of Physics, Tsinghua
More informationStochastic Approximation: Mini-Batches, Optimistic Rates and Acceleration
Stochastic Approximation: Mini-Batches, Optimistic Rates and Acceleration Nati Srebro Toyota Technological Institute at Chicago a philanthropically endowed academic computer science institute dedicated
More informationarxiv: v1 [cs.lg] 2 Apr 2013
JMLR: Worshop and Conference Proceedings vol 30 013) 1 3 Olog T) Projections for Stochastic Optimization of Smooth and Strongly Convex Functions arxiv:1304.0740v1 [cs.lg] Apr 013 Lijun Zhang zhanglij@msu.edu
More informationImproved Optimization of Finite Sums with Minibatch Stochastic Variance Reduced Proximal Iterations
Improved Optimization of Finite Sums with Miniatch Stochastic Variance Reduced Proximal Iterations Jialei Wang University of Chicago Tong Zhang Tencent AI La Astract jialei@uchicago.edu tongzhang@tongzhang-ml.org
More informationarxiv: v1 [stat.ml] 27 Sep 2016
Generalization Error Bounds for Optimization Algorithms via Stability Qi Meng 1, Yue Wang, Wei Chen 3, Taifeng Wang 3, Zhi-Ming Ma 4, Tie-Yan Liu 3 1 School of Mathematical Sciences, Peking University,
More informationOnline Optimization with Gradual Variations
JMLR: Workshop and Conference Proceedings vol (0) 0 Online Optimization with Gradual Variations Chao-Kai Chiang, Tianbao Yang 3 Chia-Jung Lee Mehrdad Mahdavi 3 Chi-Jen Lu Rong Jin 3 Shenghuo Zhu 4 Institute
More informationRandomized Coordinate Descent with Arbitrary Sampling: Algorithms and Complexity
Randomized Coordinate Descent with Arbitrary Sampling: Algorithms and Complexity Zheng Qu University of Hong Kong CAM, 23-26 Aug 2016 Hong Kong based on joint work with Peter Richtarik and Dominique Cisba(University
More informationOnline Passive-Aggressive Algorithms
Online Passive-Aggressive Algorithms Koby Crammer Ofer Dekel Shai Shalev-Shwartz Yoram Singer School of Computer Science & Engineering The Hebrew University, Jerusalem 91904, Israel {kobics,oferd,shais,singer}@cs.huji.ac.il
More informationLarge Scale Semi-supervised Linear SVM with Stochastic Gradient Descent
Journal of Computational Information Systems 9: 15 (2013) 6251 6258 Available at http://www.jofcis.com Large Scale Semi-supervised Linear SVM with Stochastic Gradient Descent Xin ZHOU, Conghui ZHU, Sheng
More informationMachine Learning in the Data Revolution Era
Machine Learning in the Data Revolution Era Shai Shalev-Shwartz School of Computer Science and Engineering The Hebrew University of Jerusalem Machine Learning Seminar Series, Google & University of Waterloo,
More informationGeneralization of ERM in Stochastic Convex Optimization: The Dimension Strikes Back
Generalization of ERM in Stochastic Convex Optimization: The Dimension Strikes Back Vitaly Feldman IBM Research Almaden Abstract In stochastic convex optimization the goal is to minimize a convex function
More informationModern Stochastic Methods. Ryan Tibshirani (notes by Sashank Reddi and Ryan Tibshirani) Convex Optimization
Modern Stochastic Methods Ryan Tibshirani (notes by Sashank Reddi and Ryan Tibshirani) Convex Optimization 10-725 Last time: conditional gradient method For the problem min x f(x) subject to x C where
More informationLarge-scale machine learning and convex optimization
Large-scale machine learning and convex optimization Francis Bach INRIA - Ecole Normale Supérieure, Paris, France ÉCOLE NORMALE SUPÉRIEURE IFCAM, Bangalore - July 2014 Big data revolution? A new scientific
More informationOptimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade
Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade Machine Learning for Big Data CSE547/STAT548 University of Washington S. M. Kakade (UW) Optimization for
More informationLearning Bound for Parameter Transfer Learning
Learning Bound for Parameter Transfer Learning Wataru Kumagai Faculty of Engineering Kanagawa University kumagai@kanagawa-u.ac.jp Abstract We consider a transfer-learning problem by using the parameter
More informationFast Rates for Estimation Error and Oracle Inequalities for Model Selection
Fast Rates for Estimation Error and Oracle Inequalities for Model Selection Peter L. Bartlett Computer Science Division and Department of Statistics University of California, Berkeley bartlett@cs.berkeley.edu
More informationBig Data Analytics: Optimization and Randomization
Big Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML 2015 Hong Kong Department of Computer Science, The University of Iowa, IA, USA Nov. 20, 2015 Yang Tutorial for ACML 15 Nov.
More informationarxiv: v2 [cs.lg] 14 Sep 2017
Elad Hazan Princeton University, Princeton, NJ 08540, USA Haipeng Luo Princeton University, Princeton, NJ 08540, USA EHAZAN@CS.PRINCETON.EDU HAIPENGL@CS.PRINCETON.EDU arxiv:160.0101v [cs.lg] 14 Sep 017
More informationProximal Minimization by Incremental Surrogate Optimization (MISO)
Proximal Minimization by Incremental Surrogate Optimization (MISO) (and a few variants) Julien Mairal Inria, Grenoble ICCOPT, Tokyo, 2016 Julien Mairal, Inria MISO 1/26 Motivation: large-scale machine
More informationSADAGRAD: Strongly Adaptive Stochastic Gradient Methods
Zaiyi Chen * Yi Xu * Enhong Chen Tianbao Yang Abstract Although the convergence rates of existing variants of ADAGRAD have a better dependence on the number of iterations under the strong convexity condition,
More informationVariance-Reduced and Projection-Free Stochastic Optimization
Elad Hazan Princeton University, Princeton, NJ 08540, USA Haipeng Luo Princeton University, Princeton, NJ 08540, USA EHAZAN@CS.PRINCETON.EDU HAIPENGL@CS.PRINCETON.EDU Abstract The Frank-Wolfe optimization
More informationSub-Sampled Newton Methods for Machine Learning. Jorge Nocedal
Sub-Sampled Newton Methods for Machine Learning Jorge Nocedal Northwestern University Goldman Lecture, Sept 2016 1 Collaborators Raghu Bollapragada Northwestern University Richard Byrd University of Colorado
More informationLarge-scale machine learning and convex optimization
Large-scale machine learning and convex optimization Francis Bach INRIA - Ecole Normale Supérieure, Paris, France Allerton Conference - September 2015 Slides available at www.di.ens.fr/~fbach/gradsto_allerton.pdf
More informationLimitations on Variance-Reduction and Acceleration Schemes for Finite Sum Optimization
Limitations on Variance-Reduction and Acceleration Schemes for Finite Sum Optimization Yossi Arjevani Department of Computer Science and Applied Mathematics Weizmann Institute of Science Rehovot 7610001,
More informationBeating SGD: Learning SVMs in Sublinear Time
Beating SGD: Learning SVMs in Sublinear Time Elad Hazan Tomer Koren Technion, Israel Institute of Technology Haifa, Israel 32000 {ehazan@ie,tomerk@cs}.technion.ac.il Nathan Srebro Toyota Technological
More informationOracle Complexity of Second-Order Methods for Smooth Convex Optimization
racle Complexity of Second-rder Methods for Smooth Convex ptimization Yossi Arjevani had Shamir Ron Shiff Weizmann Institute of Science Rehovot 7610001 Israel Abstract yossi.arjevani@weizmann.ac.il ohad.shamir@weizmann.ac.il
More informationOnline Optimization in Dynamic Environments: Improved Regret Rates for Strongly Convex Problems
216 IEEE 55th Conference on Decision and Control (CDC) ARIA Resort & Casino December 12-14, 216, Las Vegas, USA Online Optimization in Dynamic Environments: Improved Regret Rates for Strongly Convex Problems
More informationDistributed online optimization over jointly connected digraphs
Distributed online optimization over jointly connected digraphs David Mateos-Núñez Jorge Cortés University of California, San Diego {dmateosn,cortes}@ucsd.edu Mathematical Theory of Networks and Systems
More informationOnline Convex Optimization. Gautam Goel, Milan Cvitkovic, and Ellen Feldman CS 159 4/5/2016
Online Convex Optimization Gautam Goel, Milan Cvitkovic, and Ellen Feldman CS 159 4/5/2016 The General Setting The General Setting (Cover) Given only the above, learning isn't always possible Some Natural
More informationBeating SGD: Learning SVMs in Sublinear Time
Beating SGD: Learning SVMs in Sublinear Time Elad Hazan Tomer Koren Technion, Israel Institute of Technology Haifa, Israel 32000 {ehazan@ie,tomerk@cs}.technion.ac.il Nathan Srebro Toyota Technological
More informationConvex Optimization Lecture 16
Convex Optimization Lecture 16 Today: Projected Gradient Descent Conditional Gradient Descent Stochastic Gradient Descent Random Coordinate Descent Recall: Gradient Descent (Steepest Descent w.r.t Euclidean
More informationComparison of Modern Stochastic Optimization Algorithms
Comparison of Modern Stochastic Optimization Algorithms George Papamakarios December 214 Abstract Gradient-based optimization methods are popular in machine learning applications. In large-scale problems,
More informationFast Stochastic Optimization Algorithms for ML
Fast Stochastic Optimization Algorithms for ML Aaditya Ramdas April 20, 205 This lecture is about efficient algorithms for minimizing finite sums min w R d n i= f i (w) or min w R d n f i (w) + λ 2 w 2
More informationAccelerated Stochastic Block Coordinate Gradient Descent for Sparsity Constrained Nonconvex Optimization
Accelerated Stochastic Block Coordinate Gradient Descent for Sparsity Constrained Nonconvex Optimization Jinghui Chen Department of Systems and Information Engineering University of Virginia Quanquan Gu
More informationOLSO. Online Learning and Stochastic Optimization. Yoram Singer August 10, Google Research
OLSO Online Learning and Stochastic Optimization Yoram Singer August 10, 2016 Google Research References Introduction to Online Convex Optimization, Elad Hazan, Princeton University Online Learning and
More informationAdaptive Gradient Methods AdaGrad / Adam. Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade
Adaptive Gradient Methods AdaGrad / Adam Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade 1 Announcements: HW3 posted Dual coordinate ascent (some review of SGD and random
More informationLecture 3: Minimizing Large Sums. Peter Richtárik
Lecture 3: Minimizing Large Sums Peter Richtárik Graduate School in Systems, Op@miza@on, Control and Networks Belgium 2015 Mo@va@on: Machine Learning & Empirical Risk Minimiza@on Training Linear Predictors
More informationLarge-scale Machine Learning and Optimization
1/84 Large-scale Machine Learning and Optimization Zaiwen Wen http://bicmr.pku.edu.cn/~wenzw/bigdata2018.html Acknowledgement: this slides is based on Dimitris Papailiopoulos and Shiqian Ma lecture notes
More information