Stochastic Optimization

Size: px
Start display at page:

Download "Stochastic Optimization"

Transcription

1 Introduction Related Work SGD Epoch-GD LM A DA NANJING UNIVERSITY Lijun Zhang Nanjing University, China May 26, 2017

2 Introduction Related Work SGD Epoch-GD Outline 1 Introduction 2 Related Work 3 Stochastic Gradient Descent Expectation Bound High-probability Bound 4 Epoch Gradient Descent Expectation Bound High-probability Bound

3 Introduction Related Work SGD Epoch-GD Outline 1 Introduction 2 Related Work 3 Stochastic Gradient Descent Expectation Bound High-probability Bound 4 Epoch Gradient Descent Expectation Bound High-probability Bound

4 Introduction Related Work SGD Epoch-GD Definitions [Nemirovski et al., 2009] min F(w) = E ξ[f(w,ξ)] = f(w, ξ)dp(ξ) w W Ξ ξ is a random variable The challenge: evaluation of the expectation/integration

5 Introduction Related Work SGD Epoch-GD Definitions [Nemirovski et al., 2009] min F(w) = E ξ[f(w,ξ)] = f(w, ξ)dp(ξ) w W Ξ ξ is a random variable The challenge: evaluation of the expectation/integration Supervised Learning [Vapnik, 1998] min F(h) = E (x,y)[l(h(x), y)] h H (x, y) is a random instance-label pair H = {h : X R} is a hypothesis class l(, ) is certain loss, e.g., hinge loss l(u, v) = max(0, 1 uv)

6 Introduction Related Work SGD Epoch-GD Optimization Methods Sample Average Approximation (SAA) Empirical Risk Minimization (RRM) min F(w) = 1 n f(w,ξ i ) w W n where ξ 1,...,ξ n are i.i.d. min w W F(w) = 1 n i=1 n l(h(x i ), y i )) i=1 where (x 1, y 1 ),...,(x n, y n ) are i.i.d.

7 Introduction Related Work SGD Epoch-GD Optimization Methods Sample Average Approximation (SAA) Empirical Risk Minimization (RRM) min F(w) = 1 n f(w,ξ i ) w W n where ξ 1,...,ξ n are i.i.d. min w W F(w) = 1 n i=1 n l(h(x i ), y i )) i=1 where (x 1, y 1 ),...,(x n, y n ) are i.i.d. Stochastic Approximation (SA) Optimization via noisy observations of F( ) Zero-order SA [Nesterov, 2011] First-order SA, e.g., SGD [Kushner and Yin, 2003]

8 Introduction Related Work SGD Epoch-GD Performance Optimization Error / Excess Risk O F( w) min F(w) = w W O ( ) 1 n ( ) 1 T,O ( ) 1,O n ( ) 1 T ( 1, O n is the number of samples in empirical risk minimization T is the number of stochastic gradients in first-order stochastic approximation n 2 )

9 Introduction Related Work SGD Epoch-GD Outline 1 Introduction 2 Related Work 3 Stochastic Gradient Descent Expectation Bound High-probability Bound 4 Epoch Gradient Descent Expectation Bound High-probability Bound

10 Introduction Related Work SGD Epoch-GD Related Work Empirical Risk Minimization Stochastic Approximation [Shalev-Shwartz et al., 2009] [Koren and Levy, 2015] [Feldman, 2016] [Zhang et al., 2017] [Zinkevich, 2003] [Hazan et al., 2007] [Hazan and Kale, 2011] [Rakhlin et al., 2012] [Lan, 2012] [Agarwal et al., 2012] [Shamir and Zhang, 2013] [Mahdavi et al., 2015]... Supervised Learning [Vapnik, 1998] [Lee et al., 1996] [Panchenko, 2002] [Bartlett and Mendelson, 2002] [Tsybakov, 2004] [Bartlett et al., 2005] [Sridharan et al., 2009] [Srebro et al., 2010]... [Bach and Moulines, 2013]

11 Introduction Related Work SGD Epoch-GD Risk Bounds of Empirical Risk Minimization Supervised Learning

12 Introduction Related Work SGD Epoch-GD Risk Bounds of Empirical Risk Minimization ( ) 1 Lipschitz: O n [Bartlett and Mendelson, 2002] Supervised Learning

13 Introduction Related Work SGD Epoch-GD Risk Bounds of Empirical Risk Minimization Smooth: O( 1 n [Srebro et al., 2010] ( ) 1 Lipschitz: O n [Bartlett and Mendelson, 2002] ) Supervised Learning

14 Introduction Related Work SGD Epoch-GD Risk Bounds of Empirical Risk Minimization Smooth: O( 1 n [Srebro et al., 2010] ( ) 1 Lipschitz: O n [Bartlett and Mendelson, 2002] ( ) Strongly Convex: O 1 λn [Sridharan et al., 2009] ) Supervised Learning

15 Introduction Related Work SGD Epoch-GD Risk Bounds of Empirical Risk Minimization Smooth: O( 1 n [Srebro et al., 2010] ( ) 1 Lipschitz: O n [Bartlett and Mendelson, 2002] ( ) Strongly Convex & Smooth: O 1 λn 2 [Zhang et al., 2017] ( ) Strongly Convex: O 1 λn [Sridharan et al., 2009] ) Supervised Learning

16 Introduction Related Work SGD Epoch-GD Risk Bounds of Empirical Risk Minimization ( ) d Lipschitz: O n [Shalev-Shwartz et al., 2009] Smooth: O( 1 n [Srebro et al., 2010] ( ) 1 Lipschitz: O n [Bartlett and Mendelson, 2002] ( ) Strongly Convex & Smooth: O 1 λn 2 [Zhang et al., 2017] ( ) Strongly Convex: O 1 λn [Sridharan et al., 2009] ) Supervised Learning

17 Introduction Related Work SGD Epoch-GD Risk Bounds of Empirical Risk Minimization ( ) d Lipschitz: O n Smooth: O( d n [Shalev-Shwartz et al., 2009] ) [Zhang et al., 2017] Smooth: O( 1 n [Srebro et al., 2010] ( ) 1 Lipschitz: O n [Bartlett and Mendelson, 2002] ( ) Strongly Convex & Smooth: O 1 λn 2 [Zhang et al., 2017] ( ) Strongly Convex: O 1 λn [Sridharan et al., 2009] ) Supervised Learning

18 Introduction Related Work SGD Epoch-GD Risk Bounds of Empirical Risk Minimization ( ) d Lipschitz: O n [Shalev-Shwartz et al., 2009] ( ) Strongly Convex: O 1 in Exp. λn [Shalev-Shwartz et al., 2009] Smooth: O( d n ) [Zhang et al., 2017] Smooth: O( 1 n [Srebro et al., 2010] ( ) 1 Lipschitz: O n [Bartlett and Mendelson, 2002] ( ) O d n + 1 λn [Zhang et al., 2017] ( ) Strongly Convex: O 1 λn [Sridharan et al., 2009] ) [Zhang et al., 2017] Strongly Convex & Smooth: O Supervised Learning ( 1 λn 2 )

19 Introduction Related Work SGD Epoch-GD Risk Bounds of Empirical Risk Minimization ( ) d Lipschitz: O n [Shalev-Shwartz et al., 2009] ( ) Strongly Convex: O 1 in Exp. λn [Shalev-Shwartz et al., 2009] Smooth: O( d n ) [Zhang et al., 2017] Smooth: O( 1 n [Srebro et al., 2010] ( ) 1 Lipschitz: O n [Bartlett and Mendelson, 2002] Strongly Convex & Smooth: O [Zhang et al., 2017] ( ) O d n + 1 λn [Zhang et al., 2017] ( ) Strongly Convex: O 1 λn [Sridharan et al., 2009] ) [Zhang et al., 2017] ( 1 λn 2 ) Strongly Convex & Smooth: O Supervised Learning ( 1 λn 2 )

20 Introduction Related Work SGD Epoch-GD Risk Bounds of Stochastic Approximation Supervised Learning

21 Introduction Related Work SGD Epoch-GD Risk Bounds of Stochastic Approximation ( ) Lipschitz: O 1 n [Zinkevich, 2003] Supervised Learning

22 Introduction Related Work SGD Epoch-GD Risk Bounds of Stochastic Approximation ( ) Lipschitz: O 1 n [Zinkevich, 2003] Smooth: O( 1 n ) [Srebro et al., 2010] Supervised Learning

23 Introduction Related Work SGD Epoch-GD Risk Bounds of Stochastic Approximation ( ) Lipschitz: O 1 n [Zinkevich, 2003] Smooth: O( 1 n ) [Srebro et al., 2010] ( ) Strongly Convex: O 1 λn [Hazan and Kale, 2011] Supervised Learning

24 Introduction Related Work SGD Epoch-GD Risk Bounds of Stochastic Approximation ( ) Lipschitz: O 1 n [Zinkevich, 2003] Smooth: O( 1 n ) [Srebro et al., 2010] ( ) Strongly Convex: O 1 λn [Hazan and Kale, 2011] Supervised Learning

25 Introduction Related Work SGD Epoch-GD Risk Bounds of Stochastic Approximation ( ) Strongly Convex: O 1 λn [Hazan and Kale, 2011] Smooth: O( 1 n ) [Srebro et al., 2010] Square or Logistic: O( 1 n ) [Bach and Moulines, 2013] ( ) Lipschitz: O 1 n [Zinkevich, 2003] Supervised Learning

26 Outline 1 Introduction 2 Related Work 3 Stochastic Gradient Descent Expectation Bound High-probability Bound 4 Epoch Gradient Descent Expectation Bound High-probability Bound

27 Stochastic Gradient Descent The Algorithm 1: for t = 1,...,T do 2: w t+1 = w t η t g t 3: w t+1 = Π W (w t+1 ) 4: end for 5: return w T = 1 T T w t Requirement E t [g t ] = F(w t ) where E t [ ] is the conditional expectation

28 Stochastic Gradient Descent The Algorithm 1: for t = 1,...,T do 2: w t+1 = w t η t g t 3: w t+1 = Π W (w t+1 ) 4: end for 5: return w T = 1 T T w t min F(w) = E ξ[f(w,ξ)] w W Sample ξ t from the underlying distribution g t = f(w t,ξ t )

29 Stochastic Gradient Descent The Algorithm 1: for t = 1,...,T do 2: w t+1 = w t η t g t 3: w t+1 = Π W (w t+1 ) 4: end for 5: return w T = 1 T T w t Supervised Learning min F(h) = E (x,y)[l(h(x), y)] h H Sample (x t, y t ) from the underlying distribution g t = l(h t (x t ), y t )

30 Outline 1 Introduction 2 Related Work 3 Stochastic Gradient Descent Expectation Bound High-probability Bound 4 Epoch Gradient Descent Expectation Bound High-probability Bound

31 Analysis I For any w W, we have F(w t ) F(w) F(w t ), w t w = g t, w t w + F(w t ) g t, w t w }{{} δ t = 1 η t w t w t+1, w t w +δ t = 1 ( wt w 2 2 w t+1 w w t w 2η t+1 2 ) 2 +δt t = 1 2η t ( wt w 2 2 w t+1 w η t ( wt w 2 2 w t+1 w 2 2 ) η t + 2 g t 2 2 +δ t ) η t + 2 g t 2 2 +δ t

32 Analysis II To simplify the above inequality, we assume Then, we have F(w t ) F(w) 1 2η η t = η, and g t 2 G ( wt w 2 2 w t+1 w 2 2) + η 2 G2 +δ t By adding the inequalities of all iterations, we have T F(w t ) TF(w) 1 2η w 1 w ηt 2 G2 + T δ t

33 Analysis III Then, we have 1 T F( w T ) F(w) = F ( 1 T T w t ) F(w) ( T F(w t ) F(w) 1 T = 1 2ηT w 1 w η 2 G2 + 1 T 1 2η w 1 w ηt 2 G2 + T δ t ) T δ t In summary, we have F( w T ) F(w) 1 2ηT w 1 w η 2 G2 + 1 T T δ t

34 Expectation Bound Under the condition E t [g t ] = F(w t ), it is easy to verify E[δ t ] = E[ F(w t ) g t, w t w ] =E 1,...,t 1 [E t [ F(w t ) g t, w t w ]] = 0 Thus, by taking expectation over both sides, we have E[F( w T )] F(w) 1 2ηT w 1 w η 2 G2 ( ) η= c T 1 1 = 2 T c w 1 w cg 2

35 Outline 1 Introduction 2 Related Work 3 Stochastic Gradient Descent Expectation Bound High-probability Bound 4 Epoch Gradient Descent Expectation Bound High-probability Bound

36 Martingale A sequence of random variables Y 1, Y 2, Y 3,... that satisfies for any time t E[ Y t ] E[Y t+1 Y 1,...,Y t ] = Y t An Example Y t is a gambler s fortune after t tosses of a fair coin. Let δ t a random variable such that δ t = 1 if the coin comes up heads andδ t = 1 if the coin comes up tails. Then Y t = t i=1 δ t. E[Y t+1 Y 1,...,Y t ] = E[δ t+1 + Y t Y 1,...,Y t ] =Y t + E[δ t+1 Y 1,...,Y t ] = Y t

37 Martingale A sequence of random variables Y 1, Y 2, Y 3,... that satisfies for any time t E[ Y t ] E[Y t+1 Y 1,...,Y t ] = Y t An Example Y t is a gambler s fortune after t tosses of a fair coin. Martingale Difference Suppose Y 1, Y 2, Y 3,... is a martingale, then X t = Y t Y t 1 is a martingale difference sequence. E[X t+1 X 1,...,X t ] = E[Y t+1 Y t X 1,...,X t ] = 0

38 Azuma s inequality for Martingales Suppose X 1, X 2,... is a martingale difference sequence and X i c i almost surely. Then, we have ( n ) ( t 2 Pr X i t exp 2 n i=1 i=1 c2 i ( n ) ( Corollary 1 Pr i=1 X i t exp t 2 2 n i=1 c2 i Assume c i c. With a probability at least 1 2δ, we have n X i 2nclog 1 δ i=1 ) )

39 High-probability Bound I Assume g t 2 G, and x y 2 D, x, y W Under the condition E t [g t ] = F(w t ), it is easy to verify δ t = F(w t ) g t, w t w is a martingale difference sequence with where we use Jensen s inequality δ t F(w t ) g t 2 w t w 2 D( F(w t ) 2 + g t 2 ) 2GD F(w t ) 2 = E[g t ] 2 E g t 2 G

40 High-probability Bound II From Azuma s inequality, with a probability at least 1 δ T δ t 2 TGDlog 1 δ Thus, with a probability at least 1 δ F( w T ) F(w) 1 2ηT w 1 w η 2 G2 + 1 T δ t T 1 2ηT w 1 w η 1 2 G2 + 2 T GDlog 1 δ ( η= c T 1 1 = T 2c w 1 w c ) 2 G2 + 2 GDlog 1 δ ( ) 1 =O T

41 Outline 1 Introduction 2 Related Work 3 Stochastic Gradient Descent Expectation Bound High-probability Bound 4 Epoch Gradient Descent Expectation Bound High-probability Bound

42 Strong Convexity A General Definition F is λ-strongly convex if for all α [0, 1], w, w W, F(αw+(1 α)w ) αf(w)+(1 α)f(w ) λ 2 α(1 α) w w 2 2 A More Popular Definition F is λ-strongly convex if for all w, w W, F(w)+ F(w), w w + λ 2 w w 2 2 F(w )

43 Strong Convexity A General Definition F is λ-strongly convex if for all α [0, 1], w, w W, F(αw+(1 α)w ) αf(w)+(1 α)f(w ) λ 2 α(1 α) w w 2 2 A More Popular Definition F is λ-strongly convex if for all w, w W, F(w)+ F(w), w w + λ 2 w w 2 2 F(w ) An Important Property If F is λ-strongly convex, then λ 2 w w 2 F(w) F(w ) where w = argmin w W F(w).

44 Epoch Gradient Descent The Algorithm [Hazan and Kale, 2011] 1: T 1 = 4, η 1 = 1/λ 2: while k i=1 T k T do 3: for t = 1,...,T k do 4: 5: end for 6: w k+1 1 = 1 Tk T k wk t 7: η k+1 = η k /2, T k+1 = 2T k 8: k = k + 1 9: end while w k t+1 = Π W(w k t η k g k t )

45 Epoch Gradient Descent The Algorithm [Hazan and Kale, 2011] 1: T 1 = 4, η 1 = 1/λ 2: while k i=1 T k T do 3: for t = 1,...,T k do 4: 5: end for 6: w k+1 1 = 1 Tk T k wk t 7: η k+1 = η k /2, T k+1 = 2T k 8: k = k + 1 9: end while w k t+1 = Π W(w k t η k g k t ) Our Goal Establish an O(1/T) excess risk bound

46 Outline 1 Introduction 2 Related Work 3 Stochastic Gradient Descent Expectation Bound High-probability Bound 4 Epoch Gradient Descent Expectation Bound High-probability Bound

47 Analysis I If we have [ ] ( ) 1 E F(w k 1) F(w ) = O, k λt k Then, the last solution w k 1 satisfies [ ] ( ) ( ) 1 1 E F(w k 1 ) F(w ) = O = O λt k λt Suppose [ E F(w k 1) ] F(w ) = c G2 λt k From the analysis of SGD, we have [ ] E F(w k+1 1 ) F(w ) 1 [ ] E w k 1 w η k 2η k T k 2 G2

48 Analysis II Recall that Then, we have w k 1 w λ ( ) F(w k 1) F(w ) [ ] E F(w k+1 1 ) F(w ) 1 [ ] E F(w k λη k T 1) F(w ) + η k k λη k T k =4 = 1 [ ] 4 E c G2 + 4 G 2 λt k 2λT k T k+1 =2T k = c G 2 λt k+1 ( c ) c=8 = c G2 λt k+1 2 G2

49 Outline 1 Introduction 2 Related Work 3 Stochastic Gradient Descent Expectation Bound High-probability Bound 4 Epoch Gradient Descent Expectation Bound High-probability Bound

50 What do We Need? Key Inequality in the Expectation Bound [ ] E F(w k+1 1 ) F(w ) 1 E 2η k T k [ w k 1 w 2 2 ] + η k 2 G2

51 What do We Need? Key Inequality in the Expectation Bound [ ] E F(w k+1 1 ) F(w ) 1 E 2η k T k [ w k 1 w 2 2 ] + η k 2 G2 A High-probability Inequality based on Azuma s Inequality F(w k+1 1 ) F(w ) 1 w k 1 2η k T w η k GD k 2 G2 + 2 log 1 T k δ

52 What do We Need? Key Inequality in the Expectation Bound [ ] E F(w k+1 1 ) F(w ) 1 E 2η k T k [ w k 1 w 2 2 ] + η k 2 G2 A High-probability Inequality based on Azuma s Inequality F(w k+1 1 ) F(w ) 1 w k 1 2η k T w η k GD k 2 G2 + 2 log 1 T k δ The Inequality that We Need F(w k+1 1 ) F(w ) 1 2η k T k w k 1 w η k 2 G2 + O ( 1 λt k )

53 Concentration Inequality I Bernstein s Inequality for Martingales Let X 1,...,X n be a bounded martingale difference sequence with respect to the filtration F = (F i ) 1 i n and with X i K. Let i S i = j=1 be the associated martingale. Denote the sum of the conditional variances by n [ ] Σ 2 n = E Xt 2 F t 1. Then for all constants t, ν > 0, ( ) Pr max S i > t and Σ 2 n ν i=1,...,n X j ( exp t 2 ) 2(ν + Kt/3)

54 Concentration Inequality II Corollary 2 Suppose Σ 2 n ν. Then, with a probability at least 1 δ S n 2νlog 1 δ + 2K 3 log 1 δ Note 2νlog 1 δ 1 2α log 1 +αν, α > 0 δ

55 The Basic Idea For any w W, we have F(w t ) F(w) F(w t ), w t w λ 2 w t w 2 2 λ = g t, w t w + F(w t ) g t, w t w }{{} 2 w t w 2 2 δ t Then, following the same analysis, we arrive at ( T F( w T ) F(w) 1 2ηT w 1 w 2 2+ η 2 G2 + 1 δ t λ T 2 We are going to prove T δ t = λ 2 T ) T w t w 2 2 ( ) loglog T w t w O λ

56 Analysis I δ t = F(w t ) g t, w t w is a martingale difference sequence with δ t F(w t ) g t 2 w t w 2 D( F(w t ) 2 + g t 2 ) 2GD where D could be a prior parameter or 2G/λ when w = w The sum of the conditional variances is upper bounded by T Σ 2 T = [ ] E t δ 2 t T 4G 2 [ E t F(wt ) g t 2 2 w t w 2 ] 2 T w t w 2 2

57 Analysis II A straightforward way Since Σ 2 T 4G2 T w t w 2 2 Following Bernstein s inequality for martingales, with a probability at least 1 δ, we have T ( ) T δ t 2 4G 2 w t w 2 2 log 1 δ GDlog 1 δ λ 2 T w t w G2 λ log 1 δ GDlog 1 δ

58 Analysis III We need Peeling Divide T w t w 2 2 [0, TD2 ] in to a sequence of intervals The initial: T w t w 2 2 [0, D2 /T] Since δ t F(w t ) g t 2 w t w 2 2G w t w 2 we have T T δ t 2G w t w 2 2G T T w t w 2 2 2GD The rest: T w t w 2 2 (2i 1 D 2 /T, 2 i D 2 /T] where i = 1,..., 2log 2 T

59 Analysis IV Define T A T = w t w 2 2 TD 2 We are going to bound [ T Pr δ t 2 ] 4G 2 A T τ GDτ + 2GD

60 Analysis V T Pr δ t 2 4G 2 A T τ + 2 2GDτ + 2GD 3 T =Pr δ t 2 4G 2 A T τ GDτ + 2GD, A T D 2 /T T +Pr δ t 2 4G 2 A T τ GDτ + 2GD, D2 /T < A T TD 2 T =Pr δ t 2 4G 2 A T τ GDτ + 2GD,Σ2 T 4G2 A T, D 2 /T < A T TD 2 m T Pr δ t 2 4G 2 A T τ + 43 GDτ + 2GD,Σ2T 4G2 A T, D2 T 2i 1 < A T D2 T 2i i=1 m T Pr δ t 2 4G2 D 2 2 i τ + 4 T 3 GDτ,Σ2 T 4G2 D 2 2 i me τ T i=1 where m = 2log 2 T

61 Analysis VI By setting τ = log m δ, with a probability at least 1 δ, T δ t 2 4G 2 A T τ + 4 GDτ + 2GD 3 ( T ) =2 4G 2 w t w 2 2 λ 2 = λ 2 T T τ + 4 GDτ + 2GD 3 w t w G2 λ τ + 4 GDτ + 2GD 3 ( ) loglog T w t w O λ where τ = log m δ = log 2log 2 T δ = O(loglog T)

62 Reference I Agarwal, A., Bartlett, P. L., Ravikumar, P., and Wainwright, M. J. (2012). Information-theoretic lower bounds on the oracle complexity of stochastic convex optimization. IEEE Transactions on Information Theory, 58(5): Bach, F. and Moulines, E. (2013). Non-strongly-convex smooth stochastic approximation with convergence rate O(1/n). In Advances in Neural Information Processing Systems 26, pages Bartlett, P. L., Bousquet, O., and Mendelson, S. (2005). Local rademacher complexities. The Annals of Statistics, 33(4): Bartlett, P. L. and Mendelson, S. (2002). Rademacher and gaussian complexities: risk bounds and structural results. Journal of Machine Learning Research, 3: Feldman, V. (2016). Generalization of erm in stochastic convex optimization: The dimension strikes back. ArXiv e-prints, arxiv:

63 Reference II Hazan, E., Agarwal, A., and Kale, S. (2007). Logarithmic regret algorithms for online convex optimization. Machine Learning, 69(2-3): Hazan, E. and Kale, S. (2011). Beyond the regret minimization barrier: an optimal algorithm for stochastic strongly-convex optimization. In Proceedings of the 24th Annual Conference on Learning Theory, pages Koren, T. and Levy, K. (2015). Fast rates for exp-concave empirical risk minimization. In Advances in Neural Information Processing Systems 28, pages Kushner, H. J. and Yin, G. G. (2003). Stochastic Approximation and Recursive Algorithms and Applications. Springer, second edition. Lan, G. (2012). An optimal method for stochastic composite optimization. Mathematical Programming, 133:

64 Reference III Lee, W. S., Bartlett, P. L., and Williamson, R. C. (1996). The importance of convexity in learning with squared loss. In Proceedings of the 9th Annual Conference on Computational Learning Theory, pages Mahdavi, M., Zhang, L., and Jin, R. (2015). Lower and upper bounds on the generalization of stochastic exponentially concave optimization. In Proceedings of the 28th Conference on Learning Theory. Nemirovski, A., Juditsky, A., Lan, G., and Shapiro, A. (2009). Robust stochastic approximation approach to stochastic programming. SIAM Journal on Optimization, 19(4): Nesterov, Y. (2011). Random gradient-free minimization of convex functions. Core discussion papers. Panchenko, D. (2002). Some extensions of an inequality of vapnik and chervonenkis. Electronic Communications in Probability, 7:55 65.

65 Reference IV Rakhlin, A., Shamir, O., and Sridharan, K. (2012). Making gradient descent optimal for strongly convex stochastic optimization. In Proceedings of the 29th International Conference on Machine Learning, pages Shalev-Shwartz, S., Shamir, O., Srebro, N., and Sridharan, K. (2009). Stochastic convex optimization. In Proceedings of the 22nd Annual Conference on Learning Theory. Shamir, O. and Zhang, T. (2013). Stochastic gradient descent for non-smooth optimization: Convergence results and optimal averaging schemes. In Proceedings of the 30th International Conference on Machine Learning, pages Srebro, N., Sridharan, K., and Tewari, A. (2010). Optimistic rates for learning with a smooth loss. ArXiv e-prints, arxiv: Sridharan, K., Shalev-shwartz, S., and Srebro, N. (2009). Fast rates for regularized objectives. In Advances in Neural Information Processing Systems 21, pages

66 Reference V Tsybakov, A. B. (2004). Optimal aggregation of classifiers in statistical learning. The Annals of Statistics, 32: Vapnik, V. N. (1998). Statistical Learning Theory. Wiley-Interscience. Zhang, L., Yang, T., and Jin, R. (2017). Empirical risk minimization for stochastic convex optimization: O(1/n)- and O(1/n 2 )-type of risk bounds. In Proceedings of the 30th Annual Conference on Learning Theory. Zinkevich, M. (2003). Online convex programming and generalized infinitesimal gradient ascent. In Proceedings of the 20th International Conference on Machine Learning, pages

Full-information Online Learning

Full-information Online Learning Introduction Expert Advice OCO LM A DA NANJING UNIVERSITY Full-information Lijun Zhang Nanjing University, China June 2, 2017 Outline Introduction Expert Advice OCO 1 Introduction Definitions Regret 2

More information

Making Gradient Descent Optimal for Strongly Convex Stochastic Optimization

Making Gradient Descent Optimal for Strongly Convex Stochastic Optimization Making Gradient Descent Optimal for Strongly Convex Stochastic Optimization Alexander Rakhlin University of Pennsylvania Ohad Shamir Microsoft Research New England Karthik Sridharan University of Pennsylvania

More information

Lower and Upper Bounds on the Generalization of Stochastic Exponentially Concave Optimization

Lower and Upper Bounds on the Generalization of Stochastic Exponentially Concave Optimization JMLR: Workshop and Conference Proceedings vol 40:1 16, 2015 28th Annual Conference on Learning Theory Lower and Upper Bounds on the Generalization of Stochastic Exponentially Concave Optimization Mehrdad

More information

Adaptive Online Learning in Dynamic Environments

Adaptive Online Learning in Dynamic Environments Adaptive Online Learning in Dynamic Environments Lijun Zhang, Shiyin Lu, Zhi-Hua Zhou National Key Laboratory for Novel Software Technology Nanjing University, Nanjing 210023, China {zhanglj, lusy, zhouzh}@lamda.nju.edu.cn

More information

Stochastic Gradient Descent with Only One Projection

Stochastic Gradient Descent with Only One Projection Stochastic Gradient Descent with Only One Projection Mehrdad Mahdavi, ianbao Yang, Rong Jin, Shenghuo Zhu, and Jinfeng Yi Dept. of Computer Science and Engineering, Michigan State University, MI, USA Machine

More information

Beyond stochastic gradient descent for large-scale machine learning

Beyond stochastic gradient descent for large-scale machine learning Beyond stochastic gradient descent for large-scale machine learning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France Joint work with Eric Moulines, Nicolas Le Roux and Mark Schmidt - CAP, July

More information

Stochastic gradient methods for machine learning

Stochastic gradient methods for machine learning Stochastic gradient methods for machine learning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France Joint work with Eric Moulines, Nicolas Le Roux and Mark Schmidt - September 2012 Context Machine

More information

Stochastic gradient methods for machine learning

Stochastic gradient methods for machine learning Stochastic gradient methods for machine learning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France Joint work with Eric Moulines, Nicolas Le Roux and Mark Schmidt - January 2013 Context Machine

More information

Optimistic Rates Nati Srebro

Optimistic Rates Nati Srebro Optimistic Rates Nati Srebro Based on work with Karthik Sridharan and Ambuj Tewari Examples based on work with Andy Cotter, Elad Hazan, Tomer Koren, Percy Liang, Shai Shalev-Shwartz, Ohad Shamir, Karthik

More information

Stochastic gradient descent and robustness to ill-conditioning

Stochastic gradient descent and robustness to ill-conditioning Stochastic gradient descent and robustness to ill-conditioning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France ÉCOLE NORMALE SUPÉRIEURE Joint work with Aymeric Dieuleveut, Nicolas Flammarion,

More information

A Two-Stage Approach for Learning a Sparse Model with Sharp Excess Risk Analysis

A Two-Stage Approach for Learning a Sparse Model with Sharp Excess Risk Analysis Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence (AAAI-7) A Two-Stage Approach for Learning a Sparse Model with Sharp Excess Risk Analysis Zhe Li, Tianbao Yang, Lijun Zhang, Rong

More information

Beyond stochastic gradient descent for large-scale machine learning

Beyond stochastic gradient descent for large-scale machine learning Beyond stochastic gradient descent for large-scale machine learning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France Joint work with Eric Moulines - October 2014 Big data revolution? A new

More information

On the Generalization Ability of Online Strongly Convex Programming Algorithms

On the Generalization Ability of Online Strongly Convex Programming Algorithms On the Generalization Ability of Online Strongly Convex Programming Algorithms Sham M. Kakade I Chicago Chicago, IL 60637 sham@tti-c.org Ambuj ewari I Chicago Chicago, IL 60637 tewari@tti-c.org Abstract

More information

SVRG++ with Non-uniform Sampling

SVRG++ with Non-uniform Sampling SVRG++ with Non-uniform Sampling Tamás Kern András György Department of Electrical and Electronic Engineering Imperial College London, London, UK, SW7 2BT {tamas.kern15,a.gyorgy}@imperial.ac.uk Abstract

More information

arxiv: v4 [math.oc] 5 Jan 2016

arxiv: v4 [math.oc] 5 Jan 2016 Restarted SGD: Beating SGD without Smoothness and/or Strong Convexity arxiv:151.03107v4 [math.oc] 5 Jan 016 Tianbao Yang, Qihang Lin Department of Computer Science Department of Management Sciences The

More information

Lecture: Adaptive Filtering

Lecture: Adaptive Filtering ECE 830 Spring 2013 Statistical Signal Processing instructors: K. Jamieson and R. Nowak Lecture: Adaptive Filtering Adaptive filters are commonly used for online filtering of signals. The goal is to estimate

More information

A Two-Stage Approach for Learning a Sparse Model with Sharp

A Two-Stage Approach for Learning a Sparse Model with Sharp A Two-Stage Approach for Learning a Sparse Model with Sharp Excess Risk Analysis The University of Iowa, Nanjing University, Alibaba Group February 3, 2017 1 Problem and Chanllenges 2 The Two-stage Approach

More information

Beyond the regret minimization barrier: an optimal algorithm for stochastic strongly-convex optimization

Beyond the regret minimization barrier: an optimal algorithm for stochastic strongly-convex optimization JMLR: Workshop and Conference Proceedings vol (2010) 1 16 24th Annual Conference on Learning heory Beyond the regret minimization barrier: an optimal algorithm for stochastic strongly-convex optimization

More information

Computational and Statistical Learning Theory

Computational and Statistical Learning Theory Computational and Statistical Learning Theory TTIC 31120 Prof. Nati Srebro Lecture 17: Stochastic Optimization Part II: Realizable vs Agnostic Rates Part III: Nearest Neighbor Classification Stochastic

More information

Optimal Regularized Dual Averaging Methods for Stochastic Optimization

Optimal Regularized Dual Averaging Methods for Stochastic Optimization Optimal Regularized Dual Averaging Methods for Stochastic Optimization Xi Chen Machine Learning Department Carnegie Mellon University xichen@cs.cmu.edu Qihang Lin Javier Peña Tepper School of Business

More information

Beyond stochastic gradient descent for large-scale machine learning

Beyond stochastic gradient descent for large-scale machine learning Beyond stochastic gradient descent for large-scale machine learning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France ÉCOLE NORMALE SUPÉRIEURE Joint work with Aymeric Dieuleveut, Nicolas Flammarion,

More information

Learning with stochastic proximal gradient

Learning with stochastic proximal gradient Learning with stochastic proximal gradient Lorenzo Rosasco DIBRIS, Università di Genova Via Dodecaneso, 35 16146 Genova, Italy lrosasco@mit.edu Silvia Villa, Băng Công Vũ Laboratory for Computational and

More information

Stochastic Composition Optimization

Stochastic Composition Optimization Stochastic Composition Optimization Algorithms and Sample Complexities Mengdi Wang Joint works with Ethan X. Fang, Han Liu, and Ji Liu ORFE@Princeton ICCOPT, Tokyo, August 8-11, 2016 1 / 24 Collaborators

More information

SADAGRAD: Strongly Adaptive Stochastic Gradient Methods

SADAGRAD: Strongly Adaptive Stochastic Gradient Methods Zaiyi Chen * Yi Xu * Enhong Chen Tianbao Yang Abstract Although the convergence rates of existing variants of ADAGRAD have a better dependence on the number of iterations under the strong convexity condition,

More information

Learnability, Stability, Regularization and Strong Convexity

Learnability, Stability, Regularization and Strong Convexity Learnability, Stability, Regularization and Strong Convexity Nati Srebro Shai Shalev-Shwartz HUJI Ohad Shamir Weizmann Karthik Sridharan Cornell Ambuj Tewari Michigan Toyota Technological Institute Chicago

More information

Bandit Convex Optimization

Bandit Convex Optimization March 7, 2017 Table of Contents 1 (BCO) 2 Projection Methods 3 Barrier Methods 4 Variance reduction 5 Other methods 6 Conclusion Learning scenario Compact convex action set K R d. For t = 1 to T : Predict

More information

Uniform concentration inequalities, martingales, Rademacher complexity and symmetrization

Uniform concentration inequalities, martingales, Rademacher complexity and symmetrization Uniform concentration inequalities, martingales, Rademacher complexity and symmetrization John Duchi Outline I Motivation 1 Uniform laws of large numbers 2 Loss minimization and data dependence II Uniform

More information

Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization

Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization Shai Shalev-Shwartz and Tong Zhang School of CS and Engineering, The Hebrew University of Jerusalem Optimization for Machine

More information

Stochastic and online algorithms

Stochastic and online algorithms Stochastic and online algorithms stochastic gradient method online optimization and dual averaging method minimizing finite average Stochastic and online optimization 6 1 Stochastic optimization problem

More information

Large-scale Stochastic Optimization

Large-scale Stochastic Optimization Large-scale Stochastic Optimization 11-741/641/441 (Spring 2016) Hanxiao Liu hanxiaol@cs.cmu.edu March 24, 2016 1 / 22 Outline 1. Gradient Descent (GD) 2. Stochastic Gradient Descent (SGD) Formulation

More information

A Stochastic PCA Algorithm with an Exponential Convergence Rate. Ohad Shamir

A Stochastic PCA Algorithm with an Exponential Convergence Rate. Ohad Shamir A Stochastic PCA Algorithm with an Exponential Convergence Rate Ohad Shamir Weizmann Institute of Science NIPS Optimization Workshop December 2014 Ohad Shamir Stochastic PCA with Exponential Convergence

More information

Dynamic Regret of Strongly Adaptive Methods

Dynamic Regret of Strongly Adaptive Methods Lijun Zhang 1 ianbao Yang 2 Rong Jin 3 Zhi-Hua Zhou 1 Abstract o cope with changing environments, recent developments in online learning have introduced the concepts of adaptive regret and dynamic regret

More information

Adaptive Subgradient Methods for Online Learning and Stochastic Optimization John Duchi, Elad Hanzan, Yoram Singer

Adaptive Subgradient Methods for Online Learning and Stochastic Optimization John Duchi, Elad Hanzan, Yoram Singer Adaptive Subgradient Methods for Online Learning and Stochastic Optimization John Duchi, Elad Hanzan, Yoram Singer Vicente L. Malave February 23, 2011 Outline Notation minimize a number of functions φ

More information

Distributed Inexact Newton-type Pursuit for Non-convex Sparse Learning

Distributed Inexact Newton-type Pursuit for Non-convex Sparse Learning Distributed Inexact Newton-type Pursuit for Non-convex Sparse Learning Bo Liu Department of Computer Science, Rutgers Univeristy Xiao-Tong Yuan BDAT Lab, Nanjing University of Information Science and Technology

More information

Simultaneous Model Selection and Optimization through Parameter-free Stochastic Learning

Simultaneous Model Selection and Optimization through Parameter-free Stochastic Learning Simultaneous Model Selection and Optimization through Parameter-free Stochastic Learning Francesco Orabona Yahoo! Labs New York, USA francesco@orabona.com Abstract Stochastic gradient descent algorithms

More information

Stochastic optimization in Hilbert spaces

Stochastic optimization in Hilbert spaces Stochastic optimization in Hilbert spaces Aymeric Dieuleveut Aymeric Dieuleveut Stochastic optimization Hilbert spaces 1 / 48 Outline Learning vs Statistics Aymeric Dieuleveut Stochastic optimization Hilbert

More information

Supplementary Material of Dynamic Regret of Strongly Adaptive Methods

Supplementary Material of Dynamic Regret of Strongly Adaptive Methods Supplementary Material of A Proof of Lemma 1 Lijun Zhang 1 ianbao Yang Rong Jin 3 Zhi-Hua Zhou 1 We first prove the first part of Lemma 1 Let k = log K t hen, integer t can be represented in the base-k

More information

Accelerating Stochastic Optimization

Accelerating Stochastic Optimization Accelerating Stochastic Optimization Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem and Mobileye Master Class at Tel-Aviv, Tel-Aviv University, November 2014 Shalev-Shwartz

More information

The Interplay Between Stability and Regret in Online Learning

The Interplay Between Stability and Regret in Online Learning The Interplay Between Stability and Regret in Online Learning Ankan Saha Department of Computer Science University of Chicago ankans@cs.uchicago.edu Prateek Jain Microsoft Research India prajain@microsoft.com

More information

Stochastic and Adversarial Online Learning without Hyperparameters

Stochastic and Adversarial Online Learning without Hyperparameters Stochastic and Adversarial Online Learning without Hyperparameters Ashok Cutkosky Department of Computer Science Stanford University ashokc@cs.stanford.edu Kwabena Boahen Department of Bioengineering Stanford

More information

Introduction to Machine Learning (67577) Lecture 7

Introduction to Machine Learning (67577) Lecture 7 Introduction to Machine Learning (67577) Lecture 7 Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem Solving Convex Problems using SGD and RLM Shai Shalev-Shwartz (Hebrew

More information

Constrained Stochastic Gradient Descent for Large-scale Least Squares Problem

Constrained Stochastic Gradient Descent for Large-scale Least Squares Problem Constrained Stochastic Gradient Descent for Large-scale Least Squares Problem ABSRAC Yang Mu University of Massachusetts Boston 100 Morrissey Boulevard Boston, MA, US 015 yangmu@cs.umb.edu ianyi Zhou University

More information

On Projected Stochastic Gradient Descent Algorithm with Weighted Averaging for Least Squares Regression

On Projected Stochastic Gradient Descent Algorithm with Weighted Averaging for Least Squares Regression On Projected Stochastic Gradient Descent Algorithm with Weighted Averaging for Least Squares Regression arxiv:606.03000v [cs.it] 9 Jun 206 Kobi Cohen, Angelia Nedić and R. Srikant Abstract The problem

More information

Online Learning and Online Convex Optimization

Online Learning and Online Convex Optimization Online Learning and Online Convex Optimization Nicolò Cesa-Bianchi Università degli Studi di Milano N. Cesa-Bianchi (UNIMI) Online Learning 1 / 49 Summary 1 My beautiful regret 2 A supposedly fun game

More information

Stochastic Gradient Descent with Variance Reduction

Stochastic Gradient Descent with Variance Reduction Stochastic Gradient Descent with Variance Reduction Rie Johnson, Tong Zhang Presenter: Jiawen Yao March 17, 2015 Rie Johnson, Tong Zhang Presenter: JiawenStochastic Yao Gradient Descent with Variance Reduction

More information

Stochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure

Stochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure Stochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure Alberto Bietti Julien Mairal Inria Grenoble (Thoth) March 21, 2017 Alberto Bietti Stochastic MISO March 21,

More information

Adaptive Online Gradient Descent

Adaptive Online Gradient Descent University of Pennsylvania ScholarlyCommons Statistics Papers Wharton Faculty Research 6-4-2007 Adaptive Online Gradient Descent Peter Bartlett Elad Hazan Alexander Rakhlin University of Pennsylvania Follow

More information

The Online Approach to Machine Learning

The Online Approach to Machine Learning The Online Approach to Machine Learning Nicolò Cesa-Bianchi Università degli Studi di Milano N. Cesa-Bianchi (UNIMI) Online Approach to ML 1 / 53 Summary 1 My beautiful regret 2 A supposedly fun game I

More information

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017 Machine Learning Support Vector Machines Fabio Vandin November 20, 2017 1 Classification and Margin Consider a classification problem with two classes: instance set X = R d label set Y = { 1, 1}. Training

More information

Trade-Offs in Distributed Learning and Optimization

Trade-Offs in Distributed Learning and Optimization Trade-Offs in Distributed Learning and Optimization Ohad Shamir Weizmann Institute of Science Includes joint works with Yossi Arjevani, Nathan Srebro and Tong Zhang IHES Workshop March 2016 Distributed

More information

Risk Bounds for Lévy Processes in the PAC-Learning Framework

Risk Bounds for Lévy Processes in the PAC-Learning Framework Risk Bounds for Lévy Processes in the PAC-Learning Framework Chao Zhang School of Computer Engineering anyang Technological University Dacheng Tao School of Computer Engineering anyang Technological University

More information

Optimization for Machine Learning

Optimization for Machine Learning Optimization for Machine Learning (Lecture 3-A - Convex) SUVRIT SRA Massachusetts Institute of Technology Special thanks: Francis Bach (INRIA, ENS) (for sharing this material, and permitting its use) MPI-IS

More information

March 1, Florida State University. Concentration Inequalities: Martingale. Approach and Entropy Method. Lizhe Sun and Boning Yang.

March 1, Florida State University. Concentration Inequalities: Martingale. Approach and Entropy Method. Lizhe Sun and Boning Yang. Florida State University March 1, 2018 Framework 1. (Lizhe) Basic inequalities Chernoff bounding Review for STA 6448 2. (Lizhe) Discrete-time martingales inequalities via martingale approach 3. (Boning)

More information

Barzilai-Borwein Step Size for Stochastic Gradient Descent

Barzilai-Borwein Step Size for Stochastic Gradient Descent Barzilai-Borwein Step Size for Stochastic Gradient Descent Conghui Tan The Chinese University of Hong Kong chtan@se.cuhk.edu.hk Shiqian Ma The Chinese University of Hong Kong sqma@se.cuhk.edu.hk Yu-Hong

More information

Advanced Machine Learning

Advanced Machine Learning Advanced Machine Learning Online Convex Optimization MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Outline Online projected sub-gradient descent. Exponentiated Gradient (EG). Mirror descent.

More information

Convergence Rates for Deterministic and Stochastic Subgradient Methods Without Lipschitz Continuity

Convergence Rates for Deterministic and Stochastic Subgradient Methods Without Lipschitz Continuity Convergence Rates for Deterministic and Stochastic Subgradient Methods Without Lipschitz Continuity Benjamin Grimmer Abstract We generalize the classic convergence rate theory for subgradient methods to

More information

Efficient Private ERM for Smooth Objectives

Efficient Private ERM for Smooth Objectives Efficient Private ERM for Smooth Objectives Jiaqi Zhang, Kai Zheng, Wenlong Mou, Liwei Wang Key Laboratory of Machine Perception, MOE, School of Electronics Engineering and Computer Science, Peking University,

More information

Accelerating Online Convex Optimization via Adaptive Prediction

Accelerating Online Convex Optimization via Adaptive Prediction Mehryar Mohri Courant Institute and Google Research New York, NY 10012 mohri@cims.nyu.edu Scott Yang Courant Institute New York, NY 10012 yangs@cims.nyu.edu Abstract We present a powerful general framework

More information

Tutorial: PART 2. Optimization for Machine Learning. Elad Hazan Princeton University. + help from Sanjeev Arora & Yoram Singer

Tutorial: PART 2. Optimization for Machine Learning. Elad Hazan Princeton University. + help from Sanjeev Arora & Yoram Singer Tutorial: PART 2 Optimization for Machine Learning Elad Hazan Princeton University + help from Sanjeev Arora & Yoram Singer Agenda 1. Learning as mathematical optimization Stochastic optimization, ERM,

More information

Statistical Machine Learning II Spring 2017, Learning Theory, Lecture 4

Statistical Machine Learning II Spring 2017, Learning Theory, Lecture 4 Statistical Machine Learning II Spring 07, Learning Theory, Lecture 4 Jean Honorio jhonorio@purdue.edu Deterministic Optimization For brevity, everywhere differentiable functions will be called smooth.

More information

A Low Complexity Algorithm with O( T ) Regret and Finite Constraint Violations for Online Convex Optimization with Long Term Constraints

A Low Complexity Algorithm with O( T ) Regret and Finite Constraint Violations for Online Convex Optimization with Long Term Constraints A Low Complexity Algorithm with O( T ) Regret and Finite Constraint Violations for Online Convex Optimization with Long Term Constraints Hao Yu and Michael J. Neely Department of Electrical Engineering

More information

arxiv: v1 [math.oc] 18 Mar 2016

arxiv: v1 [math.oc] 18 Mar 2016 Katyusha: Accelerated Variance Reduction for Faster SGD Zeyuan Allen-Zhu zeyuan@csail.mit.edu Princeton University arxiv:1603.05953v1 [math.oc] 18 Mar 016 March 18, 016 Abstract We consider minimizing

More information

Inverse Time Dependency in Convex Regularized Learning

Inverse Time Dependency in Convex Regularized Learning Inverse Time Dependency in Convex Regularized Learning Zeyuan Allen Zhu 2*, Weizhu Chen 2, Chenguang Zhu 23, Gang Wang 2, Haixun Wang 2, Zheng Chen 2 Fundamental Science Class, Department of Physics, Tsinghua

More information

Stochastic Approximation: Mini-Batches, Optimistic Rates and Acceleration

Stochastic Approximation: Mini-Batches, Optimistic Rates and Acceleration Stochastic Approximation: Mini-Batches, Optimistic Rates and Acceleration Nati Srebro Toyota Technological Institute at Chicago a philanthropically endowed academic computer science institute dedicated

More information

arxiv: v1 [cs.lg] 2 Apr 2013

arxiv: v1 [cs.lg] 2 Apr 2013 JMLR: Worshop and Conference Proceedings vol 30 013) 1 3 Olog T) Projections for Stochastic Optimization of Smooth and Strongly Convex Functions arxiv:1304.0740v1 [cs.lg] Apr 013 Lijun Zhang zhanglij@msu.edu

More information

Improved Optimization of Finite Sums with Minibatch Stochastic Variance Reduced Proximal Iterations

Improved Optimization of Finite Sums with Minibatch Stochastic Variance Reduced Proximal Iterations Improved Optimization of Finite Sums with Miniatch Stochastic Variance Reduced Proximal Iterations Jialei Wang University of Chicago Tong Zhang Tencent AI La Astract jialei@uchicago.edu tongzhang@tongzhang-ml.org

More information

arxiv: v1 [stat.ml] 27 Sep 2016

arxiv: v1 [stat.ml] 27 Sep 2016 Generalization Error Bounds for Optimization Algorithms via Stability Qi Meng 1, Yue Wang, Wei Chen 3, Taifeng Wang 3, Zhi-Ming Ma 4, Tie-Yan Liu 3 1 School of Mathematical Sciences, Peking University,

More information

Online Optimization with Gradual Variations

Online Optimization with Gradual Variations JMLR: Workshop and Conference Proceedings vol (0) 0 Online Optimization with Gradual Variations Chao-Kai Chiang, Tianbao Yang 3 Chia-Jung Lee Mehrdad Mahdavi 3 Chi-Jen Lu Rong Jin 3 Shenghuo Zhu 4 Institute

More information

Randomized Coordinate Descent with Arbitrary Sampling: Algorithms and Complexity

Randomized Coordinate Descent with Arbitrary Sampling: Algorithms and Complexity Randomized Coordinate Descent with Arbitrary Sampling: Algorithms and Complexity Zheng Qu University of Hong Kong CAM, 23-26 Aug 2016 Hong Kong based on joint work with Peter Richtarik and Dominique Cisba(University

More information

Online Passive-Aggressive Algorithms

Online Passive-Aggressive Algorithms Online Passive-Aggressive Algorithms Koby Crammer Ofer Dekel Shai Shalev-Shwartz Yoram Singer School of Computer Science & Engineering The Hebrew University, Jerusalem 91904, Israel {kobics,oferd,shais,singer}@cs.huji.ac.il

More information

Large Scale Semi-supervised Linear SVM with Stochastic Gradient Descent

Large Scale Semi-supervised Linear SVM with Stochastic Gradient Descent Journal of Computational Information Systems 9: 15 (2013) 6251 6258 Available at http://www.jofcis.com Large Scale Semi-supervised Linear SVM with Stochastic Gradient Descent Xin ZHOU, Conghui ZHU, Sheng

More information

Machine Learning in the Data Revolution Era

Machine Learning in the Data Revolution Era Machine Learning in the Data Revolution Era Shai Shalev-Shwartz School of Computer Science and Engineering The Hebrew University of Jerusalem Machine Learning Seminar Series, Google & University of Waterloo,

More information

Generalization of ERM in Stochastic Convex Optimization: The Dimension Strikes Back

Generalization of ERM in Stochastic Convex Optimization: The Dimension Strikes Back Generalization of ERM in Stochastic Convex Optimization: The Dimension Strikes Back Vitaly Feldman IBM Research Almaden Abstract In stochastic convex optimization the goal is to minimize a convex function

More information

Modern Stochastic Methods. Ryan Tibshirani (notes by Sashank Reddi and Ryan Tibshirani) Convex Optimization

Modern Stochastic Methods. Ryan Tibshirani (notes by Sashank Reddi and Ryan Tibshirani) Convex Optimization Modern Stochastic Methods Ryan Tibshirani (notes by Sashank Reddi and Ryan Tibshirani) Convex Optimization 10-725 Last time: conditional gradient method For the problem min x f(x) subject to x C where

More information

Large-scale machine learning and convex optimization

Large-scale machine learning and convex optimization Large-scale machine learning and convex optimization Francis Bach INRIA - Ecole Normale Supérieure, Paris, France ÉCOLE NORMALE SUPÉRIEURE IFCAM, Bangalore - July 2014 Big data revolution? A new scientific

More information

Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade

Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade Machine Learning for Big Data CSE547/STAT548 University of Washington S. M. Kakade (UW) Optimization for

More information

Learning Bound for Parameter Transfer Learning

Learning Bound for Parameter Transfer Learning Learning Bound for Parameter Transfer Learning Wataru Kumagai Faculty of Engineering Kanagawa University kumagai@kanagawa-u.ac.jp Abstract We consider a transfer-learning problem by using the parameter

More information

Fast Rates for Estimation Error and Oracle Inequalities for Model Selection

Fast Rates for Estimation Error and Oracle Inequalities for Model Selection Fast Rates for Estimation Error and Oracle Inequalities for Model Selection Peter L. Bartlett Computer Science Division and Department of Statistics University of California, Berkeley bartlett@cs.berkeley.edu

More information

Big Data Analytics: Optimization and Randomization

Big Data Analytics: Optimization and Randomization Big Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML 2015 Hong Kong Department of Computer Science, The University of Iowa, IA, USA Nov. 20, 2015 Yang Tutorial for ACML 15 Nov.

More information

arxiv: v2 [cs.lg] 14 Sep 2017

arxiv: v2 [cs.lg] 14 Sep 2017 Elad Hazan Princeton University, Princeton, NJ 08540, USA Haipeng Luo Princeton University, Princeton, NJ 08540, USA EHAZAN@CS.PRINCETON.EDU HAIPENGL@CS.PRINCETON.EDU arxiv:160.0101v [cs.lg] 14 Sep 017

More information

Proximal Minimization by Incremental Surrogate Optimization (MISO)

Proximal Minimization by Incremental Surrogate Optimization (MISO) Proximal Minimization by Incremental Surrogate Optimization (MISO) (and a few variants) Julien Mairal Inria, Grenoble ICCOPT, Tokyo, 2016 Julien Mairal, Inria MISO 1/26 Motivation: large-scale machine

More information

SADAGRAD: Strongly Adaptive Stochastic Gradient Methods

SADAGRAD: Strongly Adaptive Stochastic Gradient Methods Zaiyi Chen * Yi Xu * Enhong Chen Tianbao Yang Abstract Although the convergence rates of existing variants of ADAGRAD have a better dependence on the number of iterations under the strong convexity condition,

More information

Variance-Reduced and Projection-Free Stochastic Optimization

Variance-Reduced and Projection-Free Stochastic Optimization Elad Hazan Princeton University, Princeton, NJ 08540, USA Haipeng Luo Princeton University, Princeton, NJ 08540, USA EHAZAN@CS.PRINCETON.EDU HAIPENGL@CS.PRINCETON.EDU Abstract The Frank-Wolfe optimization

More information

Sub-Sampled Newton Methods for Machine Learning. Jorge Nocedal

Sub-Sampled Newton Methods for Machine Learning. Jorge Nocedal Sub-Sampled Newton Methods for Machine Learning Jorge Nocedal Northwestern University Goldman Lecture, Sept 2016 1 Collaborators Raghu Bollapragada Northwestern University Richard Byrd University of Colorado

More information

Large-scale machine learning and convex optimization

Large-scale machine learning and convex optimization Large-scale machine learning and convex optimization Francis Bach INRIA - Ecole Normale Supérieure, Paris, France Allerton Conference - September 2015 Slides available at www.di.ens.fr/~fbach/gradsto_allerton.pdf

More information

Limitations on Variance-Reduction and Acceleration Schemes for Finite Sum Optimization

Limitations on Variance-Reduction and Acceleration Schemes for Finite Sum Optimization Limitations on Variance-Reduction and Acceleration Schemes for Finite Sum Optimization Yossi Arjevani Department of Computer Science and Applied Mathematics Weizmann Institute of Science Rehovot 7610001,

More information

Beating SGD: Learning SVMs in Sublinear Time

Beating SGD: Learning SVMs in Sublinear Time Beating SGD: Learning SVMs in Sublinear Time Elad Hazan Tomer Koren Technion, Israel Institute of Technology Haifa, Israel 32000 {ehazan@ie,tomerk@cs}.technion.ac.il Nathan Srebro Toyota Technological

More information

Oracle Complexity of Second-Order Methods for Smooth Convex Optimization

Oracle Complexity of Second-Order Methods for Smooth Convex Optimization racle Complexity of Second-rder Methods for Smooth Convex ptimization Yossi Arjevani had Shamir Ron Shiff Weizmann Institute of Science Rehovot 7610001 Israel Abstract yossi.arjevani@weizmann.ac.il ohad.shamir@weizmann.ac.il

More information

Online Optimization in Dynamic Environments: Improved Regret Rates for Strongly Convex Problems

Online Optimization in Dynamic Environments: Improved Regret Rates for Strongly Convex Problems 216 IEEE 55th Conference on Decision and Control (CDC) ARIA Resort & Casino December 12-14, 216, Las Vegas, USA Online Optimization in Dynamic Environments: Improved Regret Rates for Strongly Convex Problems

More information

Distributed online optimization over jointly connected digraphs

Distributed online optimization over jointly connected digraphs Distributed online optimization over jointly connected digraphs David Mateos-Núñez Jorge Cortés University of California, San Diego {dmateosn,cortes}@ucsd.edu Mathematical Theory of Networks and Systems

More information

Online Convex Optimization. Gautam Goel, Milan Cvitkovic, and Ellen Feldman CS 159 4/5/2016

Online Convex Optimization. Gautam Goel, Milan Cvitkovic, and Ellen Feldman CS 159 4/5/2016 Online Convex Optimization Gautam Goel, Milan Cvitkovic, and Ellen Feldman CS 159 4/5/2016 The General Setting The General Setting (Cover) Given only the above, learning isn't always possible Some Natural

More information

Beating SGD: Learning SVMs in Sublinear Time

Beating SGD: Learning SVMs in Sublinear Time Beating SGD: Learning SVMs in Sublinear Time Elad Hazan Tomer Koren Technion, Israel Institute of Technology Haifa, Israel 32000 {ehazan@ie,tomerk@cs}.technion.ac.il Nathan Srebro Toyota Technological

More information

Convex Optimization Lecture 16

Convex Optimization Lecture 16 Convex Optimization Lecture 16 Today: Projected Gradient Descent Conditional Gradient Descent Stochastic Gradient Descent Random Coordinate Descent Recall: Gradient Descent (Steepest Descent w.r.t Euclidean

More information

Comparison of Modern Stochastic Optimization Algorithms

Comparison of Modern Stochastic Optimization Algorithms Comparison of Modern Stochastic Optimization Algorithms George Papamakarios December 214 Abstract Gradient-based optimization methods are popular in machine learning applications. In large-scale problems,

More information

Fast Stochastic Optimization Algorithms for ML

Fast Stochastic Optimization Algorithms for ML Fast Stochastic Optimization Algorithms for ML Aaditya Ramdas April 20, 205 This lecture is about efficient algorithms for minimizing finite sums min w R d n i= f i (w) or min w R d n f i (w) + λ 2 w 2

More information

Accelerated Stochastic Block Coordinate Gradient Descent for Sparsity Constrained Nonconvex Optimization

Accelerated Stochastic Block Coordinate Gradient Descent for Sparsity Constrained Nonconvex Optimization Accelerated Stochastic Block Coordinate Gradient Descent for Sparsity Constrained Nonconvex Optimization Jinghui Chen Department of Systems and Information Engineering University of Virginia Quanquan Gu

More information

OLSO. Online Learning and Stochastic Optimization. Yoram Singer August 10, Google Research

OLSO. Online Learning and Stochastic Optimization. Yoram Singer August 10, Google Research OLSO Online Learning and Stochastic Optimization Yoram Singer August 10, 2016 Google Research References Introduction to Online Convex Optimization, Elad Hazan, Princeton University Online Learning and

More information

Adaptive Gradient Methods AdaGrad / Adam. Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade

Adaptive Gradient Methods AdaGrad / Adam. Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade Adaptive Gradient Methods AdaGrad / Adam Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade 1 Announcements: HW3 posted Dual coordinate ascent (some review of SGD and random

More information

Lecture 3: Minimizing Large Sums. Peter Richtárik

Lecture 3: Minimizing Large Sums. Peter Richtárik Lecture 3: Minimizing Large Sums Peter Richtárik Graduate School in Systems, Op@miza@on, Control and Networks Belgium 2015 Mo@va@on: Machine Learning & Empirical Risk Minimiza@on Training Linear Predictors

More information

Large-scale Machine Learning and Optimization

Large-scale Machine Learning and Optimization 1/84 Large-scale Machine Learning and Optimization Zaiwen Wen http://bicmr.pku.edu.cn/~wenzw/bigdata2018.html Acknowledgement: this slides is based on Dimitris Papailiopoulos and Shiqian Ma lecture notes

More information