Stochastic Optimization

Size: px

Start display at page:

Download "Stochastic Optimization"

Gillian York
5 years ago
Views:

1 Introduction Related Work SGD Epoch-GD LM A DA NANJING UNIVERSITY Lijun Zhang Nanjing University, China May 26, 2017

2 Introduction Related Work SGD Epoch-GD Outline 1 Introduction 2 Related Work 3 Stochastic Gradient Descent Expectation Bound High-probability Bound 4 Epoch Gradient Descent Expectation Bound High-probability Bound

3 Introduction Related Work SGD Epoch-GD Outline 1 Introduction 2 Related Work 3 Stochastic Gradient Descent Expectation Bound High-probability Bound 4 Epoch Gradient Descent Expectation Bound High-probability Bound

4 Introduction Related Work SGD Epoch-GD Definitions [Nemirovski et al., 2009] min F(w) = E ξ[f(w,ξ)] = f(w, ξ)dp(ξ) w W Ξ ξ is a random variable The challenge: evaluation of the expectation/integration

5 Introduction Related Work SGD Epoch-GD Definitions [Nemirovski et al., 2009] min F(w) = E ξ[f(w,ξ)] = f(w, ξ)dp(ξ) w W Ξ ξ is a random variable The challenge: evaluation of the expectation/integration Supervised Learning [Vapnik, 1998] min F(h) = E (x,y)[l(h(x), y)] h H (x, y) is a random instance-label pair H = {h : X R} is a hypothesis class l(, ) is certain loss, e.g., hinge loss l(u, v) = max(0, 1 uv)

6 Introduction Related Work SGD Epoch-GD Optimization Methods Sample Average Approximation (SAA) Empirical Risk Minimization (RRM) min F(w) = 1 n f(w,ξ i ) w W n where ξ 1,...,ξ n are i.i.d. min w W F(w) = 1 n i=1 n l(h(x i ), y i )) i=1 where (x 1, y 1 ),...,(x n, y n ) are i.i.d.

7 Introduction Related Work SGD Epoch-GD Optimization Methods Sample Average Approximation (SAA) Empirical Risk Minimization (RRM) min F(w) = 1 n f(w,ξ i ) w W n where ξ 1,...,ξ n are i.i.d. min w W F(w) = 1 n i=1 n l(h(x i ), y i )) i=1 where (x 1, y 1 ),...,(x n, y n ) are i.i.d. Stochastic Approximation (SA) Optimization via noisy observations of F( ) Zero-order SA [Nesterov, 2011] First-order SA, e.g., SGD [Kushner and Yin, 2003]

8 Introduction Related Work SGD Epoch-GD Performance Optimization Error / Excess Risk O F( w) min F(w) = w W O ( ) 1 n ( ) 1 T,O ( ) 1,O n ( ) 1 T ( 1, O n is the number of samples in empirical risk minimization T is the number of stochastic gradients in first-order stochastic approximation n 2 )

9 Introduction Related Work SGD Epoch-GD Outline 1 Introduction 2 Related Work 3 Stochastic Gradient Descent Expectation Bound High-probability Bound 4 Epoch Gradient Descent Expectation Bound High-probability Bound

10 Introduction Related Work SGD Epoch-GD Related Work Empirical Risk Minimization Stochastic Approximation [Shalev-Shwartz et al., 2009] [Koren and Levy, 2015] [Feldman, 2016] [Zhang et al., 2017] [Zinkevich, 2003] [Hazan et al., 2007] [Hazan and Kale, 2011] [Rakhlin et al., 2012] [Lan, 2012] [Agarwal et al., 2012] [Shamir and Zhang, 2013] [Mahdavi et al., 2015]... Supervised Learning [Vapnik, 1998] [Lee et al., 1996] [Panchenko, 2002] [Bartlett and Mendelson, 2002] [Tsybakov, 2004] [Bartlett et al., 2005] [Sridharan et al., 2009] [Srebro et al., 2010]... [Bach and Moulines, 2013]

11 Introduction Related Work SGD Epoch-GD Risk Bounds of Empirical Risk Minimization Supervised Learning

12 Introduction Related Work SGD Epoch-GD Risk Bounds of Empirical Risk Minimization ( ) 1 Lipschitz: O n [Bartlett and Mendelson, 2002] Supervised Learning

13 Introduction Related Work SGD Epoch-GD Risk Bounds of Empirical Risk Minimization Smooth: O( 1 n [Srebro et al., 2010] ( ) 1 Lipschitz: O n [Bartlett and Mendelson, 2002] ) Supervised Learning

14 Introduction Related Work SGD Epoch-GD Risk Bounds of Empirical Risk Minimization Smooth: O( 1 n [Srebro et al., 2010] ( ) 1 Lipschitz: O n [Bartlett and Mendelson, 2002] ( ) Strongly Convex: O 1 λn [Sridharan et al., 2009] ) Supervised Learning

15 Introduction Related Work SGD Epoch-GD Risk Bounds of Empirical Risk Minimization Smooth: O( 1 n [Srebro et al., 2010] ( ) 1 Lipschitz: O n [Bartlett and Mendelson, 2002] ( ) Strongly Convex & Smooth: O 1 λn 2 [Zhang et al., 2017] ( ) Strongly Convex: O 1 λn [Sridharan et al., 2009] ) Supervised Learning

16 Introduction Related Work SGD Epoch-GD Risk Bounds of Empirical Risk Minimization ( ) d Lipschitz: O n [Shalev-Shwartz et al., 2009] Smooth: O( 1 n [Srebro et al., 2010] ( ) 1 Lipschitz: O n [Bartlett and Mendelson, 2002] ( ) Strongly Convex & Smooth: O 1 λn 2 [Zhang et al., 2017] ( ) Strongly Convex: O 1 λn [Sridharan et al., 2009] ) Supervised Learning

17 Introduction Related Work SGD Epoch-GD Risk Bounds of Empirical Risk Minimization ( ) d Lipschitz: O n Smooth: O( d n [Shalev-Shwartz et al., 2009] ) [Zhang et al., 2017] Smooth: O( 1 n [Srebro et al., 2010] ( ) 1 Lipschitz: O n [Bartlett and Mendelson, 2002] ( ) Strongly Convex & Smooth: O 1 λn 2 [Zhang et al., 2017] ( ) Strongly Convex: O 1 λn [Sridharan et al., 2009] ) Supervised Learning

18 Introduction Related Work SGD Epoch-GD Risk Bounds of Empirical Risk Minimization ( ) d Lipschitz: O n [Shalev-Shwartz et al., 2009] ( ) Strongly Convex: O 1 in Exp. λn [Shalev-Shwartz et al., 2009] Smooth: O( d n ) [Zhang et al., 2017] Smooth: O( 1 n [Srebro et al., 2010] ( ) 1 Lipschitz: O n [Bartlett and Mendelson, 2002] ( ) O d n + 1 λn [Zhang et al., 2017] ( ) Strongly Convex: O 1 λn [Sridharan et al., 2009] ) [Zhang et al., 2017] Strongly Convex & Smooth: O Supervised Learning ( 1 λn 2 )

19 Introduction Related Work SGD Epoch-GD Risk Bounds of Empirical Risk Minimization ( ) d Lipschitz: O n [Shalev-Shwartz et al., 2009] ( ) Strongly Convex: O 1 in Exp. λn [Shalev-Shwartz et al., 2009] Smooth: O( d n ) [Zhang et al., 2017] Smooth: O( 1 n [Srebro et al., 2010] ( ) 1 Lipschitz: O n [Bartlett and Mendelson, 2002] Strongly Convex & Smooth: O [Zhang et al., 2017] ( ) O d n + 1 λn [Zhang et al., 2017] ( ) Strongly Convex: O 1 λn [Sridharan et al., 2009] ) [Zhang et al., 2017] ( 1 λn 2 ) Strongly Convex & Smooth: O Supervised Learning ( 1 λn 2 )

20 Introduction Related Work SGD Epoch-GD Risk Bounds of Stochastic Approximation Supervised Learning

21 Introduction Related Work SGD Epoch-GD Risk Bounds of Stochastic Approximation ( ) Lipschitz: O 1 n [Zinkevich, 2003] Supervised Learning

22 Introduction Related Work SGD Epoch-GD Risk Bounds of Stochastic Approximation ( ) Lipschitz: O 1 n [Zinkevich, 2003] Smooth: O( 1 n ) [Srebro et al., 2010] Supervised Learning

23 Introduction Related Work SGD Epoch-GD Risk Bounds of Stochastic Approximation ( ) Lipschitz: O 1 n [Zinkevich, 2003] Smooth: O( 1 n ) [Srebro et al., 2010] ( ) Strongly Convex: O 1 λn [Hazan and Kale, 2011] Supervised Learning

24 Introduction Related Work SGD Epoch-GD Risk Bounds of Stochastic Approximation ( ) Lipschitz: O 1 n [Zinkevich, 2003] Smooth: O( 1 n ) [Srebro et al., 2010] ( ) Strongly Convex: O 1 λn [Hazan and Kale, 2011] Supervised Learning

25 Introduction Related Work SGD Epoch-GD Risk Bounds of Stochastic Approximation ( ) Strongly Convex: O 1 λn [Hazan and Kale, 2011] Smooth: O( 1 n ) [Srebro et al., 2010] Square or Logistic: O( 1 n ) [Bach and Moulines, 2013] ( ) Lipschitz: O 1 n [Zinkevich, 2003] Supervised Learning

26 Outline 1 Introduction 2 Related Work 3 Stochastic Gradient Descent Expectation Bound High-probability Bound 4 Epoch Gradient Descent Expectation Bound High-probability Bound

27 Stochastic Gradient Descent The Algorithm 1: for t = 1,...,T do 2: w t+1 = w t η t g t 3: w t+1 = Π W (w t+1 ) 4: end for 5: return w T = 1 T T w t Requirement E t [g t ] = F(w t ) where E t [ ] is the conditional expectation

28 Stochastic Gradient Descent The Algorithm 1: for t = 1,...,T do 2: w t+1 = w t η t g t 3: w t+1 = Π W (w t+1 ) 4: end for 5: return w T = 1 T T w t min F(w) = E ξ[f(w,ξ)] w W Sample ξ t from the underlying distribution g t = f(w t,ξ t )

29 Stochastic Gradient Descent The Algorithm 1: for t = 1,...,T do 2: w t+1 = w t η t g t 3: w t+1 = Π W (w t+1 ) 4: end for 5: return w T = 1 T T w t Supervised Learning min F(h) = E (x,y)[l(h(x), y)] h H Sample (x t, y t ) from the underlying distribution g t = l(h t (x t ), y t )

30 Outline 1 Introduction 2 Related Work 3 Stochastic Gradient Descent Expectation Bound High-probability Bound 4 Epoch Gradient Descent Expectation Bound High-probability Bound

31 Analysis I For any w W, we have F(w t ) F(w) F(w t ), w t w = g t, w t w + F(w t ) g t, w t w }{{} δ t = 1 η t w t w t+1, w t w +δ t = 1 ( wt w 2 2 w t+1 w w t w 2η t+1 2 ) 2 +δt t = 1 2η t ( wt w 2 2 w t+1 w η t ( wt w 2 2 w t+1 w 2 2 ) η t + 2 g t 2 2 +δ t ) η t + 2 g t 2 2 +δ t

32 Analysis II To simplify the above inequality, we assume Then, we have F(w t ) F(w) 1 2η η t = η, and g t 2 G ( wt w 2 2 w t+1 w 2 2) + η 2 G2 +δ t By adding the inequalities of all iterations, we have T F(w t ) TF(w) 1 2η w 1 w ηt 2 G2 + T δ t

33 Analysis III Then, we have 1 T F( w T ) F(w) = F ( 1 T T w t ) F(w) ( T F(w t ) F(w) 1 T = 1 2ηT w 1 w η 2 G2 + 1 T 1 2η w 1 w ηt 2 G2 + T δ t ) T δ t In summary, we have F( w T ) F(w) 1 2ηT w 1 w η 2 G2 + 1 T T δ t

34 Expectation Bound Under the condition E t [g t ] = F(w t ), it is easy to verify E[δ t ] = E[ F(w t ) g t, w t w ] =E 1,...,t 1 [E t [ F(w t ) g t, w t w ]] = 0 Thus, by taking expectation over both sides, we have E[F( w T )] F(w) 1 2ηT w 1 w η 2 G2 ( ) η= c T 1 1 = 2 T c w 1 w cg 2

35 Outline 1 Introduction 2 Related Work 3 Stochastic Gradient Descent Expectation Bound High-probability Bound 4 Epoch Gradient Descent Expectation Bound High-probability Bound

36 Martingale A sequence of random variables Y 1, Y 2, Y 3,... that satisfies for any time t E[ Y t ] E[Y t+1 Y 1,...,Y t ] = Y t An Example Y t is a gambler s fortune after t tosses of a fair coin. Let δ t a random variable such that δ t = 1 if the coin comes up heads andδ t = 1 if the coin comes up tails. Then Y t = t i=1 δ t. E[Y t+1 Y 1,...,Y t ] = E[δ t+1 + Y t Y 1,...,Y t ] =Y t + E[δ t+1 Y 1,...,Y t ] = Y t

37 Martingale A sequence of random variables Y 1, Y 2, Y 3,... that satisfies for any time t E[ Y t ] E[Y t+1 Y 1,...,Y t ] = Y t An Example Y t is a gambler s fortune after t tosses of a fair coin. Martingale Difference Suppose Y 1, Y 2, Y 3,... is a martingale, then X t = Y t Y t 1 is a martingale difference sequence. E[X t+1 X 1,...,X t ] = E[Y t+1 Y t X 1,...,X t ] = 0

38 Azuma s inequality for Martingales Suppose X 1, X 2,... is a martingale difference sequence and X i c i almost surely. Then, we have ( n ) ( t 2 Pr X i t exp 2 n i=1 i=1 c2 i ( n ) ( Corollary 1 Pr i=1 X i t exp t 2 2 n i=1 c2 i Assume c i c. With a probability at least 1 2δ, we have n X i 2nclog 1 δ i=1 ) )

39 High-probability Bound I Assume g t 2 G, and x y 2 D, x, y W Under the condition E t [g t ] = F(w t ), it is easy to verify δ t = F(w t ) g t, w t w is a martingale difference sequence with where we use Jensen s inequality δ t F(w t ) g t 2 w t w 2 D( F(w t ) 2 + g t 2 ) 2GD F(w t ) 2 = E[g t ] 2 E g t 2 G

40 High-probability Bound II From Azuma s inequality, with a probability at least 1 δ T δ t 2 TGDlog 1 δ Thus, with a probability at least 1 δ F( w T ) F(w) 1 2ηT w 1 w η 2 G2 + 1 T δ t T 1 2ηT w 1 w η 1 2 G2 + 2 T GDlog 1 δ ( η= c T 1 1 = T 2c w 1 w c ) 2 G2 + 2 GDlog 1 δ ( ) 1 =O T

41 Outline 1 Introduction 2 Related Work 3 Stochastic Gradient Descent Expectation Bound High-probability Bound 4 Epoch Gradient Descent Expectation Bound High-probability Bound

42 Strong Convexity A General Definition F is λ-strongly convex if for all α [0, 1], w, w W, F(αw+(1 α)w ) αf(w)+(1 α)f(w ) λ 2 α(1 α) w w 2 2 A More Popular Definition F is λ-strongly convex if for all w, w W, F(w)+ F(w), w w + λ 2 w w 2 2 F(w )

43 Strong Convexity A General Definition F is λ-strongly convex if for all α [0, 1], w, w W, F(αw+(1 α)w ) αf(w)+(1 α)f(w ) λ 2 α(1 α) w w 2 2 A More Popular Definition F is λ-strongly convex if for all w, w W, F(w)+ F(w), w w + λ 2 w w 2 2 F(w ) An Important Property If F is λ-strongly convex, then λ 2 w w 2 F(w) F(w ) where w = argmin w W F(w).

44 Epoch Gradient Descent The Algorithm [Hazan and Kale, 2011] 1: T 1 = 4, η 1 = 1/λ 2: while k i=1 T k T do 3: for t = 1,...,T k do 4: 5: end for 6: w k+1 1 = 1 Tk T k wk t 7: η k+1 = η k /2, T k+1 = 2T k 8: k = k + 1 9: end while w k t+1 = Π W(w k t η k g k t )

45 Epoch Gradient Descent The Algorithm [Hazan and Kale, 2011] 1: T 1 = 4, η 1 = 1/λ 2: while k i=1 T k T do 3: for t = 1,...,T k do 4: 5: end for 6: w k+1 1 = 1 Tk T k wk t 7: η k+1 = η k /2, T k+1 = 2T k 8: k = k + 1 9: end while w k t+1 = Π W(w k t η k g k t ) Our Goal Establish an O(1/T) excess risk bound

46 Outline 1 Introduction 2 Related Work 3 Stochastic Gradient Descent Expectation Bound High-probability Bound 4 Epoch Gradient Descent Expectation Bound High-probability Bound

47 Analysis I If we have [ ] ( ) 1 E F(w k 1) F(w ) = O, k λt k Then, the last solution w k 1 satisfies [ ] ( ) ( ) 1 1 E F(w k 1 ) F(w ) = O = O λt k λt Suppose [ E F(w k 1) ] F(w ) = c G2 λt k From the analysis of SGD, we have [ ] E F(w k+1 1 ) F(w ) 1 [ ] E w k 1 w η k 2η k T k 2 G2

48 Analysis II Recall that Then, we have w k 1 w λ ( ) F(w k 1) F(w ) [ ] E F(w k+1 1 ) F(w ) 1 [ ] E F(w k λη k T 1) F(w ) + η k k λη k T k =4 = 1 [ ] 4 E c G2 + 4 G 2 λt k 2λT k T k+1 =2T k = c G 2 λt k+1 ( c ) c=8 = c G2 λt k+1 2 G2

49 Outline 1 Introduction 2 Related Work 3 Stochastic Gradient Descent Expectation Bound High-probability Bound 4 Epoch Gradient Descent Expectation Bound High-probability Bound

50 What do We Need? Key Inequality in the Expectation Bound [ ] E F(w k+1 1 ) F(w ) 1 E 2η k T k [ w k 1 w 2 2 ] + η k 2 G2

51 What do We Need? Key Inequality in the Expectation Bound [ ] E F(w k+1 1 ) F(w ) 1 E 2η k T k [ w k 1 w 2 2 ] + η k 2 G2 A High-probability Inequality based on Azuma s Inequality F(w k+1 1 ) F(w ) 1 w k 1 2η k T w η k GD k 2 G2 + 2 log 1 T k δ

52 What do We Need? Key Inequality in the Expectation Bound [ ] E F(w k+1 1 ) F(w ) 1 E 2η k T k [ w k 1 w 2 2 ] + η k 2 G2 A High-probability Inequality based on Azuma s Inequality F(w k+1 1 ) F(w ) 1 w k 1 2η k T w η k GD k 2 G2 + 2 log 1 T k δ The Inequality that We Need F(w k+1 1 ) F(w ) 1 2η k T k w k 1 w η k 2 G2 + O ( 1 λt k )

53 Concentration Inequality I Bernstein s Inequality for Martingales Let X 1,...,X n be a bounded martingale difference sequence with respect to the filtration F = (F i ) 1 i n and with X i K. Let i S i = j=1 be the associated martingale. Denote the sum of the conditional variances by n [ ] Σ 2 n = E Xt 2 F t 1. Then for all constants t, ν > 0, ( ) Pr max S i > t and Σ 2 n ν i=1,...,n X j ( exp t 2 ) 2(ν + Kt/3)

54 Concentration Inequality II Corollary 2 Suppose Σ 2 n ν. Then, with a probability at least 1 δ S n 2νlog 1 δ + 2K 3 log 1 δ Note 2νlog 1 δ 1 2α log 1 +αν, α > 0 δ

55 The Basic Idea For any w W, we have F(w t ) F(w) F(w t ), w t w λ 2 w t w 2 2 λ = g t, w t w + F(w t ) g t, w t w }{{} 2 w t w 2 2 δ t Then, following the same analysis, we arrive at ( T F( w T ) F(w) 1 2ηT w 1 w 2 2+ η 2 G2 + 1 δ t λ T 2 We are going to prove T δ t = λ 2 T ) T w t w 2 2 ( ) loglog T w t w O λ

56 Analysis I δ t = F(w t ) g t, w t w is a martingale difference sequence with δ t F(w t ) g t 2 w t w 2 D( F(w t ) 2 + g t 2 ) 2GD where D could be a prior parameter or 2G/λ when w = w The sum of the conditional variances is upper bounded by T Σ 2 T = [ ] E t δ 2 t T 4G 2 [ E t F(wt ) g t 2 2 w t w 2 ] 2 T w t w 2 2

57 Analysis II A straightforward way Since Σ 2 T 4G2 T w t w 2 2 Following Bernstein s inequality for martingales, with a probability at least 1 δ, we have T ( ) T δ t 2 4G 2 w t w 2 2 log 1 δ GDlog 1 δ λ 2 T w t w G2 λ log 1 δ GDlog 1 δ

58 Analysis III We need Peeling Divide T w t w 2 2 [0, TD2 ] in to a sequence of intervals The initial: T w t w 2 2 [0, D2 /T] Since δ t F(w t ) g t 2 w t w 2 2G w t w 2 we have T T δ t 2G w t w 2 2G T T w t w 2 2 2GD The rest: T w t w 2 2 (2i 1 D 2 /T, 2 i D 2 /T] where i = 1,..., 2log 2 T

59 Analysis IV Define T A T = w t w 2 2 TD 2 We are going to bound [ T Pr δ t 2 ] 4G 2 A T τ GDτ + 2GD

60 Analysis V T Pr δ t 2 4G 2 A T τ + 2 2GDτ + 2GD 3 T =Pr δ t 2 4G 2 A T τ GDτ + 2GD, A T D 2 /T T +Pr δ t 2 4G 2 A T τ GDτ + 2GD, D2 /T < A T TD 2 T =Pr δ t 2 4G 2 A T τ GDτ + 2GD,Σ2 T 4G2 A T, D 2 /T < A T TD 2 m T Pr δ t 2 4G 2 A T τ + 43 GDτ + 2GD,Σ2T 4G2 A T, D2 T 2i 1 < A T D2 T 2i i=1 m T Pr δ t 2 4G2 D 2 2 i τ + 4 T 3 GDτ,Σ2 T 4G2 D 2 2 i me τ T i=1 where m = 2log 2 T

61 Analysis VI By setting τ = log m δ, with a probability at least 1 δ, T δ t 2 4G 2 A T τ + 4 GDτ + 2GD 3 ( T ) =2 4G 2 w t w 2 2 λ 2 = λ 2 T T τ + 4 GDτ + 2GD 3 w t w G2 λ τ + 4 GDτ + 2GD 3 ( ) loglog T w t w O λ where τ = log m δ = log 2log 2 T δ = O(loglog T)

62 Reference I Agarwal, A., Bartlett, P. L., Ravikumar, P., and Wainwright, M. J. (2012). Information-theoretic lower bounds on the oracle complexity of stochastic convex optimization. IEEE Transactions on Information Theory, 58(5): Bach, F. and Moulines, E. (2013). Non-strongly-convex smooth stochastic approximation with convergence rate O(1/n). In Advances in Neural Information Processing Systems 26, pages Bartlett, P. L., Bousquet, O., and Mendelson, S. (2005). Local rademacher complexities. The Annals of Statistics, 33(4): Bartlett, P. L. and Mendelson, S. (2002). Rademacher and gaussian complexities: risk bounds and structural results. Journal of Machine Learning Research, 3: Feldman, V. (2016). Generalization of erm in stochastic convex optimization: The dimension strikes back. ArXiv e-prints, arxiv:

63 Reference II Hazan, E., Agarwal, A., and Kale, S. (2007). Logarithmic regret algorithms for online convex optimization. Machine Learning, 69(2-3): Hazan, E. and Kale, S. (2011). Beyond the regret minimization barrier: an optimal algorithm for stochastic strongly-convex optimization. In Proceedings of the 24th Annual Conference on Learning Theory, pages Koren, T. and Levy, K. (2015). Fast rates for exp-concave empirical risk minimization. In Advances in Neural Information Processing Systems 28, pages Kushner, H. J. and Yin, G. G. (2003). Stochastic Approximation and Recursive Algorithms and Applications. Springer, second edition. Lan, G. (2012). An optimal method for stochastic composite optimization. Mathematical Programming, 133:

64 Reference III Lee, W. S., Bartlett, P. L., and Williamson, R. C. (1996). The importance of convexity in learning with squared loss. In Proceedings of the 9th Annual Conference on Computational Learning Theory, pages Mahdavi, M., Zhang, L., and Jin, R. (2015). Lower and upper bounds on the generalization of stochastic exponentially concave optimization. In Proceedings of the 28th Conference on Learning Theory. Nemirovski, A., Juditsky, A., Lan, G., and Shapiro, A. (2009). Robust stochastic approximation approach to stochastic programming. SIAM Journal on Optimization, 19(4): Nesterov, Y. (2011). Random gradient-free minimization of convex functions. Core discussion papers. Panchenko, D. (2002). Some extensions of an inequality of vapnik and chervonenkis. Electronic Communications in Probability, 7:55 65.

65 Reference IV Rakhlin, A., Shamir, O., and Sridharan, K. (2012). Making gradient descent optimal for strongly convex stochastic optimization. In Proceedings of the 29th International Conference on Machine Learning, pages Shalev-Shwartz, S., Shamir, O., Srebro, N., and Sridharan, K. (2009). Stochastic convex optimization. In Proceedings of the 22nd Annual Conference on Learning Theory. Shamir, O. and Zhang, T. (2013). Stochastic gradient descent for non-smooth optimization: Convergence results and optimal averaging schemes. In Proceedings of the 30th International Conference on Machine Learning, pages Srebro, N., Sridharan, K., and Tewari, A. (2010). Optimistic rates for learning with a smooth loss. ArXiv e-prints, arxiv: Sridharan, K., Shalev-shwartz, S., and Srebro, N. (2009). Fast rates for regularized objectives. In Advances in Neural Information Processing Systems 21, pages

66 Reference V Tsybakov, A. B. (2004). Optimal aggregation of classifiers in statistical learning. The Annals of Statistics, 32: Vapnik, V. N. (1998). Statistical Learning Theory. Wiley-Interscience. Zhang, L., Yang, T., and Jin, R. (2017). Empirical risk minimization for stochastic convex optimization: O(1/n)- and O(1/n 2 )-type of risk bounds. In Proceedings of the 30th Annual Conference on Learning Theory. Zinkevich, M. (2003). Online convex programming and generalized infinitesimal gradient ascent. In Proceedings of the 20th International Conference on Machine Learning, pages

Full-information Online Learning

Full-information Online Learning Introduction Expert Advice OCO LM A DA NANJING UNIVERSITY Full-information Lijun Zhang Nanjing University, China June 2, 2017 Outline Introduction Expert Advice OCO 1 Introduction Definitions Regret 2