Accelerate Subgradient Methods

Size: px

Start display at page:

Download "Accelerate Subgradient Methods"

Lizbeth Sims
6 years ago
Views:

1 Accelerate Subgradient Methods Tianbao Yang Department of Computer Science The University of Iowa Contributors: students Yi Xu, Yan Yan and colleague Qihang Lin Yang Accelerate Subgradient Methods 1 / 49

2 Introduction and Background Outline 1 Introduction and Background 2 A Key Error Inequality 3 Restarted Subgradient Method 4 Homotopy Smoothing 5 Accelerated Stochastic Subgradient Method 6 Experiments 7 Conclusion Yang (CS@Uiowa) Accelerate Subgradient Methods 2 / 49

3 Introduction and Background Convex Optimization f = min w Ω f (w) Ω R d is a convex set f (w) is a convex function: Goal: For a sufficiently small ɛ > 0, find a solution w such that f (w) f ɛ Yang (CS@Uiowa) Accelerate Subgradient Methods 3 / 49

4 Introduction and Background Applications Machine Learning Classification and Regression 1 min w R d n n l(w x i, y i ) + λr(w) i=1 training examples (x i, y i ), i = 1,..., n, where x i R d, y i {1, 1} (classification) or y i R (regression) l(z, y) convex loss function w.r.t z R(w) is a regularizer Yang (CS@Uiowa) Accelerate Subgradient Methods 4 / 49

5 Introduction and Background Applications in Machine Learning Classification and Regression Examples: SVM 1 min w R d n n l(w x i, y i ) + λr(w) i=1 1 min w R d n Examples: LASSO n max(0, 1 y i w x i ) + λ 2 w 2 2 i=1 1 min w R d n Many others (examples given later) n (y i w x i ) 2 + λ w 1 i=1 Yang (CS@Uiowa) Accelerate Subgradient Methods 5 / 49

6 Introduction and Background Applications in Machine Learning Classification and Regression Examples: SVM 1 min w R d n n l(w x i, y i ) + λr(w) i=1 1 min w R d n Examples: LASSO n max(0, 1 y i w x i ) + λ 2 w 2 2 i=1 1 min w R d n Many others (examples given later) n (y i w x i ) 2 + λ w 1 i=1 Yang (CS@Uiowa) Accelerate Subgradient Methods 5 / 49

7 Introduction and Background How to solve the Optimization Efficiently? Concern: Running Time (RT) Iterative Algorithm w t+1 = w t + w t RT = #of Iterations per-iteration RT Iteration Complexity How many iterations T (ɛ) are needed in order to have f (ŵ T ) f ɛ Yang (CS@Uiowa) Accelerate Subgradient Methods 6 / 49

8 Introduction and Background How to solve the Optimization Efficiently? Concern: Running Time (RT) Iterative Algorithm w t+1 = w t + w t RT = #of Iterations per-iteration RT Iteration Complexity How many iterations T (ɛ) are needed in order to have f (ŵ T ) f ɛ Yang (CS@Uiowa) Accelerate Subgradient Methods 6 / 49

9 Introduction and Background Iteration Complexity of Convex Optimization The properties of the objective function smooth function: function is upper bounded by a quadratic function strongly convex: function is lower bounded by a quadratic function Minimax Iteration Complexity smooth and strongly convex: O(log(1/ɛ)): Accel. Gradient Method smooth: O(1/ ɛ): Accel. Gradient Method strongly convex: O(1/ɛ): SubGradient (SG) Method non-smooth and non-strongly convex: O(1/ɛ 2 ): SG Yang (CS@Uiowa) Accelerate Subgradient Methods 7 / 49

10 Introduction and Background Iteration Complexity of Convex Optimization The properties of the objective function smooth function: function is upper bounded by a quadratic function strongly convex: function is lower bounded by a quadratic function Minimax Iteration Complexity smooth and strongly convex: O(log(1/ɛ)): Accel. Gradient Method smooth: O(1/ ɛ): Accel. Gradient Method strongly convex: O(1/ɛ): SubGradient (SG) Method non-smooth and non-strongly convex: O(1/ɛ 2 ): SG Yang (CS@Uiowa) Accelerate Subgradient Methods 7 / 49

11 Introduction and Background Non-smooth and Non-Strongly Convex Optimization Robust Regression: Sparse Classification: min w min w 1 n 1 n n w x i y i p, p [1, 2) i=1 n max(0, 1 y i w x i ) + λ w 1 i=1 Yang (CS@Uiowa) Accelerate Subgradient Methods 8 / 49

12 Introduction and Background Deep Learning Yang Accelerate Subgradient Methods 9 / 49

13 Introduction and Background Subgradient Method f = min w Ω f (w) SG Method w t+1 = Π Ω [w t η t f (w t )] f (w t ) is a subgradient η t step size: η t 1/ t iteration complexity O(1/ɛ 2 ): very slow (e.g., ɛ = iterations) Yang (CS@Uiowa) Accelerate Subgradient Methods 10 / 49

14 Introduction and Background Accelerate Subgradient Method f = min w Ω f (w) Special structure/condition of the objective function One example is explicit max-strucutre f (w) = max u Ω 2 u Aw φ(w) Nesterov s Smoothing technique (Nesterov, 2005) F µ (w) = max u Ω 2 u Aw φ(w) µ 2 u 2 2 AG for the smoothed problem: (1/ɛ) for the original problem Yang (CS@Uiowa) Accelerate Subgradient Methods 11 / 49

15 Introduction and Background Our Methodology for Accelerating SG f = min w Ω f (w) Explore local structure around the optimal solution (local error bound) w w 2 c(f (w) f ) θ, θ (0, 1], w S ɛ This family is broad enough ( ) Improved Iteration Complexity Õ 1 ɛ 2(1 θ) With explicit max-structure: Õ ( 1 ɛ (1 θ) ) Yang (CS@Uiowa) Accelerate Subgradient Methods 12 / 49

16 A Key Error Inequality Outline 1 Introduction and Background 2 A Key Error Inequality 3 Restarted Subgradient Method 4 Homotopy Smoothing 5 Accelerated Stochastic Subgradient Method 6 Experiments 7 Conclusion Yang (CS@Uiowa) Accelerate Subgradient Methods 13 / 49

17 A Key Error Inequality Some Notations Notations Optimal Set Ω = {w Ω : f (w) = f } ɛ-level set L ɛ = {w Ω : f (w) f = ɛ} ɛ-sublevel set S ɛ = {w Ω : f (w) f ɛ} w w : the closest optimal solution to w L w w w = arg min u Ω u w 2 S w ɛ: the closest solution to w in the ɛ-sublevel set w ɛ = arg min u Ω u w 2, f (w) f ɛ Yang (CS@Uiowa) Accelerate Subgradient Methods 14 / 49

18 A Key Error Inequality Assumptions Assumptions there exists w 0, ɛ 0 such that f (w 0 ) f ɛ 0 there exists G such that f (w) 2 G Ω is a non-empty compact set Yang (CS@Uiowa) Accelerate Subgradient Methods 15 / 49

19 A Key Error Inequality An Key Error Inequaltiy w L w w kw w k 2 f(w ) f(w ) kw w k 2 f(w) f(w ) Yang (CS@Uiowa) Accelerate Subgradient Methods 16 / 49

20 A Key Error Inequality An Key Error Inequaltiy w L w w kw w k 2 f(w ) f(w ) kw w k 2 f(w) f(w ) w ɛ w ɛ 2 f (w ɛ ) f (w ɛ ) w w ɛ 2 f (w) f (w ɛ ) Key Error Inequality (Yang & Lin, 2016) w w ɛ 2 w ɛ wɛ 2 (f (w) f (w ɛ )) ɛ Yang (CS@Uiowa) Accelerate Subgradient Methods 17 / 49

21 A Key Error Inequality An Key Error Inequaltiy w L w w kw w k 2 f(w ) f(w ) kw w k 2 f(w) f(w ) w ɛ w ɛ 2 f (w ɛ ) f (w ɛ ) w w ɛ 2 f (w) f (w ɛ ) Key Error Inequality (Yang & Lin, 2016) w w ɛ 2 w ɛ wɛ 2 (f (w) f (w ɛ )) ɛ Yang (CS@Uiowa) Accelerate Subgradient Methods 17 / 49

22 Restarted Subgradient Method Outline 1 Introduction and Background 2 A Key Error Inequality 3 Restarted Subgradient Method 4 Homotopy Smoothing 5 Accelerated Stochastic Subgradient Method 6 Experiments 7 Conclusion Yang (CS@Uiowa) Accelerate Subgradient Methods 18 / 49

23 Restarted Subgradient Method SG Method ŵ T = SG(w 0, η, T ) 1: Input: the number of iterations T, and the initial solution w 0 Ω, 2: Let w 1 = w 0 3: for t = 1,..., T do 4: Compute a subgradient f (w t ) 5: Update w t+1 = Π Ω [w t η f (w t )] 6: end for 7: Output: ŵ T = T t=1 w t T Convergence Guarantee For any w Ω T = G2 w 0 w 2 2, η = G2 ɛ 2 ɛ f (ŵ T ) f (w) ηg2 2 + w 0 w 2 2 2ηT f (ŵ T ) f ɛ Yang (CS@Uiowa) Accelerate Subgradient Methods 19 / 49

24 Restarted Subgradient Method RSG Method w K = RSG(w 0, t, K) 1: Input: the number of stages K and the number of iterations t per-stage, w 0 Ω 2: Set η 1 = ɛ 0 /(2G 2 ), where ɛ 0 is from our assumption 3: for k = 1,..., K do 4: Call subroutine SG to obtain w k = SG(w k 1, η k, t) 5: Set η k+1 = η k /2 6: end for 7: Output: w K Yang (CS@Uiowa) Accelerate Subgradient Methods 20 / 49

25 Restarted Subgradient Method A General Convergence of RSG Distance between the ɛ-level set and the optimal set B ɛ = max w w w 2 w L ɛ w 0 w L B Convergence of RSG If t 4G2 B 2 ɛ ɛ 2 and K = log 2 ( ɛ 0 ɛ ), then f (w K ) f 2ɛ Iteration Complexity of RSG: O ( G 2 B 2 ɛ ɛ 2 log ( ɛ 0ɛ ) ) Yang (CS@Uiowa) Accelerate Subgradient Methods 21 / 49

26 Restarted Subgradient Method Comparison with SG ( Iteration Complexity of RSG: O G 2 Bɛ 2 log ( ɛ ) ) 0ɛ ( ɛ 2 ) G Iteration Complexity of SG: O 2 w 0 w 2 2 ɛ 2 SG: dependence on the distance from the initial solution to the optimal set RSG: dependence on the distance from the ɛ-level set to the optimal set RSG: log-dependence on the quality of the initial solution (ɛ 0 ) Yang (CS@Uiowa) Accelerate Subgradient Methods 22 / 49

27 Restarted Subgradient Method Improved Convergence under Local Error Bound Condition Iteration Complexity: O ( G 2 B 2 ɛ ɛ 2 log ( ɛ 0ɛ ) ), where B ɛ = max w Lɛ w w 2 Local Error Bound: w w 2 c(f (w) f ) θ, θ (0, 1], w S ɛ Implies: ( G B ɛ cɛ θ 2 c 2 ( ) ), and O ɛ 2(1 θ) log ɛ0 ɛ x, θ=1 x 1.5, θ=2/3 x 2, θ=0.5 F(x) x Yang (CS@Uiowa) Accelerate Subgradient Methods 23 / 49

28 Restarted Subgradient Method Polyhedral Convex Optimization Linear Convergence: epigraph is a polyhedron: θ = 1 Examples: Robust Regression Sparse Classification: min w 1 n min w 1 n n w x i y i i=1 n max(0, 1 y i w x i ) + λ w 1 i=1 Yang (CS@Uiowa) Accelerate Subgradient Methods 24 / 49

29 Restarted Subgradient Method Polyhedral Convex Optimization Linear Convergence: epigraph is a polyhedron: θ = 1 Examples: Robust Regression Sparse Classification: min w 1 n min w 1 n n w x i y i i=1 n max(0, 1 y i w x i ) + λ w 1 i=1 Yang (CS@Uiowa) Accelerate Subgradient Methods 24 / 49

30 Restarted Subgradient Method Locally Semi-Strongly Convex Function Iteration Complexity: Õ( c2 G 2 ɛ ) Examples: Robust Regression w w 2 c(f (w) f ) 1/2, min w 1 n w S ɛ n w x i y i p, p (1, 2) i=1 l 1 regularized problems: h( ) is strongly convex on any compact set f (w) = h(aw) + λ w 1 Huber Loss: { 1 l δ (z) = 2 z2 if z δ δ( z 1 2 δ) otherwise Yang (CS@Uiowa) Accelerate Subgradient Methods 25 / 49

31 Restarted Subgradient Method Locally Semi-Strongly Convex Function Iteration Complexity: Õ( c2 G 2 ɛ ) Examples: Robust Regression w w 2 c(f (w) f ) 1/2, min w 1 n w S ɛ n w x i y i p, p (1, 2) i=1 l 1 regularized problems: h( ) is strongly convex on any compact set f (w) = h(aw) + λ w 1 Huber Loss: { 1 l δ (z) = 2 z2 if z δ δ( z 1 2 δ) otherwise Yang (CS@Uiowa) Accelerate Subgradient Methods 25 / 49

32 Restarted Subgradient Method Why RSG Converges Faster Step size is decreasing in a stage-wise manner (sounds familar?) Warm-start Yang (CS@Uiowa) Accelerate Subgradient Methods 26 / 49

33 Restarted Subgradient Method Proof of RSG Proof is very simple given the key inequality ɛ k = ɛ 0 2 k, thus η k = ɛ k G 2 By induction: assume f (w k 1 ) f ɛ k 1 + ɛ Apply the convergence result of SG for the k-th stage f (w k ) f (w k 1,ɛ ) η kg w k 1 w k 1,ɛ 2 2 2η k t and our key error inequality w k 1 w k 1,ɛ 2 B ɛ ɛ (f (w k 1) f (w k 1,ɛ )) B ɛ ɛ ɛ k 1 = B ɛ ɛ 2ɛ k Get f (w k ) f (w k 1,ɛ ) ɛ k 2 + ɛ2 k 4G 2 Bɛ 2 2ɛ k ɛ 2 t ɛ k Yang (CS@Uiowa) Accelerate Subgradient Methods 27 / 49

34 Homotopy Smoothing Outline 1 Introduction and Background 2 A Key Error Inequality 3 Restarted Subgradient Method 4 Homotopy Smoothing 5 Accelerated Stochastic Subgradient Method 6 Experiments 7 Conclusion Yang (CS@Uiowa) Accelerate Subgradient Methods 28 / 49

35 Homotopy Smoothing Nesterov s smoothing improves SG Explicit-max structure Nesterov s Smoothing technique f (w) = max u Ω 2 u Aw φ(w) Smoothing parameter F µ (w) = max u Ω 2 u Aw φ(w) µ 2 u 2 2 which is a L µ = A 2 µ -smooth function. Assume: max u Ω2 u 2 D Yang (CS@Uiowa) Accelerate Subgradient Methods 29 / 49

36 Homotopy Smoothing Nesterov s smoothing improves SG Explicit-max structure Nesterov s Smoothing technique f (w) = max u Ω 2 u Aw φ(w) Smoothing parameter F µ (w) = max u Ω 2 u Aw φ(w) µ 2 u 2 2 which is a L µ = A 2 µ -smooth function. Assume: max u Ω2 u 2 D Yang (CS@Uiowa) Accelerate Subgradient Methods 29 / 49

37 Homotopy Smoothing Accelerated Gradient Method for Smooth Function w T = AG(w 0, t, L µ ) 1: Input: the number of iterations t, and the initial solution w 0 Ω 2: Let t 0 = t 1 = 1, w 1 = w 0 3: for k = 0,..., t do 4: Compute y k = w k + t k 1 1 t k (w k w k 1 ) 5: Compute w k+1 = arg min w Ω {w F µ (y k ) + Lµ 2 w y k 2 2 } 6: Compute t k+1 = t 2 k 2 7: end for 8: Output: w t+1 Convergence of AG For any w Ω approximation error f (w T ) f (w) µd A 2 w 0 w 2 2 µt 2 Yang (CS@Uiowa) Accelerate Subgradient Methods 30 / 49

38 Homotopy Smoothing Accelerated Gradient Method for Smooth Function w T = AG(w 0, t, L µ ) 1: Input: the number of iterations t, and the initial solution w 0 Ω 2: Let t 0 = t 1 = 1, w 1 = w 0 3: for k = 0,..., t do 4: Compute y k = w k + t k 1 1 t k (w k w k 1 ) 5: Compute w k+1 = arg min w Ω {w F µ (y k ) + Lµ 2 w y k 2 2 } 6: Compute t k+1 = t 2 k 2 7: end for 8: Output: w t+1 Convergence of AG For any w Ω approximation error f (w T ) f (w) µd A 2 w 0 w 2 2 µt 2 Yang (CS@Uiowa) Accelerate Subgradient Methods 30 / 49

39 Homotopy Smoothing Homotopy Smoothing (HOPS) HOPS (Xu et al., 2016b) 1: Input: the number of stages K and the number of iterations t per-stage, and the initial solution w 0 Ω 1 2: Let µ 1 = ɛ 0 /(2D 2 ) 3: for s = 1,..., K do 4: Let w s = AG(w s 1, t, L µs ) 5: Update µ s+1 = µ s /2 6: end for 7: Output: w K Yang (CS@Uiowa) Accelerate Subgradient Methods 31 / 49

40 Homotopy Smoothing Improved Convergence under Local Error Bound Condition Local Error Bound: w w 2 c(f (w) f ) θ, θ (0, 1], w S ɛ Implies: Iteration Complexity: ( ( )) cd A O ɛ (1 θ) log ɛ0 ɛ Yang (CS@Uiowa) Accelerate Subgradient Methods 32 / 49

41 Accelerated Stochastic Subgradient Method Outline 1 Introduction and Background 2 A Key Error Inequality 3 Restarted Subgradient Method 4 Homotopy Smoothing 5 Accelerated Stochastic Subgradient Method 6 Experiments 7 Conclusion Yang (CS@Uiowa) Accelerate Subgradient Methods 33 / 49

42 Accelerated Stochastic Subgradient Method Stochastic Subgradient (SSG) Method min F (w) E ξ[f (w; ξ)] w Ω SSG Method w t+1 = Π Ω [w t η t f (w t ; ξ t )] η t 1/ t More scalable for large-scale problems Examples: Empirical Risk Minimization min w 1 n n l(w x i, y i ) + λr(w) i=1 Yang (CS@Uiowa) Accelerate Subgradient Methods 34 / 49

43 Accelerated Stochastic Subgradient Method Accelerated Stochastic Subgradient (ASSG) Method Do the same tricks suffice? Step size is decreasing in a stage-wise manner Warm-start Not enough in theory E[f (ŵ T ) f (w)] ηg2 2 + E[ w 0 w 2 2 ] 2ηT Expectation bound does not work Yang (CS@Uiowa) Accelerate Subgradient Methods 35 / 49

44 Accelerated Stochastic Subgradient Method Accelerated Stochastic Subgradient (ASSG) Method Do the same tricks suffice? Step size is decreasing in a stage-wise manner Warm-start Not enough in theory E[f (ŵ T ) f (w)] ηg2 2 + E[ w 0 w 2 2 ] 2ηT Expectation bound does not work Yang (CS@Uiowa) Accelerate Subgradient Methods 35 / 49

45 Accelerated Stochastic Subgradient Method Accelerated Stochastic Subgradient (ASSG) Method Use high-probability analysis f (ŵ T ) f (w) ηg2 2 + w 0 w 2 Tt=1 2 (Var of SSG) + t w t w 2 2ηT T Another Trick: Domain Shrinking (Xu et al., 2016a) Yang (CS@Uiowa) Accelerate Subgradient Methods 36 / 49

46 Accelerated Stochastic Subgradient Method Accelerated Stochastic Subgradient (ASSG) Method in the k-stage of SSG w k t+1 = Π Ω B(wk 1,D k )[w k t η k f (w k t ; ξ k t )] B(w, D) is a ball centered at w with radius D D k is decreasing by half every stage In a high probability 1 δ, iteration complexity is Õ ( ) c 2 G 2 log(1/δ) ɛ 2(1 θ) Yang (CS@Uiowa) Accelerate Subgradient Methods 37 / 49

47 Accelerated Stochastic Subgradient Method Domain Shrinking Mitigates variance in stochastic subgradient ASSG vs SSG Yang Accelerate Subgradient Methods 38 / 49

48 Experiments Outline 1 Introduction and Background 2 A Key Error Inequality 3 Restarted Subgradient Method 4 Homotopy Smoothing 5 Accelerated Stochastic Subgradient Method 6 Experiments 7 Conclusion Yang (CS@Uiowa) Accelerate Subgradient Methods 39 / 49

49 Experiments RSG: Convergence robust regression 1 min w R d n n w x i y i p, p = 1, p = 1.5 i=1 objective robust regression (p=1) log(level gap) t=10 2 t=10 3 t=10 4 objective robust regression (p=1.5) log(level gap) t=10 2 t=10 3 t= #of stages #of stages Yang (CS@Uiowa) Accelerate Subgradient Methods 40 / 49

50 Experiments RSG vs SG robust regression 1 min w R d n n w x i y i p, p = 1, p = 1.5 i=1 log(objective gap) robust regression (p=1) SG RSG (t=10 3 ) R 2 SG (t1=10 3 ) RSG (t=10 4 ) log(objective gap) robust regression (p=1.5) SG RSG (t=10 3 ) R 2 SG (t1=10 3 ) RSG (t=10 4 ) x 10 4 #of iterations x 10 5 #of iterations Yang (CS@Uiowa) Accelerate Subgradient Methods 41 / 49

51 Experiments RSG vs SG Graph-lasso regularized SVM 1 min w R d n n max(0, 1 y i w x i ) + λ F w 1 i=1 objective SG ADMM R 2 SG objective SG (initial 1) SG (initial 2) R 2 SG (initial 1) R 2 SG (initial 2) x 10 5 #of iterations x 10 5 #of iterations Yang (CS@Uiowa) Accelerate Subgradient Methods 42 / 49

52 Experiments HOPS vs Smoothing Table: Comparison of different optimization algorithms by the number of iterations and running time for achieving a solution that satisfies F (x) F ɛ. Sparse Classification Matrix Decomposition ɛ = 10 4 ɛ = 10 5 ɛ = 10 3 ɛ = 10 4 PD 526 (4.48s) 1777 (13.91s) 2523 (7.57s) 3441 (9.82s) APG-D 591 (5.63s) 1122 (10.61s) 1967 (10.30s) 8622 (44.90s) APG-F 573 (5.12s) 943 (8.51s) 1115 (3.25s) 4151 (11.82s) HOPS-D 501 (4.39s) 873 (8.35s) 224 (1.54s) 313 (2.16s) HOPS-F 490 (4.38s) 868 (7.81s) 230 (0.90s) 312 (1.23s) PD-HOPS 427 (3.41s) 609 (4.87s) 124 (0.48s) 162 (0.66s) Yang (CS@Uiowa) Accelerate Subgradient Methods 43 / 49

53 Experiments ASSG vs SSG Million songs data (n = 463, 715) objective robust regression (p=1) SG ASSG(t=10 5 ) RASSG (t1=10 4 ) #of iterations 10 6 objective robust regression (p=1.5) SG ASSG(t=10 5 ) RASSG (t1=10 4 ) #of iterations 10 6 Yang (CS@Uiowa) Accelerate Subgradient Methods 44 / 49

54 Experiments ASSG vs SSG Hinge loss + l 1 regularizer, Covtype data (n = 581, 012) -1-2 SG ASSG RASSG log 10 (F(w t )-F * ) #of iterations 10 6 Yang (CS@Uiowa) Accelerate Subgradient Methods 45 / 49

55 Conclusion Outline 1 Introduction and Background 2 A Key Error Inequality 3 Restarted Subgradient Method 4 Homotopy Smoothing 5 Accelerated Stochastic Subgradient Method 6 Experiments 7 Conclusion Yang (CS@Uiowa) Accelerate Subgradient Methods 46 / 49

56 Conclusion Conclusion Developed a key error inequality that can accelerate many algorithms include stochastic momentum methods, stochastic Nesterov s accelerated gradient methods (Yang et al., 2016) Developed a restarted subgradient (RSG) method that has faster convergence than SG Developed a homotopy smoothing (HOPS) algorithm with even faster convergence Developed an accelerated stochastic subgradient (ASSG) method Preliminary Experiments show very promising results Yang (CS@Uiowa) Accelerate Subgradient Methods 47 / 49

57 Conclusion Thank You! Questions? Yang Accelerate Subgradient Methods 48 / 49

58 Conclusion References I Nesterov, Yu. Smooth minimization of non-smooth functions. Mathematical Programming, 103(1): , Xu, Yi, Lin, Qihang, and Yang, Tianbao. Accelerate stochastic subgradient method by leveraging local error bound. CoRR, abs/ , 2016a. Xu, Yi, Yan, Yan, Lin, Qihang, and Yang, Tianbao. Homotopy smoothing for non-smooth problems with lower complexity than 1/epsilon. CoRR, abs/ , 2016b. Yang, Tianbao and Lin, Qihang. Rsg: Beating sgd without smoothness and/or strong convexity. CoRR, abs/ , Yang, Tianbao, Lin, Qihang, and Li, Zhe. Unified convergence analysis of stochastic momentum methods for convex and non-convex optimization. corr, Yang (CS@Uiowa) Accelerate Subgradient Methods 49 / 49

arxiv: v4 [math.oc] 5 Jan 2016

arxiv: v4 [math.oc] 5 Jan 2016 Restarted SGD: Beating SGD without Smoothness and/or Strong Convexity arxiv:151.03107v4 [math.oc] 5 Jan 016 Tianbao Yang, Qihang Lin Department of Computer Science Department of Management Sciences The