Accelerate Subgradient Methods

Size: px
Start display at page:

Download "Accelerate Subgradient Methods"

Transcription

1 Accelerate Subgradient Methods Tianbao Yang Department of Computer Science The University of Iowa Contributors: students Yi Xu, Yan Yan and colleague Qihang Lin Yang Accelerate Subgradient Methods 1 / 49

2 Introduction and Background Outline 1 Introduction and Background 2 A Key Error Inequality 3 Restarted Subgradient Method 4 Homotopy Smoothing 5 Accelerated Stochastic Subgradient Method 6 Experiments 7 Conclusion Yang (CS@Uiowa) Accelerate Subgradient Methods 2 / 49

3 Introduction and Background Convex Optimization f = min w Ω f (w) Ω R d is a convex set f (w) is a convex function: Goal: For a sufficiently small ɛ > 0, find a solution w such that f (w) f ɛ Yang (CS@Uiowa) Accelerate Subgradient Methods 3 / 49

4 Introduction and Background Applications Machine Learning Classification and Regression 1 min w R d n n l(w x i, y i ) + λr(w) i=1 training examples (x i, y i ), i = 1,..., n, where x i R d, y i {1, 1} (classification) or y i R (regression) l(z, y) convex loss function w.r.t z R(w) is a regularizer Yang (CS@Uiowa) Accelerate Subgradient Methods 4 / 49

5 Introduction and Background Applications in Machine Learning Classification and Regression Examples: SVM 1 min w R d n n l(w x i, y i ) + λr(w) i=1 1 min w R d n Examples: LASSO n max(0, 1 y i w x i ) + λ 2 w 2 2 i=1 1 min w R d n Many others (examples given later) n (y i w x i ) 2 + λ w 1 i=1 Yang (CS@Uiowa) Accelerate Subgradient Methods 5 / 49

6 Introduction and Background Applications in Machine Learning Classification and Regression Examples: SVM 1 min w R d n n l(w x i, y i ) + λr(w) i=1 1 min w R d n Examples: LASSO n max(0, 1 y i w x i ) + λ 2 w 2 2 i=1 1 min w R d n Many others (examples given later) n (y i w x i ) 2 + λ w 1 i=1 Yang (CS@Uiowa) Accelerate Subgradient Methods 5 / 49

7 Introduction and Background How to solve the Optimization Efficiently? Concern: Running Time (RT) Iterative Algorithm w t+1 = w t + w t RT = #of Iterations per-iteration RT Iteration Complexity How many iterations T (ɛ) are needed in order to have f (ŵ T ) f ɛ Yang (CS@Uiowa) Accelerate Subgradient Methods 6 / 49

8 Introduction and Background How to solve the Optimization Efficiently? Concern: Running Time (RT) Iterative Algorithm w t+1 = w t + w t RT = #of Iterations per-iteration RT Iteration Complexity How many iterations T (ɛ) are needed in order to have f (ŵ T ) f ɛ Yang (CS@Uiowa) Accelerate Subgradient Methods 6 / 49

9 Introduction and Background Iteration Complexity of Convex Optimization The properties of the objective function smooth function: function is upper bounded by a quadratic function strongly convex: function is lower bounded by a quadratic function Minimax Iteration Complexity smooth and strongly convex: O(log(1/ɛ)): Accel. Gradient Method smooth: O(1/ ɛ): Accel. Gradient Method strongly convex: O(1/ɛ): SubGradient (SG) Method non-smooth and non-strongly convex: O(1/ɛ 2 ): SG Yang (CS@Uiowa) Accelerate Subgradient Methods 7 / 49

10 Introduction and Background Iteration Complexity of Convex Optimization The properties of the objective function smooth function: function is upper bounded by a quadratic function strongly convex: function is lower bounded by a quadratic function Minimax Iteration Complexity smooth and strongly convex: O(log(1/ɛ)): Accel. Gradient Method smooth: O(1/ ɛ): Accel. Gradient Method strongly convex: O(1/ɛ): SubGradient (SG) Method non-smooth and non-strongly convex: O(1/ɛ 2 ): SG Yang (CS@Uiowa) Accelerate Subgradient Methods 7 / 49

11 Introduction and Background Non-smooth and Non-Strongly Convex Optimization Robust Regression: Sparse Classification: min w min w 1 n 1 n n w x i y i p, p [1, 2) i=1 n max(0, 1 y i w x i ) + λ w 1 i=1 Yang (CS@Uiowa) Accelerate Subgradient Methods 8 / 49

12 Introduction and Background Deep Learning Yang Accelerate Subgradient Methods 9 / 49

13 Introduction and Background Subgradient Method f = min w Ω f (w) SG Method w t+1 = Π Ω [w t η t f (w t )] f (w t ) is a subgradient η t step size: η t 1/ t iteration complexity O(1/ɛ 2 ): very slow (e.g., ɛ = iterations) Yang (CS@Uiowa) Accelerate Subgradient Methods 10 / 49

14 Introduction and Background Accelerate Subgradient Method f = min w Ω f (w) Special structure/condition of the objective function One example is explicit max-strucutre f (w) = max u Ω 2 u Aw φ(w) Nesterov s Smoothing technique (Nesterov, 2005) F µ (w) = max u Ω 2 u Aw φ(w) µ 2 u 2 2 AG for the smoothed problem: (1/ɛ) for the original problem Yang (CS@Uiowa) Accelerate Subgradient Methods 11 / 49

15 Introduction and Background Our Methodology for Accelerating SG f = min w Ω f (w) Explore local structure around the optimal solution (local error bound) w w 2 c(f (w) f ) θ, θ (0, 1], w S ɛ This family is broad enough ( ) Improved Iteration Complexity Õ 1 ɛ 2(1 θ) With explicit max-structure: Õ ( 1 ɛ (1 θ) ) Yang (CS@Uiowa) Accelerate Subgradient Methods 12 / 49

16 A Key Error Inequality Outline 1 Introduction and Background 2 A Key Error Inequality 3 Restarted Subgradient Method 4 Homotopy Smoothing 5 Accelerated Stochastic Subgradient Method 6 Experiments 7 Conclusion Yang (CS@Uiowa) Accelerate Subgradient Methods 13 / 49

17 A Key Error Inequality Some Notations Notations Optimal Set Ω = {w Ω : f (w) = f } ɛ-level set L ɛ = {w Ω : f (w) f = ɛ} ɛ-sublevel set S ɛ = {w Ω : f (w) f ɛ} w w : the closest optimal solution to w L w w w = arg min u Ω u w 2 S w ɛ: the closest solution to w in the ɛ-sublevel set w ɛ = arg min u Ω u w 2, f (w) f ɛ Yang (CS@Uiowa) Accelerate Subgradient Methods 14 / 49

18 A Key Error Inequality Assumptions Assumptions there exists w 0, ɛ 0 such that f (w 0 ) f ɛ 0 there exists G such that f (w) 2 G Ω is a non-empty compact set Yang (CS@Uiowa) Accelerate Subgradient Methods 15 / 49

19 A Key Error Inequality An Key Error Inequaltiy w L w w kw w k 2 f(w ) f(w ) kw w k 2 f(w) f(w ) Yang (CS@Uiowa) Accelerate Subgradient Methods 16 / 49

20 A Key Error Inequality An Key Error Inequaltiy w L w w kw w k 2 f(w ) f(w ) kw w k 2 f(w) f(w ) w ɛ w ɛ 2 f (w ɛ ) f (w ɛ ) w w ɛ 2 f (w) f (w ɛ ) Key Error Inequality (Yang & Lin, 2016) w w ɛ 2 w ɛ wɛ 2 (f (w) f (w ɛ )) ɛ Yang (CS@Uiowa) Accelerate Subgradient Methods 17 / 49

21 A Key Error Inequality An Key Error Inequaltiy w L w w kw w k 2 f(w ) f(w ) kw w k 2 f(w) f(w ) w ɛ w ɛ 2 f (w ɛ ) f (w ɛ ) w w ɛ 2 f (w) f (w ɛ ) Key Error Inequality (Yang & Lin, 2016) w w ɛ 2 w ɛ wɛ 2 (f (w) f (w ɛ )) ɛ Yang (CS@Uiowa) Accelerate Subgradient Methods 17 / 49

22 Restarted Subgradient Method Outline 1 Introduction and Background 2 A Key Error Inequality 3 Restarted Subgradient Method 4 Homotopy Smoothing 5 Accelerated Stochastic Subgradient Method 6 Experiments 7 Conclusion Yang (CS@Uiowa) Accelerate Subgradient Methods 18 / 49

23 Restarted Subgradient Method SG Method ŵ T = SG(w 0, η, T ) 1: Input: the number of iterations T, and the initial solution w 0 Ω, 2: Let w 1 = w 0 3: for t = 1,..., T do 4: Compute a subgradient f (w t ) 5: Update w t+1 = Π Ω [w t η f (w t )] 6: end for 7: Output: ŵ T = T t=1 w t T Convergence Guarantee For any w Ω T = G2 w 0 w 2 2, η = G2 ɛ 2 ɛ f (ŵ T ) f (w) ηg2 2 + w 0 w 2 2 2ηT f (ŵ T ) f ɛ Yang (CS@Uiowa) Accelerate Subgradient Methods 19 / 49

24 Restarted Subgradient Method RSG Method w K = RSG(w 0, t, K) 1: Input: the number of stages K and the number of iterations t per-stage, w 0 Ω 2: Set η 1 = ɛ 0 /(2G 2 ), where ɛ 0 is from our assumption 3: for k = 1,..., K do 4: Call subroutine SG to obtain w k = SG(w k 1, η k, t) 5: Set η k+1 = η k /2 6: end for 7: Output: w K Yang (CS@Uiowa) Accelerate Subgradient Methods 20 / 49

25 Restarted Subgradient Method A General Convergence of RSG Distance between the ɛ-level set and the optimal set B ɛ = max w w w 2 w L ɛ w 0 w L B Convergence of RSG If t 4G2 B 2 ɛ ɛ 2 and K = log 2 ( ɛ 0 ɛ ), then f (w K ) f 2ɛ Iteration Complexity of RSG: O ( G 2 B 2 ɛ ɛ 2 log ( ɛ 0ɛ ) ) Yang (CS@Uiowa) Accelerate Subgradient Methods 21 / 49

26 Restarted Subgradient Method Comparison with SG ( Iteration Complexity of RSG: O G 2 Bɛ 2 log ( ɛ ) ) 0ɛ ( ɛ 2 ) G Iteration Complexity of SG: O 2 w 0 w 2 2 ɛ 2 SG: dependence on the distance from the initial solution to the optimal set RSG: dependence on the distance from the ɛ-level set to the optimal set RSG: log-dependence on the quality of the initial solution (ɛ 0 ) Yang (CS@Uiowa) Accelerate Subgradient Methods 22 / 49

27 Restarted Subgradient Method Improved Convergence under Local Error Bound Condition Iteration Complexity: O ( G 2 B 2 ɛ ɛ 2 log ( ɛ 0ɛ ) ), where B ɛ = max w Lɛ w w 2 Local Error Bound: w w 2 c(f (w) f ) θ, θ (0, 1], w S ɛ Implies: ( G B ɛ cɛ θ 2 c 2 ( ) ), and O ɛ 2(1 θ) log ɛ0 ɛ x, θ=1 x 1.5, θ=2/3 x 2, θ=0.5 F(x) x Yang (CS@Uiowa) Accelerate Subgradient Methods 23 / 49

28 Restarted Subgradient Method Polyhedral Convex Optimization Linear Convergence: epigraph is a polyhedron: θ = 1 Examples: Robust Regression Sparse Classification: min w 1 n min w 1 n n w x i y i i=1 n max(0, 1 y i w x i ) + λ w 1 i=1 Yang (CS@Uiowa) Accelerate Subgradient Methods 24 / 49

29 Restarted Subgradient Method Polyhedral Convex Optimization Linear Convergence: epigraph is a polyhedron: θ = 1 Examples: Robust Regression Sparse Classification: min w 1 n min w 1 n n w x i y i i=1 n max(0, 1 y i w x i ) + λ w 1 i=1 Yang (CS@Uiowa) Accelerate Subgradient Methods 24 / 49

30 Restarted Subgradient Method Locally Semi-Strongly Convex Function Iteration Complexity: Õ( c2 G 2 ɛ ) Examples: Robust Regression w w 2 c(f (w) f ) 1/2, min w 1 n w S ɛ n w x i y i p, p (1, 2) i=1 l 1 regularized problems: h( ) is strongly convex on any compact set f (w) = h(aw) + λ w 1 Huber Loss: { 1 l δ (z) = 2 z2 if z δ δ( z 1 2 δ) otherwise Yang (CS@Uiowa) Accelerate Subgradient Methods 25 / 49

31 Restarted Subgradient Method Locally Semi-Strongly Convex Function Iteration Complexity: Õ( c2 G 2 ɛ ) Examples: Robust Regression w w 2 c(f (w) f ) 1/2, min w 1 n w S ɛ n w x i y i p, p (1, 2) i=1 l 1 regularized problems: h( ) is strongly convex on any compact set f (w) = h(aw) + λ w 1 Huber Loss: { 1 l δ (z) = 2 z2 if z δ δ( z 1 2 δ) otherwise Yang (CS@Uiowa) Accelerate Subgradient Methods 25 / 49

32 Restarted Subgradient Method Why RSG Converges Faster Step size is decreasing in a stage-wise manner (sounds familar?) Warm-start Yang (CS@Uiowa) Accelerate Subgradient Methods 26 / 49

33 Restarted Subgradient Method Proof of RSG Proof is very simple given the key inequality ɛ k = ɛ 0 2 k, thus η k = ɛ k G 2 By induction: assume f (w k 1 ) f ɛ k 1 + ɛ Apply the convergence result of SG for the k-th stage f (w k ) f (w k 1,ɛ ) η kg w k 1 w k 1,ɛ 2 2 2η k t and our key error inequality w k 1 w k 1,ɛ 2 B ɛ ɛ (f (w k 1) f (w k 1,ɛ )) B ɛ ɛ ɛ k 1 = B ɛ ɛ 2ɛ k Get f (w k ) f (w k 1,ɛ ) ɛ k 2 + ɛ2 k 4G 2 Bɛ 2 2ɛ k ɛ 2 t ɛ k Yang (CS@Uiowa) Accelerate Subgradient Methods 27 / 49

34 Homotopy Smoothing Outline 1 Introduction and Background 2 A Key Error Inequality 3 Restarted Subgradient Method 4 Homotopy Smoothing 5 Accelerated Stochastic Subgradient Method 6 Experiments 7 Conclusion Yang (CS@Uiowa) Accelerate Subgradient Methods 28 / 49

35 Homotopy Smoothing Nesterov s smoothing improves SG Explicit-max structure Nesterov s Smoothing technique f (w) = max u Ω 2 u Aw φ(w) Smoothing parameter F µ (w) = max u Ω 2 u Aw φ(w) µ 2 u 2 2 which is a L µ = A 2 µ -smooth function. Assume: max u Ω2 u 2 D Yang (CS@Uiowa) Accelerate Subgradient Methods 29 / 49

36 Homotopy Smoothing Nesterov s smoothing improves SG Explicit-max structure Nesterov s Smoothing technique f (w) = max u Ω 2 u Aw φ(w) Smoothing parameter F µ (w) = max u Ω 2 u Aw φ(w) µ 2 u 2 2 which is a L µ = A 2 µ -smooth function. Assume: max u Ω2 u 2 D Yang (CS@Uiowa) Accelerate Subgradient Methods 29 / 49

37 Homotopy Smoothing Accelerated Gradient Method for Smooth Function w T = AG(w 0, t, L µ ) 1: Input: the number of iterations t, and the initial solution w 0 Ω 2: Let t 0 = t 1 = 1, w 1 = w 0 3: for k = 0,..., t do 4: Compute y k = w k + t k 1 1 t k (w k w k 1 ) 5: Compute w k+1 = arg min w Ω {w F µ (y k ) + Lµ 2 w y k 2 2 } 6: Compute t k+1 = t 2 k 2 7: end for 8: Output: w t+1 Convergence of AG For any w Ω approximation error f (w T ) f (w) µd A 2 w 0 w 2 2 µt 2 Yang (CS@Uiowa) Accelerate Subgradient Methods 30 / 49

38 Homotopy Smoothing Accelerated Gradient Method for Smooth Function w T = AG(w 0, t, L µ ) 1: Input: the number of iterations t, and the initial solution w 0 Ω 2: Let t 0 = t 1 = 1, w 1 = w 0 3: for k = 0,..., t do 4: Compute y k = w k + t k 1 1 t k (w k w k 1 ) 5: Compute w k+1 = arg min w Ω {w F µ (y k ) + Lµ 2 w y k 2 2 } 6: Compute t k+1 = t 2 k 2 7: end for 8: Output: w t+1 Convergence of AG For any w Ω approximation error f (w T ) f (w) µd A 2 w 0 w 2 2 µt 2 Yang (CS@Uiowa) Accelerate Subgradient Methods 30 / 49

39 Homotopy Smoothing Homotopy Smoothing (HOPS) HOPS (Xu et al., 2016b) 1: Input: the number of stages K and the number of iterations t per-stage, and the initial solution w 0 Ω 1 2: Let µ 1 = ɛ 0 /(2D 2 ) 3: for s = 1,..., K do 4: Let w s = AG(w s 1, t, L µs ) 5: Update µ s+1 = µ s /2 6: end for 7: Output: w K Yang (CS@Uiowa) Accelerate Subgradient Methods 31 / 49

40 Homotopy Smoothing Improved Convergence under Local Error Bound Condition Local Error Bound: w w 2 c(f (w) f ) θ, θ (0, 1], w S ɛ Implies: Iteration Complexity: ( ( )) cd A O ɛ (1 θ) log ɛ0 ɛ Yang (CS@Uiowa) Accelerate Subgradient Methods 32 / 49

41 Accelerated Stochastic Subgradient Method Outline 1 Introduction and Background 2 A Key Error Inequality 3 Restarted Subgradient Method 4 Homotopy Smoothing 5 Accelerated Stochastic Subgradient Method 6 Experiments 7 Conclusion Yang (CS@Uiowa) Accelerate Subgradient Methods 33 / 49

42 Accelerated Stochastic Subgradient Method Stochastic Subgradient (SSG) Method min F (w) E ξ[f (w; ξ)] w Ω SSG Method w t+1 = Π Ω [w t η t f (w t ; ξ t )] η t 1/ t More scalable for large-scale problems Examples: Empirical Risk Minimization min w 1 n n l(w x i, y i ) + λr(w) i=1 Yang (CS@Uiowa) Accelerate Subgradient Methods 34 / 49

43 Accelerated Stochastic Subgradient Method Accelerated Stochastic Subgradient (ASSG) Method Do the same tricks suffice? Step size is decreasing in a stage-wise manner Warm-start Not enough in theory E[f (ŵ T ) f (w)] ηg2 2 + E[ w 0 w 2 2 ] 2ηT Expectation bound does not work Yang (CS@Uiowa) Accelerate Subgradient Methods 35 / 49

44 Accelerated Stochastic Subgradient Method Accelerated Stochastic Subgradient (ASSG) Method Do the same tricks suffice? Step size is decreasing in a stage-wise manner Warm-start Not enough in theory E[f (ŵ T ) f (w)] ηg2 2 + E[ w 0 w 2 2 ] 2ηT Expectation bound does not work Yang (CS@Uiowa) Accelerate Subgradient Methods 35 / 49

45 Accelerated Stochastic Subgradient Method Accelerated Stochastic Subgradient (ASSG) Method Use high-probability analysis f (ŵ T ) f (w) ηg2 2 + w 0 w 2 Tt=1 2 (Var of SSG) + t w t w 2 2ηT T Another Trick: Domain Shrinking (Xu et al., 2016a) Yang (CS@Uiowa) Accelerate Subgradient Methods 36 / 49

46 Accelerated Stochastic Subgradient Method Accelerated Stochastic Subgradient (ASSG) Method in the k-stage of SSG w k t+1 = Π Ω B(wk 1,D k )[w k t η k f (w k t ; ξ k t )] B(w, D) is a ball centered at w with radius D D k is decreasing by half every stage In a high probability 1 δ, iteration complexity is Õ ( ) c 2 G 2 log(1/δ) ɛ 2(1 θ) Yang (CS@Uiowa) Accelerate Subgradient Methods 37 / 49

47 Accelerated Stochastic Subgradient Method Domain Shrinking Mitigates variance in stochastic subgradient ASSG vs SSG Yang Accelerate Subgradient Methods 38 / 49

48 Experiments Outline 1 Introduction and Background 2 A Key Error Inequality 3 Restarted Subgradient Method 4 Homotopy Smoothing 5 Accelerated Stochastic Subgradient Method 6 Experiments 7 Conclusion Yang (CS@Uiowa) Accelerate Subgradient Methods 39 / 49

49 Experiments RSG: Convergence robust regression 1 min w R d n n w x i y i p, p = 1, p = 1.5 i=1 objective robust regression (p=1) log(level gap) t=10 2 t=10 3 t=10 4 objective robust regression (p=1.5) log(level gap) t=10 2 t=10 3 t= #of stages #of stages Yang (CS@Uiowa) Accelerate Subgradient Methods 40 / 49

50 Experiments RSG vs SG robust regression 1 min w R d n n w x i y i p, p = 1, p = 1.5 i=1 log(objective gap) robust regression (p=1) SG RSG (t=10 3 ) R 2 SG (t1=10 3 ) RSG (t=10 4 ) log(objective gap) robust regression (p=1.5) SG RSG (t=10 3 ) R 2 SG (t1=10 3 ) RSG (t=10 4 ) x 10 4 #of iterations x 10 5 #of iterations Yang (CS@Uiowa) Accelerate Subgradient Methods 41 / 49

51 Experiments RSG vs SG Graph-lasso regularized SVM 1 min w R d n n max(0, 1 y i w x i ) + λ F w 1 i=1 objective SG ADMM R 2 SG objective SG (initial 1) SG (initial 2) R 2 SG (initial 1) R 2 SG (initial 2) x 10 5 #of iterations x 10 5 #of iterations Yang (CS@Uiowa) Accelerate Subgradient Methods 42 / 49

52 Experiments HOPS vs Smoothing Table: Comparison of different optimization algorithms by the number of iterations and running time for achieving a solution that satisfies F (x) F ɛ. Sparse Classification Matrix Decomposition ɛ = 10 4 ɛ = 10 5 ɛ = 10 3 ɛ = 10 4 PD 526 (4.48s) 1777 (13.91s) 2523 (7.57s) 3441 (9.82s) APG-D 591 (5.63s) 1122 (10.61s) 1967 (10.30s) 8622 (44.90s) APG-F 573 (5.12s) 943 (8.51s) 1115 (3.25s) 4151 (11.82s) HOPS-D 501 (4.39s) 873 (8.35s) 224 (1.54s) 313 (2.16s) HOPS-F 490 (4.38s) 868 (7.81s) 230 (0.90s) 312 (1.23s) PD-HOPS 427 (3.41s) 609 (4.87s) 124 (0.48s) 162 (0.66s) Yang (CS@Uiowa) Accelerate Subgradient Methods 43 / 49

53 Experiments ASSG vs SSG Million songs data (n = 463, 715) objective robust regression (p=1) SG ASSG(t=10 5 ) RASSG (t1=10 4 ) #of iterations 10 6 objective robust regression (p=1.5) SG ASSG(t=10 5 ) RASSG (t1=10 4 ) #of iterations 10 6 Yang (CS@Uiowa) Accelerate Subgradient Methods 44 / 49

54 Experiments ASSG vs SSG Hinge loss + l 1 regularizer, Covtype data (n = 581, 012) -1-2 SG ASSG RASSG log 10 (F(w t )-F * ) #of iterations 10 6 Yang (CS@Uiowa) Accelerate Subgradient Methods 45 / 49

55 Conclusion Outline 1 Introduction and Background 2 A Key Error Inequality 3 Restarted Subgradient Method 4 Homotopy Smoothing 5 Accelerated Stochastic Subgradient Method 6 Experiments 7 Conclusion Yang (CS@Uiowa) Accelerate Subgradient Methods 46 / 49

56 Conclusion Conclusion Developed a key error inequality that can accelerate many algorithms include stochastic momentum methods, stochastic Nesterov s accelerated gradient methods (Yang et al., 2016) Developed a restarted subgradient (RSG) method that has faster convergence than SG Developed a homotopy smoothing (HOPS) algorithm with even faster convergence Developed an accelerated stochastic subgradient (ASSG) method Preliminary Experiments show very promising results Yang (CS@Uiowa) Accelerate Subgradient Methods 47 / 49

57 Conclusion Thank You! Questions? Yang Accelerate Subgradient Methods 48 / 49

58 Conclusion References I Nesterov, Yu. Smooth minimization of non-smooth functions. Mathematical Programming, 103(1): , Xu, Yi, Lin, Qihang, and Yang, Tianbao. Accelerate stochastic subgradient method by leveraging local error bound. CoRR, abs/ , 2016a. Xu, Yi, Yan, Yan, Lin, Qihang, and Yang, Tianbao. Homotopy smoothing for non-smooth problems with lower complexity than 1/epsilon. CoRR, abs/ , 2016b. Yang, Tianbao and Lin, Qihang. Rsg: Beating sgd without smoothness and/or strong convexity. CoRR, abs/ , Yang, Tianbao, Lin, Qihang, and Li, Zhe. Unified convergence analysis of stochastic momentum methods for convex and non-convex optimization. corr, Yang (CS@Uiowa) Accelerate Subgradient Methods 49 / 49

arxiv: v4 [math.oc] 5 Jan 2016

arxiv: v4 [math.oc] 5 Jan 2016 Restarted SGD: Beating SGD without Smoothness and/or Strong Convexity arxiv:151.03107v4 [math.oc] 5 Jan 016 Tianbao Yang, Qihang Lin Department of Computer Science Department of Management Sciences The

More information

Big Data Analytics: Optimization and Randomization

Big Data Analytics: Optimization and Randomization Big Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML 2015 Hong Kong Department of Computer Science, The University of Iowa, IA, USA Nov. 20, 2015 Yang Tutorial for ACML 15 Nov.

More information

Homotopy Smoothing for Non-Smooth Problems with Lower Complexity than O(1/ɛ)

Homotopy Smoothing for Non-Smooth Problems with Lower Complexity than O(1/ɛ) 1 28 Homotopy Smoothing for Non-Smooth Problems with Lower Complexity than O(1/) Yi Xu yi-xu@uiowa.edu Yan Yan yan.yan-3@student.uts.edu.au Qihang Lin qihang-lin@uiowa.edu Tianbao Yang tianbao-yang@uiowa.edu

More information

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017 Machine Learning Support Vector Machines Fabio Vandin November 20, 2017 1 Classification and Margin Consider a classification problem with two classes: instance set X = R d label set Y = { 1, 1}. Training

More information

SADAGRAD: Strongly Adaptive Stochastic Gradient Methods

SADAGRAD: Strongly Adaptive Stochastic Gradient Methods Zaiyi Chen * Yi Xu * Enhong Chen Tianbao Yang Abstract Although the convergence rates of existing variants of ADAGRAD have a better dependence on the number of iterations under the strong convexity condition,

More information

Stochastic and online algorithms

Stochastic and online algorithms Stochastic and online algorithms stochastic gradient method online optimization and dual averaging method minimizing finite average Stochastic and online optimization 6 1 Stochastic optimization problem

More information

SADAGRAD: Strongly Adaptive Stochastic Gradient Methods

SADAGRAD: Strongly Adaptive Stochastic Gradient Methods Zaiyi Chen * Yi Xu * Enhong Chen Tianbao Yang Abstract Although the convergence rates of existing variants of ADAGRAD have a better dependence on the number of iterations under the strong convexity condition,

More information

ECS289: Scalable Machine Learning

ECS289: Scalable Machine Learning ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Sept 29, 2016 Outline Convex vs Nonconvex Functions Coordinate Descent Gradient Descent Newton s method Stochastic Gradient Descent Numerical Optimization

More information

CSCI 1951-G Optimization Methods in Finance Part 12: Variants of Gradient Descent

CSCI 1951-G Optimization Methods in Finance Part 12: Variants of Gradient Descent CSCI 1951-G Optimization Methods in Finance Part 12: Variants of Gradient Descent April 27, 2018 1 / 32 Outline 1) Moment and Nesterov s accelerated gradient descent 2) AdaGrad and RMSProp 4) Adam 5) Stochastic

More information

A Two-Stage Approach for Learning a Sparse Model with Sharp

A Two-Stage Approach for Learning a Sparse Model with Sharp A Two-Stage Approach for Learning a Sparse Model with Sharp Excess Risk Analysis The University of Iowa, Nanjing University, Alibaba Group February 3, 2017 1 Problem and Chanllenges 2 The Two-stage Approach

More information

CS260: Machine Learning Algorithms

CS260: Machine Learning Algorithms CS260: Machine Learning Algorithms Lecture 4: Stochastic Gradient Descent Cho-Jui Hsieh UCLA Jan 16, 2019 Large-scale Problems Machine learning: usually minimizing the training loss min w { 1 N min w {

More information

Optimal Stochastic Strongly Convex Optimization with a Logarithmic Number of Projections

Optimal Stochastic Strongly Convex Optimization with a Logarithmic Number of Projections Optimal Stochastic Strongly Convex Optimization with a Logarithmic Number of Projections Jianhui Chen 1, Tianbao Yang 2, Qihang Lin 2, Lijun Zhang 3, and Yi Chang 4 July 18, 2016 Yahoo Research 1, The

More information

Convex Optimization Lecture 16

Convex Optimization Lecture 16 Convex Optimization Lecture 16 Today: Projected Gradient Descent Conditional Gradient Descent Stochastic Gradient Descent Random Coordinate Descent Recall: Gradient Descent (Steepest Descent w.r.t Euclidean

More information

A Two-Stage Approach for Learning a Sparse Model with Sharp Excess Risk Analysis

A Two-Stage Approach for Learning a Sparse Model with Sharp Excess Risk Analysis Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence (AAAI-7) A Two-Stage Approach for Learning a Sparse Model with Sharp Excess Risk Analysis Zhe Li, Tianbao Yang, Lijun Zhang, Rong

More information

arxiv: v1 [math.oc] 10 Oct 2018

arxiv: v1 [math.oc] 10 Oct 2018 8 Frank-Wolfe Method is Automatically Adaptive to Error Bound ondition arxiv:80.04765v [math.o] 0 Oct 08 Yi Xu yi-xu@uiowa.edu Tianbao Yang tianbao-yang@uiowa.edu Department of omputer Science, The University

More information

Stochastic Convex Optimization: Faster Local Growth Implies Faster Global Convergence

Stochastic Convex Optimization: Faster Local Growth Implies Faster Global Convergence Stochastic Convex Optimization: Faster Local Growth Implies Faster Global Convergence Yi Xu 1 Qihang Lin 2 Tianbao Yang 1 Abstract In this paper, a new theory is developed for firstorder stochastic convex

More information

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference ECE 18-898G: Special Topics in Signal Processing: Sparsity, Structure, and Inference Sparse Recovery using L1 minimization - algorithms Yuejie Chi Department of Electrical and Computer Engineering Spring

More information

Coordinate Descent and Ascent Methods

Coordinate Descent and Ascent Methods Coordinate Descent and Ascent Methods Julie Nutini Machine Learning Reading Group November 3 rd, 2015 1 / 22 Projected-Gradient Methods Motivation Rewrite non-smooth problem as smooth constrained problem:

More information

Statistical Machine Learning II Spring 2017, Learning Theory, Lecture 4

Statistical Machine Learning II Spring 2017, Learning Theory, Lecture 4 Statistical Machine Learning II Spring 07, Learning Theory, Lecture 4 Jean Honorio jhonorio@purdue.edu Deterministic Optimization For brevity, everywhere differentiable functions will be called smooth.

More information

1. Gradient method. gradient method, first-order methods. quadratic bounds on convex functions. analysis of gradient method

1. Gradient method. gradient method, first-order methods. quadratic bounds on convex functions. analysis of gradient method L. Vandenberghe EE236C (Spring 2016) 1. Gradient method gradient method, first-order methods quadratic bounds on convex functions analysis of gradient method 1-1 Approximate course outline First-order

More information

ECS171: Machine Learning

ECS171: Machine Learning ECS171: Machine Learning Lecture 4: Optimization (LFD 3.3, SGD) Cho-Jui Hsieh UC Davis Jan 22, 2018 Gradient descent Optimization Goal: find the minimizer of a function min f (w) w For now we assume f

More information

Support Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017

Support Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017 Support Vector Machines: Training with Stochastic Gradient Descent Machine Learning Fall 2017 1 Support vector machines Training by maximizing margin The SVM objective Solving the SVM optimization problem

More information

Accelerating Stochastic Optimization

Accelerating Stochastic Optimization Accelerating Stochastic Optimization Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem and Mobileye Master Class at Tel-Aviv, Tel-Aviv University, November 2014 Shalev-Shwartz

More information

Machine Learning and Computational Statistics, Spring 2017 Homework 2: Lasso Regression

Machine Learning and Computational Statistics, Spring 2017 Homework 2: Lasso Regression Machine Learning and Computational Statistics, Spring 2017 Homework 2: Lasso Regression Due: Monday, February 13, 2017, at 10pm (Submit via Gradescope) Instructions: Your answers to the questions below,

More information

Large-scale Stochastic Optimization

Large-scale Stochastic Optimization Large-scale Stochastic Optimization 11-741/641/441 (Spring 2016) Hanxiao Liu hanxiaol@cs.cmu.edu March 24, 2016 1 / 22 Outline 1. Gradient Descent (GD) 2. Stochastic Gradient Descent (SGD) Formulation

More information

Stochastic Optimization: First order method

Stochastic Optimization: First order method Stochastic Optimization: First order method Taiji Suzuki Tokyo Institute of Technology Graduate School of Information Science and Engineering Department of Mathematical and Computing Sciences JST, PRESTO

More information

Convex Optimization Algorithms for Machine Learning in 10 Slides

Convex Optimization Algorithms for Machine Learning in 10 Slides Convex Optimization Algorithms for Machine Learning in 10 Slides Presenter: Jul. 15. 2015 Outline 1 Quadratic Problem Linear System 2 Smooth Problem Newton-CG 3 Composite Problem Proximal-Newton-CD 4 Non-smooth,

More information

Machine Learning And Applications: Supervised Learning-SVM

Machine Learning And Applications: Supervised Learning-SVM Machine Learning And Applications: Supervised Learning-SVM Raphaël Bournhonesque École Normale Supérieure de Lyon, Lyon, France raphael.bournhonesque@ens-lyon.fr 1 Supervised vs unsupervised learning Machine

More information

Journal Club. A Universal Catalyst for First-Order Optimization (H. Lin, J. Mairal and Z. Harchaoui) March 8th, CMAP, Ecole Polytechnique 1/19

Journal Club. A Universal Catalyst for First-Order Optimization (H. Lin, J. Mairal and Z. Harchaoui) March 8th, CMAP, Ecole Polytechnique 1/19 Journal Club A Universal Catalyst for First-Order Optimization (H. Lin, J. Mairal and Z. Harchaoui) CMAP, Ecole Polytechnique March 8th, 2018 1/19 Plan 1 Motivations 2 Existing Acceleration Methods 3 Universal

More information

Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade

Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade Machine Learning for Big Data CSE547/STAT548 University of Washington S. M. Kakade (UW) Optimization for

More information

CPSC 540: Machine Learning

CPSC 540: Machine Learning CPSC 540: Machine Learning Proximal-Gradient Mark Schmidt University of British Columbia Winter 2018 Admin Auditting/registration forms: Pick up after class today. Assignment 1: 2 late days to hand in

More information

Proximal Methods for Optimization with Spasity-inducing Norms

Proximal Methods for Optimization with Spasity-inducing Norms Proximal Methods for Optimization with Spasity-inducing Norms Group Learning Presentation Xiaowei Zhou Department of Electronic and Computer Engineering The Hong Kong University of Science and Technology

More information

STA141C: Big Data & High Performance Statistical Computing

STA141C: Big Data & High Performance Statistical Computing STA141C: Big Data & High Performance Statistical Computing Lecture 8: Optimization Cho-Jui Hsieh UC Davis May 9, 2017 Optimization Numerical Optimization Numerical Optimization: min X f (X ) Can be applied

More information

Optimization methods

Optimization methods Optimization methods Optimization-Based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_spring16 Carlos Fernandez-Granda /8/016 Introduction Aim: Overview of optimization methods that Tend to

More information

Convex optimization COMS 4771

Convex optimization COMS 4771 Convex optimization COMS 4771 1. Recap: learning via optimization Soft-margin SVMs Soft-margin SVM optimization problem defined by training data: w R d λ 2 w 2 2 + 1 n n [ ] 1 y ix T i w. + 1 / 15 Soft-margin

More information

Machine Learning for NLP

Machine Learning for NLP Machine Learning for NLP Linear Models Joakim Nivre Uppsala University Department of Linguistics and Philology Slides adapted from Ryan McDonald, Google Research Machine Learning for NLP 1(26) Outline

More information

Trade-Offs in Distributed Learning and Optimization

Trade-Offs in Distributed Learning and Optimization Trade-Offs in Distributed Learning and Optimization Ohad Shamir Weizmann Institute of Science Includes joint works with Yossi Arjevani, Nathan Srebro and Tong Zhang IHES Workshop March 2016 Distributed

More information

Adaptive Gradient Methods AdaGrad / Adam. Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade

Adaptive Gradient Methods AdaGrad / Adam. Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade Adaptive Gradient Methods AdaGrad / Adam Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade 1 Announcements: HW3 posted Dual coordinate ascent (some review of SGD and random

More information

Logistic Regression. Stochastic Gradient Descent

Logistic Regression. Stochastic Gradient Descent Tutorial 8 CPSC 340 Logistic Regression Stochastic Gradient Descent Logistic Regression Model A discriminative probabilistic model for classification e.g. spam filtering Let x R d be input and y { 1, 1}

More information

Stochastic Gradient Descent. Ryan Tibshirani Convex Optimization

Stochastic Gradient Descent. Ryan Tibshirani Convex Optimization Stochastic Gradient Descent Ryan Tibshirani Convex Optimization 10-725 Last time: proximal gradient descent Consider the problem min x g(x) + h(x) with g, h convex, g differentiable, and h simple in so

More information

An Evolving Gradient Resampling Method for Machine Learning. Jorge Nocedal

An Evolving Gradient Resampling Method for Machine Learning. Jorge Nocedal An Evolving Gradient Resampling Method for Machine Learning Jorge Nocedal Northwestern University NIPS, Montreal 2015 1 Collaborators Figen Oztoprak Stefan Solntsev Richard Byrd 2 Outline 1. How to improve

More information

Discriminative Models

Discriminative Models No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models

More information

Warm up. Regrade requests submitted directly in Gradescope, do not instructors.

Warm up. Regrade requests submitted directly in Gradescope, do not  instructors. Warm up Regrade requests submitted directly in Gradescope, do not email instructors. 1 float in NumPy = 8 bytes 10 6 2 20 bytes = 1 MB 10 9 2 30 bytes = 1 GB For each block compute the memory required

More information

Stochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure

Stochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure Stochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure Alberto Bietti Julien Mairal Inria Grenoble (Thoth) March 21, 2017 Alberto Bietti Stochastic MISO March 21,

More information

Adaptive Gradient Methods AdaGrad / Adam

Adaptive Gradient Methods AdaGrad / Adam Case Study 1: Estimating Click Probabilities Adaptive Gradient Methods AdaGrad / Adam Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade 1 The Problem with GD (and SGD)

More information

Parallel and Distributed Stochastic Learning -Towards Scalable Learning for Big Data Intelligence

Parallel and Distributed Stochastic Learning -Towards Scalable Learning for Big Data Intelligence Parallel and Distributed Stochastic Learning -Towards Scalable Learning for Big Data Intelligence oé LAMDA Group H ŒÆOŽÅ Æ EâX ^ #EâI[ : liwujun@nju.edu.cn Dec 10, 2016 Wu-Jun Li (http://cs.nju.edu.cn/lwj)

More information

Stochastic Subgradient Method

Stochastic Subgradient Method Stochastic Subgradient Method Lingjie Weng, Yutian Chen Bren School of Information and Computer Science UC Irvine Subgradient Recall basic inequality for convex differentiable f : f y f x + f x T (y x)

More information

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique Master 2 MathBigData S. Gaïffas 1 3 novembre 2014 1 CMAP - Ecole Polytechnique 1 Supervised learning recap Introduction Loss functions, linearity 2 Penalization Introduction Ridge Sparsity Lasso 3 Some

More information

Support Vector Machines for Classification and Regression. 1 Linearly Separable Data: Hard Margin SVMs

Support Vector Machines for Classification and Regression. 1 Linearly Separable Data: Hard Margin SVMs E0 270 Machine Learning Lecture 5 (Jan 22, 203) Support Vector Machines for Classification and Regression Lecturer: Shivani Agarwal Disclaimer: These notes are a brief summary of the topics covered in

More information

Randomized Smoothing for Stochastic Optimization

Randomized Smoothing for Stochastic Optimization Randomized Smoothing for Stochastic Optimization John Duchi Peter Bartlett Martin Wainwright University of California, Berkeley NIPS Big Learn Workshop, December 2011 Duchi (UC Berkeley) Smoothing and

More information

Stochastic Optimization Algorithms Beyond SG

Stochastic Optimization Algorithms Beyond SG Stochastic Optimization Algorithms Beyond SG Frank E. Curtis 1, Lehigh University involving joint work with Léon Bottou, Facebook AI Research Jorge Nocedal, Northwestern University Optimization Methods

More information

A Unified Analysis of Stochastic Momentum Methods for Deep Learning

A Unified Analysis of Stochastic Momentum Methods for Deep Learning A Unified Analysis of Stochastic Momentum Methods for Deep Learning Yan Yan,2, Tianbao Yang 3, Zhe Li 3, Qihang Lin 4, Yi Yang,2 SUSTech-UTS Joint Centre of CIS, Southern University of Science and Technology

More information

Subgradient Method. Ryan Tibshirani Convex Optimization

Subgradient Method. Ryan Tibshirani Convex Optimization Subgradient Method Ryan Tibshirani Convex Optimization 10-725 Consider the problem Last last time: gradient descent min x f(x) for f convex and differentiable, dom(f) = R n. Gradient descent: choose initial

More information

Smoothing Proximal Gradient Method. General Structured Sparse Regression

Smoothing Proximal Gradient Method. General Structured Sparse Regression for General Structured Sparse Regression Xi Chen, Qihang Lin, Seyoung Kim, Jaime G. Carbonell, Eric P. Xing (Annals of Applied Statistics, 2012) Gatsby Unit, Tea Talk October 25, 2013 Outline Motivation:

More information

Machine Learning. Lecture 04: Logistic and Softmax Regression. Nevin L. Zhang

Machine Learning. Lecture 04: Logistic and Softmax Regression. Nevin L. Zhang Machine Learning Lecture 04: Logistic and Softmax Regression Nevin L. Zhang lzhang@cse.ust.hk Department of Computer Science and Engineering The Hong Kong University of Science and Technology This set

More information

Subgradient Method. Guest Lecturer: Fatma Kilinc-Karzan. Instructors: Pradeep Ravikumar, Aarti Singh Convex Optimization /36-725

Subgradient Method. Guest Lecturer: Fatma Kilinc-Karzan. Instructors: Pradeep Ravikumar, Aarti Singh Convex Optimization /36-725 Subgradient Method Guest Lecturer: Fatma Kilinc-Karzan Instructors: Pradeep Ravikumar, Aarti Singh Convex Optimization 10-725/36-725 Adapted from slides from Ryan Tibshirani Consider the problem Recall:

More information

CS60021: Scalable Data Mining. Large Scale Machine Learning

CS60021: Scalable Data Mining. Large Scale Machine Learning J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 1 CS60021: Scalable Data Mining Large Scale Machine Learning Sourangshu Bhattacharya Example: Spam filtering Instance

More information

ECE521 lecture 4: 19 January Optimization, MLE, regularization

ECE521 lecture 4: 19 January Optimization, MLE, regularization ECE521 lecture 4: 19 January 2017 Optimization, MLE, regularization First four lectures Lectures 1 and 2: Intro to ML Probability review Types of loss functions and algorithms Lecture 3: KNN Convexity

More information

Distributed Stochastic Variance Reduced Gradient Methods and A Lower Bound for Communication Complexity

Distributed Stochastic Variance Reduced Gradient Methods and A Lower Bound for Communication Complexity Noname manuscript No. (will be inserted by the editor) Distributed Stochastic Variance Reduced Gradient Methods and A Lower Bound for Communication Complexity Jason D. Lee Qihang Lin Tengyu Ma Tianbao

More information

On Acceleration with Noise-Corrupted Gradients. + m k 1 (x). By the definition of Bregman divergence:

On Acceleration with Noise-Corrupted Gradients. + m k 1 (x). By the definition of Bregman divergence: A Omitted Proofs from Section 3 Proof of Lemma 3 Let m x) = a i On Acceleration with Noise-Corrupted Gradients fxi ), u x i D ψ u, x 0 ) denote the function under the minimum in the lower bound By Proposition

More information

Discriminative Models

Discriminative Models No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models

More information

Robust high-dimensional linear regression: A statistical perspective

Robust high-dimensional linear regression: A statistical perspective Robust high-dimensional linear regression: A statistical perspective Po-Ling Loh University of Wisconsin - Madison Departments of ECE & Statistics STOC workshop on robustness and nonconvexity Montreal,

More information

ECS289: Scalable Machine Learning

ECS289: Scalable Machine Learning ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Nov 2, 2016 Outline SGD-typed algorithms for Deep Learning Parallel SGD for deep learning Perceptron Prediction value for a training data: prediction

More information

Statistical Approaches to Learning and Discovery. Week 4: Decision Theory and Risk Minimization. February 3, 2003

Statistical Approaches to Learning and Discovery. Week 4: Decision Theory and Risk Minimization. February 3, 2003 Statistical Approaches to Learning and Discovery Week 4: Decision Theory and Risk Minimization February 3, 2003 Recall From Last Time Bayesian expected loss is ρ(π, a) = E π [L(θ, a)] = L(θ, a) df π (θ)

More information

Distributed Box-Constrained Quadratic Optimization for Dual Linear SVM

Distributed Box-Constrained Quadratic Optimization for Dual Linear SVM Distributed Box-Constrained Quadratic Optimization for Dual Linear SVM Lee, Ching-pei University of Illinois at Urbana-Champaign Joint work with Dan Roth ICML 2015 Outline Introduction Algorithm Experiments

More information

Adaptive Probabilities in Stochastic Optimization Algorithms

Adaptive Probabilities in Stochastic Optimization Algorithms Research Collection Master Thesis Adaptive Probabilities in Stochastic Optimization Algorithms Author(s): Zhong, Lei Publication Date: 2015 Permanent Link: https://doi.org/10.3929/ethz-a-010421465 Rights

More information

Support Vector Machines for Classification and Regression

Support Vector Machines for Classification and Regression CIS 520: Machine Learning Oct 04, 207 Support Vector Machines for Classification and Regression Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the lecture. They may

More information

Gradient Sliding for Composite Optimization

Gradient Sliding for Composite Optimization Noname manuscript No. (will be inserted by the editor) Gradient Sliding for Composite Optimization Guanghui Lan the date of receipt and acceptance should be inserted later Abstract We consider in this

More information

A Parallel SGD method with Strong Convergence

A Parallel SGD method with Strong Convergence A Parallel SGD method with Strong Convergence Dhruv Mahajan Microsoft Research India dhrumaha@microsoft.com S. Sundararajan Microsoft Research India ssrajan@microsoft.com S. Sathiya Keerthi Microsoft Corporation,

More information

Topics we covered. Machine Learning. Statistics. Optimization. Systems! Basics of probability Tail bounds Density Estimation Exponential Families

Topics we covered. Machine Learning. Statistics. Optimization. Systems! Basics of probability Tail bounds Density Estimation Exponential Families Midterm Review Topics we covered Machine Learning Optimization Basics of optimization Convexity Unconstrained: GD, SGD Constrained: Lagrange, KKT Duality Linear Methods Perceptrons Support Vector Machines

More information

Learning Theory. Ingo Steinwart University of Stuttgart. September 4, 2013

Learning Theory. Ingo Steinwart University of Stuttgart. September 4, 2013 Learning Theory Ingo Steinwart University of Stuttgart September 4, 2013 Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 1 / 62 Basics Informal Introduction Informal Description

More information

Adaptive restart of accelerated gradient methods under local quadratic growth condition

Adaptive restart of accelerated gradient methods under local quadratic growth condition Adaptive restart of accelerated gradient methods under local quadratic growth condition Olivier Fercoq Zheng Qu September 6, 08 arxiv:709.0300v [math.oc] 7 Sep 07 Abstract By analyzing accelerated proximal

More information

Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization

Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization Shai Shalev-Shwartz and Tong Zhang School of CS and Engineering, The Hebrew University of Jerusalem Optimization for Machine

More information

Estimators based on non-convex programs: Statistical and computational guarantees

Estimators based on non-convex programs: Statistical and computational guarantees Estimators based on non-convex programs: Statistical and computational guarantees Martin Wainwright UC Berkeley Statistics and EECS Based on joint work with: Po-Ling Loh (UC Berkeley) Martin Wainwright

More information

Linear Convergence under the Polyak-Łojasiewicz Inequality

Linear Convergence under the Polyak-Łojasiewicz Inequality Linear Convergence under the Polyak-Łojasiewicz Inequality Hamed Karimi, Julie Nutini and Mark Schmidt The University of British Columbia LCI Forum February 28 th, 2017 1 / 17 Linear Convergence of Gradient-Based

More information

Support Vector Machine

Support Vector Machine Support Vector Machine Fabrice Rossi SAMM Université Paris 1 Panthéon Sorbonne 2018 Outline Linear Support Vector Machine Kernelized SVM Kernels 2 From ERM to RLM Empirical Risk Minimization in the binary

More information

Oslo Class 2 Tikhonov regularization and kernels

Oslo Class 2 Tikhonov regularization and kernels RegML2017@SIMULA Oslo Class 2 Tikhonov regularization and kernels Lorenzo Rosasco UNIGE-MIT-IIT May 3, 2017 Learning problem Problem For H {f f : X Y }, solve min E(f), f H dρ(x, y)l(f(x), y) given S n

More information

Stochastic Optimization Part I: Convex analysis and online stochastic optimization

Stochastic Optimization Part I: Convex analysis and online stochastic optimization Stochastic Optimization Part I: Convex analysis and online stochastic optimization Taiji Suzuki Tokyo Institute of Technology Graduate School of Information Science and Engineering Department of Mathematical

More information

l 1 and l 2 Regularization

l 1 and l 2 Regularization David Rosenberg New York University February 5, 2015 David Rosenberg (New York University) DS-GA 1003 February 5, 2015 1 / 32 Tikhonov and Ivanov Regularization Hypothesis Spaces We ve spoken vaguely about

More information

RegML 2018 Class 2 Tikhonov regularization and kernels

RegML 2018 Class 2 Tikhonov regularization and kernels RegML 2018 Class 2 Tikhonov regularization and kernels Lorenzo Rosasco UNIGE-MIT-IIT June 17, 2018 Learning problem Problem For H {f f : X Y }, solve min E(f), f H dρ(x, y)l(f(x), y) given S n = (x i,

More information

Lecture 1: Supervised Learning

Lecture 1: Supervised Learning Lecture 1: Supervised Learning Tuo Zhao Schools of ISYE and CSE, Georgia Tech ISYE6740/CSE6740/CS7641: Computational Data Analysis/Machine from Portland, Learning Oregon: pervised learning (Supervised)

More information

Adaptive Restarting for First Order Optimization Methods

Adaptive Restarting for First Order Optimization Methods Adaptive Restarting for First Order Optimization Methods Nesterov method for smooth convex optimization adpative restarting schemes step-size insensitivity extension to non-smooth optimization continuation

More information

Linear Convergence under the Polyak-Łojasiewicz Inequality

Linear Convergence under the Polyak-Łojasiewicz Inequality Linear Convergence under the Polyak-Łojasiewicz Inequality Hamed Karimi, Julie Nutini, Mark Schmidt University of British Columbia Linear of Convergence of Gradient-Based Methods Fitting most machine learning

More information

Optimization and Gradient Descent

Optimization and Gradient Descent Optimization and Gradient Descent INFO-4604, Applied Machine Learning University of Colorado Boulder September 12, 2017 Prof. Michael Paul Prediction Functions Remember: a prediction function is the function

More information

Lasso: Algorithms and Extensions

Lasso: Algorithms and Extensions ELE 538B: Sparsity, Structure and Inference Lasso: Algorithms and Extensions Yuxin Chen Princeton University, Spring 2017 Outline Proximal operators Proximal gradient methods for lasso and its extensions

More information

Gradient Descent. Ryan Tibshirani Convex Optimization /36-725

Gradient Descent. Ryan Tibshirani Convex Optimization /36-725 Gradient Descent Ryan Tibshirani Convex Optimization 10-725/36-725 Last time: canonical convex programs Linear program (LP): takes the form min x subject to c T x Gx h Ax = b Quadratic program (QP): like

More information

ECS289: Scalable Machine Learning

ECS289: Scalable Machine Learning ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 27, 2015 Outline One versus all/one versus one Ranking loss for multiclass/multilabel classification Scaling to millions of labels Multiclass

More information

Large Scale Semi-supervised Linear SVMs. University of Chicago

Large Scale Semi-supervised Linear SVMs. University of Chicago Large Scale Semi-supervised Linear SVMs Vikas Sindhwani and Sathiya Keerthi University of Chicago SIGIR 2006 Semi-supervised Learning (SSL) Motivation Setting Categorize x-billion documents into commercial/non-commercial.

More information

Nesterov s Acceleration

Nesterov s Acceleration Nesterov s Acceleration Nesterov Accelerated Gradient min X f(x)+ (X) f -smooth. Set s 1 = 1 and = 1. Set y 0. Iterate by increasing t: g t 2 @f(y t ) s t+1 = 1+p 1+4s 2 t 2 y t = x t + s t 1 s t+1 (x

More information

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers) Support vector machines In a nutshell Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers) Solution only depends on a small subset of training

More information

Optimization methods

Optimization methods Lecture notes 3 February 8, 016 1 Introduction Optimization methods In these notes we provide an overview of a selection of optimization methods. We focus on methods which rely on first-order information,

More information

Optimization. Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison

Optimization. Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison Optimization Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison optimization () cost constraints might be too much to cover in 3 hours optimization (for big

More information

Distributed Inexact Newton-type Pursuit for Non-convex Sparse Learning

Distributed Inexact Newton-type Pursuit for Non-convex Sparse Learning Distributed Inexact Newton-type Pursuit for Non-convex Sparse Learning Bo Liu Department of Computer Science, Rutgers Univeristy Xiao-Tong Yuan BDAT Lab, Nanjing University of Information Science and Technology

More information

Lecture 2 - Learning Binary & Multi-class Classifiers from Labelled Training Data

Lecture 2 - Learning Binary & Multi-class Classifiers from Labelled Training Data Lecture 2 - Learning Binary & Multi-class Classifiers from Labelled Training Data DD2424 March 23, 2017 Binary classification problem given labelled training data Have labelled training examples? Given

More information

Stochastic optimization in Hilbert spaces

Stochastic optimization in Hilbert spaces Stochastic optimization in Hilbert spaces Aymeric Dieuleveut Aymeric Dieuleveut Stochastic optimization Hilbert spaces 1 / 48 Outline Learning vs Statistics Aymeric Dieuleveut Stochastic optimization Hilbert

More information

Algorithmic Stability and Generalization Christoph Lampert

Algorithmic Stability and Generalization Christoph Lampert Algorithmic Stability and Generalization Christoph Lampert November 28, 2018 1 / 32 IST Austria (Institute of Science and Technology Austria) institute for basic research opened in 2009 located in outskirts

More information

Optimization Methods for Machine Learning

Optimization Methods for Machine Learning Optimization Methods for Machine Learning Sathiya Keerthi Microsoft Talks given at UC Santa Cruz February 21-23, 2017 The slides for the talks will be made available at: http://www.keerthis.com/ Introduction

More information

Bits of Machine Learning Part 1: Supervised Learning

Bits of Machine Learning Part 1: Supervised Learning Bits of Machine Learning Part 1: Supervised Learning Alexandre Proutiere and Vahan Petrosyan KTH (The Royal Institute of Technology) Outline of the Course 1. Supervised Learning Regression and Classification

More information

Accelerating SVRG via second-order information

Accelerating SVRG via second-order information Accelerating via second-order information Ritesh Kolte Department of Electrical Engineering rkolte@stanford.edu Murat Erdogdu Department of Statistics erdogdu@stanford.edu Ayfer Özgür Department of Electrical

More information

Frank-Wolfe Method. Ryan Tibshirani Convex Optimization

Frank-Wolfe Method. Ryan Tibshirani Convex Optimization Frank-Wolfe Method Ryan Tibshirani Convex Optimization 10-725 Last time: ADMM For the problem min x,z f(x) + g(z) subject to Ax + Bz = c we form augmented Lagrangian (scaled form): L ρ (x, z, w) = f(x)

More information