Accelerate Subgradient Methods
|
|
- Lizbeth Sims
- 6 years ago
- Views:
Transcription
1 Accelerate Subgradient Methods Tianbao Yang Department of Computer Science The University of Iowa Contributors: students Yi Xu, Yan Yan and colleague Qihang Lin Yang Accelerate Subgradient Methods 1 / 49
2 Introduction and Background Outline 1 Introduction and Background 2 A Key Error Inequality 3 Restarted Subgradient Method 4 Homotopy Smoothing 5 Accelerated Stochastic Subgradient Method 6 Experiments 7 Conclusion Yang (CS@Uiowa) Accelerate Subgradient Methods 2 / 49
3 Introduction and Background Convex Optimization f = min w Ω f (w) Ω R d is a convex set f (w) is a convex function: Goal: For a sufficiently small ɛ > 0, find a solution w such that f (w) f ɛ Yang (CS@Uiowa) Accelerate Subgradient Methods 3 / 49
4 Introduction and Background Applications Machine Learning Classification and Regression 1 min w R d n n l(w x i, y i ) + λr(w) i=1 training examples (x i, y i ), i = 1,..., n, where x i R d, y i {1, 1} (classification) or y i R (regression) l(z, y) convex loss function w.r.t z R(w) is a regularizer Yang (CS@Uiowa) Accelerate Subgradient Methods 4 / 49
5 Introduction and Background Applications in Machine Learning Classification and Regression Examples: SVM 1 min w R d n n l(w x i, y i ) + λr(w) i=1 1 min w R d n Examples: LASSO n max(0, 1 y i w x i ) + λ 2 w 2 2 i=1 1 min w R d n Many others (examples given later) n (y i w x i ) 2 + λ w 1 i=1 Yang (CS@Uiowa) Accelerate Subgradient Methods 5 / 49
6 Introduction and Background Applications in Machine Learning Classification and Regression Examples: SVM 1 min w R d n n l(w x i, y i ) + λr(w) i=1 1 min w R d n Examples: LASSO n max(0, 1 y i w x i ) + λ 2 w 2 2 i=1 1 min w R d n Many others (examples given later) n (y i w x i ) 2 + λ w 1 i=1 Yang (CS@Uiowa) Accelerate Subgradient Methods 5 / 49
7 Introduction and Background How to solve the Optimization Efficiently? Concern: Running Time (RT) Iterative Algorithm w t+1 = w t + w t RT = #of Iterations per-iteration RT Iteration Complexity How many iterations T (ɛ) are needed in order to have f (ŵ T ) f ɛ Yang (CS@Uiowa) Accelerate Subgradient Methods 6 / 49
8 Introduction and Background How to solve the Optimization Efficiently? Concern: Running Time (RT) Iterative Algorithm w t+1 = w t + w t RT = #of Iterations per-iteration RT Iteration Complexity How many iterations T (ɛ) are needed in order to have f (ŵ T ) f ɛ Yang (CS@Uiowa) Accelerate Subgradient Methods 6 / 49
9 Introduction and Background Iteration Complexity of Convex Optimization The properties of the objective function smooth function: function is upper bounded by a quadratic function strongly convex: function is lower bounded by a quadratic function Minimax Iteration Complexity smooth and strongly convex: O(log(1/ɛ)): Accel. Gradient Method smooth: O(1/ ɛ): Accel. Gradient Method strongly convex: O(1/ɛ): SubGradient (SG) Method non-smooth and non-strongly convex: O(1/ɛ 2 ): SG Yang (CS@Uiowa) Accelerate Subgradient Methods 7 / 49
10 Introduction and Background Iteration Complexity of Convex Optimization The properties of the objective function smooth function: function is upper bounded by a quadratic function strongly convex: function is lower bounded by a quadratic function Minimax Iteration Complexity smooth and strongly convex: O(log(1/ɛ)): Accel. Gradient Method smooth: O(1/ ɛ): Accel. Gradient Method strongly convex: O(1/ɛ): SubGradient (SG) Method non-smooth and non-strongly convex: O(1/ɛ 2 ): SG Yang (CS@Uiowa) Accelerate Subgradient Methods 7 / 49
11 Introduction and Background Non-smooth and Non-Strongly Convex Optimization Robust Regression: Sparse Classification: min w min w 1 n 1 n n w x i y i p, p [1, 2) i=1 n max(0, 1 y i w x i ) + λ w 1 i=1 Yang (CS@Uiowa) Accelerate Subgradient Methods 8 / 49
12 Introduction and Background Deep Learning Yang Accelerate Subgradient Methods 9 / 49
13 Introduction and Background Subgradient Method f = min w Ω f (w) SG Method w t+1 = Π Ω [w t η t f (w t )] f (w t ) is a subgradient η t step size: η t 1/ t iteration complexity O(1/ɛ 2 ): very slow (e.g., ɛ = iterations) Yang (CS@Uiowa) Accelerate Subgradient Methods 10 / 49
14 Introduction and Background Accelerate Subgradient Method f = min w Ω f (w) Special structure/condition of the objective function One example is explicit max-strucutre f (w) = max u Ω 2 u Aw φ(w) Nesterov s Smoothing technique (Nesterov, 2005) F µ (w) = max u Ω 2 u Aw φ(w) µ 2 u 2 2 AG for the smoothed problem: (1/ɛ) for the original problem Yang (CS@Uiowa) Accelerate Subgradient Methods 11 / 49
15 Introduction and Background Our Methodology for Accelerating SG f = min w Ω f (w) Explore local structure around the optimal solution (local error bound) w w 2 c(f (w) f ) θ, θ (0, 1], w S ɛ This family is broad enough ( ) Improved Iteration Complexity Õ 1 ɛ 2(1 θ) With explicit max-structure: Õ ( 1 ɛ (1 θ) ) Yang (CS@Uiowa) Accelerate Subgradient Methods 12 / 49
16 A Key Error Inequality Outline 1 Introduction and Background 2 A Key Error Inequality 3 Restarted Subgradient Method 4 Homotopy Smoothing 5 Accelerated Stochastic Subgradient Method 6 Experiments 7 Conclusion Yang (CS@Uiowa) Accelerate Subgradient Methods 13 / 49
17 A Key Error Inequality Some Notations Notations Optimal Set Ω = {w Ω : f (w) = f } ɛ-level set L ɛ = {w Ω : f (w) f = ɛ} ɛ-sublevel set S ɛ = {w Ω : f (w) f ɛ} w w : the closest optimal solution to w L w w w = arg min u Ω u w 2 S w ɛ: the closest solution to w in the ɛ-sublevel set w ɛ = arg min u Ω u w 2, f (w) f ɛ Yang (CS@Uiowa) Accelerate Subgradient Methods 14 / 49
18 A Key Error Inequality Assumptions Assumptions there exists w 0, ɛ 0 such that f (w 0 ) f ɛ 0 there exists G such that f (w) 2 G Ω is a non-empty compact set Yang (CS@Uiowa) Accelerate Subgradient Methods 15 / 49
19 A Key Error Inequality An Key Error Inequaltiy w L w w kw w k 2 f(w ) f(w ) kw w k 2 f(w) f(w ) Yang (CS@Uiowa) Accelerate Subgradient Methods 16 / 49
20 A Key Error Inequality An Key Error Inequaltiy w L w w kw w k 2 f(w ) f(w ) kw w k 2 f(w) f(w ) w ɛ w ɛ 2 f (w ɛ ) f (w ɛ ) w w ɛ 2 f (w) f (w ɛ ) Key Error Inequality (Yang & Lin, 2016) w w ɛ 2 w ɛ wɛ 2 (f (w) f (w ɛ )) ɛ Yang (CS@Uiowa) Accelerate Subgradient Methods 17 / 49
21 A Key Error Inequality An Key Error Inequaltiy w L w w kw w k 2 f(w ) f(w ) kw w k 2 f(w) f(w ) w ɛ w ɛ 2 f (w ɛ ) f (w ɛ ) w w ɛ 2 f (w) f (w ɛ ) Key Error Inequality (Yang & Lin, 2016) w w ɛ 2 w ɛ wɛ 2 (f (w) f (w ɛ )) ɛ Yang (CS@Uiowa) Accelerate Subgradient Methods 17 / 49
22 Restarted Subgradient Method Outline 1 Introduction and Background 2 A Key Error Inequality 3 Restarted Subgradient Method 4 Homotopy Smoothing 5 Accelerated Stochastic Subgradient Method 6 Experiments 7 Conclusion Yang (CS@Uiowa) Accelerate Subgradient Methods 18 / 49
23 Restarted Subgradient Method SG Method ŵ T = SG(w 0, η, T ) 1: Input: the number of iterations T, and the initial solution w 0 Ω, 2: Let w 1 = w 0 3: for t = 1,..., T do 4: Compute a subgradient f (w t ) 5: Update w t+1 = Π Ω [w t η f (w t )] 6: end for 7: Output: ŵ T = T t=1 w t T Convergence Guarantee For any w Ω T = G2 w 0 w 2 2, η = G2 ɛ 2 ɛ f (ŵ T ) f (w) ηg2 2 + w 0 w 2 2 2ηT f (ŵ T ) f ɛ Yang (CS@Uiowa) Accelerate Subgradient Methods 19 / 49
24 Restarted Subgradient Method RSG Method w K = RSG(w 0, t, K) 1: Input: the number of stages K and the number of iterations t per-stage, w 0 Ω 2: Set η 1 = ɛ 0 /(2G 2 ), where ɛ 0 is from our assumption 3: for k = 1,..., K do 4: Call subroutine SG to obtain w k = SG(w k 1, η k, t) 5: Set η k+1 = η k /2 6: end for 7: Output: w K Yang (CS@Uiowa) Accelerate Subgradient Methods 20 / 49
25 Restarted Subgradient Method A General Convergence of RSG Distance between the ɛ-level set and the optimal set B ɛ = max w w w 2 w L ɛ w 0 w L B Convergence of RSG If t 4G2 B 2 ɛ ɛ 2 and K = log 2 ( ɛ 0 ɛ ), then f (w K ) f 2ɛ Iteration Complexity of RSG: O ( G 2 B 2 ɛ ɛ 2 log ( ɛ 0ɛ ) ) Yang (CS@Uiowa) Accelerate Subgradient Methods 21 / 49
26 Restarted Subgradient Method Comparison with SG ( Iteration Complexity of RSG: O G 2 Bɛ 2 log ( ɛ ) ) 0ɛ ( ɛ 2 ) G Iteration Complexity of SG: O 2 w 0 w 2 2 ɛ 2 SG: dependence on the distance from the initial solution to the optimal set RSG: dependence on the distance from the ɛ-level set to the optimal set RSG: log-dependence on the quality of the initial solution (ɛ 0 ) Yang (CS@Uiowa) Accelerate Subgradient Methods 22 / 49
27 Restarted Subgradient Method Improved Convergence under Local Error Bound Condition Iteration Complexity: O ( G 2 B 2 ɛ ɛ 2 log ( ɛ 0ɛ ) ), where B ɛ = max w Lɛ w w 2 Local Error Bound: w w 2 c(f (w) f ) θ, θ (0, 1], w S ɛ Implies: ( G B ɛ cɛ θ 2 c 2 ( ) ), and O ɛ 2(1 θ) log ɛ0 ɛ x, θ=1 x 1.5, θ=2/3 x 2, θ=0.5 F(x) x Yang (CS@Uiowa) Accelerate Subgradient Methods 23 / 49
28 Restarted Subgradient Method Polyhedral Convex Optimization Linear Convergence: epigraph is a polyhedron: θ = 1 Examples: Robust Regression Sparse Classification: min w 1 n min w 1 n n w x i y i i=1 n max(0, 1 y i w x i ) + λ w 1 i=1 Yang (CS@Uiowa) Accelerate Subgradient Methods 24 / 49
29 Restarted Subgradient Method Polyhedral Convex Optimization Linear Convergence: epigraph is a polyhedron: θ = 1 Examples: Robust Regression Sparse Classification: min w 1 n min w 1 n n w x i y i i=1 n max(0, 1 y i w x i ) + λ w 1 i=1 Yang (CS@Uiowa) Accelerate Subgradient Methods 24 / 49
30 Restarted Subgradient Method Locally Semi-Strongly Convex Function Iteration Complexity: Õ( c2 G 2 ɛ ) Examples: Robust Regression w w 2 c(f (w) f ) 1/2, min w 1 n w S ɛ n w x i y i p, p (1, 2) i=1 l 1 regularized problems: h( ) is strongly convex on any compact set f (w) = h(aw) + λ w 1 Huber Loss: { 1 l δ (z) = 2 z2 if z δ δ( z 1 2 δ) otherwise Yang (CS@Uiowa) Accelerate Subgradient Methods 25 / 49
31 Restarted Subgradient Method Locally Semi-Strongly Convex Function Iteration Complexity: Õ( c2 G 2 ɛ ) Examples: Robust Regression w w 2 c(f (w) f ) 1/2, min w 1 n w S ɛ n w x i y i p, p (1, 2) i=1 l 1 regularized problems: h( ) is strongly convex on any compact set f (w) = h(aw) + λ w 1 Huber Loss: { 1 l δ (z) = 2 z2 if z δ δ( z 1 2 δ) otherwise Yang (CS@Uiowa) Accelerate Subgradient Methods 25 / 49
32 Restarted Subgradient Method Why RSG Converges Faster Step size is decreasing in a stage-wise manner (sounds familar?) Warm-start Yang (CS@Uiowa) Accelerate Subgradient Methods 26 / 49
33 Restarted Subgradient Method Proof of RSG Proof is very simple given the key inequality ɛ k = ɛ 0 2 k, thus η k = ɛ k G 2 By induction: assume f (w k 1 ) f ɛ k 1 + ɛ Apply the convergence result of SG for the k-th stage f (w k ) f (w k 1,ɛ ) η kg w k 1 w k 1,ɛ 2 2 2η k t and our key error inequality w k 1 w k 1,ɛ 2 B ɛ ɛ (f (w k 1) f (w k 1,ɛ )) B ɛ ɛ ɛ k 1 = B ɛ ɛ 2ɛ k Get f (w k ) f (w k 1,ɛ ) ɛ k 2 + ɛ2 k 4G 2 Bɛ 2 2ɛ k ɛ 2 t ɛ k Yang (CS@Uiowa) Accelerate Subgradient Methods 27 / 49
34 Homotopy Smoothing Outline 1 Introduction and Background 2 A Key Error Inequality 3 Restarted Subgradient Method 4 Homotopy Smoothing 5 Accelerated Stochastic Subgradient Method 6 Experiments 7 Conclusion Yang (CS@Uiowa) Accelerate Subgradient Methods 28 / 49
35 Homotopy Smoothing Nesterov s smoothing improves SG Explicit-max structure Nesterov s Smoothing technique f (w) = max u Ω 2 u Aw φ(w) Smoothing parameter F µ (w) = max u Ω 2 u Aw φ(w) µ 2 u 2 2 which is a L µ = A 2 µ -smooth function. Assume: max u Ω2 u 2 D Yang (CS@Uiowa) Accelerate Subgradient Methods 29 / 49
36 Homotopy Smoothing Nesterov s smoothing improves SG Explicit-max structure Nesterov s Smoothing technique f (w) = max u Ω 2 u Aw φ(w) Smoothing parameter F µ (w) = max u Ω 2 u Aw φ(w) µ 2 u 2 2 which is a L µ = A 2 µ -smooth function. Assume: max u Ω2 u 2 D Yang (CS@Uiowa) Accelerate Subgradient Methods 29 / 49
37 Homotopy Smoothing Accelerated Gradient Method for Smooth Function w T = AG(w 0, t, L µ ) 1: Input: the number of iterations t, and the initial solution w 0 Ω 2: Let t 0 = t 1 = 1, w 1 = w 0 3: for k = 0,..., t do 4: Compute y k = w k + t k 1 1 t k (w k w k 1 ) 5: Compute w k+1 = arg min w Ω {w F µ (y k ) + Lµ 2 w y k 2 2 } 6: Compute t k+1 = t 2 k 2 7: end for 8: Output: w t+1 Convergence of AG For any w Ω approximation error f (w T ) f (w) µd A 2 w 0 w 2 2 µt 2 Yang (CS@Uiowa) Accelerate Subgradient Methods 30 / 49
38 Homotopy Smoothing Accelerated Gradient Method for Smooth Function w T = AG(w 0, t, L µ ) 1: Input: the number of iterations t, and the initial solution w 0 Ω 2: Let t 0 = t 1 = 1, w 1 = w 0 3: for k = 0,..., t do 4: Compute y k = w k + t k 1 1 t k (w k w k 1 ) 5: Compute w k+1 = arg min w Ω {w F µ (y k ) + Lµ 2 w y k 2 2 } 6: Compute t k+1 = t 2 k 2 7: end for 8: Output: w t+1 Convergence of AG For any w Ω approximation error f (w T ) f (w) µd A 2 w 0 w 2 2 µt 2 Yang (CS@Uiowa) Accelerate Subgradient Methods 30 / 49
39 Homotopy Smoothing Homotopy Smoothing (HOPS) HOPS (Xu et al., 2016b) 1: Input: the number of stages K and the number of iterations t per-stage, and the initial solution w 0 Ω 1 2: Let µ 1 = ɛ 0 /(2D 2 ) 3: for s = 1,..., K do 4: Let w s = AG(w s 1, t, L µs ) 5: Update µ s+1 = µ s /2 6: end for 7: Output: w K Yang (CS@Uiowa) Accelerate Subgradient Methods 31 / 49
40 Homotopy Smoothing Improved Convergence under Local Error Bound Condition Local Error Bound: w w 2 c(f (w) f ) θ, θ (0, 1], w S ɛ Implies: Iteration Complexity: ( ( )) cd A O ɛ (1 θ) log ɛ0 ɛ Yang (CS@Uiowa) Accelerate Subgradient Methods 32 / 49
41 Accelerated Stochastic Subgradient Method Outline 1 Introduction and Background 2 A Key Error Inequality 3 Restarted Subgradient Method 4 Homotopy Smoothing 5 Accelerated Stochastic Subgradient Method 6 Experiments 7 Conclusion Yang (CS@Uiowa) Accelerate Subgradient Methods 33 / 49
42 Accelerated Stochastic Subgradient Method Stochastic Subgradient (SSG) Method min F (w) E ξ[f (w; ξ)] w Ω SSG Method w t+1 = Π Ω [w t η t f (w t ; ξ t )] η t 1/ t More scalable for large-scale problems Examples: Empirical Risk Minimization min w 1 n n l(w x i, y i ) + λr(w) i=1 Yang (CS@Uiowa) Accelerate Subgradient Methods 34 / 49
43 Accelerated Stochastic Subgradient Method Accelerated Stochastic Subgradient (ASSG) Method Do the same tricks suffice? Step size is decreasing in a stage-wise manner Warm-start Not enough in theory E[f (ŵ T ) f (w)] ηg2 2 + E[ w 0 w 2 2 ] 2ηT Expectation bound does not work Yang (CS@Uiowa) Accelerate Subgradient Methods 35 / 49
44 Accelerated Stochastic Subgradient Method Accelerated Stochastic Subgradient (ASSG) Method Do the same tricks suffice? Step size is decreasing in a stage-wise manner Warm-start Not enough in theory E[f (ŵ T ) f (w)] ηg2 2 + E[ w 0 w 2 2 ] 2ηT Expectation bound does not work Yang (CS@Uiowa) Accelerate Subgradient Methods 35 / 49
45 Accelerated Stochastic Subgradient Method Accelerated Stochastic Subgradient (ASSG) Method Use high-probability analysis f (ŵ T ) f (w) ηg2 2 + w 0 w 2 Tt=1 2 (Var of SSG) + t w t w 2 2ηT T Another Trick: Domain Shrinking (Xu et al., 2016a) Yang (CS@Uiowa) Accelerate Subgradient Methods 36 / 49
46 Accelerated Stochastic Subgradient Method Accelerated Stochastic Subgradient (ASSG) Method in the k-stage of SSG w k t+1 = Π Ω B(wk 1,D k )[w k t η k f (w k t ; ξ k t )] B(w, D) is a ball centered at w with radius D D k is decreasing by half every stage In a high probability 1 δ, iteration complexity is Õ ( ) c 2 G 2 log(1/δ) ɛ 2(1 θ) Yang (CS@Uiowa) Accelerate Subgradient Methods 37 / 49
47 Accelerated Stochastic Subgradient Method Domain Shrinking Mitigates variance in stochastic subgradient ASSG vs SSG Yang Accelerate Subgradient Methods 38 / 49
48 Experiments Outline 1 Introduction and Background 2 A Key Error Inequality 3 Restarted Subgradient Method 4 Homotopy Smoothing 5 Accelerated Stochastic Subgradient Method 6 Experiments 7 Conclusion Yang (CS@Uiowa) Accelerate Subgradient Methods 39 / 49
49 Experiments RSG: Convergence robust regression 1 min w R d n n w x i y i p, p = 1, p = 1.5 i=1 objective robust regression (p=1) log(level gap) t=10 2 t=10 3 t=10 4 objective robust regression (p=1.5) log(level gap) t=10 2 t=10 3 t= #of stages #of stages Yang (CS@Uiowa) Accelerate Subgradient Methods 40 / 49
50 Experiments RSG vs SG robust regression 1 min w R d n n w x i y i p, p = 1, p = 1.5 i=1 log(objective gap) robust regression (p=1) SG RSG (t=10 3 ) R 2 SG (t1=10 3 ) RSG (t=10 4 ) log(objective gap) robust regression (p=1.5) SG RSG (t=10 3 ) R 2 SG (t1=10 3 ) RSG (t=10 4 ) x 10 4 #of iterations x 10 5 #of iterations Yang (CS@Uiowa) Accelerate Subgradient Methods 41 / 49
51 Experiments RSG vs SG Graph-lasso regularized SVM 1 min w R d n n max(0, 1 y i w x i ) + λ F w 1 i=1 objective SG ADMM R 2 SG objective SG (initial 1) SG (initial 2) R 2 SG (initial 1) R 2 SG (initial 2) x 10 5 #of iterations x 10 5 #of iterations Yang (CS@Uiowa) Accelerate Subgradient Methods 42 / 49
52 Experiments HOPS vs Smoothing Table: Comparison of different optimization algorithms by the number of iterations and running time for achieving a solution that satisfies F (x) F ɛ. Sparse Classification Matrix Decomposition ɛ = 10 4 ɛ = 10 5 ɛ = 10 3 ɛ = 10 4 PD 526 (4.48s) 1777 (13.91s) 2523 (7.57s) 3441 (9.82s) APG-D 591 (5.63s) 1122 (10.61s) 1967 (10.30s) 8622 (44.90s) APG-F 573 (5.12s) 943 (8.51s) 1115 (3.25s) 4151 (11.82s) HOPS-D 501 (4.39s) 873 (8.35s) 224 (1.54s) 313 (2.16s) HOPS-F 490 (4.38s) 868 (7.81s) 230 (0.90s) 312 (1.23s) PD-HOPS 427 (3.41s) 609 (4.87s) 124 (0.48s) 162 (0.66s) Yang (CS@Uiowa) Accelerate Subgradient Methods 43 / 49
53 Experiments ASSG vs SSG Million songs data (n = 463, 715) objective robust regression (p=1) SG ASSG(t=10 5 ) RASSG (t1=10 4 ) #of iterations 10 6 objective robust regression (p=1.5) SG ASSG(t=10 5 ) RASSG (t1=10 4 ) #of iterations 10 6 Yang (CS@Uiowa) Accelerate Subgradient Methods 44 / 49
54 Experiments ASSG vs SSG Hinge loss + l 1 regularizer, Covtype data (n = 581, 012) -1-2 SG ASSG RASSG log 10 (F(w t )-F * ) #of iterations 10 6 Yang (CS@Uiowa) Accelerate Subgradient Methods 45 / 49
55 Conclusion Outline 1 Introduction and Background 2 A Key Error Inequality 3 Restarted Subgradient Method 4 Homotopy Smoothing 5 Accelerated Stochastic Subgradient Method 6 Experiments 7 Conclusion Yang (CS@Uiowa) Accelerate Subgradient Methods 46 / 49
56 Conclusion Conclusion Developed a key error inequality that can accelerate many algorithms include stochastic momentum methods, stochastic Nesterov s accelerated gradient methods (Yang et al., 2016) Developed a restarted subgradient (RSG) method that has faster convergence than SG Developed a homotopy smoothing (HOPS) algorithm with even faster convergence Developed an accelerated stochastic subgradient (ASSG) method Preliminary Experiments show very promising results Yang (CS@Uiowa) Accelerate Subgradient Methods 47 / 49
57 Conclusion Thank You! Questions? Yang Accelerate Subgradient Methods 48 / 49
58 Conclusion References I Nesterov, Yu. Smooth minimization of non-smooth functions. Mathematical Programming, 103(1): , Xu, Yi, Lin, Qihang, and Yang, Tianbao. Accelerate stochastic subgradient method by leveraging local error bound. CoRR, abs/ , 2016a. Xu, Yi, Yan, Yan, Lin, Qihang, and Yang, Tianbao. Homotopy smoothing for non-smooth problems with lower complexity than 1/epsilon. CoRR, abs/ , 2016b. Yang, Tianbao and Lin, Qihang. Rsg: Beating sgd without smoothness and/or strong convexity. CoRR, abs/ , Yang, Tianbao, Lin, Qihang, and Li, Zhe. Unified convergence analysis of stochastic momentum methods for convex and non-convex optimization. corr, Yang (CS@Uiowa) Accelerate Subgradient Methods 49 / 49
arxiv: v4 [math.oc] 5 Jan 2016
Restarted SGD: Beating SGD without Smoothness and/or Strong Convexity arxiv:151.03107v4 [math.oc] 5 Jan 016 Tianbao Yang, Qihang Lin Department of Computer Science Department of Management Sciences The
More informationBig Data Analytics: Optimization and Randomization
Big Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML 2015 Hong Kong Department of Computer Science, The University of Iowa, IA, USA Nov. 20, 2015 Yang Tutorial for ACML 15 Nov.
More informationHomotopy Smoothing for Non-Smooth Problems with Lower Complexity than O(1/ɛ)
1 28 Homotopy Smoothing for Non-Smooth Problems with Lower Complexity than O(1/) Yi Xu yi-xu@uiowa.edu Yan Yan yan.yan-3@student.uts.edu.au Qihang Lin qihang-lin@uiowa.edu Tianbao Yang tianbao-yang@uiowa.edu
More informationMachine Learning. Support Vector Machines. Fabio Vandin November 20, 2017
Machine Learning Support Vector Machines Fabio Vandin November 20, 2017 1 Classification and Margin Consider a classification problem with two classes: instance set X = R d label set Y = { 1, 1}. Training
More informationSADAGRAD: Strongly Adaptive Stochastic Gradient Methods
Zaiyi Chen * Yi Xu * Enhong Chen Tianbao Yang Abstract Although the convergence rates of existing variants of ADAGRAD have a better dependence on the number of iterations under the strong convexity condition,
More informationStochastic and online algorithms
Stochastic and online algorithms stochastic gradient method online optimization and dual averaging method minimizing finite average Stochastic and online optimization 6 1 Stochastic optimization problem
More informationSADAGRAD: Strongly Adaptive Stochastic Gradient Methods
Zaiyi Chen * Yi Xu * Enhong Chen Tianbao Yang Abstract Although the convergence rates of existing variants of ADAGRAD have a better dependence on the number of iterations under the strong convexity condition,
More informationECS289: Scalable Machine Learning
ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Sept 29, 2016 Outline Convex vs Nonconvex Functions Coordinate Descent Gradient Descent Newton s method Stochastic Gradient Descent Numerical Optimization
More informationCSCI 1951-G Optimization Methods in Finance Part 12: Variants of Gradient Descent
CSCI 1951-G Optimization Methods in Finance Part 12: Variants of Gradient Descent April 27, 2018 1 / 32 Outline 1) Moment and Nesterov s accelerated gradient descent 2) AdaGrad and RMSProp 4) Adam 5) Stochastic
More informationA Two-Stage Approach for Learning a Sparse Model with Sharp
A Two-Stage Approach for Learning a Sparse Model with Sharp Excess Risk Analysis The University of Iowa, Nanjing University, Alibaba Group February 3, 2017 1 Problem and Chanllenges 2 The Two-stage Approach
More informationCS260: Machine Learning Algorithms
CS260: Machine Learning Algorithms Lecture 4: Stochastic Gradient Descent Cho-Jui Hsieh UCLA Jan 16, 2019 Large-scale Problems Machine learning: usually minimizing the training loss min w { 1 N min w {
More informationOptimal Stochastic Strongly Convex Optimization with a Logarithmic Number of Projections
Optimal Stochastic Strongly Convex Optimization with a Logarithmic Number of Projections Jianhui Chen 1, Tianbao Yang 2, Qihang Lin 2, Lijun Zhang 3, and Yi Chang 4 July 18, 2016 Yahoo Research 1, The
More informationConvex Optimization Lecture 16
Convex Optimization Lecture 16 Today: Projected Gradient Descent Conditional Gradient Descent Stochastic Gradient Descent Random Coordinate Descent Recall: Gradient Descent (Steepest Descent w.r.t Euclidean
More informationA Two-Stage Approach for Learning a Sparse Model with Sharp Excess Risk Analysis
Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence (AAAI-7) A Two-Stage Approach for Learning a Sparse Model with Sharp Excess Risk Analysis Zhe Li, Tianbao Yang, Lijun Zhang, Rong
More informationarxiv: v1 [math.oc] 10 Oct 2018
8 Frank-Wolfe Method is Automatically Adaptive to Error Bound ondition arxiv:80.04765v [math.o] 0 Oct 08 Yi Xu yi-xu@uiowa.edu Tianbao Yang tianbao-yang@uiowa.edu Department of omputer Science, The University
More informationStochastic Convex Optimization: Faster Local Growth Implies Faster Global Convergence
Stochastic Convex Optimization: Faster Local Growth Implies Faster Global Convergence Yi Xu 1 Qihang Lin 2 Tianbao Yang 1 Abstract In this paper, a new theory is developed for firstorder stochastic convex
More informationECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference
ECE 18-898G: Special Topics in Signal Processing: Sparsity, Structure, and Inference Sparse Recovery using L1 minimization - algorithms Yuejie Chi Department of Electrical and Computer Engineering Spring
More informationCoordinate Descent and Ascent Methods
Coordinate Descent and Ascent Methods Julie Nutini Machine Learning Reading Group November 3 rd, 2015 1 / 22 Projected-Gradient Methods Motivation Rewrite non-smooth problem as smooth constrained problem:
More informationStatistical Machine Learning II Spring 2017, Learning Theory, Lecture 4
Statistical Machine Learning II Spring 07, Learning Theory, Lecture 4 Jean Honorio jhonorio@purdue.edu Deterministic Optimization For brevity, everywhere differentiable functions will be called smooth.
More information1. Gradient method. gradient method, first-order methods. quadratic bounds on convex functions. analysis of gradient method
L. Vandenberghe EE236C (Spring 2016) 1. Gradient method gradient method, first-order methods quadratic bounds on convex functions analysis of gradient method 1-1 Approximate course outline First-order
More informationECS171: Machine Learning
ECS171: Machine Learning Lecture 4: Optimization (LFD 3.3, SGD) Cho-Jui Hsieh UC Davis Jan 22, 2018 Gradient descent Optimization Goal: find the minimizer of a function min f (w) w For now we assume f
More informationSupport Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017
Support Vector Machines: Training with Stochastic Gradient Descent Machine Learning Fall 2017 1 Support vector machines Training by maximizing margin The SVM objective Solving the SVM optimization problem
More informationAccelerating Stochastic Optimization
Accelerating Stochastic Optimization Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem and Mobileye Master Class at Tel-Aviv, Tel-Aviv University, November 2014 Shalev-Shwartz
More informationMachine Learning and Computational Statistics, Spring 2017 Homework 2: Lasso Regression
Machine Learning and Computational Statistics, Spring 2017 Homework 2: Lasso Regression Due: Monday, February 13, 2017, at 10pm (Submit via Gradescope) Instructions: Your answers to the questions below,
More informationLarge-scale Stochastic Optimization
Large-scale Stochastic Optimization 11-741/641/441 (Spring 2016) Hanxiao Liu hanxiaol@cs.cmu.edu March 24, 2016 1 / 22 Outline 1. Gradient Descent (GD) 2. Stochastic Gradient Descent (SGD) Formulation
More informationStochastic Optimization: First order method
Stochastic Optimization: First order method Taiji Suzuki Tokyo Institute of Technology Graduate School of Information Science and Engineering Department of Mathematical and Computing Sciences JST, PRESTO
More informationConvex Optimization Algorithms for Machine Learning in 10 Slides
Convex Optimization Algorithms for Machine Learning in 10 Slides Presenter: Jul. 15. 2015 Outline 1 Quadratic Problem Linear System 2 Smooth Problem Newton-CG 3 Composite Problem Proximal-Newton-CD 4 Non-smooth,
More informationMachine Learning And Applications: Supervised Learning-SVM
Machine Learning And Applications: Supervised Learning-SVM Raphaël Bournhonesque École Normale Supérieure de Lyon, Lyon, France raphael.bournhonesque@ens-lyon.fr 1 Supervised vs unsupervised learning Machine
More informationJournal Club. A Universal Catalyst for First-Order Optimization (H. Lin, J. Mairal and Z. Harchaoui) March 8th, CMAP, Ecole Polytechnique 1/19
Journal Club A Universal Catalyst for First-Order Optimization (H. Lin, J. Mairal and Z. Harchaoui) CMAP, Ecole Polytechnique March 8th, 2018 1/19 Plan 1 Motivations 2 Existing Acceleration Methods 3 Universal
More informationOptimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade
Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade Machine Learning for Big Data CSE547/STAT548 University of Washington S. M. Kakade (UW) Optimization for
More informationCPSC 540: Machine Learning
CPSC 540: Machine Learning Proximal-Gradient Mark Schmidt University of British Columbia Winter 2018 Admin Auditting/registration forms: Pick up after class today. Assignment 1: 2 late days to hand in
More informationProximal Methods for Optimization with Spasity-inducing Norms
Proximal Methods for Optimization with Spasity-inducing Norms Group Learning Presentation Xiaowei Zhou Department of Electronic and Computer Engineering The Hong Kong University of Science and Technology
More informationSTA141C: Big Data & High Performance Statistical Computing
STA141C: Big Data & High Performance Statistical Computing Lecture 8: Optimization Cho-Jui Hsieh UC Davis May 9, 2017 Optimization Numerical Optimization Numerical Optimization: min X f (X ) Can be applied
More informationOptimization methods
Optimization methods Optimization-Based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_spring16 Carlos Fernandez-Granda /8/016 Introduction Aim: Overview of optimization methods that Tend to
More informationConvex optimization COMS 4771
Convex optimization COMS 4771 1. Recap: learning via optimization Soft-margin SVMs Soft-margin SVM optimization problem defined by training data: w R d λ 2 w 2 2 + 1 n n [ ] 1 y ix T i w. + 1 / 15 Soft-margin
More informationMachine Learning for NLP
Machine Learning for NLP Linear Models Joakim Nivre Uppsala University Department of Linguistics and Philology Slides adapted from Ryan McDonald, Google Research Machine Learning for NLP 1(26) Outline
More informationTrade-Offs in Distributed Learning and Optimization
Trade-Offs in Distributed Learning and Optimization Ohad Shamir Weizmann Institute of Science Includes joint works with Yossi Arjevani, Nathan Srebro and Tong Zhang IHES Workshop March 2016 Distributed
More informationAdaptive Gradient Methods AdaGrad / Adam. Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade
Adaptive Gradient Methods AdaGrad / Adam Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade 1 Announcements: HW3 posted Dual coordinate ascent (some review of SGD and random
More informationLogistic Regression. Stochastic Gradient Descent
Tutorial 8 CPSC 340 Logistic Regression Stochastic Gradient Descent Logistic Regression Model A discriminative probabilistic model for classification e.g. spam filtering Let x R d be input and y { 1, 1}
More informationStochastic Gradient Descent. Ryan Tibshirani Convex Optimization
Stochastic Gradient Descent Ryan Tibshirani Convex Optimization 10-725 Last time: proximal gradient descent Consider the problem min x g(x) + h(x) with g, h convex, g differentiable, and h simple in so
More informationAn Evolving Gradient Resampling Method for Machine Learning. Jorge Nocedal
An Evolving Gradient Resampling Method for Machine Learning Jorge Nocedal Northwestern University NIPS, Montreal 2015 1 Collaborators Figen Oztoprak Stefan Solntsev Richard Byrd 2 Outline 1. How to improve
More informationDiscriminative Models
No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models
More informationWarm up. Regrade requests submitted directly in Gradescope, do not instructors.
Warm up Regrade requests submitted directly in Gradescope, do not email instructors. 1 float in NumPy = 8 bytes 10 6 2 20 bytes = 1 MB 10 9 2 30 bytes = 1 GB For each block compute the memory required
More informationStochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure
Stochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure Alberto Bietti Julien Mairal Inria Grenoble (Thoth) March 21, 2017 Alberto Bietti Stochastic MISO March 21,
More informationAdaptive Gradient Methods AdaGrad / Adam
Case Study 1: Estimating Click Probabilities Adaptive Gradient Methods AdaGrad / Adam Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade 1 The Problem with GD (and SGD)
More informationParallel and Distributed Stochastic Learning -Towards Scalable Learning for Big Data Intelligence
Parallel and Distributed Stochastic Learning -Towards Scalable Learning for Big Data Intelligence oé LAMDA Group H ŒÆOŽÅ Æ EâX ^ #EâI[ : liwujun@nju.edu.cn Dec 10, 2016 Wu-Jun Li (http://cs.nju.edu.cn/lwj)
More informationStochastic Subgradient Method
Stochastic Subgradient Method Lingjie Weng, Yutian Chen Bren School of Information and Computer Science UC Irvine Subgradient Recall basic inequality for convex differentiable f : f y f x + f x T (y x)
More informationMaster 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique
Master 2 MathBigData S. Gaïffas 1 3 novembre 2014 1 CMAP - Ecole Polytechnique 1 Supervised learning recap Introduction Loss functions, linearity 2 Penalization Introduction Ridge Sparsity Lasso 3 Some
More informationSupport Vector Machines for Classification and Regression. 1 Linearly Separable Data: Hard Margin SVMs
E0 270 Machine Learning Lecture 5 (Jan 22, 203) Support Vector Machines for Classification and Regression Lecturer: Shivani Agarwal Disclaimer: These notes are a brief summary of the topics covered in
More informationRandomized Smoothing for Stochastic Optimization
Randomized Smoothing for Stochastic Optimization John Duchi Peter Bartlett Martin Wainwright University of California, Berkeley NIPS Big Learn Workshop, December 2011 Duchi (UC Berkeley) Smoothing and
More informationStochastic Optimization Algorithms Beyond SG
Stochastic Optimization Algorithms Beyond SG Frank E. Curtis 1, Lehigh University involving joint work with Léon Bottou, Facebook AI Research Jorge Nocedal, Northwestern University Optimization Methods
More informationA Unified Analysis of Stochastic Momentum Methods for Deep Learning
A Unified Analysis of Stochastic Momentum Methods for Deep Learning Yan Yan,2, Tianbao Yang 3, Zhe Li 3, Qihang Lin 4, Yi Yang,2 SUSTech-UTS Joint Centre of CIS, Southern University of Science and Technology
More informationSubgradient Method. Ryan Tibshirani Convex Optimization
Subgradient Method Ryan Tibshirani Convex Optimization 10-725 Consider the problem Last last time: gradient descent min x f(x) for f convex and differentiable, dom(f) = R n. Gradient descent: choose initial
More informationSmoothing Proximal Gradient Method. General Structured Sparse Regression
for General Structured Sparse Regression Xi Chen, Qihang Lin, Seyoung Kim, Jaime G. Carbonell, Eric P. Xing (Annals of Applied Statistics, 2012) Gatsby Unit, Tea Talk October 25, 2013 Outline Motivation:
More informationMachine Learning. Lecture 04: Logistic and Softmax Regression. Nevin L. Zhang
Machine Learning Lecture 04: Logistic and Softmax Regression Nevin L. Zhang lzhang@cse.ust.hk Department of Computer Science and Engineering The Hong Kong University of Science and Technology This set
More informationSubgradient Method. Guest Lecturer: Fatma Kilinc-Karzan. Instructors: Pradeep Ravikumar, Aarti Singh Convex Optimization /36-725
Subgradient Method Guest Lecturer: Fatma Kilinc-Karzan Instructors: Pradeep Ravikumar, Aarti Singh Convex Optimization 10-725/36-725 Adapted from slides from Ryan Tibshirani Consider the problem Recall:
More informationCS60021: Scalable Data Mining. Large Scale Machine Learning
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 1 CS60021: Scalable Data Mining Large Scale Machine Learning Sourangshu Bhattacharya Example: Spam filtering Instance
More informationECE521 lecture 4: 19 January Optimization, MLE, regularization
ECE521 lecture 4: 19 January 2017 Optimization, MLE, regularization First four lectures Lectures 1 and 2: Intro to ML Probability review Types of loss functions and algorithms Lecture 3: KNN Convexity
More informationDistributed Stochastic Variance Reduced Gradient Methods and A Lower Bound for Communication Complexity
Noname manuscript No. (will be inserted by the editor) Distributed Stochastic Variance Reduced Gradient Methods and A Lower Bound for Communication Complexity Jason D. Lee Qihang Lin Tengyu Ma Tianbao
More informationOn Acceleration with Noise-Corrupted Gradients. + m k 1 (x). By the definition of Bregman divergence:
A Omitted Proofs from Section 3 Proof of Lemma 3 Let m x) = a i On Acceleration with Noise-Corrupted Gradients fxi ), u x i D ψ u, x 0 ) denote the function under the minimum in the lower bound By Proposition
More informationDiscriminative Models
No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models
More informationRobust high-dimensional linear regression: A statistical perspective
Robust high-dimensional linear regression: A statistical perspective Po-Ling Loh University of Wisconsin - Madison Departments of ECE & Statistics STOC workshop on robustness and nonconvexity Montreal,
More informationECS289: Scalable Machine Learning
ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Nov 2, 2016 Outline SGD-typed algorithms for Deep Learning Parallel SGD for deep learning Perceptron Prediction value for a training data: prediction
More informationStatistical Approaches to Learning and Discovery. Week 4: Decision Theory and Risk Minimization. February 3, 2003
Statistical Approaches to Learning and Discovery Week 4: Decision Theory and Risk Minimization February 3, 2003 Recall From Last Time Bayesian expected loss is ρ(π, a) = E π [L(θ, a)] = L(θ, a) df π (θ)
More informationDistributed Box-Constrained Quadratic Optimization for Dual Linear SVM
Distributed Box-Constrained Quadratic Optimization for Dual Linear SVM Lee, Ching-pei University of Illinois at Urbana-Champaign Joint work with Dan Roth ICML 2015 Outline Introduction Algorithm Experiments
More informationAdaptive Probabilities in Stochastic Optimization Algorithms
Research Collection Master Thesis Adaptive Probabilities in Stochastic Optimization Algorithms Author(s): Zhong, Lei Publication Date: 2015 Permanent Link: https://doi.org/10.3929/ethz-a-010421465 Rights
More informationSupport Vector Machines for Classification and Regression
CIS 520: Machine Learning Oct 04, 207 Support Vector Machines for Classification and Regression Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the lecture. They may
More informationGradient Sliding for Composite Optimization
Noname manuscript No. (will be inserted by the editor) Gradient Sliding for Composite Optimization Guanghui Lan the date of receipt and acceptance should be inserted later Abstract We consider in this
More informationA Parallel SGD method with Strong Convergence
A Parallel SGD method with Strong Convergence Dhruv Mahajan Microsoft Research India dhrumaha@microsoft.com S. Sundararajan Microsoft Research India ssrajan@microsoft.com S. Sathiya Keerthi Microsoft Corporation,
More informationTopics we covered. Machine Learning. Statistics. Optimization. Systems! Basics of probability Tail bounds Density Estimation Exponential Families
Midterm Review Topics we covered Machine Learning Optimization Basics of optimization Convexity Unconstrained: GD, SGD Constrained: Lagrange, KKT Duality Linear Methods Perceptrons Support Vector Machines
More informationLearning Theory. Ingo Steinwart University of Stuttgart. September 4, 2013
Learning Theory Ingo Steinwart University of Stuttgart September 4, 2013 Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 1 / 62 Basics Informal Introduction Informal Description
More informationAdaptive restart of accelerated gradient methods under local quadratic growth condition
Adaptive restart of accelerated gradient methods under local quadratic growth condition Olivier Fercoq Zheng Qu September 6, 08 arxiv:709.0300v [math.oc] 7 Sep 07 Abstract By analyzing accelerated proximal
More informationStochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization
Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization Shai Shalev-Shwartz and Tong Zhang School of CS and Engineering, The Hebrew University of Jerusalem Optimization for Machine
More informationEstimators based on non-convex programs: Statistical and computational guarantees
Estimators based on non-convex programs: Statistical and computational guarantees Martin Wainwright UC Berkeley Statistics and EECS Based on joint work with: Po-Ling Loh (UC Berkeley) Martin Wainwright
More informationLinear Convergence under the Polyak-Łojasiewicz Inequality
Linear Convergence under the Polyak-Łojasiewicz Inequality Hamed Karimi, Julie Nutini and Mark Schmidt The University of British Columbia LCI Forum February 28 th, 2017 1 / 17 Linear Convergence of Gradient-Based
More informationSupport Vector Machine
Support Vector Machine Fabrice Rossi SAMM Université Paris 1 Panthéon Sorbonne 2018 Outline Linear Support Vector Machine Kernelized SVM Kernels 2 From ERM to RLM Empirical Risk Minimization in the binary
More informationOslo Class 2 Tikhonov regularization and kernels
RegML2017@SIMULA Oslo Class 2 Tikhonov regularization and kernels Lorenzo Rosasco UNIGE-MIT-IIT May 3, 2017 Learning problem Problem For H {f f : X Y }, solve min E(f), f H dρ(x, y)l(f(x), y) given S n
More informationStochastic Optimization Part I: Convex analysis and online stochastic optimization
Stochastic Optimization Part I: Convex analysis and online stochastic optimization Taiji Suzuki Tokyo Institute of Technology Graduate School of Information Science and Engineering Department of Mathematical
More informationl 1 and l 2 Regularization
David Rosenberg New York University February 5, 2015 David Rosenberg (New York University) DS-GA 1003 February 5, 2015 1 / 32 Tikhonov and Ivanov Regularization Hypothesis Spaces We ve spoken vaguely about
More informationRegML 2018 Class 2 Tikhonov regularization and kernels
RegML 2018 Class 2 Tikhonov regularization and kernels Lorenzo Rosasco UNIGE-MIT-IIT June 17, 2018 Learning problem Problem For H {f f : X Y }, solve min E(f), f H dρ(x, y)l(f(x), y) given S n = (x i,
More informationLecture 1: Supervised Learning
Lecture 1: Supervised Learning Tuo Zhao Schools of ISYE and CSE, Georgia Tech ISYE6740/CSE6740/CS7641: Computational Data Analysis/Machine from Portland, Learning Oregon: pervised learning (Supervised)
More informationAdaptive Restarting for First Order Optimization Methods
Adaptive Restarting for First Order Optimization Methods Nesterov method for smooth convex optimization adpative restarting schemes step-size insensitivity extension to non-smooth optimization continuation
More informationLinear Convergence under the Polyak-Łojasiewicz Inequality
Linear Convergence under the Polyak-Łojasiewicz Inequality Hamed Karimi, Julie Nutini, Mark Schmidt University of British Columbia Linear of Convergence of Gradient-Based Methods Fitting most machine learning
More informationOptimization and Gradient Descent
Optimization and Gradient Descent INFO-4604, Applied Machine Learning University of Colorado Boulder September 12, 2017 Prof. Michael Paul Prediction Functions Remember: a prediction function is the function
More informationLasso: Algorithms and Extensions
ELE 538B: Sparsity, Structure and Inference Lasso: Algorithms and Extensions Yuxin Chen Princeton University, Spring 2017 Outline Proximal operators Proximal gradient methods for lasso and its extensions
More informationGradient Descent. Ryan Tibshirani Convex Optimization /36-725
Gradient Descent Ryan Tibshirani Convex Optimization 10-725/36-725 Last time: canonical convex programs Linear program (LP): takes the form min x subject to c T x Gx h Ax = b Quadratic program (QP): like
More informationECS289: Scalable Machine Learning
ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 27, 2015 Outline One versus all/one versus one Ranking loss for multiclass/multilabel classification Scaling to millions of labels Multiclass
More informationLarge Scale Semi-supervised Linear SVMs. University of Chicago
Large Scale Semi-supervised Linear SVMs Vikas Sindhwani and Sathiya Keerthi University of Chicago SIGIR 2006 Semi-supervised Learning (SSL) Motivation Setting Categorize x-billion documents into commercial/non-commercial.
More informationNesterov s Acceleration
Nesterov s Acceleration Nesterov Accelerated Gradient min X f(x)+ (X) f -smooth. Set s 1 = 1 and = 1. Set y 0. Iterate by increasing t: g t 2 @f(y t ) s t+1 = 1+p 1+4s 2 t 2 y t = x t + s t 1 s t+1 (x
More informationLinear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)
Support vector machines In a nutshell Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers) Solution only depends on a small subset of training
More informationOptimization methods
Lecture notes 3 February 8, 016 1 Introduction Optimization methods In these notes we provide an overview of a selection of optimization methods. We focus on methods which rely on first-order information,
More informationOptimization. Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison
Optimization Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison optimization () cost constraints might be too much to cover in 3 hours optimization (for big
More informationDistributed Inexact Newton-type Pursuit for Non-convex Sparse Learning
Distributed Inexact Newton-type Pursuit for Non-convex Sparse Learning Bo Liu Department of Computer Science, Rutgers Univeristy Xiao-Tong Yuan BDAT Lab, Nanjing University of Information Science and Technology
More informationLecture 2 - Learning Binary & Multi-class Classifiers from Labelled Training Data
Lecture 2 - Learning Binary & Multi-class Classifiers from Labelled Training Data DD2424 March 23, 2017 Binary classification problem given labelled training data Have labelled training examples? Given
More informationStochastic optimization in Hilbert spaces
Stochastic optimization in Hilbert spaces Aymeric Dieuleveut Aymeric Dieuleveut Stochastic optimization Hilbert spaces 1 / 48 Outline Learning vs Statistics Aymeric Dieuleveut Stochastic optimization Hilbert
More informationAlgorithmic Stability and Generalization Christoph Lampert
Algorithmic Stability and Generalization Christoph Lampert November 28, 2018 1 / 32 IST Austria (Institute of Science and Technology Austria) institute for basic research opened in 2009 located in outskirts
More informationOptimization Methods for Machine Learning
Optimization Methods for Machine Learning Sathiya Keerthi Microsoft Talks given at UC Santa Cruz February 21-23, 2017 The slides for the talks will be made available at: http://www.keerthis.com/ Introduction
More informationBits of Machine Learning Part 1: Supervised Learning
Bits of Machine Learning Part 1: Supervised Learning Alexandre Proutiere and Vahan Petrosyan KTH (The Royal Institute of Technology) Outline of the Course 1. Supervised Learning Regression and Classification
More informationAccelerating SVRG via second-order information
Accelerating via second-order information Ritesh Kolte Department of Electrical Engineering rkolte@stanford.edu Murat Erdogdu Department of Statistics erdogdu@stanford.edu Ayfer Özgür Department of Electrical
More informationFrank-Wolfe Method. Ryan Tibshirani Convex Optimization
Frank-Wolfe Method Ryan Tibshirani Convex Optimization 10-725 Last time: ADMM For the problem min x,z f(x) + g(z) subject to Ax + Bz = c we form augmented Lagrangian (scaled form): L ρ (x, z, w) = f(x)
More information