Lecture 3: Minimizing Large Sums. Peter Richtárik
|
|
- Edith Sherman
- 5 years ago
- Views:
Transcription
1 Lecture 3: Minimizing Large Sums Peter Richtárik Graduate School in Systems, Control and Networks Belgium 2015
2 Machine Learning & Empirical Risk
3 Training Linear Predictors
4 Nature of Data Label (A i,y i ) Distribution Data (e.g., image, text, measurements, ) A i 2 R d m y i 2 R m
5 of Labels from Data Find w 2 R d Linear predictor Such that when (data, label) pair is drawn from the distribu@on Then (A i,y i ) Distribution A > i w y i Predicted label True label
6 Measure of Success loss(a, b) data label We want the expected loss (=risk) to be small: E loss(a > i w, y i ) (A i,y i ) Distribution
7 Finding a Linear Predictor via Empirical Risk Minimiza@on (ERM) Draw i.i.d. data (samples) from the distribu@on (A 1,y 1 ), (A 2,y 2 ),...,(A n,y n ) Distribution Output predictor which minimizes the empirical risk: 1 nx min loss(a > w2r d i w, y i ) n i=1
8 Primal and Dual Problems
9 Primal Problem: ERM i : R m 7! R 1/ regulariza@on parameter 1 -smooth and convex min w2r d " P (w) 1 n nx i=1 i(a > i w)+ g(w) # d = # features (parameters) n = # samples A i 2 R d m 1 - strongly convex func@on (regularizer)
10 Is the difficulty in n or d? Big n Work in the primal Process one loss func@on (= one example) at Type of methods: stochas@c gradient descent (modern variants: SAG, SVRG, S2GD, ms2gd, SAGA, S2CD, MISO, FINITO, ) Big d Work in the primal Process one primal variable at Type of methods: randomized coordinate descent (e.g., Hydra, Hydra2) Big n Work in the dual Process one dual variable (=one example) at Type of methods: randomized coordinate descent (modern variants: RCDM, PCDM, Shotgun, SDCA, APPROX, Quartz, ALPHA, SDNA, SPDC, ASDCA, ) E.g. SDCA = run coordinate descent on the dual problem
11 Dual Problem D( ) g 1 n nx i=1 A i i! 1 n nx i=1 2 R m i ( i ) 1 smooth & convex g (w 0 ) = max (w 0 ) > w g(w) w2r d 2 R d - strongly convex i (a 0 ) = max a2r m (a0 ) > a i(a) max D( ) =( 1,..., n )2R N =R nm 2 R m 2 R m
12 An Efficient Dual Method Zheng Qu, P.R. and Tong Zhang Randomized dual coordinate ascent with arbitrary sampling In NIPS 2015 (arxiv: )
13 Empirical Risk
14 Primal Problem: ERM 1/ - smooth regulariza@on & convex parameter min w2r d " P (w) 1 n nx i=1 i(a > i w)+ g(w) # d = # features (parameters) n = # samples 1 - strongly convex func@on (regularizer)
15 1 The loss functions i : R m 7! R are 1 -smooth: kr i (a) r i (a 0 )k apple 1 ka a 0 k, a,a 0 2 R m Lipschitz constant of the gradient of the func@on
16 2 Regularizer is 1- strongly convex g(w) g(w 0 )+hrg(w 0 ),w w 0 i kw w0 k 2, w,w 0 2 R d subgradient
17 Dual Problem D( ) g 1 n nx i=1 A i i! 1 n nx i=1 2 R m i ( i ) 1 smooth & convex g (w 0 ) = max (w 0 ) > w g(w) w2r d 2 R d - strongly convex i (a 0 ) = max a2r m (a0 ) > a i(a) max D( ) =( 1,..., n )2R N =R nm 2 R m 2 R m
18 The Algorithm
19 Quartz
20 Fenchel Duality = 1 n nx A i i i=1 P (w) D( ) = (g(w)+g ( )) + 1 n nx i=1 i(a > i w)+ i ( i )= (g(w)+g ( ) hw, i)+ 1 n nx i=1 i(a > i w)+ i ( i )+ A > i w, i 0 Weak duality 0 Op;mality condi;ons w = rg ( ) i = r i (A > i w)
21 The Algorithm ( t,w t ) ) ( t+1,w t+1 )
22 Quartz: Bird s Eye View STEP 1: PRIMAL UPDATE w t+1 (1 )w t + rg ( t ) STEP 2: DUAL UPDATE Choose a random set S t of dual variables For i 2 S t do 1 t+1 i p i p i = P(i 2 S t ) t i + p i r i (A > i w t+1 )
23 The Algorithm STEP 1 Convex combina@on constant STEP 2 Just maintaining
24 Other Dual Methods for ERM
25 Randomized Dual Coordinate Ascent Methods for ERM Algorithm 1-nice 1-optimal -nice arbitrary additional speedup direct p-d analysis acceleration SDCA msdca ASDCA AccProx-SDCA DisDCA Iprox-SDCA APCG SPDC Quartz SDCA: SS Shwartz & T Zhang, 09/2012 msdca M Takac, A Bijral, P R & N Srebro, 03/2013 ASDCA: SS Shwartz & T Zhang, 05/2013 AccProx-SDCA: SS Shwartz & T Zhang, 10/2013 DisDCA: T Yang, 2013 Iprox-SDCA: P Zhao & T Zhang, 01/2014 APCG: Q Lin, Z Lu & L Xiao, 07/2014 SPDC: Y Zhang & L Xiao, 09/2014 Quartz: Z Qu, P R & T Zhang, 11/2014
26 Complexity
27 3 (Expected Separable Overapproxima@on) Parameters v 1,...,v n satisfy: X E A i i apple i2s t 2 nx p i v i k i k 2 i=1 inequality must hold for all 1,..., n 2 R m p i = P(i 2 S t )
28 Complexity Theorem [Qu, R & Zhang 14] p i n =min i v i + n E[P (w t ) D( t )] apple (1 ) t (P (w 0 ) D( 0 )) t max i 1 p i + p i v i n log P (w 0 ) D( 0 ) E P (w t ) D( t ) apple Example Data: n = = 1 4 v i max (A > i A i) apple 1 Method: S t 1 p i = 1 n = 1 n (1 ) n = (1 ) 12n = < 1 10
29 One Dual Variable at a Time
30 Complexity of Quartz specialized to serial sampling Optimal sampling n + 1 n P n i=1 L i Uniform sampling n + max i L i L i max A > i A i
31 Experiment: Quartz vs SDCA, uniform vs sampling 10 0 Prox-SDCA Quartz-U (10θ) Iprox-SDCA Quartz-IP (10θ) Primal dual gap nb of epochs Data = cov1, n = 522, 911, = 10 6
32 An Efficient Primal Method S. Shalev- Shwartz SDCA without Duality, NIPS 2015 (arxiv: ) Dominik Csiba and P.R. Primal method for ERM with flexible mini- batching schemes and non- convex losses, arxiv: , 2015
33 Empirical Risk
34 Primal Problem: ERM i : R m 7! R 1 -smooth and convex regulariza@on parameter min w2r d " P (w) 1 n nx i=1 i(a > i w)+ 2 kwk 2 2 # d = # features (parameters) n = # samples A i 2 R d m We had a general 1- strongly convex func@on g here before
35 The loss functions i : R m 7! R are 1 -smooth: kr i (a) r i (a 0 )k apple 1 ka a 0 k, a,a 0 2 R m Lipschitz constant of the gradient of the func@on
36 D( ) g 1 n 1 smooth & convex g (w 0 ) = max (w 0 ) > w g(w) w2r d Dual Problem! nx A i i 1 n nx i=1 Goal: An efficient i=1 algorithm which 2 R d max D( ) =( 1,..., n )2R N =R nm 2 R m 2 R m 2 R m i ( i ) naturally operates in the primal space (i.e., on the primal problem) only - strongly convex The method will have the same theore@cal guarantee as Quartz i (a 0 ) = max a2r (a0 ) > a i(a) m The computer lab will be based on this
37 6.2 The Algorithm
38 I w is optimal 0=rP (w )= 1 n nx A i r i (A > i w )! + w i=1 nx w = 1 n A i i i=1 i := r i (A > i w )
39 II Algorithmic Ideas: 1 Simultaneously search for both w and 1,..., n 2 Try to do something like 3 t+1 i r i (A > i w t ) Maintain the relationship nx Does not quite work: too greedy w t = 1 n i=1 A i t i
40 The Algorithm: dfsdca STEP 0: INITIALIZE Choose 0 1,..., 0 n 2 R m STEP 1: DUAL UPDATE For i 2 S t do STEP 2: PRIMAL UPDATE w 0 = 1 n nx A i i 0 i=1 Choose a random set S t of dual variables t+1 i This is just maintaining the rela@onship 1 w t+1 w t X Controlling greed by taking a convex combina@on i2s t p i i t + r i (A > i w t ) p i n p i A i Ini@alize the rela@onship r i (A > i w t ) + t i p i n =min i v i + n p i = P(i 2 S t )
41 Complexity
42 ESO (same as before!) Parameters v 1,...,v n satisfy: X E A i i apple i2s t 2 nx p i v i k i k 2 i=1 inequality must hold for all 1,..., n 2 R m p i = P(i 2 S t )
43 Complexity Theorem [Csiba & R 15] A constant depending on P, w 0, i 0,w, i t max i 1 p i + p i v i n log C p i = P(i 2 S t ) E P (w t ) P (w ) apple
44 Experiments
45 Some More Efficient Primal Methods for ERM: SAG, SVRG and S2GD SAG: Average Gradient N. Le Roux, M. Schmidt, and F. Bach. A stochas;c gradient method with an exponen;al convergence rate for finite training sets. NIPS, 2012 SVRG: Stochas@c Variance Reduced Gradient Rie Johnson and Tong Zhang. Accelera;ng stochas;c gradient descent using predic;ve variance reduc;on. NIPS, S2GD: Semi- Stochas@c Gradient Descent J. Konečný and P. R. Semi- stochas;c gradient descent methods. arxiv: , 2013
46 Modern Methods for ERM vs SGD Dataset: rcv1 (n = 20,241 ; d = 47,232) dfsdca
47 Behavior of dfsdca for various Dataset: rcv1 (n = 20,241 ; d = 47,232)
48 (Minibatching)
49 NAIVE APPROACH
50 Failure of naive 1b 0 1a
51 Failure of naive 1b 1 0 1a
52 Failure of naive 2b 1 2a
53 Failure of naive 2b 1 2 2a
54 Failure of naive 2
55 Idea: averaging updates may help 1b 1 0 1a
56 0 1a Averaging can be too 1b 2b 1 2 2a
57 Averaging can be too BAD!!! WANT
58 Experiment with a 1 billion- by- 2 billion LASSO problem P.R. and Mar@n Takáč Parallel coordinate descent methods for big data op;miza;on MathemaAcal Programming, 2015 (arxiv: )
59 Coordinate Updates The image cannot be displayed. Your computer may not have enough memory to open the image, or the image may have been corrupted. Restart your computer, and then open the file again. If the red x still appears, you may have to delete the image and then insert it again.
60 The image cannot be displayed. Your computer may not have enough memory to open the image, or the image may have been corrupted. Restart your computer, and then open the file again. If the red x still appears, you may have to delete the image and then insert it again. The image cannot be displayed. Your computer may not have enough memory to open the image, or the image may have been corrupted. Restart your computer, and then open the file again. If the red x still appears, you may have to delete the image and then insert it again.
61 Wall Time The image cannot be displayed. Your computer may not have enough memory to open the image, or the image may have been corrupted. Restart your computer, and then open the file again. If the red x still appears, you may have to delete the image and then insert it again.
62 Minibatching & Quartz [Qu, R & Zhang 14]
63 Data Sparsity 1 apple! apple n Fully sparse data A normalized measure of average sparsity of the data Fully dense data
64 Complexity of Quartz Quartz [Qu, R & Zhang 14] Fully sparse data (! =1) n + max i L i Fully dense data (! = n) n + max i L i Any data (1 apple! apple n) n + 1+ (! 1)( 1) n 1 max i L i T ( )
65 Speedup Assume the data is normalized: L i max (A > i A i ) apple 1 Then: T ( ) = 1+ (! 1)( 1) (n 1)(1+ n) T (1) Linear speedup up to a certain data- independent minibatch size: apple 2+ n T ( ) apple 2 T (1) Further data- dependent speedup, up to the extreme case: T (1)! = O( n) T ( ) =O
66 Quartz [Qu, R & Zhang 14] Quartz: Paralleliza@on Speedup # examples: n = 10 6 Smoothness of loss func@ons: =1 Low regulariza@on: High regulariza@on: =1/n =1/ p n speed up factor λ=1e-3 λ=1e-4 λ=1e-6 speed up factor λ=1e-3 λ=1e-4 λ=1e-6 speed up factor λ=1e-3 λ=1e-4 λ=1e τ τ τ Sparse Data Denser Data Fully Dense Data! = 10 2! = 10 4! = 10 6
67 astro_ph: n = 29,882 density = 0.08% speed up factor T(1,1)/T(1,τ) λ=5.8e 3 in practice λ=5.8e 3 in theory λ=1e 3 in practice λ=1e 3 in theory λ=1e 4 in practice λ=1e 4 in theory τ
68 CCAT: n = 781,265 density = 0.16% speed up factor T(1,1)/T(1,τ) λ=1.1e 3 in practice λ=1.1e 3 in theory λ=1e 4 in practice λ=1e 4 in theory λ=1e 5 in practice λ=1e 5 in theory τ
69 Primal- dual methods with tau- nice sampling Algorithm Iteration complexity g SDCA n k k2 [S- Shwartz & Zhang 12] Li =1 ASDCA [S- Shwartz & Zhang 13a] SPDC [Zhang & Xiao 14] 4 max ( r n n,, 1, n 1 3 ( ) 2 3 n + r n ) 1 2 k k2 general Quartz n + 1+ (! 1)( 1) n 1 1 general
70 For sufficiently sparse data, Quartz wins even when compared against accelerated methods Algorithm n = ( 1 ) n = (1) n = ( ) n = (p n) apple = n apple = n apple = n/ apple = p n SDCA n n n n Accelerated ASDCA SPDC n n n p n p n n n + n3/4 p n + n3/4 p Quartz n +! n +! n n +! p n
71 THE END
Coordinate Descent Faceoff: Primal or Dual?
JMLR: Workshop and Conference Proceedings 83:1 22, 2018 Algorithmic Learning Theory 2018 Coordinate Descent Faceoff: Primal or Dual? Dominik Csiba peter.richtarik@ed.ac.uk School of Mathematics University
More informationAccelerated Randomized Primal-Dual Coordinate Method for Empirical Risk Minimization
Accelerated Randomized Primal-Dual Coordinate Method for Empirical Risk Minimization Lin Xiao (Microsoft Research) Joint work with Qihang Lin (CMU), Zhaosong Lu (Simon Fraser) Yuchen Zhang (UC Berkeley)
More informationImportance Sampling for Minibatches
Importance Sampling for Minibatches Dominik Csiba School of Mathematics University of Edinburgh 07.09.2016, Birmingham Dominik Csiba (University of Edinburgh) Importance Sampling for Minibatches 07.09.2016,
More informationMini-Batch Primal and Dual Methods for SVMs
Mini-Batch Primal and Dual Methods for SVMs Peter Richtárik School of Mathematics The University of Edinburgh Coauthors: M. Takáč (Edinburgh), A. Bijral and N. Srebro (both TTI at Chicago) arxiv:1303.2314
More informationSDNA: Stochastic Dual Newton Ascent for Empirical Risk Minimization
Zheng Qu ZHENGQU@HKU.HK Department of Mathematics, The University of Hong Kong, Hong Kong Peter Richtárik PETER.RICHTARIK@ED.AC.UK School of Mathematics, The University of Edinburgh, UK Martin Takáč TAKAC.MT@GMAIL.COM
More informationStochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization
Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization Shai Shalev-Shwartz and Tong Zhang School of CS and Engineering, The Hebrew University of Jerusalem Optimization for Machine
More informationRandomized Coordinate Descent with Arbitrary Sampling: Algorithms and Complexity
Randomized Coordinate Descent with Arbitrary Sampling: Algorithms and Complexity Zheng Qu University of Hong Kong CAM, 23-26 Aug 2016 Hong Kong based on joint work with Peter Richtarik and Dominique Cisba(University
More informationNesterov s Acceleration
Nesterov s Acceleration Nesterov Accelerated Gradient min X f(x)+ (X) f -smooth. Set s 1 = 1 and = 1. Set y 0. Iterate by increasing t: g t 2 @f(y t ) s t+1 = 1+p 1+4s 2 t 2 y t = x t + s t 1 s t+1 (x
More informationFast Stochastic Optimization Algorithms for ML
Fast Stochastic Optimization Algorithms for ML Aaditya Ramdas April 20, 205 This lecture is about efficient algorithms for minimizing finite sums min w R d n i= f i (w) or min w R d n f i (w) + λ 2 w 2
More informationSVRG++ with Non-uniform Sampling
SVRG++ with Non-uniform Sampling Tamás Kern András György Department of Electrical and Electronic Engineering Imperial College London, London, UK, SW7 2BT {tamas.kern15,a.gyorgy}@imperial.ac.uk Abstract
More informationStochastic Dual Coordinate Ascent with Adaptive Probabilities
Dominik Csiba Zheng Qu Peter Richtárik University of Edinburgh CDOMINIK@GMAIL.COM ZHENG.QU@ED.AC.UK PETER.RICHTARIK@ED.AC.UK Abstract This paper introduces AdaSDCA: an adaptive variant of stochastic dual
More informationBig Data Analytics: Optimization and Randomization
Big Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML 2015 Hong Kong Department of Computer Science, The University of Iowa, IA, USA Nov. 20, 2015 Yang Tutorial for ACML 15 Nov.
More informationStochastic Gradient Descent with Variance Reduction
Stochastic Gradient Descent with Variance Reduction Rie Johnson, Tong Zhang Presenter: Jiawen Yao March 17, 2015 Rie Johnson, Tong Zhang Presenter: JiawenStochastic Yao Gradient Descent with Variance Reduction
More informationParallel Coordinate Optimization
1 / 38 Parallel Coordinate Optimization Julie Nutini MLRG - Spring Term March 6 th, 2018 2 / 38 Contours of a function F : IR 2 IR. Goal: Find the minimizer of F. Coordinate Descent in 2D Contours of a
More informationTrade-Offs in Distributed Learning and Optimization
Trade-Offs in Distributed Learning and Optimization Ohad Shamir Weizmann Institute of Science Includes joint works with Yossi Arjevani, Nathan Srebro and Tong Zhang IHES Workshop March 2016 Distributed
More informationImproved Optimization of Finite Sums with Minibatch Stochastic Variance Reduced Proximal Iterations
Improved Optimization of Finite Sums with Miniatch Stochastic Variance Reduced Proximal Iterations Jialei Wang University of Chicago Tong Zhang Tencent AI La Astract jialei@uchicago.edu tongzhang@tongzhang-ml.org
More informationOptimization for Machine Learning
Optimization for Machine Learning (Lecture 3-A - Convex) SUVRIT SRA Massachusetts Institute of Technology Special thanks: Francis Bach (INRIA, ENS) (for sharing this material, and permitting its use) MPI-IS
More informationProximal Minimization by Incremental Surrogate Optimization (MISO)
Proximal Minimization by Incremental Surrogate Optimization (MISO) (and a few variants) Julien Mairal Inria, Grenoble ICCOPT, Tokyo, 2016 Julien Mairal, Inria MISO 1/26 Motivation: large-scale machine
More informationSTA141C: Big Data & High Performance Statistical Computing
STA141C: Big Data & High Performance Statistical Computing Lecture 8: Optimization Cho-Jui Hsieh UC Davis May 9, 2017 Optimization Numerical Optimization Numerical Optimization: min X f (X ) Can be applied
More informationStochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure
Stochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure Alberto Bietti Julien Mairal Inria Grenoble (Thoth) March 21, 2017 Alberto Bietti Stochastic MISO March 21,
More informationAdaptive Probabilities in Stochastic Optimization Algorithms
Research Collection Master Thesis Adaptive Probabilities in Stochastic Optimization Algorithms Author(s): Zhong, Lei Publication Date: 2015 Permanent Link: https://doi.org/10.3929/ethz-a-010421465 Rights
More informationSDNA: Stochastic Dual Newton Ascent for Empirical Risk Minimization
Zheng Qu ZHENGQU@HKU.HK Department of Mathematics, The University of Hong Kong, Hong Kong Peter Richtárik PETER.RICHTARIK@ED.AC.UK School of Mathematics, The University of Edinburgh, UK Martin Takáč TAKAC.MT@GMAIL.COM
More informationDoubly Accelerated Stochastic Variance Reduced Dual Averaging Method for Regularized Empirical Risk Minimization
Doubly Accelerated Stochastic Variance Reduced Dual Averaging Method for Regularized Empirical Risk Minimization Tomoya Murata Taiji Suzuki More recently, several authors have proposed accelerarxiv:70300439v4
More informationFAST DISTRIBUTED COORDINATE DESCENT FOR NON-STRONGLY CONVEX LOSSES. Olivier Fercoq Zheng Qu Peter Richtárik Martin Takáč
FAST DISTRIBUTED COORDINATE DESCENT FOR NON-STRONGLY CONVEX LOSSES Olivier Fercoq Zheng Qu Peter Richtárik Martin Takáč School of Mathematics, University of Edinburgh, Edinburgh, EH9 3JZ, United Kingdom
More informationPrimal-dual coordinate descent A Coordinate Descent Primal-Dual Algorithm with Large Step Size and Possibly Non-Separable Functions
Primal-dual coordinate descent A Coordinate Descent Primal-Dual Algorithm with Large Step Size and Possibly Non-Separable Functions Olivier Fercoq and Pascal Bianchi Problem Minimize the convex function
More informationFinite-sum Composition Optimization via Variance Reduced Gradient Descent
Finite-sum Composition Optimization via Variance Reduced Gradient Descent Xiangru Lian Mengdi Wang Ji Liu University of Rochester Princeton University University of Rochester xiangru@yandex.com mengdiw@princeton.edu
More informationLarge-scale Stochastic Optimization
Large-scale Stochastic Optimization 11-741/641/441 (Spring 2016) Hanxiao Liu hanxiaol@cs.cmu.edu March 24, 2016 1 / 22 Outline 1. Gradient Descent (GD) 2. Stochastic Gradient Descent (SGD) Formulation
More informationAccelerating SVRG via second-order information
Accelerating via second-order information Ritesh Kolte Department of Electrical Engineering rkolte@stanford.edu Murat Erdogdu Department of Statistics erdogdu@stanford.edu Ayfer Özgür Department of Electrical
More informationStochastic and online algorithms
Stochastic and online algorithms stochastic gradient method online optimization and dual averaging method minimizing finite average Stochastic and online optimization 6 1 Stochastic optimization problem
More informationOptimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade
Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade Machine Learning for Big Data CSE547/STAT548 University of Washington S. M. Kakade (UW) Optimization for
More informationECS289: Scalable Machine Learning
ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Sept 29, 2016 Outline Convex vs Nonconvex Functions Coordinate Descent Gradient Descent Newton s method Stochastic Gradient Descent Numerical Optimization
More informationAccelerating Stochastic Optimization
Accelerating Stochastic Optimization Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem and Mobileye Master Class at Tel-Aviv, Tel-Aviv University, November 2014 Shalev-Shwartz
More informationAccelerated Stochastic Block Coordinate Gradient Descent for Sparsity Constrained Nonconvex Optimization
Accelerated Stochastic Block Coordinate Gradient Descent for Sparsity Constrained Nonconvex Optimization Jinghui Chen Department of Systems and Information Engineering University of Virginia Quanquan Gu
More informationParallel and Distributed Stochastic Learning -Towards Scalable Learning for Big Data Intelligence
Parallel and Distributed Stochastic Learning -Towards Scalable Learning for Big Data Intelligence oé LAMDA Group H ŒÆOŽÅ Æ EâX ^ #EâI[ : liwujun@nju.edu.cn Dec 10, 2016 Wu-Jun Li (http://cs.nju.edu.cn/lwj)
More informationModern Stochastic Methods. Ryan Tibshirani (notes by Sashank Reddi and Ryan Tibshirani) Convex Optimization
Modern Stochastic Methods Ryan Tibshirani (notes by Sashank Reddi and Ryan Tibshirani) Convex Optimization 10-725 Last time: conditional gradient method For the problem min x f(x) subject to x C where
More informationComparison of Modern Stochastic Optimization Algorithms
Comparison of Modern Stochastic Optimization Algorithms George Papamakarios December 214 Abstract Gradient-based optimization methods are popular in machine learning applications. In large-scale problems,
More informationFast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee
Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence (AAAI-16) Fast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee Shen-Yi Zhao and
More informationFast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee
Fast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee Shen-Yi Zhao and Wu-Jun Li National Key Laboratory for Novel Software Technology Department of Computer
More informationStochas(c Dual Ascent Linear Systems, Quasi-Newton Updates and Matrix Inversion
Stochas(c Dual Ascent Linear Systems, Quasi-Newton Updates and Matrix Inversion Peter Richtárik (joint work with Robert M. Gower) University of Edinburgh Oberwolfach, March 8, 2016 Part I Stochas(c Dual
More informationA Universal Catalyst for Gradient-Based Optimization
A Universal Catalyst for Gradient-Based Optimization Julien Mairal Inria, Grenoble CIMI workshop, Toulouse, 2015 Julien Mairal, Inria Catalyst 1/58 Collaborators Hongzhou Lin Zaid Harchaoui Publication
More informationarxiv: v2 [stat.ml] 16 Jun 2015
Semi-Stochastic Gradient Descent Methods Jakub Konečný Peter Richtárik arxiv:1312.1666v2 [stat.ml] 16 Jun 2015 School of Mathematics University of Edinburgh United Kingdom June 15, 2015 (first version:
More informationStochastic optimization: Beyond stochastic gradients and convexity Part I
Stochastic optimization: Beyond stochastic gradients and convexity Part I Francis Bach INRIA - Ecole Normale Supérieure, Paris, France ÉCOLE NORMALE SUPÉRIEURE Joint tutorial with Suvrit Sra, MIT - NIPS
More informationBarzilai-Borwein Step Size for Stochastic Gradient Descent
Barzilai-Borwein Step Size for Stochastic Gradient Descent Conghui Tan The Chinese University of Hong Kong chtan@se.cuhk.edu.hk Shiqian Ma The Chinese University of Hong Kong sqma@se.cuhk.edu.hk Yu-Hong
More informationReferences. --- a tentative list of papers to be mentioned in the ICML 2017 tutorial. Recent Advances in Stochastic Convex and Non-Convex Optimization
References --- a tentative list of papers to be mentioned in the ICML 2017 tutorial Recent Advances in Stochastic Convex and Non-Convex Optimization Disclaimer: in a quite arbitrary order. 1. [ShalevShwartz-Zhang,
More informationOptimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method
Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method Davood Hajinezhad Iowa State University Davood Hajinezhad Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method 1 / 35 Co-Authors
More informationDistributed Optimization with Arbitrary Local Solvers
Industrial and Systems Engineering Distributed Optimization with Arbitrary Local Solvers Chenxin Ma, Jakub Konečný 2, Martin Jaggi 3, Virginia Smith 4, Michael I. Jordan 4, Peter Richtárik 2, and Martin
More informationMinimizing Finite Sums with the Stochastic Average Gradient Algorithm
Minimizing Finite Sums with the Stochastic Average Gradient Algorithm Joint work with Nicolas Le Roux and Francis Bach University of British Columbia Context: Machine Learning for Big Data Large-scale
More informationarxiv: v1 [math.oc] 18 Mar 2016
Katyusha: Accelerated Variance Reduction for Faster SGD Zeyuan Allen-Zhu zeyuan@csail.mit.edu Princeton University arxiv:1603.05953v1 [math.oc] 18 Mar 016 March 18, 016 Abstract We consider minimizing
More informationPrimal-dual coordinate descent
Primal-dual coordinate descent Olivier Fercoq Joint work with P. Bianchi & W. Hachem 15 July 2015 1/28 Minimize the convex function f, g, h convex f is differentiable Problem min f (x) + g(x) + h(mx) x
More informationIncremental and Stochastic Majorization-Minimization Algorithms for Large-Scale Machine Learning
Incremental and Stochastic Majorization-Minimization Algorithms for Large-Scale Machine Learning Julien Mairal Inria, LEAR Team, Grenoble Journées MAS, Toulouse Julien Mairal Incremental and Stochastic
More informationBarzilai-Borwein Step Size for Stochastic Gradient Descent
Barzilai-Borwein Step Size for Stochastic Gradient Descent Conghui Tan Shiqian Ma Yu-Hong Dai Yuqiu Qian May 16, 2016 Abstract One of the major issues in stochastic gradient descent (SGD) methods is how
More informationCoordinate Descent and Ascent Methods
Coordinate Descent and Ascent Methods Julie Nutini Machine Learning Reading Group November 3 rd, 2015 1 / 22 Projected-Gradient Methods Motivation Rewrite non-smooth problem as smooth constrained problem:
More informationStochastic Approximation: Mini-Batches, Optimistic Rates and Acceleration
Stochastic Approximation: Mini-Batches, Optimistic Rates and Acceleration Nati Srebro Toyota Technological Institute at Chicago a philanthropically endowed academic computer science institute dedicated
More informationLeast Mean Squares Regression. Machine Learning Fall 2017
Least Mean Squares Regression Machine Learning Fall 2017 1 Lecture Overview Linear classifiers What func?ons do linear classifiers express? Least Squares Method for Regression 2 Where are we? Linear classifiers
More informationarxiv: v5 [cs.lg] 23 Dec 2017 Abstract
Nonconvex Sparse Learning via Stochastic Optimization with Progressive Variance Reduction Xingguo Li, Raman Arora, Han Liu, Jarvis Haupt, and Tuo Zhao arxiv:1605.02711v5 [cs.lg] 23 Dec 2017 Abstract We
More informationDistributed Inexact Newton-type Pursuit for Non-convex Sparse Learning
Distributed Inexact Newton-type Pursuit for Non-convex Sparse Learning Bo Liu Department of Computer Science, Rutgers Univeristy Xiao-Tong Yuan BDAT Lab, Nanjing University of Information Science and Technology
More informationWarm up. Regrade requests submitted directly in Gradescope, do not instructors.
Warm up Regrade requests submitted directly in Gradescope, do not email instructors. 1 float in NumPy = 8 bytes 10 6 2 20 bytes = 1 MB 10 9 2 30 bytes = 1 GB For each block compute the memory required
More informationSimple Optimization, Bigger Models, and Faster Learning. Niao He
Simple Optimization, Bigger Models, and Faster Learning Niao He Big Data Symposium, UIUC, 2016 Big Data, Big Picture Niao He (UIUC) 2/26 Big Data, Big Picture Niao He (UIUC) 3/26 Big Data, Big Picture
More informationAsynchronous Mini-Batch Gradient Descent with Variance Reduction for Non-Convex Optimization
Proceedings of the hirty-first AAAI Conference on Artificial Intelligence (AAAI-7) Asynchronous Mini-Batch Gradient Descent with Variance Reduction for Non-Convex Optimization Zhouyuan Huo Dept. of Computer
More information1. Introduction. We consider the problem of minimizing the sum of two convex functions: = F (x)+r(x)}, f i (x),
SIAM J. OPTIM. Vol. 24, No. 4, pp. 2057 2075 c 204 Society for Industrial and Applied Mathematics A PROXIMAL STOCHASTIC GRADIENT METHOD WITH PROGRESSIVE VARIANCE REDUCTION LIN XIAO AND TONG ZHANG Abstract.
More informationConvex Optimization Lecture 16
Convex Optimization Lecture 16 Today: Projected Gradient Descent Conditional Gradient Descent Stochastic Gradient Descent Random Coordinate Descent Recall: Gradient Descent (Steepest Descent w.r.t Euclidean
More informationStochastic Optimization
Introduction Related Work SGD Epoch-GD LM A DA NANJING UNIVERSITY Lijun Zhang Nanjing University, China May 26, 2017 Introduction Related Work SGD Epoch-GD Outline 1 Introduction 2 Related Work 3 Stochastic
More informationMaster 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique
Master 2 MathBigData S. Gaïffas 1 3 novembre 2014 1 CMAP - Ecole Polytechnique 1 Supervised learning recap Introduction Loss functions, linearity 2 Penalization Introduction Ridge Sparsity Lasso 3 Some
More informationA Stochastic PCA Algorithm with an Exponential Convergence Rate. Ohad Shamir
A Stochastic PCA Algorithm with an Exponential Convergence Rate Ohad Shamir Weizmann Institute of Science NIPS Optimization Workshop December 2014 Ohad Shamir Stochastic PCA with Exponential Convergence
More informationIN this work we are concerned with the problem of minimizing. Mini-Batch Semi-Stochastic Gradient Descent in the Proximal Setting
Mini-Batch Semi-Stochastic Gradient Descent in the Proximal Setting Jakub Konečný, Jie Liu, Peter Richtárik, Martin Takáč arxiv:504.04407v2 [cs.lg] 6 Nov 205 Abstract We propose : a method incorporating
More informationLASSO Review, Fused LASSO, Parallel LASSO Solvers
Case Study 3: fmri Prediction LASSO Review, Fused LASSO, Parallel LASSO Solvers Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade May 3, 2016 Sham Kakade 2016 1 Variable
More informationarxiv: v2 [cs.lg] 10 Oct 2018
Journal of Machine Learning Research 9 (208) -49 Submitted 0/6; Published 7/8 CoCoA: A General Framework for Communication-Efficient Distributed Optimization arxiv:6.0289v2 [cs.lg] 0 Oct 208 Virginia Smith
More informationAn Evolving Gradient Resampling Method for Machine Learning. Jorge Nocedal
An Evolving Gradient Resampling Method for Machine Learning Jorge Nocedal Northwestern University NIPS, Montreal 2015 1 Collaborators Figen Oztoprak Stefan Solntsev Richard Byrd 2 Outline 1. How to improve
More informationA Parallel SGD method with Strong Convergence
A Parallel SGD method with Strong Convergence Dhruv Mahajan Microsoft Research India dhrumaha@microsoft.com S. Sundararajan Microsoft Research India ssrajan@microsoft.com S. Sathiya Keerthi Microsoft Corporation,
More informationKatyusha: The First Direct Acceleration of Stochastic Gradient Methods
Journal of Machine Learning Research 8 8-5 Submitted 8/6; Revised 5/7; Published 6/8 : The First Direct Acceleration of Stochastic Gradient Methods Zeyuan Allen-Zhu Microsoft Research AI Redmond, WA 985,
More informationAdaptive Gradient Methods AdaGrad / Adam. Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade
Adaptive Gradient Methods AdaGrad / Adam Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade 1 Announcements: HW3 posted Dual coordinate ascent (some review of SGD and random
More informationBellman s Curse of Dimensionality
Bellman s Curse of Dimensionality n- dimensional state space Number of states grows exponen
More informationFast-and-Light Stochastic ADMM
Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence (IJCAI-6) Fast-and-Light Stochastic ADMM Shuai Zheng James T. Kwok Department of Computer Science and Engineering
More informationTsybakov noise adap/ve margin- based ac/ve learning
Tsybakov noise adap/ve margin- based ac/ve learning Aar$ Singh A. Nico Habermann Associate Professor NIPS workshop on Learning Faster from Easy Data II Dec 11, 2015 Passive Learning Ac/ve Learning (X j,?)
More informationClassification Logistic Regression
Announcements: Classification Logistic Regression Machine Learning CSE546 Sham Kakade University of Washington HW due on Friday. Today: Review: sub-gradients,lasso Logistic Regression October 3, 26 Sham
More informationarxiv: v2 [cs.lg] 3 Jul 2015
arxiv:502.03508v2 [cs.lg] 3 Jul 205 Chenxin Ma Industrial and Systems Engineering, Lehigh University, USA Virginia Smith University of California, Berkeley, USA Martin Jaggi ETH Zürich, Switzerland Michael
More informationarxiv: v2 [cs.na] 3 Apr 2018
Stochastic variance reduced multiplicative update for nonnegative matrix factorization Hiroyui Kasai April 5, 208 arxiv:70.078v2 [cs.a] 3 Apr 208 Abstract onnegative matrix factorization (MF), a dimensionality
More informationAdding vs. Averaging in Distributed Primal-Dual Optimization
Chenxin Ma Industrial and Systems Engineering, Lehigh University, USA Virginia Smith University of California, Berkeley, USA Martin Jaggi ETH Zürich, Switzerland Michael I. Jordan University of California,
More informationAnnouncements Kevin Jamieson
Announcements Project proposal due next week: Tuesday 10/24 Still looking for people to work on deep learning Phytolith project, join #phytolith slack channel 2017 Kevin Jamieson 1 Gradient Descent Machine
More informationJournal Club. A Universal Catalyst for First-Order Optimization (H. Lin, J. Mairal and Z. Harchaoui) March 8th, CMAP, Ecole Polytechnique 1/19
Journal Club A Universal Catalyst for First-Order Optimization (H. Lin, J. Mairal and Z. Harchaoui) CMAP, Ecole Polytechnique March 8th, 2018 1/19 Plan 1 Motivations 2 Existing Acceleration Methods 3 Universal
More informationOn Stochastic Primal-Dual Hybrid Gradient Approach for Compositely Regularized Minimization
On Stochastic Primal-Dual Hybrid Gradient Approach for Compositely Regularized Minimization Linbo Qiao, and Tianyi Lin 3 and Yu-Gang Jiang and Fan Yang 5 and Wei Liu 6 and Xicheng Lu, Abstract We consider
More informationSupplement: Distributed Box-constrained Quadratic Optimization for Dual Linear SVM
Supplement: Distributed Box-constrained Quadratic Optimization for Dual Linear SVM Ching-pei Lee LEECHINGPEI@GMAIL.COM Dan Roth DANR@ILLINOIS.EDU University of Illinois at Urbana-Champaign, 201 N. Goodwin
More informationBeyond stochastic gradient descent for large-scale machine learning
Beyond stochastic gradient descent for large-scale machine learning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France Joint work with Eric Moulines, Nicolas Le Roux and Mark Schmidt - CAP, July
More informationStochastic gradient methods for machine learning
Stochastic gradient methods for machine learning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France Joint work with Eric Moulines, Nicolas Le Roux and Mark Schmidt - January 2013 Context Machine
More informationCase Study 1: Estimating Click Probabilities. Kakade Announcements: Project Proposals: due this Friday!
Case Study 1: Estimating Click Probabilities Intro Logistic Regression Gradient Descent + SGD Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade April 4, 017 1 Announcements:
More informationCS 6140: Machine Learning Spring What We Learned Last Week 2/26/16
Logis@cs CS 6140: Machine Learning Spring 2016 Instructor: Lu Wang College of Computer and Informa@on Science Northeastern University Webpage: www.ccs.neu.edu/home/luwang Email: luwang@ccs.neu.edu Sign
More informationStochastic optimization in Hilbert spaces
Stochastic optimization in Hilbert spaces Aymeric Dieuleveut Aymeric Dieuleveut Stochastic optimization Hilbert spaces 1 / 48 Outline Learning vs Statistics Aymeric Dieuleveut Stochastic optimization Hilbert
More informationMachine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox February 4 th, Emily Fox 2014
Case Study 3: fmri Prediction Fused LASSO LARS Parallel LASSO Solvers Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox February 4 th, 2014 Emily Fox 2014 1 LASSO Regression
More informationStochastic Optimization Part I: Convex analysis and online stochastic optimization
Stochastic Optimization Part I: Convex analysis and online stochastic optimization Taiji Suzuki Tokyo Institute of Technology Graduate School of Information Science and Engineering Department of Mathematical
More informationNESTT: A Nonconvex Primal-Dual Splitting Method for Distributed and Stochastic Optimization
ESTT: A onconvex Primal-Dual Splitting Method for Distributed and Stochastic Optimization Davood Hajinezhad, Mingyi Hong Tuo Zhao Zhaoran Wang Abstract We study a stochastic and distributed algorithm for
More informationCS 6140: Machine Learning Spring What We Learned Last Week. Survey 2/26/16. VS. Model
Logis@cs CS 6140: Machine Learning Spring 2016 Instructor: Lu Wang College of Computer and Informa@on Science Northeastern University Webpage: www.ccs.neu.edu/home/luwang Email: luwang@ccs.neu.edu Assignment
More informationInverse Time Dependency in Convex Regularized Learning
Inverse Time Dependency in Convex Regularized Learning Zeyuan A. Zhu (Tsinghua University) Weizhu Chen (MSRA) Chenguang Zhu (Tsinghua University) Gang Wang (MSRA) Haixun Wang (MSRA) Zheng Chen (MSRA) December
More informationCS 6140: Machine Learning Spring 2016
CS 6140: Machine Learning Spring 2016 Instructor: Lu Wang College of Computer and Informa?on Science Northeastern University Webpage: www.ccs.neu.edu/home/luwang Email: luwang@ccs.neu.edu Logis?cs Assignment
More informationMachine Learning and Data Mining. Linear regression. Prof. Alexander Ihler
+ Machine Learning and Data Mining Linear regression Prof. Alexander Ihler Supervised learning Notation Features x Targets y Predictions ŷ Parameters θ Learning algorithm Program ( Learner ) Change µ Improve
More informationONLINE VARIANCE-REDUCING OPTIMIZATION
ONLINE VARIANCE-REDUCING OPTIMIZATION Nicolas Le Roux Google Brain nlr@google.com Reza Babanezhad University of British Columbia rezababa@cs.ubc.ca Pierre-Antoine Manzagol Google Brain manzagop@google.com
More informationRegression.
Regression www.biostat.wisc.edu/~dpage/cs760/ Goals for the lecture you should understand the following concepts linear regression RMSE, MAE, and R-square logistic regression convex functions and sets
More informationStochastic Gradient Descent. Ryan Tibshirani Convex Optimization
Stochastic Gradient Descent Ryan Tibshirani Convex Optimization 10-725 Last time: proximal gradient descent Consider the problem min x g(x) + h(x) with g, h convex, g differentiable, and h simple in so
More informationHigh Order Methods for Empirical Risk Minimization
High Order Methods for Empirical Risk Minimization Alejandro Ribeiro Department of Electrical and Systems Engineering University of Pennsylvania aribeiro@seas.upenn.edu IPAM Workshop of Emerging Wireless
More informationAd Placement Strategies
Case Study : Estimating Click Probabilities Intro Logistic Regression Gradient Descent + SGD AdaGrad Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox January 7 th, 04 Ad
More informationA Fast Augmented Lagrangian Algorithm for Learning Low-Rank Matrices
A Fast Augmented Lagrangian Algorithm for Learning Low-Rank Matrices Ryota Tomioka 1, Taiji Suzuki 1, Masashi Sugiyama 2, Hisashi Kashima 1 1 The University of Tokyo 2 Tokyo Institute of Technology 2010-06-22
More information