Stochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure
|
|
- Geraldine McDaniel
- 5 years ago
- Views:
Transcription
1 Stochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure Alberto Bietti Julien Mairal Inria Grenoble (Thoth) March 21, 2017 Alberto Bietti Stochastic MISO March 21, / 20
2 Stochastic optimization in machine learning Stochastic approximation: min x E ζ D [f (x, ζ)] Infinite datasets (expected risk, D: data distribution), or single pass SGD, stochastic mirror descent, FOBOS, RDA O(1/ɛ) complexity 1 Incremental methods with variance reduction: min ni=1 x n f i (x) Finite datasets (empirical risk): f i (x) = l(y i, x ξ i ) + (µ/2) x 2 SAG, SDCA, SVRG, SAGA, MISO, etc. O(log 1/ɛ) complexity Alberto Bietti Stochastic MISO March 21, / 20
3 Data perturbations in machine learning Perturbations of data useful for regularization, stable feature selection, privacy aware learning 2 Data Augmentation via Le vy Processes We focus on data augmentation of a finite training set, for regularization purposes (better performance on test data), e.g.: I I Image data augmentation: add random transformations of each image in the training set (crop, scale, rotate, brightness, contrast, etc.) Dropout: set coordinates of feature vectors to 0 with probability δ. Invariant SVM using Selective Sampling (a) Gaussian noise The colorful Norwegian city of Bergen is also a gateway to majestic fjords. Bryggen Hanseatic Wharf will give you a sense of the local culture take some time to snap photos of the Hanseatic commercial buildings, which look like scenery from a movie set. The colorful of gateway to fjords. Hanseatic Wharf will sense the culture take some to snap photos the commercial buildings, which look scenery a (b) Dropout noise Figure: Data augmentation on MNIST digit (left), Dropout on text (right). Figure 1.1: Two examples of transforming an original input X into a noisy, less e The new inputs clearly have the same label but contain less informative input X. information and thus are harder to classify. is most likely still about travel. In both cases, the expert knowledge amounts to a belief that a certain transform of the features should generally not a ect an March 21, 2017 example s label. MISO Figure 6: This figure shows 16 Bietti variations of a digit with all the transformations cited here. Alberto Stochastic 3 / 20
4 Optimization objective with perturbations min x R p f i (x)=e ρ Γ [ f i (x, ρ)] { F (x) = 1 } n E ρ Γ [ f i (x, ρ)] + h(x) n i=1 ρ: perturbation f i (, ρ) is convex with L-Lipschitz gradients F is µ-strongly convex h: convex, possibly non-smooth, penalty, e.g. l 1 norm Alberto Bietti Stochastic MISO March 21, / 20
5 Can we do better than SGD? min x R p { f (x) = 1 } n E ρ Γ [ f i (x, ρ)] n i=1 SGD is a natural choice Sample index it, perturbation ρ t Γ Update: x t = x t 1 η t f it (x t 1, ρ t ) O(σ 2 tot/µt) convergence, with σ 2 tot := E i,ρ [ f i (x, ρ) 2 ] Key observation: variance from perturbations only is small compared to variance across all examples Contribution: improve convergence of SGD by exploiting the finite-sum structure using variance reduction. Yields O(σ 2 /µt) convergence with E ρ [ f i (x, ρ) f i (x ) 2] σ 2 σ 2 tot Alberto Bietti Stochastic MISO March 21, / 20
6 Background: MISO algorithm (Mairal, 2015) Finite sum problem: min x f (x) = 1 n ni=1 f i (x) Maintains a quadratic lower bound model di t(x) = µ 2 x zt i 2 + ci t on each f i di t is updated using a strong convexity lower bound on f i : f i (x) f i (x t 1 ) + f i (x t 1 ), x x t 1 + µ 2 x x t 1 2 =: l t i (x) Two steps: { Select i t, update: di t(x) = (1 α)d t 1 i (x) + αli t(x), if i = i t d t 1 i (x), otherwise Minimize the model: xt = arg min x {D t (x) = 1 n n i=1 d i t(x)} Alberto Bietti Stochastic MISO March 21, / 20
7 MISO algorithm (Mairal, 2015) Final algorithm: at iteration t, choose index i t at random and update: z t i = x t = 1 n { (1 α)z t 1 i + α(x t 1 1 µ f i(x t 1 )), if i = i t z t 1 i, otherwise. n zi t i=1 Complexity O((n + L/µ) log 1/ɛ), typical of variance reduction Similar to SDCA without duality (Shalev-Shwartz, 2016) Alberto Bietti Stochastic MISO March 21, / 20
8 Stochastic MISO min x R p { f (x) = 1 } n E ρ Γ [ f i (x, ρ)] n i=1 With perturbations, we cannot compute exact strong convexity lower bounds on f i = E ρ [ f i (, ρ)] Instead, use approximate lower bounds using stochastic gradient estimates f it (x t 1, ρ t ) Allow decreasing step-sizes α t in order to guarantee convergence as in stochastic approximation Alberto Bietti Stochastic MISO March 21, / 20
9 Stochastic MISO: algorithm Input: step-size sequence (α t ) t 1 ; for t = 1,... do Sample i t uniformly at random, ρ t Γ, and update: { (1 zi t αt )zi t 1 + α t (x t 1 1 µ = f it (x t 1, ρ t )), if i = i t zi t 1, otherwise. x t = 1 n zi t = x t n n (zt i t zi t 1 t ). i=1 end for Note: reduces to MISO for σ 2 = 0, α t = α, and to SGD for n = 1. Alberto Bietti Stochastic MISO March 21, / 20
10 Stochastic MISO: convergence analysis Define the Lyapunov function (with z i := x 1 µ f i(x )) C t = 1 2 x t x 2 + α t n 2 n zi t zi 2. i=1 Theorem (Recursion on C t, smooth case) If (α t ) t 1 are positive, non-increasing step-sizes with { } 1 α 1 min 2, n, 2(2κ 1) with κ = L/µ, then C t obeys the recursion ( E[C t ] 1 α t n ) E[C t 1 ] + 2 ( ) 2 αt σ 2 n µ 2. Note: Similar recursion for SGD with σ 2 tot instead of σ 2. Alberto Bietti Stochastic MISO March 21, / 20
11 Stochastic MISO: convergence with decreasing step-sizes Similar to SGD (Bottou et al., 2016). Theorem (Convergence of Lyapunov function) Let the sequence of step-sizes (α t ) t 1 be defined by α t = 2n γ + t { } 1 for γ 0 s.t. α 1 min 2, n. 2(2κ 1) For t 0, where E[C t ] ν := max { 8σ 2 ν γ + t + 1, µ 2, (γ + 1)C 0 }. Q: How can we get rid of the dependence on C 0? Alberto Bietti Stochastic MISO March 21, / 20
12 Practical step-size strategy Following Bottou et al. (2016), we keep the step-size constant for a few epochs in order to quickly forget the initial condition C 0 Using a constant step-size ᾱ, we can converge linearly near a constant error C = 2ᾱσ2 nµ 2 (in practice: a few epochs) We then start decreasing step-sizes with γ large enough s.t. α 1 = 2n/(γ + 1) ᾱ, no more C 0 in the convergence rate! Overall, complexity for reaching E[ x t x 2 ] ɛ: ( O (n + L/µ) log C 0 ɛ ) ( ) σ 2 + O µ 2. ɛ For E[f (x t ) f (x )] ɛ, the second term becomes O(Lσ 2 /µ 2 ɛ) via smoothness. Iterate averaging brings this down to O(σ 2 /µɛ). Alberto Bietti Stochastic MISO March 21, / 20
13 Extensions Composite objectives (h 0, e.g., l 1 penalty) MISO extends to this case by adding h to lower bound model (Lin et al., 2015) Different Lyapunov function ( xt x 2 replaced by an upper bound) Similar to Regularized Dual Averaging when n = 1 Non-uniform sampling Smoothness constants Li of each f i can vary a lot in heterogeneous datasets Sampling difficult examples more often can improve dependence in L from L max to L average Same convergence results apply (same Lyapunov recursion, decreasing step-sizes, iterate averaging) Alberto Bietti Stochastic MISO March 21, / 20
14 Experiments: dropout Dropout rate δ controls the variance of the perturbations. f - f* gene dropout, δ = 0.30 S-MISO η = 0. 1 S-MISO η = 1. 0 SGD η = 0. 1 SGD η = 1. 0 N-SAGA η = 0. 1 N-SAGA η = epochs f - f* gene dropout, 10 0 δ = epochs epochs f - f* gene dropout, δ = 0.01 Alberto Bietti Stochastic MISO March 21, / 20
15 Experiments: image data augmentation Random image crops and scalings, encoding with an unsupervised deep convolutional network. Different conditioning, controlled by µ. f - f* STL-10 ckn, µ = 10 3 S-MISO η = 0. 1 S-MISO η = 1. 0 SGD η = 0. 1 SGD η = 1. 0 N-SAGA η = epochs f - f* STL-10 ckn, µ = epochs f - f* STL-10 ckn, µ = epochs Alberto Bietti Stochastic MISO March 21, / 20
16 Conclusion Exploit underlying finite-sum structures in stochastic optimization problems using variance reduction Bring SGD variance term down to the variance induced by perturbations only Useful for data augmentation (e.g. random image transformations, Dropout) Future work: application to stable feature selection? C++/Eigen library with Cython extension available: Alberto Bietti Stochastic MISO March 21, / 20
17 References L. Bottou, F. E. Curtis, and J. Nocedal. Optimization Methods for Large-Scale Machine Learning. arxiv: , S. Lacoste-Julien, M. Schmidt, and F. Bach. A simpler approach to obtaining an O(1/t) convergence rate for the projected stochastic subgradient method. arxiv: , H. Lin, J. Mairal, and Z. Harchaoui. A Universal Catalyst for First-Order Optimization. In Advances in Neural Information Processing Systems (NIPS), J. Mairal. Incremental Majorization-Minimization Optimization with Application to Large-Scale Machine Learning. SIAM Journal on Optimization, 25(2): , S. Shalev-Shwartz. SDCA without Duality, Regularization, and Individual Convexity. In International Conference on Machine Learning (ICML), Alberto Bietti Stochastic MISO March 21, / 20
18 Acceleration by iterate averaging For function values, averaging helps bring the complexity term O(Lσ 2 /µ 2 ɛ) down to O(σ 2 /µɛ) Similar technique to Lacoste-Julien et al. (2012), but allows small initial step-sizes Theorem (Convergence under iterate averaging) Let the step-size sequence (α t ) t 1 be defined by We have α t = 2n γ + t { } 1 for γ 1 s.t. α 1 min 2, n. 4(2κ 1) E[f ( x T ) f (x )] 2µγ(γ 1)C 0 T (2γ + T 1) + 16σ 2 µ(2γ + T 1), 2 where x T := T 1 T (2γ+T 1) t=0 (γ + t)x t. Alberto Bietti Stochastic MISO March 21, / 20
19 Stochastic MISO (composite, non-uniform sampling) Input: step-sizes (α t ) t 1, sampling distribution q; for t = 1,... do Sample an index i t q, a perturbation ρ t Γ, and update: z t i = { (1 α t q i n )zt 1 i + αt q i n (x t 1 1 µ f it (x t 1, ρ t )), if i = i t zi t 1, otherwise z t = 1 n zi t n i=1 x t = prox h/µ ( z t ). = z t n (zt i t z t 1 i t ) end for Note: Similar to RDA for n = 1 when α t = 1/t. Alberto Bietti Stochastic MISO March 21, / 20
20 General S-MISO: analysis Lyapunov function C q t = F (x ) D t (x t ) + µα t n 2 n i=1 1 q i n zt i z i 2. Bound on the iterates Recursion with σ 2 q = 1 n µ 2 E[ x t x 2 ] E[F (x ) D t (x t )]. ( E[Ct q ] 1 α t n i σ 2 i q i n. ) ( ) 2 E[C q t 1 ] + 2 αt σq 2 n µ, Alberto Bietti Stochastic MISO March 21, / 20
Proximal Minimization by Incremental Surrogate Optimization (MISO)
Proximal Minimization by Incremental Surrogate Optimization (MISO) (and a few variants) Julien Mairal Inria, Grenoble ICCOPT, Tokyo, 2016 Julien Mairal, Inria MISO 1/26 Motivation: large-scale machine
More informationOptimization for Machine Learning
Optimization for Machine Learning (Lecture 3-A - Convex) SUVRIT SRA Massachusetts Institute of Technology Special thanks: Francis Bach (INRIA, ENS) (for sharing this material, and permitting its use) MPI-IS
More informationSVRG++ with Non-uniform Sampling
SVRG++ with Non-uniform Sampling Tamás Kern András György Department of Electrical and Electronic Engineering Imperial College London, London, UK, SW7 2BT {tamas.kern15,a.gyorgy}@imperial.ac.uk Abstract
More informationJournal Club. A Universal Catalyst for First-Order Optimization (H. Lin, J. Mairal and Z. Harchaoui) March 8th, CMAP, Ecole Polytechnique 1/19
Journal Club A Universal Catalyst for First-Order Optimization (H. Lin, J. Mairal and Z. Harchaoui) CMAP, Ecole Polytechnique March 8th, 2018 1/19 Plan 1 Motivations 2 Existing Acceleration Methods 3 Universal
More informationMachine Learning. Support Vector Machines. Fabio Vandin November 20, 2017
Machine Learning Support Vector Machines Fabio Vandin November 20, 2017 1 Classification and Margin Consider a classification problem with two classes: instance set X = R d label set Y = { 1, 1}. Training
More informationStochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization
Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization Shai Shalev-Shwartz and Tong Zhang School of CS and Engineering, The Hebrew University of Jerusalem Optimization for Machine
More informationStochastic Gradient Descent with Variance Reduction
Stochastic Gradient Descent with Variance Reduction Rie Johnson, Tong Zhang Presenter: Jiawen Yao March 17, 2015 Rie Johnson, Tong Zhang Presenter: JiawenStochastic Yao Gradient Descent with Variance Reduction
More informationImproved Optimization of Finite Sums with Minibatch Stochastic Variance Reduced Proximal Iterations
Improved Optimization of Finite Sums with Miniatch Stochastic Variance Reduced Proximal Iterations Jialei Wang University of Chicago Tong Zhang Tencent AI La Astract jialei@uchicago.edu tongzhang@tongzhang-ml.org
More informationStochastic optimization: Beyond stochastic gradients and convexity Part I
Stochastic optimization: Beyond stochastic gradients and convexity Part I Francis Bach INRIA - Ecole Normale Supérieure, Paris, France ÉCOLE NORMALE SUPÉRIEURE Joint tutorial with Suvrit Sra, MIT - NIPS
More informationStochastic and online algorithms
Stochastic and online algorithms stochastic gradient method online optimization and dual averaging method minimizing finite average Stochastic and online optimization 6 1 Stochastic optimization problem
More informationOptimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade
Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade Machine Learning for Big Data CSE547/STAT548 University of Washington S. M. Kakade (UW) Optimization for
More informationFast Stochastic Optimization Algorithms for ML
Fast Stochastic Optimization Algorithms for ML Aaditya Ramdas April 20, 205 This lecture is about efficient algorithms for minimizing finite sums min w R d n i= f i (w) or min w R d n f i (w) + λ 2 w 2
More informationTrade-Offs in Distributed Learning and Optimization
Trade-Offs in Distributed Learning and Optimization Ohad Shamir Weizmann Institute of Science Includes joint works with Yossi Arjevani, Nathan Srebro and Tong Zhang IHES Workshop March 2016 Distributed
More informationAccelerated Randomized Primal-Dual Coordinate Method for Empirical Risk Minimization
Accelerated Randomized Primal-Dual Coordinate Method for Empirical Risk Minimization Lin Xiao (Microsoft Research) Joint work with Qihang Lin (CMU), Zhaosong Lu (Simon Fraser) Yuchen Zhang (UC Berkeley)
More informationMini-Batch Primal and Dual Methods for SVMs
Mini-Batch Primal and Dual Methods for SVMs Peter Richtárik School of Mathematics The University of Edinburgh Coauthors: M. Takáč (Edinburgh), A. Bijral and N. Srebro (both TTI at Chicago) arxiv:1303.2314
More informationBig Data Analytics: Optimization and Randomization
Big Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML 2015 Hong Kong Department of Computer Science, The University of Iowa, IA, USA Nov. 20, 2015 Yang Tutorial for ACML 15 Nov.
More informationOptimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method
Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method Davood Hajinezhad Iowa State University Davood Hajinezhad Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method 1 / 35 Co-Authors
More informationCoordinate Descent and Ascent Methods
Coordinate Descent and Ascent Methods Julie Nutini Machine Learning Reading Group November 3 rd, 2015 1 / 22 Projected-Gradient Methods Motivation Rewrite non-smooth problem as smooth constrained problem:
More informationBarzilai-Borwein Step Size for Stochastic Gradient Descent
Barzilai-Borwein Step Size for Stochastic Gradient Descent Conghui Tan The Chinese University of Hong Kong chtan@se.cuhk.edu.hk Shiqian Ma The Chinese University of Hong Kong sqma@se.cuhk.edu.hk Yu-Hong
More informationA Universal Catalyst for Gradient-Based Optimization
A Universal Catalyst for Gradient-Based Optimization Julien Mairal Inria, Grenoble CIMI workshop, Toulouse, 2015 Julien Mairal, Inria Catalyst 1/58 Collaborators Hongzhou Lin Zaid Harchaoui Publication
More informationLecture 3: Minimizing Large Sums. Peter Richtárik
Lecture 3: Minimizing Large Sums Peter Richtárik Graduate School in Systems, Op@miza@on, Control and Networks Belgium 2015 Mo@va@on: Machine Learning & Empirical Risk Minimiza@on Training Linear Predictors
More informationLimitations on Variance-Reduction and Acceleration Schemes for Finite Sum Optimization
Limitations on Variance-Reduction and Acceleration Schemes for Finite Sum Optimization Yossi Arjevani Department of Computer Science and Applied Mathematics Weizmann Institute of Science Rehovot 7610001,
More informationBeyond stochastic gradient descent for large-scale machine learning
Beyond stochastic gradient descent for large-scale machine learning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France Joint work with Eric Moulines, Nicolas Le Roux and Mark Schmidt - CAP, July
More informationLarge-scale Stochastic Optimization
Large-scale Stochastic Optimization 11-741/641/441 (Spring 2016) Hanxiao Liu hanxiaol@cs.cmu.edu March 24, 2016 1 / 22 Outline 1. Gradient Descent (GD) 2. Stochastic Gradient Descent (SGD) Formulation
More informationSimple Optimization, Bigger Models, and Faster Learning. Niao He
Simple Optimization, Bigger Models, and Faster Learning Niao He Big Data Symposium, UIUC, 2016 Big Data, Big Picture Niao He (UIUC) 2/26 Big Data, Big Picture Niao He (UIUC) 3/26 Big Data, Big Picture
More informationStochastic Optimization Algorithms Beyond SG
Stochastic Optimization Algorithms Beyond SG Frank E. Curtis 1, Lehigh University involving joint work with Léon Bottou, Facebook AI Research Jorge Nocedal, Northwestern University Optimization Methods
More informationAn Evolving Gradient Resampling Method for Machine Learning. Jorge Nocedal
An Evolving Gradient Resampling Method for Machine Learning Jorge Nocedal Northwestern University NIPS, Montreal 2015 1 Collaborators Figen Oztoprak Stefan Solntsev Richard Byrd 2 Outline 1. How to improve
More informationNesterov s Acceleration
Nesterov s Acceleration Nesterov Accelerated Gradient min X f(x)+ (X) f -smooth. Set s 1 = 1 and = 1. Set y 0. Iterate by increasing t: g t 2 @f(y t ) s t+1 = 1+p 1+4s 2 t 2 y t = x t + s t 1 s t+1 (x
More informationAccelerating SVRG via second-order information
Accelerating via second-order information Ritesh Kolte Department of Electrical Engineering rkolte@stanford.edu Murat Erdogdu Department of Statistics erdogdu@stanford.edu Ayfer Özgür Department of Electrical
More informationSTA141C: Big Data & High Performance Statistical Computing
STA141C: Big Data & High Performance Statistical Computing Lecture 8: Optimization Cho-Jui Hsieh UC Davis May 9, 2017 Optimization Numerical Optimization Numerical Optimization: min X f (X ) Can be applied
More informationarxiv: v1 [math.oc] 18 Mar 2016
Katyusha: Accelerated Variance Reduction for Faster SGD Zeyuan Allen-Zhu zeyuan@csail.mit.edu Princeton University arxiv:1603.05953v1 [math.oc] 18 Mar 016 March 18, 016 Abstract We consider minimizing
More informationModern Stochastic Methods. Ryan Tibshirani (notes by Sashank Reddi and Ryan Tibshirani) Convex Optimization
Modern Stochastic Methods Ryan Tibshirani (notes by Sashank Reddi and Ryan Tibshirani) Convex Optimization 10-725 Last time: conditional gradient method For the problem min x f(x) subject to x C where
More informationIncremental and Stochastic Majorization-Minimization Algorithms for Large-Scale Machine Learning
Incremental and Stochastic Majorization-Minimization Algorithms for Large-Scale Machine Learning Julien Mairal Inria, LEAR Team, Grenoble Journées MAS, Toulouse Julien Mairal Incremental and Stochastic
More informationAccelerating Stochastic Optimization
Accelerating Stochastic Optimization Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem and Mobileye Master Class at Tel-Aviv, Tel-Aviv University, November 2014 Shalev-Shwartz
More informationStochastic gradient methods for machine learning
Stochastic gradient methods for machine learning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France Joint work with Eric Moulines, Nicolas Le Roux and Mark Schmidt - September 2012 Context Machine
More informationMinimizing Finite Sums with the Stochastic Average Gradient Algorithm
Minimizing Finite Sums with the Stochastic Average Gradient Algorithm Joint work with Nicolas Le Roux and Francis Bach University of British Columbia Context: Machine Learning for Big Data Large-scale
More informationFinite-sum Composition Optimization via Variance Reduced Gradient Descent
Finite-sum Composition Optimization via Variance Reduced Gradient Descent Xiangru Lian Mengdi Wang Ji Liu University of Rochester Princeton University University of Rochester xiangru@yandex.com mengdiw@princeton.edu
More informationStochastic gradient descent and robustness to ill-conditioning
Stochastic gradient descent and robustness to ill-conditioning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France ÉCOLE NORMALE SUPÉRIEURE Joint work with Aymeric Dieuleveut, Nicolas Flammarion,
More informationParallel and Distributed Stochastic Learning -Towards Scalable Learning for Big Data Intelligence
Parallel and Distributed Stochastic Learning -Towards Scalable Learning for Big Data Intelligence oé LAMDA Group H ŒÆOŽÅ Æ EâX ^ #EâI[ : liwujun@nju.edu.cn Dec 10, 2016 Wu-Jun Li (http://cs.nju.edu.cn/lwj)
More informationBarzilai-Borwein Step Size for Stochastic Gradient Descent
Barzilai-Borwein Step Size for Stochastic Gradient Descent Conghui Tan Shiqian Ma Yu-Hong Dai Yuqiu Qian May 16, 2016 Abstract One of the major issues in stochastic gradient descent (SGD) methods is how
More informationECS289: Scalable Machine Learning
ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Sept 29, 2016 Outline Convex vs Nonconvex Functions Coordinate Descent Gradient Descent Newton s method Stochastic Gradient Descent Numerical Optimization
More informationStochastic gradient methods for machine learning
Stochastic gradient methods for machine learning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France Joint work with Eric Moulines, Nicolas Le Roux and Mark Schmidt - January 2013 Context Machine
More informationStochastic optimization in Hilbert spaces
Stochastic optimization in Hilbert spaces Aymeric Dieuleveut Aymeric Dieuleveut Stochastic optimization Hilbert spaces 1 / 48 Outline Learning vs Statistics Aymeric Dieuleveut Stochastic optimization Hilbert
More informationIFT Lecture 6 Nesterov s Accelerated Gradient, Stochastic Gradient Descent
IFT 6085 - Lecture 6 Nesterov s Accelerated Gradient, Stochastic Gradient Descent This version of the notes has not yet been thoroughly checked. Please report any bugs to the scribes or instructor. Scribe(s):
More informationStochastic Gradient Descent. Ryan Tibshirani Convex Optimization
Stochastic Gradient Descent Ryan Tibshirani Convex Optimization 10-725 Last time: proximal gradient descent Consider the problem min x g(x) + h(x) with g, h convex, g differentiable, and h simple in so
More informationIntroduction to Machine Learning (67577) Lecture 7
Introduction to Machine Learning (67577) Lecture 7 Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem Solving Convex Problems using SGD and RLM Shai Shalev-Shwartz (Hebrew
More informationONLINE VARIANCE-REDUCING OPTIMIZATION
ONLINE VARIANCE-REDUCING OPTIMIZATION Nicolas Le Roux Google Brain nlr@google.com Reza Babanezhad University of British Columbia rezababa@cs.ubc.ca Pierre-Antoine Manzagol Google Brain manzagop@google.com
More informationStochastic Quasi-Newton Methods
Stochastic Quasi-Newton Methods Donald Goldfarb Department of IEOR Columbia University UCLA Distinguished Lecture Series May 17-19, 2016 1 / 35 Outline Stochastic Approximation Stochastic Gradient Descent
More informationComparison of Modern Stochastic Optimization Algorithms
Comparison of Modern Stochastic Optimization Algorithms George Papamakarios December 214 Abstract Gradient-based optimization methods are popular in machine learning applications. In large-scale problems,
More informationOptimal Stochastic Strongly Convex Optimization with a Logarithmic Number of Projections
Optimal Stochastic Strongly Convex Optimization with a Logarithmic Number of Projections Jianhui Chen 1, Tianbao Yang 2, Qihang Lin 2, Lijun Zhang 3, and Yi Chang 4 July 18, 2016 Yahoo Research 1, The
More informationOn Projected Stochastic Gradient Descent Algorithm with Weighted Averaging for Least Squares Regression
On Projected Stochastic Gradient Descent Algorithm with Weighted Averaging for Least Squares Regression arxiv:606.03000v [cs.it] 9 Jun 206 Kobi Cohen, Angelia Nedić and R. Srikant Abstract The problem
More informationConvex Optimization Algorithms for Machine Learning in 10 Slides
Convex Optimization Algorithms for Machine Learning in 10 Slides Presenter: Jul. 15. 2015 Outline 1 Quadratic Problem Linear System 2 Smooth Problem Newton-CG 3 Composite Problem Proximal-Newton-CD 4 Non-smooth,
More informationPrimal-dual coordinate descent A Coordinate Descent Primal-Dual Algorithm with Large Step Size and Possibly Non-Separable Functions
Primal-dual coordinate descent A Coordinate Descent Primal-Dual Algorithm with Large Step Size and Possibly Non-Separable Functions Olivier Fercoq and Pascal Bianchi Problem Minimize the convex function
More informationLarge-scale Machine Learning and Optimization
1/84 Large-scale Machine Learning and Optimization Zaiwen Wen http://bicmr.pku.edu.cn/~wenzw/bigdata2018.html Acknowledgement: this slides is based on Dimitris Papailiopoulos and Shiqian Ma lecture notes
More informationOptimization Methods for Machine Learning
Optimization Methods for Machine Learning Sathiya Keerthi Microsoft Talks given at UC Santa Cruz February 21-23, 2017 The slides for the talks will be made available at: http://www.keerthis.com/ Introduction
More informationStochastic Optimization Part I: Convex analysis and online stochastic optimization
Stochastic Optimization Part I: Convex analysis and online stochastic optimization Taiji Suzuki Tokyo Institute of Technology Graduate School of Information Science and Engineering Department of Mathematical
More informationStochastic Optimization
Introduction Related Work SGD Epoch-GD LM A DA NANJING UNIVERSITY Lijun Zhang Nanjing University, China May 26, 2017 Introduction Related Work SGD Epoch-GD Outline 1 Introduction 2 Related Work 3 Stochastic
More informationConvex Optimization Lecture 16
Convex Optimization Lecture 16 Today: Projected Gradient Descent Conditional Gradient Descent Stochastic Gradient Descent Random Coordinate Descent Recall: Gradient Descent (Steepest Descent w.r.t Euclidean
More informationContents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016
ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016 LECTURERS: NAMAN AGARWAL AND BRIAN BULLINS SCRIBE: KIRAN VODRAHALLI Contents 1 Introduction 1 1.1 History of Optimization.....................................
More informationMASAGA: A Stochastic Algorithm for Manifold Optimization
MASAGA: A Stochastic Algorithm for Manifold Optimization Reza Babanezhad, Issam H. Laradji, Alireza Shafaei, and Mark Schmidt Department of Computer Science, University of British Columbia Vancouver, British
More informationMachine Learning And Applications: Supervised Learning-SVM
Machine Learning And Applications: Supervised Learning-SVM Raphaël Bournhonesque École Normale Supérieure de Lyon, Lyon, France raphael.bournhonesque@ens-lyon.fr 1 Supervised vs unsupervised learning Machine
More informationLearning with stochastic proximal gradient
Learning with stochastic proximal gradient Lorenzo Rosasco DIBRIS, Università di Genova Via Dodecaneso, 35 16146 Genova, Italy lrosasco@mit.edu Silvia Villa, Băng Công Vũ Laboratory for Computational and
More informationSGD and Deep Learning
SGD and Deep Learning Subgradients Lets make the gradient cheating more formal. Recall that the gradient is the slope of the tangent. f(w 1 )+rf(w 1 ) (w w 1 ) Non differentiable case? w 1 Subgradients
More informationBeyond stochastic gradient descent for large-scale machine learning
Beyond stochastic gradient descent for large-scale machine learning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France ÉCOLE NORMALE SUPÉRIEURE Joint work with Aymeric Dieuleveut, Nicolas Flammarion,
More informationLinear Convergence under the Polyak-Łojasiewicz Inequality
Linear Convergence under the Polyak-Łojasiewicz Inequality Hamed Karimi, Julie Nutini and Mark Schmidt The University of British Columbia LCI Forum February 28 th, 2017 1 / 17 Linear Convergence of Gradient-Based
More informationProvable Non-Convex Min-Max Optimization
Provable Non-Convex Min-Max Optimization Mingrui Liu, Hassan Rafique, Qihang Lin, Tianbao Yang Department of Computer Science, The University of Iowa, Iowa City, IA, 52242 Department of Mathematics, The
More informationNeural Networks: Backpropagation
Neural Networks: Backpropagation Machine Learning Fall 2017 Based on slides and material from Geoffrey Hinton, Richard Socher, Dan Roth, Yoav Goldberg, Shai Shalev-Shwartz and Shai Ben-David, and others
More informationSupport Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017
Support Vector Machines: Training with Stochastic Gradient Descent Machine Learning Fall 2017 1 Support vector machines Training by maximizing margin The SVM objective Solving the SVM optimization problem
More informationarxiv: v4 [math.oc] 5 Jan 2016
Restarted SGD: Beating SGD without Smoothness and/or Strong Convexity arxiv:151.03107v4 [math.oc] 5 Jan 016 Tianbao Yang, Qihang Lin Department of Computer Science Department of Management Sciences The
More informationNESTT: A Nonconvex Primal-Dual Splitting Method for Distributed and Stochastic Optimization
ESTT: A onconvex Primal-Dual Splitting Method for Distributed and Stochastic Optimization Davood Hajinezhad, Mingyi Hong Tuo Zhao Zhaoran Wang Abstract We study a stochastic and distributed algorithm for
More informationSub-Sampled Newton Methods
Sub-Sampled Newton Methods F. Roosta-Khorasani and M. W. Mahoney ICSI and Dept of Statistics, UC Berkeley February 2016 F. Roosta-Khorasani and M. W. Mahoney (UCB) Sub-Sampled Newton Methods Feb 2016 1
More informationCSCI 1951-G Optimization Methods in Finance Part 12: Variants of Gradient Descent
CSCI 1951-G Optimization Methods in Finance Part 12: Variants of Gradient Descent April 27, 2018 1 / 32 Outline 1) Moment and Nesterov s accelerated gradient descent 2) AdaGrad and RMSProp 4) Adam 5) Stochastic
More informationSecond order machine learning
Second order machine learning Michael W. Mahoney ICSI and Department of Statistics UC Berkeley Michael W. Mahoney (UC Berkeley) Second order machine learning 1 / 88 Outline Machine Learning s Inverse Problem
More informationDiscriminative Models
No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models
More informationAdaptive Probabilities in Stochastic Optimization Algorithms
Research Collection Master Thesis Adaptive Probabilities in Stochastic Optimization Algorithms Author(s): Zhong, Lei Publication Date: 2015 Permanent Link: https://doi.org/10.3929/ethz-a-010421465 Rights
More informationarxiv: v1 [math.oc] 1 Jul 2016
Convergence Rate of Frank-Wolfe for Non-Convex Objectives Simon Lacoste-Julien INRIA - SIERRA team ENS, Paris June 8, 016 Abstract arxiv:1607.00345v1 [math.oc] 1 Jul 016 We give a simple proof that the
More informationECS171: Machine Learning
ECS171: Machine Learning Lecture 3: Linear Models I (LFD 3.2, 3.3) Cho-Jui Hsieh UC Davis Jan 17, 2018 Linear Regression (LFD 3.2) Regression Classification: Customer record Yes/No Regression: predicting
More informationNeural networks and optimization
Neural networks and optimization Nicolas Le Roux INRIA 8 Nov 2011 Nicolas Le Roux (INRIA) Neural networks and optimization 8 Nov 2011 1 / 80 1 Introduction 2 Linear classifier 3 Convolutional neural networks
More informationSDNA: Stochastic Dual Newton Ascent for Empirical Risk Minimization
Zheng Qu ZHENGQU@HKU.HK Department of Mathematics, The University of Hong Kong, Hong Kong Peter Richtárik PETER.RICHTARIK@ED.AC.UK School of Mathematics, The University of Edinburgh, UK Martin Takáč TAKAC.MT@GMAIL.COM
More informationLinear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)
Support vector machines In a nutshell Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers) Solution only depends on a small subset of training
More informationModern Optimization Techniques
Modern Optimization Techniques 2. Unconstrained Optimization / 2.2. Stochastic Gradient Descent Lars Schmidt-Thieme Information Systems and Machine Learning Lab (ISMLL) Institute of Computer Science University
More informationOptimal Regularized Dual Averaging Methods for Stochastic Optimization
Optimal Regularized Dual Averaging Methods for Stochastic Optimization Xi Chen Machine Learning Department Carnegie Mellon University xichen@cs.cmu.edu Qihang Lin Javier Peña Tepper School of Business
More informationNonlinear Optimization Methods for Machine Learning
Nonlinear Optimization Methods for Machine Learning Jorge Nocedal Northwestern University University of California, Davis, Sept 2018 1 Introduction We don t really know, do we? a) Deep neural networks
More informationCoordinate Descent Faceoff: Primal or Dual?
JMLR: Workshop and Conference Proceedings 83:1 22, 2018 Algorithmic Learning Theory 2018 Coordinate Descent Faceoff: Primal or Dual? Dominik Csiba peter.richtarik@ed.ac.uk School of Mathematics University
More informationAccelerate Subgradient Methods
Accelerate Subgradient Methods Tianbao Yang Department of Computer Science The University of Iowa Contributors: students Yi Xu, Yan Yan and colleague Qihang Lin Yang (CS@Uiowa) Accelerate Subgradient Methods
More informationSupport Vector Machine
Andrea Passerini passerini@disi.unitn.it Machine Learning Support vector machines In a nutshell Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)
More informationAdaptive Subgradient Methods for Online Learning and Stochastic Optimization John Duchi, Elad Hanzan, Yoram Singer
Adaptive Subgradient Methods for Online Learning and Stochastic Optimization John Duchi, Elad Hanzan, Yoram Singer Vicente L. Malave February 23, 2011 Outline Notation minimize a number of functions φ
More informationA Parallel SGD method with Strong Convergence
A Parallel SGD method with Strong Convergence Dhruv Mahajan Microsoft Research India dhrumaha@microsoft.com S. Sundararajan Microsoft Research India ssrajan@microsoft.com S. Sathiya Keerthi Microsoft Corporation,
More informationSummary and discussion of: Dropout Training as Adaptive Regularization
Summary and discussion of: Dropout Training as Adaptive Regularization Statistics Journal Club, 36-825 Kirstin Early and Calvin Murdock November 21, 2014 1 Introduction Multi-layered (i.e. deep) artificial
More informationOn the Convergence Rate of Incremental Aggregated Gradient Algorithms
On the Convergence Rate of Incremental Aggregated Gradient Algorithms M. Gürbüzbalaban, A. Ozdaglar, P. Parrilo June 5, 2015 Abstract Motivated by applications to distributed optimization over networks
More informationProximal Methods for Optimization with Spasity-inducing Norms
Proximal Methods for Optimization with Spasity-inducing Norms Group Learning Presentation Xiaowei Zhou Department of Electronic and Computer Engineering The Hong Kong University of Science and Technology
More informationSecond order machine learning
Second order machine learning Michael W. Mahoney ICSI and Department of Statistics UC Berkeley Michael W. Mahoney (UC Berkeley) Second order machine learning 1 / 96 Outline Machine Learning s Inverse Problem
More informationDiscriminative Models
No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models
More informationarxiv: v2 [cs.na] 3 Apr 2018
Stochastic variance reduced multiplicative update for nonnegative matrix factorization Hiroyui Kasai April 5, 208 arxiv:70.078v2 [cs.a] 3 Apr 208 Abstract onnegative matrix factorization (MF), a dimensionality
More informationMaking Gradient Descent Optimal for Strongly Convex Stochastic Optimization
Making Gradient Descent Optimal for Strongly Convex Stochastic Optimization Alexander Rakhlin University of Pennsylvania Ohad Shamir Microsoft Research New England Karthik Sridharan University of Pennsylvania
More informationA Multilevel Proximal Algorithm for Large Scale Composite Convex Optimization
A Multilevel Proximal Algorithm for Large Scale Composite Convex Optimization Panos Parpas Department of Computing Imperial College London www.doc.ic.ac.uk/ pp500 p.parpas@imperial.ac.uk jointly with D.V.
More informationIntroduction to Three Paradigms in Machine Learning. Julien Mairal
Introduction to Three Paradigms in Machine Learning Julien Mairal Inria Grenoble Yerevan, 208 Julien Mairal Introduction to Three Paradigms in Machine Learning /25 Optimization is central to machine learning
More informationOn the Generalization Ability of Online Strongly Convex Programming Algorithms
On the Generalization Ability of Online Strongly Convex Programming Algorithms Sham M. Kakade I Chicago Chicago, IL 60637 sham@tti-c.org Ambuj ewari I Chicago Chicago, IL 60637 tewari@tti-c.org Abstract
More informationOn Gradient-Based Optimization: Accelerated, Stochastic, Asynchronous, Distributed
On Gradient-Based Optimization: Accelerated, Stochastic, Asynchronous, Distributed Michael I. Jordan University of California, Berkeley February 9, 2017 Modern Optimization and Large-Scale Data Analysis
More informationECS171: Machine Learning
ECS171: Machine Learning Lecture 4: Optimization (LFD 3.3, SGD) Cho-Jui Hsieh UC Davis Jan 22, 2018 Gradient descent Optimization Goal: find the minimizer of a function min f (w) w For now we assume f
More information