Stochastic Composition Optimization
|
|
- Marian Snow
- 5 years ago
- Views:
Transcription
1 Stochastic Composition Optimization Algorithms and Sample Complexities Mengdi Wang Joint works with Ethan X. Fang, Han Liu, and Ji Liu ICCOPT, Tokyo, August 8-11, / 24
2 Collaborators M. Wang, X. Fang, and H. Liu. Stochastic Compositional Gradient Descent: Algorithms for Minimizing Compositions of Expected-Value Functions. Mathematical Programming, Submitted in 2014, to appear in M. Wang and J. Liu. Accelerating Stochastic Composition Optimization M. Wang and J. Liu. A Stochastic Compositional Subgradient Method Using Markov Samples / 24
3 Outline 1 Background: Why is SGD a good method? 2 A New Problem: Stochastic Composition Optimization 3 Stochastic Composition Algorithms: Convergence and Sample Complexity 4 Acceleration via Smoothing-Extrapolation 3 / 24
4 Background: Why is SGD a good method? Outline 1 Background: Why is SGD a good method? 2 A New Problem: Stochastic Composition Optimization 3 Stochastic Composition Algorithms: Convergence and Sample Complexity 4 Acceleration via Smoothing-Extrapolation 4 / 24
5 Background: Why is SGD a good method? Background Machine learning is optimization Learning from batch data Learning from online data min x R d 1 n n i=1 l(x; A i, b i ) + ρ(x) min x R d E A,b [l(x; A, b)] + ρ(x) Both problems can be formulated as Stochastic Convex Optimization min E [f (x, ξ)] x }{{} expectation over batch data set or unknown distribution A general framework encompasses likelihood estimation, online learning, empirical risk minimization, multi-arm bandit, online MDP Stochastic gradient descent (SGD) updates by taking sample gradients: x k+1 = x k α f (x k, ξ k ) A special case of stochastic approximation with a long history (Robbins and Monro, Kushner and Yin, Polyak and Juditsky, Benveniste et al., Ruszcyǹski, Borkar, Bertsekas and Tsitsiklis, and many) 5 / 24
6 Background: Why is SGD a good method? Background: Stochastic first-order methods Stochastic gradient descent (SGD) updates by taking sample gradients: x k+1 = x k α f (x k, ξ k ) (1,410,000 results on Google Scholar Search and 24,400 since 2016!) Why is SGD a good method in practice? When processing either batch or online data, a scalable algorithm needs to update using partial information (a small subset of all data) Answer: We have no other choice Why is SGD a good method beyond practical reasons? SGD achieves optimal convergence after processing k samples: E [F (x k ) F ] = O(1/ k) for convex minimization E [F (x k ) F ] = O(1/k) for strongly convex minimization (Nemirovski and Yudin 1983, Agarwal et al. 2012, Rakhlin et al. 2012, Ghadimi and Lan 2012,2013, Shamir and Zhang 2013 and many more) Beyond convexity: nearly optimal online PCA (Li, Wang, Liu, Zhang 2015) Answer: Strong theoretical guarantees for data-driven problems 6 / 24
7 A New Problem: Stochastic Composition Optimization Outline 1 Background: Why is SGD a good method? 2 A New Problem: Stochastic Composition Optimization 3 Stochastic Composition Algorithms: Convergence and Sample Complexity 4 Acceleration via Smoothing-Extrapolation 7 / 24
8 A New Problem: Stochastic Composition Optimization Stochastic Composition Optimization Consider the problem { } min x X F (x) := (f g)(x) = f (g(x)), where the outer and inner functions are f : R m R, g : R n R m and X is a closed and convex set in R n. f (y) = E [f v (y)], g(x) = E [g w (x)], We focus on the case where the overall problem is convex (for now) No structural assumptions on f, g (nonconvex/nonmonotone/nondifferentiable) We may not know the distribution of v, w. 8 / 24
9 A New Problem: Stochastic Composition Optimization Expectation Minimization vs. Stochastic Composition Optimization Recall the classical problem: min x X E [f (x, ξ)] }{{} linear w.r.t the distribution of ξ In stochastic composition optimization, the objective is no longer a linear functional of the (v, w) distribution: min E[f v (E [g w (x)])] x X }{{} nonlinear w.r.t the distribution of (w,v) In the classical problem, nice properties come from linearity w.r.t. data distribution In stochastic composition optimization, they are all lost A little nonlinearity takes a long way to go. 9 / 24
10 A New Problem: Stochastic Composition Optimization Motivating Example: High-Dimensional Nonparametric Estimation Sparse Additive Model (SpAM): d y i = h j (x ij ) + ɛ i. j=1 High-dimensional feature space with relatively few data samples: Sample size n Features!d Features d Sample size!n n d n d Optimization model for SpAM 1 : [ min E[f v (E [g w (x)])] x min E Y h j H j d ] 2 d h j (X j ) + λ E [ hj 2(X j ) ] The term λ d j=1 E [ hj 2(X j ) ] induces sparsity in the feature space. j=1 j=1 1 P. Ravikumar, J. Lafferty, H. Liu, and L. Wasserman. Sparse additive models. Journal of the Royal Statistical Society: Series B, 71(5): , / 24
11 A New Problem: Stochastic Composition Optimization Motivating Example: Risk-Averse Learning Consider the mean-variance minimization problem Its batch version is 1 min x N min x E a,b [l(x; a, b)] + λvar a,b [l(x; a, b)], ( 2 N l(x; a i, b i ) + λ N l(x; a i, b i ) 1 N l(x; a i, b i )). N N i=1 i=1 i=1 The variance Var[Z] = E [ (Z E[Z]) 2] is a composition between two functions Many other risk functions are equivalent to compositions of multiple expected-value functions (Shapiro, Dentcheva, Ruszcyǹski 2014) A central limit theorem for composite of multiple smooth functions has been established for risk metrics (Dentcheva, Penev, Ruszcyǹski 2016) No good way to optimize a risk-averse objective while learning from online data 11 / 24
12 A New Problem: Stochastic Composition Optimization Motivating Example: Reinforcement Learning On-policy reinforcement learning is to learn the value-per-state of a stochastic system. We want to solve a (huge) Bellman equations γp π V π + r π = V π, where P π is transition prob. matrix and r π are rewards, both unknown. On-policy learning aims to solve Bellman equation via blackbox simulation. It becomes a special stochastic composition optimization problem: min x E[f v (E [g w (x)])] min x R S E[A]x E[b] 2, where E[A] = I γp π and E[b] = r π. 12 / 24
13 Stochastic Composition Algorithms: Convergence and Sample Complexity Outline 1 Background: Why is SGD a good method? 2 A New Problem: Stochastic Composition Optimization 3 Stochastic Composition Algorithms: Convergence and Sample Complexity 4 Acceleration via Smoothing-Extrapolation 13 / 24
14 Stochastic Composition Algorithms: Convergence and Sample Complexity Problem Formulation min x X { F (x) := E[f v (E [g w (x)])] }{{} nonlinear w.r.t the distribution of (w,v) Sampling Oracle (SO) Upon query (x, y), the oracle returns: Noisy inner sample g w (x) and its noisy subgradient g w (x); Noisy outer gradient f v (y) Challenges Stochastic gradient descent (SGD) method does not work since an unbiased sample of the gradient g(x k ) f (g(x k )) is not available. Fenchel dual does not work except for rare conditions Sample average approximation (SAA) subject to curse of dimensionality. Sample complexity unclear }, 14 / 24
15 Stochastic Composition Algorithms: Convergence and Sample Complexity Basic Idea To approximate x k+1 = Π X { xk α k g(x k ) f (g(x k )) }, by a quasi-gradient iteration using estimates of g(x k ) Algorithm 1: Stochastic Compositional Gradient Descent (SCGD) Require: x 0, z 0 R n, y 0 R m, SO, K, stepsizes {α k } K k=1, and {β k} K k=1. Ensure: {x k } K k=1 for k = 1,, K do Query the SO and obtain g wk (x k ), g wk (x k ), f vk (y k+1 ) Update by y k+1 = (1 β k )y k + β k g wk (x k ), end for Remarks x k+1 = Π X { xk α k g wk (x k ) f vk (y k+1 ) }, Each iteration makes simple updates by interacting with SO Scalable with large-scale batch data and can process streaming data points online Considered for the first time by (Ermoliev 1976) as a stochastic approximation method without rate analysis 15 / 24
16 Stochastic Composition Algorithms: Convergence and Sample Complexity Sample Complexity (Wang et al., 2016) Under suitable conditions (inner function nonsmooth, outer function smooth), and X is bounded, let the stepsizes be α k = k 3/4, β k = k 1/2, we have that, if k is large enough, E F 2 k ( x t F 1 ) = O k k 1/4. t=k/2+1 (Optimal rate which matches the lowerbound for stochastic programming) Sample Convexity in Strongly Convex Case (Wang et al., 2016) Under suitable conditions (inner function nonsmooth, outer function smooth), suppose that the compositional function F ( ) is strongly convex, let the stepsizes be we have, if k is sufficiently large, α k = 1 k, and β k = 1 k 2/3, E [ x k x 2] ( 1 ) = O k 2/3. 16 / 24
17 Stochastic Composition Algorithms: Convergence and Sample Complexity Outline of Analysis The auxilary variable y k is taking a running estimate of g(x k ) at biased query points x 0,..., x k : k y k+1 = (Π k t =t β t )gwt (xt). t=0 Two entangled stochastic sequences {ɛ k = y k+1 g(x k ) 2 }, {ξ k = x k x 2 }. Coupled supermatingale analysis E [ɛ k+1 F k ] (1 β k )ɛ k + O(βk x k+1 x k 2 ) β k E [ξ k+1 F k ] (1 + α 2 k )ξ k α k (F (x k ) F ) + O(β k )ɛ k Almost sure convergence by using a Coupled Supermartingale Convergence Theorem (Wang and Bertsekas, 2013) Convergence rate analysis via optimizing over stepsizes and balancing noise-bias tradeoff 17 / 24
18 Acceleration via Smoothing-Extrapolation Outline 1 Background: Why is SGD a good method? 2 A New Problem: Stochastic Composition Optimization 3 Stochastic Composition Algorithms: Convergence and Sample Complexity 4 Acceleration via Smoothing-Extrapolation 18 / 24
19 Acceleration via Smoothing-Extrapolation Acceleration When the function g( ) is smooth, can the algorithms be accelerated? Yes! Algorithm 2: Accelerated SCGD Require: x 0, z 0 R n, y 0 R m, SO, K, stepsizes {α k } K k=1, and {β k } K k=1. Ensure: {x k } K k=1 for k = 1,, K do Query the SO and obtain f vk (y k ), g wk (z k ). Update by x k+1 = x k α k g w k (x k ) f vk (y k ). Update auxiliary variables via extrapolation-smoothing: ) z k+1 = (1 1βk x k + 1 x k+1, β k y k+1 = (1 β k )y k + β k g wk+1 (z k+1 ), where the sample g wk+1 (z k+1 ) is obtained via querying the SO. end for Key to the Acceleration Bias reduction by averaging over extrapolated points (extrapolation-smoothing) y k = k ( k (Π k t =t β t )gwt (z t) g(x k ) = g (Π k t =t β t ). )zt t=0 t=0 19 / 24
20 Acceleration via Smoothing-Extrapolation Accelerated Sample Complexity (Wang et al. 2016) Under suitable conditions (inner function smooth, outer function smooth), if the stepsizes are chosen as α k = k 5/7, β k = k 4/7, we have, where ˆx k = 2 k k t=k/2+1 xt. [ E F (ˆx k ) F ] ( 1 ) = O k 2/7, Strongly Convex Case (Wang et al. 2016) Under suitable conditions (inner function smooth, outer function smooth),, and we assume that F is strongly-convex. If the stepsizes are chosen as we have, α k = 1 k, and β k = 1 k 4/5, ( 1 ) E[ x k x 2 ] = O k 4/5. 20 / 24
21 Acceleration via Smoothing-Extrapolation Regularized Stochastic Composition Optimization(Wang and Liu 2016) min x R n The penalty term R(x) is convex and nonsmooth. (E v f v E w g w )(x) + R(x) Algorithm 3: Accelerated Stochastic Compositional Proximal Gradient (ASC-PG) Require: x 0, z 0 R n, y 0 R m, SO, K, stepsizes {α k } K k=1, and {β k} K k=1. Ensure: {x k } K k=1 for k = 1,, K do Query the SO and obtain f vk (y k ), g wk (z k ). Update the main iterate by ( ) x k+1 = prox αk R( ) x k α k gw k (x k ) f vk (y k ). Update auxillary iterates by an extrapolation-smoothing scheme: ) z k+1 = (1 1βk x k + 1 x k+1, β k y k+1 = (1 β k )y k + β k g wk+1 (z k+1 ), where the sample g wk+1 (z k+1 ) is obtained via querying the SO. end for 21 / 24
22 Acceleration via Smoothing-Extrapolation Sample Complexity for Smooth Optimization (Wang and Liu 2016) Under suitable conditions (inner function nonsmooth, outer function smooth), and X is bounded, let the stepsizes be chosen properly, we have that, if k is large enough, E F 2 k ( x t F 1 ) = O k k 4/9, t=k/2+1 If either the outer or the inner function is linear, the rate improves to be optimal: E F 2 k ( x t F 1 ) = O k k 1/2, t=k/2+1 Sample Convexity in Strongly Convex Case (Wang and Liu 2016) Suppose that the compositional function F ( ) is strongly convex, let the stepsizes be chosen properly, we have, if k is sufficiently large, E [ x k x 2] ( 1 ) = O k 4/5. If either the outer or the inner function is linear, the rate improves to be optimal: E [ x k x 2] ( 1 ) = O k / 24
23 Acceleration via Smoothing-Extrapolation When there are two nested uncertainties: A class of two-timescale algorithms that update using first-order samples Analyzing convergence becomes harder: two coupled stochastic processes and smoothness-noise interplay Convergence rates of stochastic algorithms establish sample complexity upper bounds for the new problem class General Convex Strongly Convex Outer Nonsmooth, Inner Smooth O ( k 1/4) O ( k 2/3)? Outer and Inner Smooth O ( k 4/9)? O ( k 4/5)? Special Case: min x E [f (x; ξ)] O ( k 1/2) O ( k 1) Special Case: min x E [f (E[A]x E[b]; ξ)] O ( k 1/2) O ( k 1) Applications and computations Table : Summary of best known sample complexities First scalable algorithm for sparse nonparametric estimation Optimal algorithm for on-policy reinforcement learning 23 / 24
24 Acceleration via Smoothing-Extrapolation Summary Stochastic Composition Optimization - a new and rich problem class min E[f v (E [g w (x)])] x }{{} nonlinear w.r.t the distribution of (w,v) Applications in risk, data analysis, machine learning, real-time intelligent systems A class of stochastic compositional gradient methods with convergence guarantees. Basic sample complexity being developed: Outer Nonsmooth, Inner Smooth Outer and Inner Smooth Special Case: minx E [f (x; ξ)] Special Case: minx E [f (E[A]x E[b]; ξ)] General/Convex ( O k 1/4) ( O k 4/9)? ( O k 1/2) ( O k 1/2) Strongly Convex ( O k 2/3)? ( O k 4/5)? O (k 1) O (k 1) Many open questions remaining and more works needed! Thank you very much! 24 / 24
Stochastic Compositional Gradient Descent: Algorithms for Minimizing Nonlinear Functions of Expected Values
Stochastic Compositional Gradient Descent: Algorithms for Minimizing Nonlinear Functions of Expected Values Mengdi Wang Ethan X. Fang Han Liu Abstract Classical stochastic gradient methods are well suited
More informationAccelerating Stochastic Composition Optimization
Journal of Machine Learning Research 18 (2017) 1-23 Submitted 10/16; Revised 6/17; Published 10/17 Accelerating Stochastic Composition Optimization Mengdi Wang Department of Operations Research and Financial
More informationConvex Optimization Lecture 16
Convex Optimization Lecture 16 Today: Projected Gradient Descent Conditional Gradient Descent Stochastic Gradient Descent Random Coordinate Descent Recall: Gradient Descent (Steepest Descent w.r.t Euclidean
More informationStochastic Gradient Descent. Ryan Tibshirani Convex Optimization
Stochastic Gradient Descent Ryan Tibshirani Convex Optimization 10-725 Last time: proximal gradient descent Consider the problem min x g(x) + h(x) with g, h convex, g differentiable, and h simple in so
More informationFinite-sum Composition Optimization via Variance Reduced Gradient Descent
Finite-sum Composition Optimization via Variance Reduced Gradient Descent Xiangru Lian Mengdi Wang Ji Liu University of Rochester Princeton University University of Rochester xiangru@yandex.com mengdiw@princeton.edu
More informationStochastic Optimization
Introduction Related Work SGD Epoch-GD LM A DA NANJING UNIVERSITY Lijun Zhang Nanjing University, China May 26, 2017 Introduction Related Work SGD Epoch-GD Outline 1 Introduction 2 Related Work 3 Stochastic
More informationBlock stochastic gradient update method
Block stochastic gradient update method Yangyang Xu and Wotao Yin IMA, University of Minnesota Department of Mathematics, UCLA November 1, 2015 This work was done while in Rice University 1 / 26 Stochastic
More informationLearning with stochastic proximal gradient
Learning with stochastic proximal gradient Lorenzo Rosasco DIBRIS, Università di Genova Via Dodecaneso, 35 16146 Genova, Italy lrosasco@mit.edu Silvia Villa, Băng Công Vũ Laboratory for Computational and
More informationStochastic gradient descent and robustness to ill-conditioning
Stochastic gradient descent and robustness to ill-conditioning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France ÉCOLE NORMALE SUPÉRIEURE Joint work with Aymeric Dieuleveut, Nicolas Flammarion,
More informationMinimizing Finite Sums with the Stochastic Average Gradient Algorithm
Minimizing Finite Sums with the Stochastic Average Gradient Algorithm Joint work with Nicolas Le Roux and Francis Bach University of British Columbia Context: Machine Learning for Big Data Large-scale
More informationErgodic Subgradient Descent
Ergodic Subgradient Descent John Duchi, Alekh Agarwal, Mikael Johansson, Michael Jordan University of California, Berkeley and Royal Institute of Technology (KTH), Sweden Allerton Conference, September
More informationA Sparsity Preserving Stochastic Gradient Method for Composite Optimization
A Sparsity Preserving Stochastic Gradient Method for Composite Optimization Qihang Lin Xi Chen Javier Peña April 3, 11 Abstract We propose new stochastic gradient algorithms for solving convex composite
More informationBeyond stochastic gradient descent for large-scale machine learning
Beyond stochastic gradient descent for large-scale machine learning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France Joint work with Eric Moulines - October 2014 Big data revolution? A new
More informationHow hard is this function to optimize?
How hard is this function to optimize? John Duchi Based on joint work with Sabyasachi Chatterjee, John Lafferty, Yuancheng Zhu Stanford University West Coast Optimization Rumble October 2016 Problem minimize
More informationAn interior-point stochastic approximation method and an L1-regularized delta rule
Photograph from National Geographic, Sept 2008 An interior-point stochastic approximation method and an L1-regularized delta rule Peter Carbonetto, Mark Schmidt and Nando de Freitas University of British
More informationStochastic and online algorithms
Stochastic and online algorithms stochastic gradient method online optimization and dual averaging method minimizing finite average Stochastic and online optimization 6 1 Stochastic optimization problem
More informationStochastic Optimization Algorithms Beyond SG
Stochastic Optimization Algorithms Beyond SG Frank E. Curtis 1, Lehigh University involving joint work with Léon Bottou, Facebook AI Research Jorge Nocedal, Northwestern University Optimization Methods
More informationLinearly-Convergent Stochastic-Gradient Methods
Linearly-Convergent Stochastic-Gradient Methods Joint work with Francis Bach, Michael Friedlander, Nicolas Le Roux INRIA - SIERRA Project - Team Laboratoire d Informatique de l École Normale Supérieure
More informationStochastic gradient methods for machine learning
Stochastic gradient methods for machine learning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France Joint work with Eric Moulines, Nicolas Le Roux and Mark Schmidt - January 2013 Context Machine
More informationStochastic Variance Reduction for Nonconvex Optimization. Barnabás Póczos
1 Stochastic Variance Reduction for Nonconvex Optimization Barnabás Póczos Contents 2 Stochastic Variance Reduction for Nonconvex Optimization Joint work with Sashank Reddi, Ahmed Hefny, Suvrit Sra, and
More informationBeyond stochastic gradient descent for large-scale machine learning
Beyond stochastic gradient descent for large-scale machine learning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France ÉCOLE NORMALE SUPÉRIEURE Joint work with Aymeric Dieuleveut, Nicolas Flammarion,
More informationBeyond stochastic gradient descent for large-scale machine learning
Beyond stochastic gradient descent for large-scale machine learning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France Joint work with Eric Moulines, Nicolas Le Roux and Mark Schmidt - CAP, July
More informationECS289: Scalable Machine Learning
ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Sept 29, 2016 Outline Convex vs Nonconvex Functions Coordinate Descent Gradient Descent Newton s method Stochastic Gradient Descent Numerical Optimization
More informationMini-batch Stochastic Approximation Methods for Nonconvex Stochastic Composite Optimization
Noname manuscript No. (will be inserted by the editor) Mini-batch Stochastic Approximation Methods for Nonconvex Stochastic Composite Optimization Saeed Ghadimi Guanghui Lan Hongchao Zhang the date of
More informationProximal Newton Method. Zico Kolter (notes by Ryan Tibshirani) Convex Optimization
Proximal Newton Method Zico Kolter (notes by Ryan Tibshirani) Convex Optimization 10-725 Consider the problem Last time: quasi-newton methods min x f(x) with f convex, twice differentiable, dom(f) = R
More informationStochastic gradient methods for machine learning
Stochastic gradient methods for machine learning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France Joint work with Eric Moulines, Nicolas Le Roux and Mark Schmidt - September 2012 Context Machine
More informationA random perturbation approach to some stochastic approximation algorithms in optimization.
A random perturbation approach to some stochastic approximation algorithms in optimization. Wenqing Hu. 1 (Presentation based on joint works with Chris Junchi Li 2, Weijie Su 3, Haoyi Xiong 4.) 1. Department
More informationProximal Newton Method. Ryan Tibshirani Convex Optimization /36-725
Proximal Newton Method Ryan Tibshirani Convex Optimization 10-725/36-725 1 Last time: primal-dual interior-point method Given the problem min x subject to f(x) h i (x) 0, i = 1,... m Ax = b where f, h
More informationStochastic Proximal Gradient Algorithm
Stochastic Institut Mines-Télécom / Telecom ParisTech / Laboratoire Traitement et Communication de l Information Joint work with: Y. Atchade, Ann Arbor, USA, G. Fort LTCI/Télécom Paristech and the kind
More informationGradient Sliding for Composite Optimization
Noname manuscript No. (will be inserted by the editor) Gradient Sliding for Composite Optimization Guanghui Lan the date of receipt and acceptance should be inserted later Abstract We consider in this
More informationOptimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method
Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method Davood Hajinezhad Iowa State University Davood Hajinezhad Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method 1 / 35 Co-Authors
More informationECE521 lecture 4: 19 January Optimization, MLE, regularization
ECE521 lecture 4: 19 January 2017 Optimization, MLE, regularization First four lectures Lectures 1 and 2: Intro to ML Probability review Types of loss functions and algorithms Lecture 3: KNN Convexity
More informationStochastic Gradient Descent. CS 584: Big Data Analytics
Stochastic Gradient Descent CS 584: Big Data Analytics Gradient Descent Recap Simplest and extremely popular Main Idea: take a step proportional to the negative of the gradient Easy to implement Each iteration
More informationIncremental Constraint Projection-Proximal Methods for Nonsmooth Convex Optimization
Massachusetts Institute of Technology, Cambridge, MA Laboratory for Information and Decision Systems Report LIDS-P-907, July 013 Incremental Constraint Projection-Proximal Methods for Nonsmooth Convex
More informationFast Stochastic Optimization Algorithms for ML
Fast Stochastic Optimization Algorithms for ML Aaditya Ramdas April 20, 205 This lecture is about efficient algorithms for minimizing finite sums min w R d n i= f i (w) or min w R d n f i (w) + λ 2 w 2
More informationAd Placement Strategies
Case Study 1: Estimating Click Probabilities Tackling an Unknown Number of Features with Sketching Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox 2014 Emily Fox January
More informationOn Projected Stochastic Gradient Descent Algorithm with Weighted Averaging for Least Squares Regression
On Projected Stochastic Gradient Descent Algorithm with Weighted Averaging for Least Squares Regression arxiv:606.03000v [cs.it] 9 Jun 206 Kobi Cohen, Angelia Nedić and R. Srikant Abstract The problem
More informationStatistical Machine Learning II Spring 2017, Learning Theory, Lecture 4
Statistical Machine Learning II Spring 07, Learning Theory, Lecture 4 Jean Honorio jhonorio@purdue.edu Deterministic Optimization For brevity, everywhere differentiable functions will be called smooth.
More informationStochastic Subgradient Method
Stochastic Subgradient Method Lingjie Weng, Yutian Chen Bren School of Information and Computer Science UC Irvine Subgradient Recall basic inequality for convex differentiable f : f y f x + f x T (y x)
More informationSubgradient Method. Guest Lecturer: Fatma Kilinc-Karzan. Instructors: Pradeep Ravikumar, Aarti Singh Convex Optimization /36-725
Subgradient Method Guest Lecturer: Fatma Kilinc-Karzan Instructors: Pradeep Ravikumar, Aarti Singh Convex Optimization 10-725/36-725 Adapted from slides from Ryan Tibshirani Consider the problem Recall:
More informationTrade-Offs in Distributed Learning and Optimization
Trade-Offs in Distributed Learning and Optimization Ohad Shamir Weizmann Institute of Science Includes joint works with Yossi Arjevani, Nathan Srebro and Tong Zhang IHES Workshop March 2016 Distributed
More informationStochastic Gradient Descent in Continuous Time
Stochastic Gradient Descent in Continuous Time Justin Sirignano University of Illinois at Urbana Champaign with Konstantinos Spiliopoulos (Boston University) 1 / 27 We consider a diffusion X t X = R m
More informationCPSC 540: Machine Learning
CPSC 540: Machine Learning Proximal-Gradient Mark Schmidt University of British Columbia Winter 2018 Admin Auditting/registration forms: Pick up after class today. Assignment 1: 2 late days to hand in
More informationMachine Learning. Support Vector Machines. Fabio Vandin November 20, 2017
Machine Learning Support Vector Machines Fabio Vandin November 20, 2017 1 Classification and Margin Consider a classification problem with two classes: instance set X = R d label set Y = { 1, 1}. Training
More informationA Brief Overview of Practical Optimization Algorithms in the Context of Relaxation
A Brief Overview of Practical Optimization Algorithms in the Context of Relaxation Zhouchen Lin Peking University April 22, 2018 Too Many Opt. Problems! Too Many Opt. Algorithms! Zero-th order algorithms:
More informationConvex Optimization. (EE227A: UC Berkeley) Lecture 15. Suvrit Sra. (Gradient methods III) 12 March, 2013
Convex Optimization (EE227A: UC Berkeley) Lecture 15 (Gradient methods III) 12 March, 2013 Suvrit Sra Optimal gradient methods 2 / 27 Optimal gradient methods We saw following efficiency estimates for
More informationOptimization methods
Optimization methods Optimization-Based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_spring16 Carlos Fernandez-Granda /8/016 Introduction Aim: Overview of optimization methods that Tend to
More informationAccelerating Stochastic Optimization
Accelerating Stochastic Optimization Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem and Mobileye Master Class at Tel-Aviv, Tel-Aviv University, November 2014 Shalev-Shwartz
More informationConditional Gradient (Frank-Wolfe) Method
Conditional Gradient (Frank-Wolfe) Method Lecturer: Aarti Singh Co-instructor: Pradeep Ravikumar Convex Optimization 10-725/36-725 1 Outline Today: Conditional gradient method Convergence analysis Properties
More informationProximal Minimization by Incremental Surrogate Optimization (MISO)
Proximal Minimization by Incremental Surrogate Optimization (MISO) (and a few variants) Julien Mairal Inria, Grenoble ICCOPT, Tokyo, 2016 Julien Mairal, Inria MISO 1/26 Motivation: large-scale machine
More informationBasis Adaptation for Sparse Nonlinear Reinforcement Learning
Basis Adaptation for Sparse Nonlinear Reinforcement Learning Sridhar Mahadevan, Stephen Giguere, and Nicholas Jacek School of Computer Science University of Massachusetts, Amherst mahadeva@cs.umass.edu,
More informationMaking Gradient Descent Optimal for Strongly Convex Stochastic Optimization
Making Gradient Descent Optimal for Strongly Convex Stochastic Optimization Alexander Rakhlin University of Pennsylvania Ohad Shamir Microsoft Research New England Karthik Sridharan University of Pennsylvania
More informationOptimal Regularized Dual Averaging Methods for Stochastic Optimization
Optimal Regularized Dual Averaging Methods for Stochastic Optimization Xi Chen Machine Learning Department Carnegie Mellon University xichen@cs.cmu.edu Qihang Lin Javier Peña Tepper School of Business
More informationProximal Methods for Optimization with Spasity-inducing Norms
Proximal Methods for Optimization with Spasity-inducing Norms Group Learning Presentation Xiaowei Zhou Department of Electronic and Computer Engineering The Hong Kong University of Science and Technology
More informationAn Evolving Gradient Resampling Method for Machine Learning. Jorge Nocedal
An Evolving Gradient Resampling Method for Machine Learning Jorge Nocedal Northwestern University NIPS, Montreal 2015 1 Collaborators Figen Oztoprak Stefan Solntsev Richard Byrd 2 Outline 1. How to improve
More informationLecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron
CS446: Machine Learning, Fall 2017 Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron Lecturer: Sanmi Koyejo Scribe: Ke Wang, Oct. 24th, 2017 Agenda Recap: SVM and Hinge loss, Representer
More informationStatistical Sparse Online Regression: A Diffusion Approximation Perspective
Statistical Sparse Online Regression: A Diffusion Approximation Perspective Jianqing Fan Wenyan Gong Chris Junchi Li Qiang Sun Princeton University Princeton University Princeton University University
More informationAdaptive Gradient Methods AdaGrad / Adam. Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade
Adaptive Gradient Methods AdaGrad / Adam Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade 1 Announcements: HW3 posted Dual coordinate ascent (some review of SGD and random
More informationOptimization for Machine Learning
Optimization for Machine Learning (Lecture 3-A - Convex) SUVRIT SRA Massachusetts Institute of Technology Special thanks: Francis Bach (INRIA, ENS) (for sharing this material, and permitting its use) MPI-IS
More informationModern Stochastic Methods. Ryan Tibshirani (notes by Sashank Reddi and Ryan Tibshirani) Convex Optimization
Modern Stochastic Methods Ryan Tibshirani (notes by Sashank Reddi and Ryan Tibshirani) Convex Optimization 10-725 Last time: conditional gradient method For the problem min x f(x) subject to x C where
More informationarxiv: v1 [stat.ml] 12 Nov 2015
Random Multi-Constraint Projection: Stochastic Gradient Methods for Convex Optimization with Many Constraints Mengdi Wang, Yichen Chen, Jialin Liu, Yuantao Gu arxiv:5.03760v [stat.ml] Nov 05 November 3,
More informationAgenda. Fast proximal gradient methods. 1 Accelerated first-order methods. 2 Auxiliary sequences. 3 Convergence analysis. 4 Numerical examples
Agenda Fast proximal gradient methods 1 Accelerated first-order methods 2 Auxiliary sequences 3 Convergence analysis 4 Numerical examples 5 Optimality of Nesterov s scheme Last time Proximal gradient method
More informationLinear Convergence under the Polyak-Łojasiewicz Inequality
Linear Convergence under the Polyak-Łojasiewicz Inequality Hamed Karimi, Julie Nutini and Mark Schmidt The University of British Columbia LCI Forum February 28 th, 2017 1 / 17 Linear Convergence of Gradient-Based
More informationApproximate Dynamic Programming
Approximate Dynamic Programming A. LAZARIC (SequeL Team @INRIA-Lille) Ecole Centrale - Option DAD SequeL INRIA Lille EC-RL Course Value Iteration: the Idea 1. Let V 0 be any vector in R N A. LAZARIC Reinforcement
More informationOptimization. Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison
Optimization Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison optimization () cost constraints might be too much to cover in 3 hours optimization (for big
More informationLarge-scale Stochastic Optimization
Large-scale Stochastic Optimization 11-741/641/441 (Spring 2016) Hanxiao Liu hanxiaol@cs.cmu.edu March 24, 2016 1 / 22 Outline 1. Gradient Descent (GD) 2. Stochastic Gradient Descent (SGD) Formulation
More informationSparse Nonparametric Density Estimation in High Dimensions Using the Rodeo
Outline in High Dimensions Using the Rodeo Han Liu 1,2 John Lafferty 2,3 Larry Wasserman 1,2 1 Statistics Department, 2 Machine Learning Department, 3 Computer Science Department, Carnegie Mellon University
More informationOn the fast convergence of random perturbations of the gradient flow.
On the fast convergence of random perturbations of the gradient flow. Wenqing Hu. 1 (Joint work with Chris Junchi Li 2.) 1. Department of Mathematics and Statistics, Missouri S&T. 2. Department of Operations
More informationLasso: Algorithms and Extensions
ELE 538B: Sparsity, Structure and Inference Lasso: Algorithms and Extensions Yuxin Chen Princeton University, Spring 2017 Outline Proximal operators Proximal gradient methods for lasso and its extensions
More informationOptimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade
Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade Machine Learning for Big Data CSE547/STAT548 University of Washington S. M. Kakade (UW) Optimization for
More informationAccelerated Proximal Gradient Methods for Convex Optimization
Accelerated Proximal Gradient Methods for Convex Optimization Paul Tseng Mathematics, University of Washington Seattle MOPTA, University of Guelph August 18, 2008 ACCELERATED PROXIMAL GRADIENT METHODS
More informationNesterov s Acceleration
Nesterov s Acceleration Nesterov Accelerated Gradient min X f(x)+ (X) f -smooth. Set s 1 = 1 and = 1. Set y 0. Iterate by increasing t: g t 2 @f(y t ) s t+1 = 1+p 1+4s 2 t 2 y t = x t + s t 1 s t+1 (x
More informationStochastic Gradient Descent with Only One Projection
Stochastic Gradient Descent with Only One Projection Mehrdad Mahdavi, ianbao Yang, Rong Jin, Shenghuo Zhu, and Jinfeng Yi Dept. of Computer Science and Engineering, Michigan State University, MI, USA Machine
More informationStochastic Analogues to Deterministic Optimizers
Stochastic Analogues to Deterministic Optimizers ISMP 2018 Bordeaux, France Vivak Patel Presented by: Mihai Anitescu July 6, 2018 1 Apology I apologize for not being here to give this talk myself. I injured
More informationStochastic Optimization Part I: Convex analysis and online stochastic optimization
Stochastic Optimization Part I: Convex analysis and online stochastic optimization Taiji Suzuki Tokyo Institute of Technology Graduate School of Information Science and Engineering Department of Mathematical
More informationExpanding the reach of optimal methods
Expanding the reach of optimal methods Dmitriy Drusvyatskiy Mathematics, University of Washington Joint work with C. Kempton (UW), M. Fazel (UW), A.S. Lewis (Cornell), and S. Roy (UW) BURKAPALOOZA! WCOM
More informationApproximate Dynamic Programming
Master MVA: Reinforcement Learning Lecture: 5 Approximate Dynamic Programming Lecturer: Alessandro Lazaric http://researchers.lille.inria.fr/ lazaric/webpage/teaching.html Objectives of the lecture 1.
More informationProximal Gradient Temporal Difference Learning Algorithms
Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence (IJCAI-16) Proximal Gradient Temporal Difference Learning Algorithms Bo Liu, Ji Liu, Mohammad Ghavamzadeh, Sridhar
More informationORIE 4741: Learning with Big Messy Data. Proximal Gradient Method
ORIE 4741: Learning with Big Messy Data Proximal Gradient Method Professor Udell Operations Research and Information Engineering Cornell November 13, 2017 1 / 31 Announcements Be a TA for CS/ORIE 1380:
More informationLinear classifiers: Overfitting and regularization
Linear classifiers: Overfitting and regularization Emily Fox University of Washington January 25, 2017 Logistic regression recap 1 . Thus far, we focused on decision boundaries Score(x i ) = w 0 h 0 (x
More informationMATH 680 Fall November 27, Homework 3
MATH 680 Fall 208 November 27, 208 Homework 3 This homework is due on December 9 at :59pm. Provide both pdf, R files. Make an individual R file with proper comments for each sub-problem. Subgradients and
More informationStochastic First-Order Methods with Random Constraint Projection
Stochastic First-Order Methods with Random Constraint Projection The MIT Faculty has made this article openly available. Please share how this access benefits you. Your story matters. Citation As Published
More informationReinforcement Learning. Yishay Mansour Tel-Aviv University
Reinforcement Learning Yishay Mansour Tel-Aviv University 1 Reinforcement Learning: Course Information Classes: Wednesday Lecture 10-13 Yishay Mansour Recitations:14-15/15-16 Eliya Nachmani Adam Polyak
More informationAd Placement Strategies
Case Study : Estimating Click Probabilities Intro Logistic Regression Gradient Descent + SGD AdaGrad Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox January 7 th, 04 Ad
More informationARock: an algorithmic framework for asynchronous parallel coordinate updates
ARock: an algorithmic framework for asynchronous parallel coordinate updates Zhimin Peng, Yangyang Xu, Ming Yan, Wotao Yin ( UCLA Math, U.Waterloo DCO) UCLA CAM Report 15-37 ShanghaiTech SSDS 15 June 25,
More informationBig Data Analytics: Optimization and Randomization
Big Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML 2015 Hong Kong Department of Computer Science, The University of Iowa, IA, USA Nov. 20, 2015 Yang Tutorial for ACML 15 Nov.
More informationECE276B: Planning & Learning in Robotics Lecture 16: Model-free Control
ECE276B: Planning & Learning in Robotics Lecture 16: Model-free Control Lecturer: Nikolay Atanasov: natanasov@ucsd.edu Teaching Assistants: Tianyu Wang: tiw161@eng.ucsd.edu Yongxi Lu: yol070@eng.ucsd.edu
More informationSTA141C: Big Data & High Performance Statistical Computing
STA141C: Big Data & High Performance Statistical Computing Lecture 8: Optimization Cho-Jui Hsieh UC Davis May 9, 2017 Optimization Numerical Optimization Numerical Optimization: min X f (X ) Can be applied
More information1 Sparsity and l 1 relaxation
6.883 Learning with Combinatorial Structure Note for Lecture 2 Author: Chiyuan Zhang Sparsity and l relaxation Last time we talked about sparsity and characterized when an l relaxation could recover the
More informationLecture 17: October 27
0-725/36-725: Convex Optimiation Fall 205 Lecturer: Ryan Tibshirani Lecture 7: October 27 Scribes: Brandon Amos, Gines Hidalgo Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer: These
More informationComposite nonlinear models at scale
Composite nonlinear models at scale Dmitriy Drusvyatskiy Mathematics, University of Washington Joint work with D. Davis (Cornell), M. Fazel (UW), A.S. Lewis (Cornell) C. Paquette (Lehigh), and S. Roy (UW)
More informationReinforcement Learning as Variational Inference: Two Recent Approaches
Reinforcement Learning as Variational Inference: Two Recent Approaches Rohith Kuditipudi Duke University 11 August 2017 Outline 1 Background 2 Stein Variational Policy Gradient 3 Soft Q-Learning 4 Closing
More informationECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference
ECE 18-898G: Special Topics in Signal Processing: Sparsity, Structure, and Inference Sparse Recovery using L1 minimization - algorithms Yuejie Chi Department of Electrical and Computer Engineering Spring
More informationStochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure
Stochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure Alberto Bietti Julien Mairal Inria Grenoble (Thoth) March 21, 2017 Alberto Bietti Stochastic MISO March 21,
More informationA Fast Augmented Lagrangian Algorithm for Learning Low-Rank Matrices
A Fast Augmented Lagrangian Algorithm for Learning Low-Rank Matrices Ryota Tomioka 1, Taiji Suzuki 1, Masashi Sugiyama 2, Hisashi Kashima 1 1 The University of Tokyo 2 Tokyo Institute of Technology 2010-06-22
More informationLinear and Sublinear Linear Algebra Algorithms: Preconditioning Stochastic Gradient Algorithms with Randomized Linear Algebra
Linear and Sublinear Linear Algebra Algorithms: Preconditioning Stochastic Gradient Algorithms with Randomized Linear Algebra Michael W. Mahoney ICSI and Dept of Statistics, UC Berkeley ( For more info,
More informationNon-Smooth, Non-Finite, and Non-Convex Optimization
Non-Smooth, Non-Finite, and Non-Convex Optimization Deep Learning Summer School Mark Schmidt University of British Columbia August 2015 Complex-Step Derivative Using complex number to compute directional
More informationJ. Sadeghi E. Patelli M. de Angelis
J. Sadeghi E. Patelli Institute for Risk and, Department of Engineering, University of Liverpool, United Kingdom 8th International Workshop on Reliable Computing, Computing with Confidence University of
More informationStatic Parameter Estimation using Kalman Filtering and Proximal Operators Vivak Patel
Static Parameter Estimation using Kalman Filtering and Proximal Operators Viva Patel Department of Statistics, University of Chicago December 2, 2015 Acnowledge Mihai Anitescu Senior Computational Mathematician
More informationAsynchronous Non-Convex Optimization For Separable Problem
Asynchronous Non-Convex Optimization For Separable Problem Sandeep Kumar and Ketan Rajawat Dept. of Electrical Engineering, IIT Kanpur Uttar Pradesh, India Distributed Optimization A general multi-agent
More information