Stochastic Composition Optimization

Similar documents
Stochastic Compositional Gradient Descent: Algorithms for Minimizing Nonlinear Functions of Expected Values

Accelerating Stochastic Composition Optimization

Convex Optimization Lecture 16

Stochastic Gradient Descent. Ryan Tibshirani Convex Optimization

Finite-sum Composition Optimization via Variance Reduced Gradient Descent

Stochastic Optimization

Block stochastic gradient update method

Learning with stochastic proximal gradient

Stochastic gradient descent and robustness to ill-conditioning

Minimizing Finite Sums with the Stochastic Average Gradient Algorithm

Ergodic Subgradient Descent

A Sparsity Preserving Stochastic Gradient Method for Composite Optimization

Beyond stochastic gradient descent for large-scale machine learning

How hard is this function to optimize?

An interior-point stochastic approximation method and an L1-regularized delta rule

Stochastic and online algorithms

Stochastic Optimization Algorithms Beyond SG

Linearly-Convergent Stochastic-Gradient Methods

Stochastic gradient methods for machine learning

Stochastic Variance Reduction for Nonconvex Optimization. Barnabás Póczos

Beyond stochastic gradient descent for large-scale machine learning

Beyond stochastic gradient descent for large-scale machine learning

ECS289: Scalable Machine Learning

Mini-batch Stochastic Approximation Methods for Nonconvex Stochastic Composite Optimization

Proximal Newton Method. Zico Kolter (notes by Ryan Tibshirani) Convex Optimization

Stochastic gradient methods for machine learning

A random perturbation approach to some stochastic approximation algorithms in optimization.

Proximal Newton Method. Ryan Tibshirani Convex Optimization /36-725

Stochastic Proximal Gradient Algorithm

Gradient Sliding for Composite Optimization

Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method

ECE521 lecture 4: 19 January Optimization, MLE, regularization

Stochastic Gradient Descent. CS 584: Big Data Analytics

Incremental Constraint Projection-Proximal Methods for Nonsmooth Convex Optimization

Fast Stochastic Optimization Algorithms for ML

Ad Placement Strategies

On Projected Stochastic Gradient Descent Algorithm with Weighted Averaging for Least Squares Regression

Statistical Machine Learning II Spring 2017, Learning Theory, Lecture 4

Stochastic Subgradient Method

Subgradient Method. Guest Lecturer: Fatma Kilinc-Karzan. Instructors: Pradeep Ravikumar, Aarti Singh Convex Optimization /36-725

Trade-Offs in Distributed Learning and Optimization

Stochastic Gradient Descent in Continuous Time

CPSC 540: Machine Learning

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

A Brief Overview of Practical Optimization Algorithms in the Context of Relaxation

Convex Optimization. (EE227A: UC Berkeley) Lecture 15. Suvrit Sra. (Gradient methods III) 12 March, 2013

Optimization methods

Accelerating Stochastic Optimization

Conditional Gradient (Frank-Wolfe) Method

Proximal Minimization by Incremental Surrogate Optimization (MISO)

Basis Adaptation for Sparse Nonlinear Reinforcement Learning

Making Gradient Descent Optimal for Strongly Convex Stochastic Optimization

Optimal Regularized Dual Averaging Methods for Stochastic Optimization

Proximal Methods for Optimization with Spasity-inducing Norms

An Evolving Gradient Resampling Method for Machine Learning. Jorge Nocedal

Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron

Statistical Sparse Online Regression: A Diffusion Approximation Perspective

Adaptive Gradient Methods AdaGrad / Adam. Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade

Optimization for Machine Learning

Modern Stochastic Methods. Ryan Tibshirani (notes by Sashank Reddi and Ryan Tibshirani) Convex Optimization

arxiv: v1 [stat.ml] 12 Nov 2015

Agenda. Fast proximal gradient methods. 1 Accelerated first-order methods. 2 Auxiliary sequences. 3 Convergence analysis. 4 Numerical examples

Linear Convergence under the Polyak-Łojasiewicz Inequality

Approximate Dynamic Programming

Optimization. Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison

Large-scale Stochastic Optimization

Sparse Nonparametric Density Estimation in High Dimensions Using the Rodeo

On the fast convergence of random perturbations of the gradient flow.

Lasso: Algorithms and Extensions

Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade

Accelerated Proximal Gradient Methods for Convex Optimization

Nesterov s Acceleration

Stochastic Gradient Descent with Only One Projection

Stochastic Analogues to Deterministic Optimizers

Stochastic Optimization Part I: Convex analysis and online stochastic optimization

Expanding the reach of optimal methods

Approximate Dynamic Programming

Proximal Gradient Temporal Difference Learning Algorithms

ORIE 4741: Learning with Big Messy Data. Proximal Gradient Method

Linear classifiers: Overfitting and regularization

MATH 680 Fall November 27, Homework 3

Stochastic First-Order Methods with Random Constraint Projection

Reinforcement Learning. Yishay Mansour Tel-Aviv University

Ad Placement Strategies

ARock: an algorithmic framework for asynchronous parallel coordinate updates

Big Data Analytics: Optimization and Randomization

ECE276B: Planning & Learning in Robotics Lecture 16: Model-free Control

STA141C: Big Data & High Performance Statistical Computing

1 Sparsity and l 1 relaxation

Lecture 17: October 27

Composite nonlinear models at scale

Reinforcement Learning as Variational Inference: Two Recent Approaches

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

Stochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure

A Fast Augmented Lagrangian Algorithm for Learning Low-Rank Matrices

Linear and Sublinear Linear Algebra Algorithms: Preconditioning Stochastic Gradient Algorithms with Randomized Linear Algebra

Non-Smooth, Non-Finite, and Non-Convex Optimization

J. Sadeghi E. Patelli M. de Angelis

Static Parameter Estimation using Kalman Filtering and Proximal Operators Vivak Patel

Asynchronous Non-Convex Optimization For Separable Problem

Transcription:

Stochastic Composition Optimization Algorithms and Sample Complexities Mengdi Wang Joint works with Ethan X. Fang, Han Liu, and Ji Liu ORFE@Princeton ICCOPT, Tokyo, August 8-11, 2016 1 / 24

Collaborators M. Wang, X. Fang, and H. Liu. Stochastic Compositional Gradient Descent: Algorithms for Minimizing Compositions of Expected-Value Functions. Mathematical Programming, Submitted in 2014, to appear in 2016. M. Wang and J. Liu. Accelerating Stochastic Composition Optimization. 2016. M. Wang and J. Liu. A Stochastic Compositional Subgradient Method Using Markov Samples. 2016. 2 / 24

Outline 1 Background: Why is SGD a good method? 2 A New Problem: Stochastic Composition Optimization 3 Stochastic Composition Algorithms: Convergence and Sample Complexity 4 Acceleration via Smoothing-Extrapolation 3 / 24

Background: Why is SGD a good method? Outline 1 Background: Why is SGD a good method? 2 A New Problem: Stochastic Composition Optimization 3 Stochastic Composition Algorithms: Convergence and Sample Complexity 4 Acceleration via Smoothing-Extrapolation 4 / 24

Background: Why is SGD a good method? Background Machine learning is optimization Learning from batch data Learning from online data min x R d 1 n n i=1 l(x; A i, b i ) + ρ(x) min x R d E A,b [l(x; A, b)] + ρ(x) Both problems can be formulated as Stochastic Convex Optimization min E [f (x, ξ)] x }{{} expectation over batch data set or unknown distribution A general framework encompasses likelihood estimation, online learning, empirical risk minimization, multi-arm bandit, online MDP Stochastic gradient descent (SGD) updates by taking sample gradients: x k+1 = x k α f (x k, ξ k ) A special case of stochastic approximation with a long history (Robbins and Monro, Kushner and Yin, Polyak and Juditsky, Benveniste et al., Ruszcyǹski, Borkar, Bertsekas and Tsitsiklis, and many) 5 / 24

Background: Why is SGD a good method? Background: Stochastic first-order methods Stochastic gradient descent (SGD) updates by taking sample gradients: x k+1 = x k α f (x k, ξ k ) (1,410,000 results on Google Scholar Search and 24,400 since 2016!) Why is SGD a good method in practice? When processing either batch or online data, a scalable algorithm needs to update using partial information (a small subset of all data) Answer: We have no other choice Why is SGD a good method beyond practical reasons? SGD achieves optimal convergence after processing k samples: E [F (x k ) F ] = O(1/ k) for convex minimization E [F (x k ) F ] = O(1/k) for strongly convex minimization (Nemirovski and Yudin 1983, Agarwal et al. 2012, Rakhlin et al. 2012, Ghadimi and Lan 2012,2013, Shamir and Zhang 2013 and many more) Beyond convexity: nearly optimal online PCA (Li, Wang, Liu, Zhang 2015) Answer: Strong theoretical guarantees for data-driven problems 6 / 24

A New Problem: Stochastic Composition Optimization Outline 1 Background: Why is SGD a good method? 2 A New Problem: Stochastic Composition Optimization 3 Stochastic Composition Algorithms: Convergence and Sample Complexity 4 Acceleration via Smoothing-Extrapolation 7 / 24

A New Problem: Stochastic Composition Optimization Stochastic Composition Optimization Consider the problem { } min x X F (x) := (f g)(x) = f (g(x)), where the outer and inner functions are f : R m R, g : R n R m and X is a closed and convex set in R n. f (y) = E [f v (y)], g(x) = E [g w (x)], We focus on the case where the overall problem is convex (for now) No structural assumptions on f, g (nonconvex/nonmonotone/nondifferentiable) We may not know the distribution of v, w. 8 / 24

A New Problem: Stochastic Composition Optimization Expectation Minimization vs. Stochastic Composition Optimization Recall the classical problem: min x X E [f (x, ξ)] }{{} linear w.r.t the distribution of ξ In stochastic composition optimization, the objective is no longer a linear functional of the (v, w) distribution: min E[f v (E [g w (x)])] x X }{{} nonlinear w.r.t the distribution of (w,v) In the classical problem, nice properties come from linearity w.r.t. data distribution In stochastic composition optimization, they are all lost A little nonlinearity takes a long way to go. 9 / 24

A New Problem: Stochastic Composition Optimization Motivating Example: High-Dimensional Nonparametric Estimation Sparse Additive Model (SpAM): d y i = h j (x ij ) + ɛ i. j=1 High-dimensional feature space with relatively few data samples: Sample size n Features!d Features d Sample size!n n d n d Optimization model for SpAM 1 : [ min E[f v (E [g w (x)])] x min E Y h j H j d ] 2 d h j (X j ) + λ E [ hj 2(X j ) ] The term λ d j=1 E [ hj 2(X j ) ] induces sparsity in the feature space. j=1 j=1 1 P. Ravikumar, J. Lafferty, H. Liu, and L. Wasserman. Sparse additive models. Journal of the Royal Statistical Society: Series B, 71(5):1009-1030, 2009. 10 / 24

A New Problem: Stochastic Composition Optimization Motivating Example: Risk-Averse Learning Consider the mean-variance minimization problem Its batch version is 1 min x N min x E a,b [l(x; a, b)] + λvar a,b [l(x; a, b)], ( 2 N l(x; a i, b i ) + λ N l(x; a i, b i ) 1 N l(x; a i, b i )). N N i=1 i=1 i=1 The variance Var[Z] = E [ (Z E[Z]) 2] is a composition between two functions Many other risk functions are equivalent to compositions of multiple expected-value functions (Shapiro, Dentcheva, Ruszcyǹski 2014) A central limit theorem for composite of multiple smooth functions has been established for risk metrics (Dentcheva, Penev, Ruszcyǹski 2016) No good way to optimize a risk-averse objective while learning from online data 11 / 24

A New Problem: Stochastic Composition Optimization Motivating Example: Reinforcement Learning On-policy reinforcement learning is to learn the value-per-state of a stochastic system. We want to solve a (huge) Bellman equations γp π V π + r π = V π, where P π is transition prob. matrix and r π are rewards, both unknown. On-policy learning aims to solve Bellman equation via blackbox simulation. It becomes a special stochastic composition optimization problem: min x E[f v (E [g w (x)])] min x R S E[A]x E[b] 2, where E[A] = I γp π and E[b] = r π. 12 / 24

Stochastic Composition Algorithms: Convergence and Sample Complexity Outline 1 Background: Why is SGD a good method? 2 A New Problem: Stochastic Composition Optimization 3 Stochastic Composition Algorithms: Convergence and Sample Complexity 4 Acceleration via Smoothing-Extrapolation 13 / 24

Stochastic Composition Algorithms: Convergence and Sample Complexity Problem Formulation min x X { F (x) := E[f v (E [g w (x)])] }{{} nonlinear w.r.t the distribution of (w,v) Sampling Oracle (SO) Upon query (x, y), the oracle returns: Noisy inner sample g w (x) and its noisy subgradient g w (x); Noisy outer gradient f v (y) Challenges Stochastic gradient descent (SGD) method does not work since an unbiased sample of the gradient g(x k ) f (g(x k )) is not available. Fenchel dual does not work except for rare conditions Sample average approximation (SAA) subject to curse of dimensionality. Sample complexity unclear }, 14 / 24

Stochastic Composition Algorithms: Convergence and Sample Complexity Basic Idea To approximate x k+1 = Π X { xk α k g(x k ) f (g(x k )) }, by a quasi-gradient iteration using estimates of g(x k ) Algorithm 1: Stochastic Compositional Gradient Descent (SCGD) Require: x 0, z 0 R n, y 0 R m, SO, K, stepsizes {α k } K k=1, and {β k} K k=1. Ensure: {x k } K k=1 for k = 1,, K do Query the SO and obtain g wk (x k ), g wk (x k ), f vk (y k+1 ) Update by y k+1 = (1 β k )y k + β k g wk (x k ), end for Remarks x k+1 = Π X { xk α k g wk (x k ) f vk (y k+1 ) }, Each iteration makes simple updates by interacting with SO Scalable with large-scale batch data and can process streaming data points online Considered for the first time by (Ermoliev 1976) as a stochastic approximation method without rate analysis 15 / 24

Stochastic Composition Algorithms: Convergence and Sample Complexity Sample Complexity (Wang et al., 2016) Under suitable conditions (inner function nonsmooth, outer function smooth), and X is bounded, let the stepsizes be α k = k 3/4, β k = k 1/2, we have that, if k is large enough, E F 2 k ( x t F 1 ) = O k k 1/4. t=k/2+1 (Optimal rate which matches the lowerbound for stochastic programming) Sample Convexity in Strongly Convex Case (Wang et al., 2016) Under suitable conditions (inner function nonsmooth, outer function smooth), suppose that the compositional function F ( ) is strongly convex, let the stepsizes be we have, if k is sufficiently large, α k = 1 k, and β k = 1 k 2/3, E [ x k x 2] ( 1 ) = O k 2/3. 16 / 24

Stochastic Composition Algorithms: Convergence and Sample Complexity Outline of Analysis The auxilary variable y k is taking a running estimate of g(x k ) at biased query points x 0,..., x k : k y k+1 = (Π k t =t β t )gwt (xt). t=0 Two entangled stochastic sequences {ɛ k = y k+1 g(x k ) 2 }, {ξ k = x k x 2 }. Coupled supermatingale analysis E [ɛ k+1 F k ] (1 β k )ɛ k + O(βk 2 + 1 x k+1 x k 2 ) β k E [ξ k+1 F k ] (1 + α 2 k )ξ k α k (F (x k ) F ) + O(β k )ɛ k Almost sure convergence by using a Coupled Supermartingale Convergence Theorem (Wang and Bertsekas, 2013) Convergence rate analysis via optimizing over stepsizes and balancing noise-bias tradeoff 17 / 24

Acceleration via Smoothing-Extrapolation Outline 1 Background: Why is SGD a good method? 2 A New Problem: Stochastic Composition Optimization 3 Stochastic Composition Algorithms: Convergence and Sample Complexity 4 Acceleration via Smoothing-Extrapolation 18 / 24

Acceleration via Smoothing-Extrapolation Acceleration When the function g( ) is smooth, can the algorithms be accelerated? Yes! Algorithm 2: Accelerated SCGD Require: x 0, z 0 R n, y 0 R m, SO, K, stepsizes {α k } K k=1, and {β k } K k=1. Ensure: {x k } K k=1 for k = 1,, K do Query the SO and obtain f vk (y k ), g wk (z k ). Update by x k+1 = x k α k g w k (x k ) f vk (y k ). Update auxiliary variables via extrapolation-smoothing: ) z k+1 = (1 1βk x k + 1 x k+1, β k y k+1 = (1 β k )y k + β k g wk+1 (z k+1 ), where the sample g wk+1 (z k+1 ) is obtained via querying the SO. end for Key to the Acceleration Bias reduction by averaging over extrapolated points (extrapolation-smoothing) y k = k ( k (Π k t =t β t )gwt (z t) g(x k ) = g (Π k t =t β t ). )zt t=0 t=0 19 / 24

Acceleration via Smoothing-Extrapolation Accelerated Sample Complexity (Wang et al. 2016) Under suitable conditions (inner function smooth, outer function smooth), if the stepsizes are chosen as α k = k 5/7, β k = k 4/7, we have, where ˆx k = 2 k k t=k/2+1 xt. [ E F (ˆx k ) F ] ( 1 ) = O k 2/7, Strongly Convex Case (Wang et al. 2016) Under suitable conditions (inner function smooth, outer function smooth),, and we assume that F is strongly-convex. If the stepsizes are chosen as we have, α k = 1 k, and β k = 1 k 4/5, ( 1 ) E[ x k x 2 ] = O k 4/5. 20 / 24

Acceleration via Smoothing-Extrapolation Regularized Stochastic Composition Optimization(Wang and Liu 2016) min x R n The penalty term R(x) is convex and nonsmooth. (E v f v E w g w )(x) + R(x) Algorithm 3: Accelerated Stochastic Compositional Proximal Gradient (ASC-PG) Require: x 0, z 0 R n, y 0 R m, SO, K, stepsizes {α k } K k=1, and {β k} K k=1. Ensure: {x k } K k=1 for k = 1,, K do Query the SO and obtain f vk (y k ), g wk (z k ). Update the main iterate by ( ) x k+1 = prox αk R( ) x k α k gw k (x k ) f vk (y k ). Update auxillary iterates by an extrapolation-smoothing scheme: ) z k+1 = (1 1βk x k + 1 x k+1, β k y k+1 = (1 β k )y k + β k g wk+1 (z k+1 ), where the sample g wk+1 (z k+1 ) is obtained via querying the SO. end for 21 / 24

Acceleration via Smoothing-Extrapolation Sample Complexity for Smooth Optimization (Wang and Liu 2016) Under suitable conditions (inner function nonsmooth, outer function smooth), and X is bounded, let the stepsizes be chosen properly, we have that, if k is large enough, E F 2 k ( x t F 1 ) = O k k 4/9, t=k/2+1 If either the outer or the inner function is linear, the rate improves to be optimal: E F 2 k ( x t F 1 ) = O k k 1/2, t=k/2+1 Sample Convexity in Strongly Convex Case (Wang and Liu 2016) Suppose that the compositional function F ( ) is strongly convex, let the stepsizes be chosen properly, we have, if k is sufficiently large, E [ x k x 2] ( 1 ) = O k 4/5. If either the outer or the inner function is linear, the rate improves to be optimal: E [ x k x 2] ( 1 ) = O k 1. 22 / 24

Acceleration via Smoothing-Extrapolation When there are two nested uncertainties: A class of two-timescale algorithms that update using first-order samples Analyzing convergence becomes harder: two coupled stochastic processes and smoothness-noise interplay Convergence rates of stochastic algorithms establish sample complexity upper bounds for the new problem class General Convex Strongly Convex Outer Nonsmooth, Inner Smooth O ( k 1/4) O ( k 2/3)? Outer and Inner Smooth O ( k 4/9)? O ( k 4/5)? Special Case: min x E [f (x; ξ)] O ( k 1/2) O ( k 1) Special Case: min x E [f (E[A]x E[b]; ξ)] O ( k 1/2) O ( k 1) Applications and computations Table : Summary of best known sample complexities First scalable algorithm for sparse nonparametric estimation Optimal algorithm for on-policy reinforcement learning 23 / 24

Acceleration via Smoothing-Extrapolation Summary Stochastic Composition Optimization - a new and rich problem class min E[f v (E [g w (x)])] x }{{} nonlinear w.r.t the distribution of (w,v) Applications in risk, data analysis, machine learning, real-time intelligent systems A class of stochastic compositional gradient methods with convergence guarantees. Basic sample complexity being developed: Outer Nonsmooth, Inner Smooth Outer and Inner Smooth Special Case: minx E [f (x; ξ)] Special Case: minx E [f (E[A]x E[b]; ξ)] General/Convex ( O k 1/4) ( O k 4/9)? ( O k 1/2) ( O k 1/2) Strongly Convex ( O k 2/3)? ( O k 4/5)? O (k 1) O (k 1) Many open questions remaining and more works needed! Thank you very much! 24 / 24