Stochastic Optimization

Similar documents
Full-information Online Learning

Making Gradient Descent Optimal for Strongly Convex Stochastic Optimization

Lower and Upper Bounds on the Generalization of Stochastic Exponentially Concave Optimization

Adaptive Online Learning in Dynamic Environments

Stochastic Gradient Descent with Only One Projection

Beyond stochastic gradient descent for large-scale machine learning

Stochastic gradient methods for machine learning

Stochastic gradient methods for machine learning

Optimistic Rates Nati Srebro

Stochastic gradient descent and robustness to ill-conditioning

A Two-Stage Approach for Learning a Sparse Model with Sharp Excess Risk Analysis

Beyond stochastic gradient descent for large-scale machine learning

On the Generalization Ability of Online Strongly Convex Programming Algorithms

SVRG++ with Non-uniform Sampling

arxiv: v4 [math.oc] 5 Jan 2016

Lecture: Adaptive Filtering

A Two-Stage Approach for Learning a Sparse Model with Sharp

Beyond the regret minimization barrier: an optimal algorithm for stochastic strongly-convex optimization

Computational and Statistical Learning Theory

Optimal Regularized Dual Averaging Methods for Stochastic Optimization

Beyond stochastic gradient descent for large-scale machine learning

Learning with stochastic proximal gradient

Stochastic Composition Optimization

SADAGRAD: Strongly Adaptive Stochastic Gradient Methods

Learnability, Stability, Regularization and Strong Convexity

Bandit Convex Optimization

Uniform concentration inequalities, martingales, Rademacher complexity and symmetrization

Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization

Stochastic and online algorithms

Large-scale Stochastic Optimization

A Stochastic PCA Algorithm with an Exponential Convergence Rate. Ohad Shamir

Dynamic Regret of Strongly Adaptive Methods

Adaptive Subgradient Methods for Online Learning and Stochastic Optimization John Duchi, Elad Hanzan, Yoram Singer

Distributed Inexact Newton-type Pursuit for Non-convex Sparse Learning

Simultaneous Model Selection and Optimization through Parameter-free Stochastic Learning

Stochastic optimization in Hilbert spaces

Supplementary Material of Dynamic Regret of Strongly Adaptive Methods

Accelerating Stochastic Optimization

The Interplay Between Stability and Regret in Online Learning

Stochastic and Adversarial Online Learning without Hyperparameters

Introduction to Machine Learning (67577) Lecture 7

Constrained Stochastic Gradient Descent for Large-scale Least Squares Problem

On Projected Stochastic Gradient Descent Algorithm with Weighted Averaging for Least Squares Regression

Online Learning and Online Convex Optimization

Stochastic Gradient Descent with Variance Reduction

Stochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure

Adaptive Online Gradient Descent

The Online Approach to Machine Learning

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

Trade-Offs in Distributed Learning and Optimization

Risk Bounds for Lévy Processes in the PAC-Learning Framework

Optimization for Machine Learning

March 1, Florida State University. Concentration Inequalities: Martingale. Approach and Entropy Method. Lizhe Sun and Boning Yang.

Barzilai-Borwein Step Size for Stochastic Gradient Descent

Advanced Machine Learning

Convergence Rates for Deterministic and Stochastic Subgradient Methods Without Lipschitz Continuity

Efficient Private ERM for Smooth Objectives

Accelerating Online Convex Optimization via Adaptive Prediction

Tutorial: PART 2. Optimization for Machine Learning. Elad Hazan Princeton University. + help from Sanjeev Arora & Yoram Singer

Statistical Machine Learning II Spring 2017, Learning Theory, Lecture 4

A Low Complexity Algorithm with O( T ) Regret and Finite Constraint Violations for Online Convex Optimization with Long Term Constraints

arxiv: v1 [math.oc] 18 Mar 2016

Inverse Time Dependency in Convex Regularized Learning

Stochastic Approximation: Mini-Batches, Optimistic Rates and Acceleration

arxiv: v1 [cs.lg] 2 Apr 2013

Improved Optimization of Finite Sums with Minibatch Stochastic Variance Reduced Proximal Iterations

arxiv: v1 [stat.ml] 27 Sep 2016

Online Optimization with Gradual Variations

Randomized Coordinate Descent with Arbitrary Sampling: Algorithms and Complexity

Online Passive-Aggressive Algorithms

Large Scale Semi-supervised Linear SVM with Stochastic Gradient Descent

Machine Learning in the Data Revolution Era

Generalization of ERM in Stochastic Convex Optimization: The Dimension Strikes Back

Modern Stochastic Methods. Ryan Tibshirani (notes by Sashank Reddi and Ryan Tibshirani) Convex Optimization

Large-scale machine learning and convex optimization

Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade

Learning Bound for Parameter Transfer Learning

Fast Rates for Estimation Error and Oracle Inequalities for Model Selection

Big Data Analytics: Optimization and Randomization

arxiv: v2 [cs.lg] 14 Sep 2017

Proximal Minimization by Incremental Surrogate Optimization (MISO)

SADAGRAD: Strongly Adaptive Stochastic Gradient Methods

Variance-Reduced and Projection-Free Stochastic Optimization

Sub-Sampled Newton Methods for Machine Learning. Jorge Nocedal

Large-scale machine learning and convex optimization

Limitations on Variance-Reduction and Acceleration Schemes for Finite Sum Optimization

Beating SGD: Learning SVMs in Sublinear Time

Oracle Complexity of Second-Order Methods for Smooth Convex Optimization

Online Optimization in Dynamic Environments: Improved Regret Rates for Strongly Convex Problems

Distributed online optimization over jointly connected digraphs

Online Convex Optimization. Gautam Goel, Milan Cvitkovic, and Ellen Feldman CS 159 4/5/2016

Beating SGD: Learning SVMs in Sublinear Time

Convex Optimization Lecture 16

Comparison of Modern Stochastic Optimization Algorithms

Fast Stochastic Optimization Algorithms for ML

Accelerated Stochastic Block Coordinate Gradient Descent for Sparsity Constrained Nonconvex Optimization

OLSO. Online Learning and Stochastic Optimization. Yoram Singer August 10, Google Research

Adaptive Gradient Methods AdaGrad / Adam. Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade

Lecture 3: Minimizing Large Sums. Peter Richtárik

Large-scale Machine Learning and Optimization

Transcription:

Introduction Related Work SGD Epoch-GD LM A DA NANJING UNIVERSITY Lijun Zhang Nanjing University, China May 26, 2017

Introduction Related Work SGD Epoch-GD Outline 1 Introduction 2 Related Work 3 Stochastic Gradient Descent Expectation Bound High-probability Bound 4 Epoch Gradient Descent Expectation Bound High-probability Bound

Introduction Related Work SGD Epoch-GD Outline 1 Introduction 2 Related Work 3 Stochastic Gradient Descent Expectation Bound High-probability Bound 4 Epoch Gradient Descent Expectation Bound High-probability Bound

Introduction Related Work SGD Epoch-GD Definitions [Nemirovski et al., 2009] min F(w) = E ξ[f(w,ξ)] = f(w, ξ)dp(ξ) w W Ξ ξ is a random variable The challenge: evaluation of the expectation/integration

Introduction Related Work SGD Epoch-GD Definitions [Nemirovski et al., 2009] min F(w) = E ξ[f(w,ξ)] = f(w, ξ)dp(ξ) w W Ξ ξ is a random variable The challenge: evaluation of the expectation/integration Supervised Learning [Vapnik, 1998] min F(h) = E (x,y)[l(h(x), y)] h H (x, y) is a random instance-label pair H = {h : X R} is a hypothesis class l(, ) is certain loss, e.g., hinge loss l(u, v) = max(0, 1 uv)

Introduction Related Work SGD Epoch-GD Optimization Methods Sample Average Approximation (SAA) Empirical Risk Minimization (RRM) min F(w) = 1 n f(w,ξ i ) w W n where ξ 1,...,ξ n are i.i.d. min w W F(w) = 1 n i=1 n l(h(x i ), y i )) i=1 where (x 1, y 1 ),...,(x n, y n ) are i.i.d.

Introduction Related Work SGD Epoch-GD Optimization Methods Sample Average Approximation (SAA) Empirical Risk Minimization (RRM) min F(w) = 1 n f(w,ξ i ) w W n where ξ 1,...,ξ n are i.i.d. min w W F(w) = 1 n i=1 n l(h(x i ), y i )) i=1 where (x 1, y 1 ),...,(x n, y n ) are i.i.d. Stochastic Approximation (SA) Optimization via noisy observations of F( ) Zero-order SA [Nesterov, 2011] First-order SA, e.g., SGD [Kushner and Yin, 2003]

Introduction Related Work SGD Epoch-GD Performance Optimization Error / Excess Risk O F( w) min F(w) = w W O ( ) 1 n ( ) 1 T,O ( ) 1,O n ( ) 1 T ( 1, O n is the number of samples in empirical risk minimization T is the number of stochastic gradients in first-order stochastic approximation n 2 )

Introduction Related Work SGD Epoch-GD Outline 1 Introduction 2 Related Work 3 Stochastic Gradient Descent Expectation Bound High-probability Bound 4 Epoch Gradient Descent Expectation Bound High-probability Bound

Introduction Related Work SGD Epoch-GD Related Work Empirical Risk Minimization Stochastic Approximation [Shalev-Shwartz et al., 2009] [Koren and Levy, 2015] [Feldman, 2016] [Zhang et al., 2017] [Zinkevich, 2003] [Hazan et al., 2007] [Hazan and Kale, 2011] [Rakhlin et al., 2012] [Lan, 2012] [Agarwal et al., 2012] [Shamir and Zhang, 2013] [Mahdavi et al., 2015]... Supervised Learning [Vapnik, 1998] [Lee et al., 1996] [Panchenko, 2002] [Bartlett and Mendelson, 2002] [Tsybakov, 2004] [Bartlett et al., 2005] [Sridharan et al., 2009] [Srebro et al., 2010]... [Bach and Moulines, 2013]

Introduction Related Work SGD Epoch-GD Risk Bounds of Empirical Risk Minimization Supervised Learning

Introduction Related Work SGD Epoch-GD Risk Bounds of Empirical Risk Minimization ( ) 1 Lipschitz: O n [Bartlett and Mendelson, 2002] Supervised Learning

Introduction Related Work SGD Epoch-GD Risk Bounds of Empirical Risk Minimization Smooth: O( 1 n [Srebro et al., 2010] ( ) 1 Lipschitz: O n [Bartlett and Mendelson, 2002] ) Supervised Learning

Introduction Related Work SGD Epoch-GD Risk Bounds of Empirical Risk Minimization Smooth: O( 1 n [Srebro et al., 2010] ( ) 1 Lipschitz: O n [Bartlett and Mendelson, 2002] ( ) Strongly Convex: O 1 λn [Sridharan et al., 2009] ) Supervised Learning

Introduction Related Work SGD Epoch-GD Risk Bounds of Empirical Risk Minimization Smooth: O( 1 n [Srebro et al., 2010] ( ) 1 Lipschitz: O n [Bartlett and Mendelson, 2002] ( ) Strongly Convex & Smooth: O 1 λn 2 [Zhang et al., 2017] ( ) Strongly Convex: O 1 λn [Sridharan et al., 2009] ) Supervised Learning

Introduction Related Work SGD Epoch-GD Risk Bounds of Empirical Risk Minimization ( ) d Lipschitz: O n [Shalev-Shwartz et al., 2009] Smooth: O( 1 n [Srebro et al., 2010] ( ) 1 Lipschitz: O n [Bartlett and Mendelson, 2002] ( ) Strongly Convex & Smooth: O 1 λn 2 [Zhang et al., 2017] ( ) Strongly Convex: O 1 λn [Sridharan et al., 2009] ) Supervised Learning

Introduction Related Work SGD Epoch-GD Risk Bounds of Empirical Risk Minimization ( ) d Lipschitz: O n Smooth: O( d n [Shalev-Shwartz et al., 2009] ) [Zhang et al., 2017] Smooth: O( 1 n [Srebro et al., 2010] ( ) 1 Lipschitz: O n [Bartlett and Mendelson, 2002] ( ) Strongly Convex & Smooth: O 1 λn 2 [Zhang et al., 2017] ( ) Strongly Convex: O 1 λn [Sridharan et al., 2009] ) Supervised Learning

Introduction Related Work SGD Epoch-GD Risk Bounds of Empirical Risk Minimization ( ) d Lipschitz: O n [Shalev-Shwartz et al., 2009] ( ) Strongly Convex: O 1 in Exp. λn [Shalev-Shwartz et al., 2009] Smooth: O( d n ) [Zhang et al., 2017] Smooth: O( 1 n [Srebro et al., 2010] ( ) 1 Lipschitz: O n [Bartlett and Mendelson, 2002] ( ) O d n + 1 λn [Zhang et al., 2017] ( ) Strongly Convex: O 1 λn [Sridharan et al., 2009] ) [Zhang et al., 2017] Strongly Convex & Smooth: O Supervised Learning ( 1 λn 2 )

Introduction Related Work SGD Epoch-GD Risk Bounds of Empirical Risk Minimization ( ) d Lipschitz: O n [Shalev-Shwartz et al., 2009] ( ) Strongly Convex: O 1 in Exp. λn [Shalev-Shwartz et al., 2009] Smooth: O( d n ) [Zhang et al., 2017] Smooth: O( 1 n [Srebro et al., 2010] ( ) 1 Lipschitz: O n [Bartlett and Mendelson, 2002] Strongly Convex & Smooth: O [Zhang et al., 2017] ( ) O d n + 1 λn [Zhang et al., 2017] ( ) Strongly Convex: O 1 λn [Sridharan et al., 2009] ) [Zhang et al., 2017] ( 1 λn 2 ) Strongly Convex & Smooth: O Supervised Learning ( 1 λn 2 )

Introduction Related Work SGD Epoch-GD Risk Bounds of Stochastic Approximation Supervised Learning

Introduction Related Work SGD Epoch-GD Risk Bounds of Stochastic Approximation ( ) Lipschitz: O 1 n [Zinkevich, 2003] Supervised Learning

Introduction Related Work SGD Epoch-GD Risk Bounds of Stochastic Approximation ( ) Lipschitz: O 1 n [Zinkevich, 2003] Smooth: O( 1 n ) [Srebro et al., 2010] Supervised Learning

Introduction Related Work SGD Epoch-GD Risk Bounds of Stochastic Approximation ( ) Lipschitz: O 1 n [Zinkevich, 2003] Smooth: O( 1 n ) [Srebro et al., 2010] ( ) Strongly Convex: O 1 λn [Hazan and Kale, 2011] Supervised Learning

Introduction Related Work SGD Epoch-GD Risk Bounds of Stochastic Approximation ( ) Lipschitz: O 1 n [Zinkevich, 2003] Smooth: O( 1 n ) [Srebro et al., 2010] ( ) Strongly Convex: O 1 λn [Hazan and Kale, 2011] Supervised Learning

Introduction Related Work SGD Epoch-GD Risk Bounds of Stochastic Approximation ( ) Strongly Convex: O 1 λn [Hazan and Kale, 2011] Smooth: O( 1 n ) [Srebro et al., 2010] Square or Logistic: O( 1 n ) [Bach and Moulines, 2013] ( ) Lipschitz: O 1 n [Zinkevich, 2003] Supervised Learning

Outline 1 Introduction 2 Related Work 3 Stochastic Gradient Descent Expectation Bound High-probability Bound 4 Epoch Gradient Descent Expectation Bound High-probability Bound

Stochastic Gradient Descent The Algorithm 1: for t = 1,...,T do 2: w t+1 = w t η t g t 3: w t+1 = Π W (w t+1 ) 4: end for 5: return w T = 1 T T w t Requirement E t [g t ] = F(w t ) where E t [ ] is the conditional expectation

Stochastic Gradient Descent The Algorithm 1: for t = 1,...,T do 2: w t+1 = w t η t g t 3: w t+1 = Π W (w t+1 ) 4: end for 5: return w T = 1 T T w t min F(w) = E ξ[f(w,ξ)] w W Sample ξ t from the underlying distribution g t = f(w t,ξ t )

Stochastic Gradient Descent The Algorithm 1: for t = 1,...,T do 2: w t+1 = w t η t g t 3: w t+1 = Π W (w t+1 ) 4: end for 5: return w T = 1 T T w t Supervised Learning min F(h) = E (x,y)[l(h(x), y)] h H Sample (x t, y t ) from the underlying distribution g t = l(h t (x t ), y t )

Outline 1 Introduction 2 Related Work 3 Stochastic Gradient Descent Expectation Bound High-probability Bound 4 Epoch Gradient Descent Expectation Bound High-probability Bound

Analysis I For any w W, we have F(w t ) F(w) F(w t ), w t w = g t, w t w + F(w t ) g t, w t w }{{} δ t = 1 η t w t w t+1, w t w +δ t = 1 ( wt w 2 2 w t+1 w 2 2 + w t w 2η t+1 2 ) 2 +δt t = 1 2η t ( wt w 2 2 w t+1 w 2 2 1 2η t ( wt w 2 2 w t+1 w 2 2 ) η t + 2 g t 2 2 +δ t ) η t + 2 g t 2 2 +δ t

Analysis II To simplify the above inequality, we assume Then, we have F(w t ) F(w) 1 2η η t = η, and g t 2 G ( wt w 2 2 w t+1 w 2 2) + η 2 G2 +δ t By adding the inequalities of all iterations, we have T F(w t ) TF(w) 1 2η w 1 w 2 2 + ηt 2 G2 + T δ t

Analysis III Then, we have 1 T F( w T ) F(w) = F ( 1 T T w t ) F(w) ( T F(w t ) F(w) 1 T = 1 2ηT w 1 w 2 2 + η 2 G2 + 1 T 1 2η w 1 w 2 2 + ηt 2 G2 + T δ t ) T δ t In summary, we have F( w T ) F(w) 1 2ηT w 1 w 2 2 + η 2 G2 + 1 T T δ t

Expectation Bound Under the condition E t [g t ] = F(w t ), it is easy to verify E[δ t ] = E[ F(w t ) g t, w t w ] =E 1,...,t 1 [E t [ F(w t ) g t, w t w ]] = 0 Thus, by taking expectation over both sides, we have E[F( w T )] F(w) 1 2ηT w 1 w 2 2 + η 2 G2 ( ) η= c T 1 1 = 2 T c w 1 w 2 2 + cg 2

Outline 1 Introduction 2 Related Work 3 Stochastic Gradient Descent Expectation Bound High-probability Bound 4 Epoch Gradient Descent Expectation Bound High-probability Bound

Martingale A sequence of random variables Y 1, Y 2, Y 3,... that satisfies for any time t E[ Y t ] E[Y t+1 Y 1,...,Y t ] = Y t An Example Y t is a gambler s fortune after t tosses of a fair coin. Let δ t a random variable such that δ t = 1 if the coin comes up heads andδ t = 1 if the coin comes up tails. Then Y t = t i=1 δ t. E[Y t+1 Y 1,...,Y t ] = E[δ t+1 + Y t Y 1,...,Y t ] =Y t + E[δ t+1 Y 1,...,Y t ] = Y t

Martingale A sequence of random variables Y 1, Y 2, Y 3,... that satisfies for any time t E[ Y t ] E[Y t+1 Y 1,...,Y t ] = Y t An Example Y t is a gambler s fortune after t tosses of a fair coin. Martingale Difference Suppose Y 1, Y 2, Y 3,... is a martingale, then X t = Y t Y t 1 is a martingale difference sequence. E[X t+1 X 1,...,X t ] = E[Y t+1 Y t X 1,...,X t ] = 0

Azuma s inequality for Martingales Suppose X 1, X 2,... is a martingale difference sequence and X i c i almost surely. Then, we have ( n ) ( t 2 Pr X i t exp 2 n i=1 i=1 c2 i ( n ) ( Corollary 1 Pr i=1 X i t exp t 2 2 n i=1 c2 i Assume c i c. With a probability at least 1 2δ, we have n X i 2nclog 1 δ i=1 ) )

High-probability Bound I Assume g t 2 G, and x y 2 D, x, y W Under the condition E t [g t ] = F(w t ), it is easy to verify δ t = F(w t ) g t, w t w is a martingale difference sequence with where we use Jensen s inequality δ t F(w t ) g t 2 w t w 2 D( F(w t ) 2 + g t 2 ) 2GD F(w t ) 2 = E[g t ] 2 E g t 2 G

High-probability Bound II From Azuma s inequality, with a probability at least 1 δ T δ t 2 TGDlog 1 δ Thus, with a probability at least 1 δ F( w T ) F(w) 1 2ηT w 1 w 2 2 + η 2 G2 + 1 T δ t T 1 2ηT w 1 w 2 2 + η 1 2 G2 + 2 T GDlog 1 δ ( η= c T 1 1 = T 2c w 1 w 2 2 + c ) 2 G2 + 2 GDlog 1 δ ( ) 1 =O T

Outline 1 Introduction 2 Related Work 3 Stochastic Gradient Descent Expectation Bound High-probability Bound 4 Epoch Gradient Descent Expectation Bound High-probability Bound

Strong Convexity A General Definition F is λ-strongly convex if for all α [0, 1], w, w W, F(αw+(1 α)w ) αf(w)+(1 α)f(w ) λ 2 α(1 α) w w 2 2 A More Popular Definition F is λ-strongly convex if for all w, w W, F(w)+ F(w), w w + λ 2 w w 2 2 F(w )

Strong Convexity A General Definition F is λ-strongly convex if for all α [0, 1], w, w W, F(αw+(1 α)w ) αf(w)+(1 α)f(w ) λ 2 α(1 α) w w 2 2 A More Popular Definition F is λ-strongly convex if for all w, w W, F(w)+ F(w), w w + λ 2 w w 2 2 F(w ) An Important Property If F is λ-strongly convex, then λ 2 w w 2 F(w) F(w ) where w = argmin w W F(w).

Epoch Gradient Descent The Algorithm [Hazan and Kale, 2011] 1: T 1 = 4, η 1 = 1/λ 2: while k i=1 T k T do 3: for t = 1,...,T k do 4: 5: end for 6: w k+1 1 = 1 Tk T k wk t 7: η k+1 = η k /2, T k+1 = 2T k 8: k = k + 1 9: end while w k t+1 = Π W(w k t η k g k t )

Epoch Gradient Descent The Algorithm [Hazan and Kale, 2011] 1: T 1 = 4, η 1 = 1/λ 2: while k i=1 T k T do 3: for t = 1,...,T k do 4: 5: end for 6: w k+1 1 = 1 Tk T k wk t 7: η k+1 = η k /2, T k+1 = 2T k 8: k = k + 1 9: end while w k t+1 = Π W(w k t η k g k t ) Our Goal Establish an O(1/T) excess risk bound

Outline 1 Introduction 2 Related Work 3 Stochastic Gradient Descent Expectation Bound High-probability Bound 4 Epoch Gradient Descent Expectation Bound High-probability Bound

Analysis I If we have [ ] ( ) 1 E F(w k 1) F(w ) = O, k λt k Then, the last solution w k 1 satisfies [ ] ( ) ( ) 1 1 E F(w k 1 ) F(w ) = O = O λt k λt Suppose [ E F(w k 1) ] F(w ) = c G2 λt k From the analysis of SGD, we have [ ] E F(w k+1 1 ) F(w ) 1 [ ] E w k 1 w 2 2 + η k 2η k T k 2 G2

Analysis II Recall that Then, we have w k 1 w 2 2 2 λ ( ) F(w k 1) F(w ) [ ] E F(w k+1 1 ) F(w ) 1 [ ] E F(w k λη k T 1) F(w ) + η k k λη k T k =4 = 1 [ ] 4 E c G2 + 4 G 2 λt k 2λT k T k+1 =2T k = c G 2 λt k+1 ( 1 2 + 4 c ) c=8 = c G2 λt k+1 2 G2

Outline 1 Introduction 2 Related Work 3 Stochastic Gradient Descent Expectation Bound High-probability Bound 4 Epoch Gradient Descent Expectation Bound High-probability Bound

What do We Need? Key Inequality in the Expectation Bound [ ] E F(w k+1 1 ) F(w ) 1 E 2η k T k [ w k 1 w 2 2 ] + η k 2 G2

What do We Need? Key Inequality in the Expectation Bound [ ] E F(w k+1 1 ) F(w ) 1 E 2η k T k [ w k 1 w 2 2 ] + η k 2 G2 A High-probability Inequality based on Azuma s Inequality F(w k+1 1 ) F(w ) 1 w k 1 2η k T w 2 2 + η k GD k 2 G2 + 2 log 1 T k δ

What do We Need? Key Inequality in the Expectation Bound [ ] E F(w k+1 1 ) F(w ) 1 E 2η k T k [ w k 1 w 2 2 ] + η k 2 G2 A High-probability Inequality based on Azuma s Inequality F(w k+1 1 ) F(w ) 1 w k 1 2η k T w 2 2 + η k GD k 2 G2 + 2 log 1 T k δ The Inequality that We Need F(w k+1 1 ) F(w ) 1 2η k T k w k 1 w 2 2 + η k 2 G2 + O ( 1 λt k )

Concentration Inequality I Bernstein s Inequality for Martingales Let X 1,...,X n be a bounded martingale difference sequence with respect to the filtration F = (F i ) 1 i n and with X i K. Let i S i = j=1 be the associated martingale. Denote the sum of the conditional variances by n [ ] Σ 2 n = E Xt 2 F t 1. Then for all constants t, ν > 0, ( ) Pr max S i > t and Σ 2 n ν i=1,...,n X j ( exp t 2 ) 2(ν + Kt/3)

Concentration Inequality II Corollary 2 Suppose Σ 2 n ν. Then, with a probability at least 1 δ S n 2νlog 1 δ + 2K 3 log 1 δ Note 2νlog 1 δ 1 2α log 1 +αν, α > 0 δ

The Basic Idea For any w W, we have F(w t ) F(w) F(w t ), w t w λ 2 w t w 2 2 λ = g t, w t w + F(w t ) g t, w t w }{{} 2 w t w 2 2 δ t Then, following the same analysis, we arrive at ( T F( w T ) F(w) 1 2ηT w 1 w 2 2+ η 2 G2 + 1 δ t λ T 2 We are going to prove T δ t = λ 2 T ) T w t w 2 2 ( ) loglog T w t w 2 2 + O λ

Analysis I δ t = F(w t ) g t, w t w is a martingale difference sequence with δ t F(w t ) g t 2 w t w 2 D( F(w t ) 2 + g t 2 ) 2GD where D could be a prior parameter or 2G/λ when w = w The sum of the conditional variances is upper bounded by T Σ 2 T = [ ] E t δ 2 t T 4G 2 [ E t F(wt ) g t 2 2 w t w 2 ] 2 T w t w 2 2

Analysis II A straightforward way Since Σ 2 T 4G2 T w t w 2 2 Following Bernstein s inequality for martingales, with a probability at least 1 δ, we have T ( ) T δ t 2 4G 2 w t w 2 2 log 1 δ + 2 3 2GDlog 1 δ λ 2 T w t w 2 2 + 4G2 λ log 1 δ + 4 3 GDlog 1 δ

Analysis III We need Peeling Divide T w t w 2 2 [0, TD2 ] in to a sequence of intervals The initial: T w t w 2 2 [0, D2 /T] Since δ t F(w t ) g t 2 w t w 2 2G w t w 2 we have T T δ t 2G w t w 2 2G T T w t w 2 2 2GD The rest: T w t w 2 2 (2i 1 D 2 /T, 2 i D 2 /T] where i = 1,..., 2log 2 T

Analysis IV Define T A T = w t w 2 2 TD 2 We are going to bound [ T Pr δ t 2 ] 4G 2 A T τ + 2 3 2GDτ + 2GD

Analysis V T Pr δ t 2 4G 2 A T τ + 2 2GDτ + 2GD 3 T =Pr δ t 2 4G 2 A T τ + 4 3 GDτ + 2GD, A T D 2 /T T +Pr δ t 2 4G 2 A T τ + 4 3 GDτ + 2GD, D2 /T < A T TD 2 T =Pr δ t 2 4G 2 A T τ + 4 3 GDτ + 2GD,Σ2 T 4G2 A T, D 2 /T < A T TD 2 m T Pr δ t 2 4G 2 A T τ + 43 GDτ + 2GD,Σ2T 4G2 A T, D2 T 2i 1 < A T D2 T 2i i=1 m T Pr δ t 2 4G2 D 2 2 i τ + 4 T 3 GDτ,Σ2 T 4G2 D 2 2 i me τ T i=1 where m = 2log 2 T

Analysis VI By setting τ = log m δ, with a probability at least 1 δ, T δ t 2 4G 2 A T τ + 4 GDτ + 2GD 3 ( T ) =2 4G 2 w t w 2 2 λ 2 = λ 2 T T τ + 4 GDτ + 2GD 3 w t w 2 2 + 8G2 λ τ + 4 GDτ + 2GD 3 ( ) loglog T w t w 2 2 + O λ where τ = log m δ = log 2log 2 T δ = O(loglog T)

Reference I Agarwal, A., Bartlett, P. L., Ravikumar, P., and Wainwright, M. J. (2012). Information-theoretic lower bounds on the oracle complexity of stochastic convex optimization. IEEE Transactions on Information Theory, 58(5):3235 3249. Bach, F. and Moulines, E. (2013). Non-strongly-convex smooth stochastic approximation with convergence rate O(1/n). In Advances in Neural Information Processing Systems 26, pages 773 781. Bartlett, P. L., Bousquet, O., and Mendelson, S. (2005). Local rademacher complexities. The Annals of Statistics, 33(4):1497 1537. Bartlett, P. L. and Mendelson, S. (2002). Rademacher and gaussian complexities: risk bounds and structural results. Journal of Machine Learning Research, 3:463 482. Feldman, V. (2016). Generalization of erm in stochastic convex optimization: The dimension strikes back. ArXiv e-prints, arxiv:1608.04414.

Reference II Hazan, E., Agarwal, A., and Kale, S. (2007). Logarithmic regret algorithms for online convex optimization. Machine Learning, 69(2-3):169 192. Hazan, E. and Kale, S. (2011). Beyond the regret minimization barrier: an optimal algorithm for stochastic strongly-convex optimization. In Proceedings of the 24th Annual Conference on Learning Theory, pages 421 436. Koren, T. and Levy, K. (2015). Fast rates for exp-concave empirical risk minimization. In Advances in Neural Information Processing Systems 28, pages 1477 1485. Kushner, H. J. and Yin, G. G. (2003). Stochastic Approximation and Recursive Algorithms and Applications. Springer, second edition. Lan, G. (2012). An optimal method for stochastic composite optimization. Mathematical Programming, 133:365 397.

Reference III Lee, W. S., Bartlett, P. L., and Williamson, R. C. (1996). The importance of convexity in learning with squared loss. In Proceedings of the 9th Annual Conference on Computational Learning Theory, pages 140 146. Mahdavi, M., Zhang, L., and Jin, R. (2015). Lower and upper bounds on the generalization of stochastic exponentially concave optimization. In Proceedings of the 28th Conference on Learning Theory. Nemirovski, A., Juditsky, A., Lan, G., and Shapiro, A. (2009). Robust stochastic approximation approach to stochastic programming. SIAM Journal on Optimization, 19(4):1574 1609. Nesterov, Y. (2011). Random gradient-free minimization of convex functions. Core discussion papers. Panchenko, D. (2002). Some extensions of an inequality of vapnik and chervonenkis. Electronic Communications in Probability, 7:55 65.

Reference IV Rakhlin, A., Shamir, O., and Sridharan, K. (2012). Making gradient descent optimal for strongly convex stochastic optimization. In Proceedings of the 29th International Conference on Machine Learning, pages 449 456. Shalev-Shwartz, S., Shamir, O., Srebro, N., and Sridharan, K. (2009). Stochastic convex optimization. In Proceedings of the 22nd Annual Conference on Learning Theory. Shamir, O. and Zhang, T. (2013). Stochastic gradient descent for non-smooth optimization: Convergence results and optimal averaging schemes. In Proceedings of the 30th International Conference on Machine Learning, pages 71 79. Srebro, N., Sridharan, K., and Tewari, A. (2010). Optimistic rates for learning with a smooth loss. ArXiv e-prints, arxiv:1009.3896. Sridharan, K., Shalev-shwartz, S., and Srebro, N. (2009). Fast rates for regularized objectives. In Advances in Neural Information Processing Systems 21, pages 1545 1552.

Reference V Tsybakov, A. B. (2004). Optimal aggregation of classifiers in statistical learning. The Annals of Statistics, 32:135 166. Vapnik, V. N. (1998). Statistical Learning Theory. Wiley-Interscience. Zhang, L., Yang, T., and Jin, R. (2017). Empirical risk minimization for stochastic convex optimization: O(1/n)- and O(1/n 2 )-type of risk bounds. In Proceedings of the 30th Annual Conference on Learning Theory. Zinkevich, M. (2003). Online convex programming and generalized infinitesimal gradient ascent. In Proceedings of the 20th International Conference on Machine Learning, pages 928 936.