Stochastic Optimization - PDF Free Download

Introduction Related Work SGD Epoch-GD LM A DA NANJING UNIVERSITY Lijun Zhang Nanjing University, China May 26, 2017

Introduction Related Work SGD Epoch-GD Outline 1 Introduction 2 Related Work 3 Stochastic Gradient Descent Expectation Bound High-probability Bound 4 Epoch Gradient Descent Expectation Bound High-probability Bound

Introduction Related Work SGD Epoch-GD Definitions [Nemirovski et al., 2009] min F(w) = E ξ[f(w,ξ)] = f(w, ξ)dp(ξ) w W Ξ ξ is a random variable The challenge: evaluation of the expectation/integration Supervised Learning [Vapnik, 1998] min F(h) = E (x,y)[l(h(x), y)] h H (x, y) is a random instance-label pair H = {h : X R} is a hypothesis class l(, ) is certain loss, e.g., hinge loss l(u, v) = max(0, 1 uv)

Introduction Related Work SGD Epoch-GD Optimization Methods Sample Average Approximation (SAA) Empirical Risk Minimization (RRM) min F(w) = 1 n f(w,ξ i ) w W n where ξ 1,...,ξ n are i.i.d. min w W F(w) = 1 n i=1 n l(h(x i ), y i )) i=1 where (x 1, y 1 ),...,(x n, y n ) are i.i.d. Stochastic Approximation (SA) Optimization via noisy observations of F( ) Zero-order SA [Nesterov, 2011] First-order SA, e.g., SGD [Kushner and Yin, 2003]

Introduction Related Work SGD Epoch-GD Performance Optimization Error / Excess Risk O F( w) min F(w) = w W O ( ) 1 n ( ) 1 T,O ( ) 1,O n ( ) 1 T ( 1, O n is the number of samples in empirical risk minimization T is the number of stochastic gradients in first-order stochastic approximation n 2 )

Introduction Related Work SGD Epoch-GD Related Work Empirical Risk Minimization Stochastic Approximation [Shalev-Shwartz et al., 2009] [Koren and Levy, 2015] [Feldman, 2016] [Zhang et al., 2017] [Zinkevich, 2003] [Hazan et al., 2007] [Hazan and Kale, 2011] [Rakhlin et al., 2012] [Lan, 2012] [Agarwal et al., 2012] [Shamir and Zhang, 2013] [Mahdavi et al., 2015]... Supervised Learning [Vapnik, 1998] [Lee et al., 1996] [Panchenko, 2002] [Bartlett and Mendelson, 2002] [Tsybakov, 2004] [Bartlett et al., 2005] [Sridharan et al., 2009] [Srebro et al., 2010]... [Bach and Moulines, 2013]

Introduction Related Work SGD Epoch-GD Risk Bounds of Empirical Risk Minimization Supervised Learning

Introduction Related Work SGD Epoch-GD Risk Bounds of Empirical Risk Minimization ( ) 1 Lipschitz: O n [Bartlett and Mendelson, 2002] Supervised Learning

Introduction Related Work SGD Epoch-GD Risk Bounds of Empirical Risk Minimization Smooth: O( 1 n [Srebro et al., 2010] ( ) 1 Lipschitz: O n [Bartlett and Mendelson, 2002] ) Supervised Learning

Introduction Related Work SGD Epoch-GD Risk Bounds of Empirical Risk Minimization Smooth: O( 1 n [Srebro et al., 2010] ( ) 1 Lipschitz: O n [Bartlett and Mendelson, 2002] ( ) Strongly Convex: O 1 λn [Sridharan et al., 2009] ) Supervised Learning

Introduction Related Work SGD Epoch-GD Risk Bounds of Empirical Risk Minimization Smooth: O( 1 n [Srebro et al., 2010] ( ) 1 Lipschitz: O n [Bartlett and Mendelson, 2002] ( ) Strongly Convex & Smooth: O 1 λn 2 [Zhang et al., 2017] ( ) Strongly Convex: O 1 λn [Sridharan et al., 2009] ) Supervised Learning

Introduction Related Work SGD Epoch-GD Risk Bounds of Empirical Risk Minimization ( ) d Lipschitz: O n [Shalev-Shwartz et al., 2009] Smooth: O( 1 n [Srebro et al., 2010] ( ) 1 Lipschitz: O n [Bartlett and Mendelson, 2002] ( ) Strongly Convex & Smooth: O 1 λn 2 [Zhang et al., 2017] ( ) Strongly Convex: O 1 λn [Sridharan et al., 2009] ) Supervised Learning

Introduction Related Work SGD Epoch-GD Risk Bounds of Empirical Risk Minimization ( ) d Lipschitz: O n Smooth: O( d n [Shalev-Shwartz et al., 2009] ) [Zhang et al., 2017] Smooth: O( 1 n [Srebro et al., 2010] ( ) 1 Lipschitz: O n [Bartlett and Mendelson, 2002] ( ) Strongly Convex & Smooth: O 1 λn 2 [Zhang et al., 2017] ( ) Strongly Convex: O 1 λn [Sridharan et al., 2009] ) Supervised Learning

Introduction Related Work SGD Epoch-GD Risk Bounds of Empirical Risk Minimization ( ) d Lipschitz: O n [Shalev-Shwartz et al., 2009] ( ) Strongly Convex: O 1 in Exp. λn [Shalev-Shwartz et al., 2009] Smooth: O( d n ) [Zhang et al., 2017] Smooth: O( 1 n [Srebro et al., 2010] ( ) 1 Lipschitz: O n [Bartlett and Mendelson, 2002] ( ) O d n + 1 λn [Zhang et al., 2017] ( ) Strongly Convex: O 1 λn [Sridharan et al., 2009] ) [Zhang et al., 2017] Strongly Convex & Smooth: O Supervised Learning ( 1 λn 2 )

Introduction Related Work SGD Epoch-GD Risk Bounds of Empirical Risk Minimization ( ) d Lipschitz: O n [Shalev-Shwartz et al., 2009] ( ) Strongly Convex: O 1 in Exp. λn [Shalev-Shwartz et al., 2009] Smooth: O( d n ) [Zhang et al., 2017] Smooth: O( 1 n [Srebro et al., 2010] ( ) 1 Lipschitz: O n [Bartlett and Mendelson, 2002] Strongly Convex & Smooth: O [Zhang et al., 2017] ( ) O d n + 1 λn [Zhang et al., 2017] ( ) Strongly Convex: O 1 λn [Sridharan et al., 2009] ) [Zhang et al., 2017] ( 1 λn 2 ) Strongly Convex & Smooth: O Supervised Learning ( 1 λn 2 )

Introduction Related Work SGD Epoch-GD Risk Bounds of Stochastic Approximation Supervised Learning

Introduction Related Work SGD Epoch-GD Risk Bounds of Stochastic Approximation ( ) Lipschitz: O 1 n [Zinkevich, 2003] Supervised Learning

Introduction Related Work SGD Epoch-GD Risk Bounds of Stochastic Approximation ( ) Lipschitz: O 1 n [Zinkevich, 2003] Smooth: O( 1 n ) [Srebro et al., 2010] Supervised Learning

Introduction Related Work SGD Epoch-GD Risk Bounds of Stochastic Approximation ( ) Lipschitz: O 1 n [Zinkevich, 2003] Smooth: O( 1 n ) [Srebro et al., 2010] ( ) Strongly Convex: O 1 λn [Hazan and Kale, 2011] Supervised Learning

Introduction Related Work SGD Epoch-GD Risk Bounds of Stochastic Approximation ( ) Strongly Convex: O 1 λn [Hazan and Kale, 2011] Smooth: O( 1 n ) [Srebro et al., 2010] Square or Logistic: O( 1 n ) [Bach and Moulines, 2013] ( ) Lipschitz: O 1 n [Zinkevich, 2003] Supervised Learning