Stochastic Optimization Part I: Convex analysis and online stochastic optimization

Size: px

Start display at page:

Download "Stochastic Optimization Part I: Convex analysis and online stochastic optimization"

Benedict Wilkins
5 years ago
Views:

1 Stochastic Optimization Part I: Convex analysis and online stochastic optimization Taiji Suzuki Tokyo Institute of Technology Graduate School of Information Science and Engineering Department of Mathematical and Computing Sciences 1 / 73

2 Outline 1 Introduction 2 Short course to convex analysis Convexity and related concepts Duality Smoothness and strong convexity 3 First order method Proximal gradient descent Nesterov s acceleration and optimal convergence 4 Online stochastic optimization Stochastic gradient descent Stochastic regularized dual averaging 2 / 73

3 Lecture plan Part I: introduction to stochastic optimization Convex analysis First order method Online stochastic optimization method: SGD, SRDA Part II: faster stochastic gradient methods and batch methods AdaGrad, acceleration of SGD Batch stochastic optimization method: SDCA, SVRG, SAG Part III: stochastic ADMM and distributed optimization Stochastic ADMM for structured sparsity Distributed optimization (very shortly) 3 / 73

4 Outline 1 Introduction 2 Short course to convex analysis Convexity and related concepts Duality Smoothness and strong convexity 3 First order method Proximal gradient descent Nesterov s acceleration and optimal convergence 4 Online stochastic optimization Stochastic gradient descent Stochastic regularized dual averaging 4 / 73

Generalization error minimization Empirical approximation min E Z [l θ

5 Machine learning as optimization Machine learning is a methodology to deal with a lot of uncertain data. Generalization error minimization Empirical approximation min E Z [l θ (Z)] θ Θ 1 n min l θ (z i ) θ Θ n Stochastic optimization is an intersection of learning and optimization. i=1 5 / 73

6 x 1 x 2 x 3 x 4... Recently stochastic optimization is used to treat huge data. 1 n l θ (z i ) +ψ(θ) n i=1 }{{} Huge How to optimize this in efficient way? Do we need to go through the whole data at every iteration? 6 / 73

7 History of stochastic optimization for ML 1951 Robbins and Monro Stochastic approximation for root finding problem 1957 Rosenblatt Perceptron 1978 Nemirovskii and Yudin Robustification via averaging for non-smooth obj Ruppert Robust step size policy and averaging 1992 Polyak and Juditsky for smooth obj Bottou Online stochastic optimization 2004 Bottou and LeCun for large scale ML task Singer and Duchi; et al.; Xiao Duchi FOBOS, AdaGrad, RDA Le Roux et al. Linear convergence on batch data 2013 Shalev-Shwartz and Zhang (SAG,SDCA,SVRG) Johnson and Zhang 7 / 73

8 Overview of stochastic optimization min f (x) x Stochastic approximation (SA) Optimization for systems with uncertainty, e.g., machine control, traffic management, social science, and so on. g t = f (x (t) ) + ξ t is observed where ξ t is noise (typically i.i.d.). Stochastic approximation for machine learning and statistics Typically generalization error minimization: min x f (x) = min E Z [l(z, x)]. x g t = l(z t, x (t) ) is observed where z t P(Z) is i.i.d. data. l(z, x) is a loss function: e.g., logistic loss l((w, y), x) = log(1 + exp( yw x)) for z = (w, y) R p {±1}. Used for huge dataset. We don t need exact optimization. Optimization with certain precision (typically O(1/n)) is sufficient. 8 / 73

9 Two types of stochastic optimization Online type stochastic optimization: We observe data sequentially. Each observation is used just once (basically). min x E Z [l(z, x)] Batch type stochastic optimization The whole sample has been already observed. We may use training data multiple times. 1 min x n n l(z i, x) i=1 9 / 73

10 Summary of convergence rates Online methods (expected risk minimization): GR T (non-smooth, non-strongly convex) G 2 (non-smooth, strongly convex) µt σr + R2 L T T 2 (smooth, non-strongly convex) σ 2 ( ) µ µt + exp L T (smooth, strongly convex) Batch methods (empirical risk minimization) ( ) exp 1 n+ T (smooth loss, strongly convex reg) ( µ L ) exp 1 T (smooth loss, strongly convex reg with acceleration) nµ n+ L G: upper bound of norm of gradient, R: diameter of the domain, L: smoothness, µ: strong convexity, σ: variance of the gradient 10 / 73

11 Example of empirical risk minimization: High dimensional data analysis Bio-informatics Text data Image data 11 / 73

12 Example of empirical risk minimization: High dimensional data analysis Redundant information deteriorates the estimation accuracy. Bio-informatics Text data Image data 11 / 73

13 Sparse estimation Cut off redundant information sparsity R. Tsibshirani (1996). Regression shrinkage and selection via the lasso. J. Royal. Statist. Soc B., Vol. 58, No. 1, pages / 73

14 Lasso estimator Lasso [L 1 regularization] Convex optimization ˆβ Lasso = argmin β R p Y X β 2 + λ β 1 where β 1 = p j=1 β j. L 1 norm is the convex hull of L 0 norm on [ 1, 1] p (the largest convex function which supports from below). L 1 norm is the Lovász extension of the cardinality function. 13 / 73

15 Lasso estimator Lasso [L 1 regularization] ˆβ Lasso = argmin β R p Y X β 2 + λ β 1 where β 1 = p j=1 β j. Convex optimization L 1 norm is the convex hull of L 0 norm on [ 1, 1] p (the largest convex function which supports from below). L 1 norm is the Lovász extension of the cardinality function. More generally for a loss function l (logistic loss, hinge loss,...) { n min x i=1 l(z i, x) + λ x 1 } 13 / 73

16 Outline 1 Introduction 2 Short course to convex analysis Convexity and related concepts Duality Smoothness and strong convexity 3 First order method Proximal gradient descent Nesterov s acceleration and optimal convergence 4 Online stochastic optimization Stochastic gradient descent Stochastic regularized dual averaging 14 / 73

17 Regularized empirical risk minimization Basically, we want to solve Empirical risk minimization: 1 min x R p n n l(z i, x). i=1 Regularized empirical risk minimization: 1 min x R p n n l(z i, x) + ψ(x). In this lecture, we assume l and ψ are convex. convex analysis to exploit the properties of convex functions. i=1 15 / 73

18 Outline 1 Introduction 2 Short course to convex analysis Convexity and related concepts Duality Smoothness and strong convexity 3 First order method Proximal gradient descent Nesterov s acceleration and optimal convergence 4 Online stochastic optimization Stochastic gradient descent Stochastic regularized dual averaging 16 / 73

19 Convex set Definition (Convex set) A convex set is a set that contains the segment connecting two points in the set: x 1, x 2 C = θx 1 + (1 θ 2 )x 2 C (θ [0, 1]). 17 / 73

20 Epigraph and domain Let R := R { }. Definition (Epigraph and domain) The epigraph of a function f : R p R is given by epi(f ) := {(x, µ) R p+1 : f (x) µ}. The domain of a function f : R p R is given by dom(f ) := {x R p : f (x) < }. 18 / 73

21 Convex function Let R := R { }. Definition (Convex function) A function f : R p R is a convex function if f satisfies θf (x) + (1 θ)f (y) f (θx + (1 θ)y) ( x, y R p, θ [0, 1]), where + =,. f is convex epi(f ) is a convex set. 19 / 73

22 Proper and closed convex function If the domain of a function f is not empty (dom(f ) ), f is called proper. If the epigraph of a convex function f is a closed set, then f is called closed. (We are interested in only a proper closed function in this lecture.) Even if f is closed, it s domain is not necessarily closed (even for 1D). f is closed does not imply f is continuous. Closed convex function is continuous on a segment in its domain. Closed function is lower semicontinuity. 20 / 73

23 Convex loss functions (regression) All well used loss functions are (closed) convex. The followings are convex w.r.t. u with a fixed label y R. Squared loss: l(y, u) = 1 2 (y u)2. τ-quantile loss: l(y, u) = (1 τ) max{u y, 0} + τ max{y u, 0}. for some τ (0, 1). Used for quantile regression. ϵ-sensitive loss: l(y, u) = max{ y u ϵ, 0} for some ϵ > 0. Used for support vector regression. 3 τ f-y 21 / 73

24 Convex surrogate loss (classification) y {±1} Logistic loss: l(y, u) = log((1 + exp( yu))/2). Hinge loss: l(y, u) = max{1 yu, 0}. Exponential loss: l(y, u) = exp( yu). Smoothed hinge loss: 0, (yu 1), l(y, u) = 1 2 yu, (yu < 0), 1 2 (1 yu)2, (otherwise) yf 22 / 73

25 Convex regularization functions Ridge regularization: L 1 regularization: R(x) = x 2 2 := p j=1 x j 2. R(x) = x 1 := p j=1 x j. Trace norm regularization: R(X ) = X tr = min{q,r} k=1 σ j (X ) where σ j (X ) 0 is the j-th singular value n n i=1 (y i z i x) 2 + λ x 1 : Lasso 1 n n i=1 log(1 + exp( y iz i x)) + λ X tr : Low rank matrix recovery 23 / 73

26 Other definitions of sets Convex hull: conv(c) is the smallest convex set that contains a set C R p. Affine set: A set A is an affine set if and only if x, y A, the line that intersects x and y lies in A: λx + (1 λ)y λ R. Affine hull: The smallest affine set that contains a set C R p. Relative interior: ri(c). Let A be the affine hull of a convex set C R p. ri(c) is a set of internal points with respect to the relative topology induced by the affine hull A. 24 / 73

27 Continuity of a closed convex function Theorem For a (possibly non-convex) function f : R p R, the following three conditions are equivalent to each other. 1 f is lower semi-continuous. 2 For any converging sequence {x n } n=1 Rp s.t. x = lim n x n, lim inf n f (x n ) f (x ). 3 f is closed. Remark: Any convex function f is continuous in ri(dom(f )). The continuity could be broken on the boundary of the domain. 25 / 73

28 Outline 1 Introduction 2 Short course to convex analysis Convexity and related concepts Duality Smoothness and strong convexity 3 First order method Proximal gradient descent Nesterov s acceleration and optimal convergence 4 Online stochastic optimization Stochastic gradient descent Stochastic regularized dual averaging 26 / 73

29 Subgradient We want to deal with non-differentiable function such as L 1 regularization. To do so, we need to define something like gradient. Definition (Subdifferential, subgradient) For a proper convex function f : R p R, the subdifferential of f at x dom(f ) is defined by f (x) := {g R p x x, g + f (x) f (x ) ( x R p )}. An element of the subdifferential is called subgradient. f(x) 27 / 73

30 Properties of subgradient Subgradient does not necessarily exist ( f (x) could be empty). f (x) = x log(x) (x 0) is proper convex but not subdifferentiable at x = 0. Subgradient always exists on ri(dom(f )). 28 / 73

31 Properties of subgradient Subgradient does not necessarily exist ( f (x) could be empty). f (x) = x log(x) (x 0) is proper convex but not subdifferentiable at x = 0. Subgradient always exists on ri(dom(f )). If f is differentiable at x, its gradient is the unique element of subdiff. f (x) = { f (x)}. 28 / 73

32 Properties of subgradient Subgradient does not necessarily exist ( f (x) could be empty). f (x) = x log(x) (x 0) is proper convex but not subdifferentiable at x = 0. Subgradient always exists on ri(dom(f )). If f is differentiable at x, its gradient is the unique element of subdiff. f (x) = { f (x)}. If ri(dom(f )) ri(dom(h)), then (f + h)(x) = f (x) + h(x) = {g + g g f (x), g h(x)} ( x dom(f ) dom(h)). 28 / 73

33 Properties of subgradient Subgradient does not necessarily exist ( f (x) could be empty). f (x) = x log(x) (x 0) is proper convex but not subdifferentiable at x = 0. Subgradient always exists on ri(dom(f )). If f is differentiable at x, its gradient is the unique element of subdiff. f (x) = { f (x)}. If ri(dom(f )) ri(dom(h)), then (f + h)(x) = f (x) + h(x) = {g + g g f (x), g h(x)} ( x dom(f ) dom(h)). For all g f (x) and all g f (x ) (x, x dom(f )), g g, x x / 73

34 Legendre transform Defines the other representation on the dual space (the space of gradients). Definition (Legendre transform) Let f be a (possibly non-convex) function f : R p R s.t. dom(f ). Its convex conjugate is given by f (y) := sup x R p { x, y f (x)}. The map from f to f is called Legendre transform. f(x) y 0 x* f(x*) x f*(y) 29 / 73

35 Examples f (x) f (y) 1 Squared loss 2 { y 2 2 y ( 1 y 0), Hinge loss max{1 x, 0} (otherwise). { ( y) log( y) + (1 + y) log(1 + y) ( 1 y 0), (otherwise). Logistic loss log(1 + exp( x)) { 0 (max j y j 1), L 1 regularization x 1 (otherwise). L p regularization d j=1 x j p p d p 1 j=1 p y j p 1 (p > 1) p p 1 logistic dual of logistic L 1 -norm dual 30 / 73

36 Properties of Legendre transform f is a convex function even if f is not. f is the closure of the convex hull of f : Corollary f = cl(conv(f )). Legendre transform is a bijection from the set of proper closed convex functions onto that defined on the dual space. f (proper closed convex) f (proper closed convex) / 73

37 Connection to subgradient Lemma y f (x) f (x) + f (y) = x, y x f (y). y f (x) x = argmax x R p { x, y f (x )} (take the derivative of x, y f (x )) f (y) = x, y f (x). Remark: By definition, we always have Young-Fenchel s inequality. f (x) + f (y) x, y. 32 / 73

38 Fenchel s duality theorem Theorem (Fenchel s duality theorem) Let f : R p R, g : R q R be proper closed convex, and A R q p. Suppose that either of condition (a) or (b) is satisfied, then it holds that inf x Rp{f (x) + g(ax)} = sup { f (A y) g ( y)}. y R q (a) x R p s.t. x ri(dom(f )) and Ax ri(dom(g)). (b) y R q s.t. A y ri(dom(f )) and y ri(dom(g )). If (a) is satisfied, there exists y R q that attains sup of the RHS. If (b) is satisfied, there exists x R p that attains inf of the LHS. Under (a) and (b), x, y are the optimal solutions of the each side iff Karush-Kuhn-Tucker condition. A y f (x ), Ax g ( y ). 33 / 73

39 Equivalence to the separation theorem 34 / 73

40 Applying Fenchel s duality theorem to RERM RERM (Regularized Empirical Risk Minimizatino): Let l i (zi x) = l(y i, zi x) where (z i, y i ) is the input-output pair of the i-th observation. { n } (Primal) inf l i (z x R p i x) + ψ(x) i=1 }{{} (Dual) f (Zx) [Fenchel s duality theorem] inf x Rp{f (Zx) + ψ(x)} = inf y R n{f (y) + ψ ( Z y)} sup y R n { n } l i (y i ) + ψ ( Z y) i=1 This fact will be used to derive dual coordinate descent alg. 35 / 73

41 Outline 1 Introduction 2 Short course to convex analysis Convexity and related concepts Duality Smoothness and strong convexity 3 First order method Proximal gradient descent Nesterov s acceleration and optimal convergence 4 Online stochastic optimization Stochastic gradient descent Stochastic regularized dual averaging 36 / 73

42 Smoothness and strong convexity Definition Smoothness: the gradient is Lipschitz continuous: f (x) f (x ) L x x. Strong convexity: θ (0, 1), x, y dom(f ), µ 2 θ(1 θ) x y 2 + f (θx + (1 θ)y) θf (x) + (1 θ)f (y). Smooth but not strongly convex Smooth and Strongly convex Strongly convex but not smooth 37 / 73

43 Duality between smoothness and strong convexity Smoothness and strong convexity is in a relation of duality. Theorem Let f : R p R be proper closed convex. f is L-smooth f is 1/L-strongly convex. logistic loss its dual function Smooth but not strongly convex Strongly convex but not smooth (gradient ) 38 / 73

44 Outline 1 Introduction 2 Short course to convex analysis Convexity and related concepts Duality Smoothness and strong convexity 3 First order method Proximal gradient descent Nesterov s acceleration and optimal convergence 4 Online stochastic optimization Stochastic gradient descent Stochastic regularized dual averaging 39 / 73

45 Regularized learning problem Lasso: 1 min x R p n n i=1 (y i zi x) 2 + x }{{} 1. regularization 40 / 73

46 Regularized learning problem Lasso: 1 min x R p n n i=1 (y i zi x) 2 + x }{{} 1. regularization General regularized learning problem: 1 min x R p n n l(z i, x) + ψ(x). i=1 Difficulty: Sparsity inducing regularization is usually non-smooth. 40 / 73

47 First order optimization x t+1 x t x t-1 Optimization methods that use only the function value f (x) and the first order gradient g f (x). Computation per iteration is light, and suited for high dimensional problems. Newton method is a second order method. 41 / 73

48 Outline 1 Introduction 2 Short course to convex analysis Convexity and related concepts Duality Smoothness and strong convexity 3 First order method Proximal gradient descent Nesterov s acceleration and optimal convergence 4 Online stochastic optimization Stochastic gradient descent Stochastic regularized dual averaging 42 / 73

49 Gradient descent Let f (x) = n i=1 l(z i, x). min x f (x). Subgradient method Differentiable f (x): x t = x t 1 η t f (x t 1 ). 43 / 73

50 Gradient descent Let f (x) = n i=1 l(z i, x). Subgradient method Subdifferentiable f (x): min x f (x). g t f (x t 1 ), x t = x t 1 η t g t. 43 / 73

51 Gradient descent Let f (x) = n i=1 l(z i, x). min x f (x). Subgradient method (equivalent formula) Subdifferentiable f (x): where g t f (x t 1 ). x t = argmin x Proximal point algorithm: x t = argmin x { x, g t + 1 } x x t 1 2, 2η t { f (x) + 1 } x x t η t f (x t ) optimum for any convex f and η t = η > 0 (Güler, 1991). ( ) t 1 If f (x) is strongly convex: f (x t ) f (x ) 1 1 2η 1+ση x0 x / 73

52 Let f (x) = n i=1 l(z i, x). Proximal gradient descent min x f (x) + ψ(x). Proximal gradient descent { x t = argmin x, g t + ψ(x) + 1 } x x t 1 2 x 2η t = argmin {η t ψ(x) + 12 } x (x t 1 η t g t ) 2 x where g t f (x t 1 ). The update rule is given by proximal mapping: { prox(q ψ) = argmin ψ(x) + 12 } x q 2 x By using the proximal mapping, we can avoid bad properties (e.g., non-smoothness) of ψ. 44 / 73

53 Example L 1 regularization: ψ(x) = C x 1. x t,j = ST Cηt (x t 1,j η t g t,j ) (j-th component) where ST C (q) = sign(q) max{ q C, 0}. Unimportant elements are forced to be 0. For many practically used regularizations, analytic form is obtained. 45 / 73

54 Example of proximal mapping (cont.) Trace norm: ψ(x ) = C X tr = C j σ j(x ) (sum of singular values). Let X t 1 η t G t = U diag(σ 1,..., σ d )V, then ST Cηt (σ 1 ) X t = U... V. STCη(σd) 46 / 73

55 Convergence of proximal gradient descent Strong convexity and smoothness of f determines the convergence rate. x t = prox(x t 1 η t g t η t ψ(x)). property of f µ-strongly ( convex non-strongly conv γ-smooth exp t µ ) γ γ t 1 1 Non-smooth µt t The step size η t should be appropriately chosen. Setting of η t Strongly conv non-strongly conv 1 1 Smooth γ γ 2 Non-smooth 1 µt t To achieve this convergence rate, we need to take an average of {x t } t appropriately; Polyak-Ruppert averaging, polynomially decaying averaging. 47 / 73

56 Convergence of proximal gradient descent Strong convexity and smoothness of f determines the convergence rate. x t = prox(x t 1 η t g t η t ψ(x)). property of f µ-strongly ( convex ) non-strongly conv µ γ γ-smooth exp t γ t 2 Non-smooth 1 µt The step size η t should be appropriately chosen. 1 t Setting of η t Strongly conv non-strongly conv 1 1 Smooth γ γ 2 Non-smooth 1 µt t To achieve this convergence rate, we need to take an average of {x t } t appropriately; Polyak-Ruppert averaging, polynomially decaying averaging. Convergence for smooth loss can be improved by Nesterov s acceleration. Optimal rate 47 / 73

57 Outline 1 Introduction 2 Short course to convex analysis Convexity and related concepts Duality Smoothness and strong convexity 3 First order method Proximal gradient descent Nesterov s acceleration and optimal convergence 4 Online stochastic optimization Stochastic gradient descent Stochastic regularized dual averaging 48 / 73

58 Nesterov s acceleration (non-strongly convex) min x {f (x) + ψ(x)} Assumption: f (x) is γ-smooth. Nesterov s acceleration scheme Let s 1 = 1 and η = 1 γ, and iterate the following for t = 1, 2,... 1 Let g t f (y t ), and update x t = prox(y t ηg t ηψ). 2 Set s t+1 = s 2 t 2. 3 Update y t+1 = x t + ( st 1 s t+1 ) (x t x t 1 ). If f is γ-smooth, then f (x t ) f (x ) 2γ x t x t 2. This is also called Fast Iterative Shrinkage Thresholding Algorithm (FISTA) (Beck and Teboulle, 2009). The step size η = 1/γ can be adaptively determined: back-tracking technique. 49 / 73

59 Nesterov s acceleration (strongly convex) min x {f (x) + ψ(x)} Assumption: f (x) is γ-smooth and µ-strongly convex. (it must be γ > µ) Nesterov s acceleration scheme Let A 1 = 1, α 1 = γ/µ and η = 1 γ, and iterate the following for t = 1, 2,... 1 Let g t f (y t ), and update x t = prox(y t ηg t ηψ). 2 Set α t+1 > 1 so that (γ µ)αt+1 2 (2γ + A t)α t+1 + γ = 0, and let A t+1 = A t /α t+1. ( ) 3 Update y t+1 = x t + (x t x t 1 ). µ+a t (γ µ)(α t+1 1)(α t 1) If f is γ-smooth and µ-strongly convex, then ( f (x t ) f (x ) γ 1 ) γ t x 0 x 2. µ 50 / 73

60 51 / 73

61 10 4 Normal Nesterov 10 2 Relative objective (f(x t ) - f * ) Iteration Nesterov s acceleration v.s. normal gradient descent Lasso: n = 8, 000, p = / 73

62 Outline 1 Introduction 2 Short course to convex analysis Convexity and related concepts Duality Smoothness and strong convexity 3 First order method Proximal gradient descent Nesterov s acceleration and optimal convergence 4 Online stochastic optimization Stochastic gradient descent Stochastic regularized dual averaging 53 / 73

63 Two types of stochastic optimization Online type stochastic optimization: We observe data sequentially. We don t need to wait until the whole sample is obtained. Each observation is obtained just once (basically). min x E Z [l(z, x)] Batch type stochastic optimization The whole sample has been already observed. We can make use of the (finite and fixed) sample size. We may use sample multiple times. 1 min x n n l(z i, x) i=1 54 / 73

64 Online method You don t need to wait until the whole sample arrives. Update the parameter at each data observation. P(z) z t-1 x t z t x t+1 Observe Update Observe Update min x E[l(Z, x)] + ψ(x) min x 1 T T l(z t, x) + ψ(x) t=1 55 / 73

65 The objective of online stochastic optimization Let l(z, x) be a loss of x for an observation z. (Expected loss) L(x) = E Z [l(z, x)] or (EL with regularization) L ψ (x) = E Z [l(z, x)] + ψ(x) The distribution of Z could be the true population L(x) is the generalization error. an empirical distribution of stored data in a storage L (or L ψ ) is the (regularized) empirical risk. Online stochastic optimization itself is learning! 56 / 73

66 Outline 1 Introduction 2 Short course to convex analysis Convexity and related concepts Duality Smoothness and strong convexity 3 First order method Proximal gradient descent Nesterov s acceleration and optimal convergence 4 Online stochastic optimization Stochastic gradient descent Stochastic regularized dual averaging 57 / 73

67 Three steps to stochastic gradient descent E[l(Z, x)] = l(z, x)dp(z) l(z t, x t 1 ) (sampling) (1) xl(z t, x t 1 ), x (linearlization) (2) The approximation is correct just around x t 1. { min E[l(Z, x)] min x l(z t, x t 1 ), x + 1 } x x t 1 2 x (3) x 2η t (proximation) x t-1 58 / 73

68 Stochastic gradient descent (SGD) SGD (without regularization) Observe z t P(Z), and let l t (x) := l(z t, x). Calculate subgradient: g t x l t (x t 1 ). Update x as x t = x t 1 η t g t. We just need to observe one training data z t at each iteration. O(1) computation per iteration (O(n) for batch gradient descent). We do not need to go through the whole sample {z i } n i=1 Reminder: prox(q ψ) := argmin x { ψ(x) x q 2}. 59 / 73

69 Stochastic gradient descent (SGD) SGD (with regularization) Observe z t P(Z), and let l t (x) := l(z t, x). Calculate subgradient: g t x l t (x t 1 ). Update x as x t = prox(x t 1 η t g t η t ψ). We just need to observe one training data z t at each iteration. O(1) computation per iteration (O(n) for batch gradient descent). We do not need to go through the whole sample {z i } n i=1 Reminder: prox(q ψ) := argmin x { ψ(x) x q 2}. 59 / 73

70 Convergence analysis of SGD Assumption (A1) E[ g t 2 ] G 2. (A2) E[ x t x 2 ] D 2. Theorem Let x T = 1 T +1 T t=0 x t. For η t = η 0 t, it holds E z1:t [L ψ ( x T ) L ψ (x )] η 0G 2 + D 2 /η 0 T. For η 0 = D G, we have 2GD T. This is minimax optimal (up to constant). G is independent of ψ thanks to the proximal mapping. Note that ψ(x) C p for L 1 -reg. 60 / 73

71 Convergence analysis of SGD (strongly convex) Assumption Theorem (A1) E[ g t 2 ] G 2. (A3) L ψ is µ-strongly convex. Let x T = 1 T T +1 t=0 x t. For η t = 1 µt, it holds E z1:t [L ψ ( x T ) L ψ (x )] G 2 log(t ). T µ Better than non-strongly convex situation. But, this is not minimax optimal. The bound is tight (Rakhlin et al., 2012). 61 / 73

72 Polynomial averaging for strongly convex risk Assumption (A1) E[ g t 2 ] G 2. (A3) L ψ is µ-strongly convex. Modify the update rule as ( x t = prox x t 1 η t and take the weighted average x T = Theorem ) t t + 1 g t η t ψ, 2 (T + 1)(T + 2) For η t = 2 µt, it holds E z 1:T [L ψ ( x T ) L ψ (x )] 2G 2 T µ. T (t + 1)x t. t=0 log(t ) is removed. This is minimax optimal (explained later). 62 / 73

73 Remark on polynomial averaging x T = O(T ) computation? No. x T can be efficiently updated: 2 (T + 1)(T + 2) x t = T (t + 1)x t t=0 t t + 2 x t t + 2 x t. 63 / 73

74 General step size and weighting policy Let s t (t = 1, 2,..., T + 1) be a positive sequence such that T +1 t=1 s t = 1. ( ) s t x t = prox x t 1 η t g t η t ψ (t = 1,..., T ) s t+1 T x T = s t+1 x t. t=0 Assumption: (A1) E[ g t 2 ] G 2, (A2) E[ x t x 2 ] D 2, (A3) L ψ is µ-strongly convex. Theorem E z1:t [L ψ ( x T ) L ψ (x )] T T s t+1 η t+1 1 max{ s t+2 G 2 η + t+1 s t+1 ( 1 η t + µ), 0}D t=1 t=0 As for t = 0, we set 1/η 0 = / 73

75 Special case Let the weight proportion to the step size (step size could be seen as importance): η t s t = T +1 τ=1 η. τ In this setting, the previous theorem gives ensures the convergence. T E z1:t [L ψ ( x T ) L ψ (x t=1 )] η2 t G 2 + D 2 2 T t=1 η t η t = t=1 ηt 2 < t=1 65 / 73

76 Outline 1 Introduction 2 Short course to convex analysis Convexity and related concepts Duality Smoothness and strong convexity 3 First order method Proximal gradient descent Nesterov s acceleration and optimal convergence 4 Online stochastic optimization Stochastic gradient descent Stochastic regularized dual averaging 66 / 73

77 Stochastic regularized dual averaging (SRDA) The second assumption E[ x t x 2 ] D 2 can be removed by using dual averaging (Nesterov, 2009, Xiao, 2009). SRDA Observe z t P(Z), and let l t (x) := l(z t, x). Calculate gradient: g t x l t (x t 1 ). Take the average of the gradients: ḡ t = 1 t g τ. t τ=1 Update as { x t = argmin ḡ t, x + ψ(x) + 1 } x 2 x R p 2η t = prox( η t ḡ t η t ψ). The information of old observations is maintained by taking average of gradients. 67 / 73

78 Convergence analysis of SRDA Assumption (A1) E[ g t 2 ] G 2. (A2) E[ x t x 2 ] D 2. Theorem Let x T = 1 T +1 T t=0 x t. For η t = η 0 t, it holds E z1:t [L ψ ( x T ) L ψ (x )] η 0G 2 + x x 0 2 /η 0 T. If x x 0 R, then for η 0 = R G, we have 2RG. T This is minimax optimal (up to constant). The norm of intermediate solution x t is well controlled. Thus (A2) is not required. 68 / 73

79 Convergence analysis of SRDA (strongly convex) Assumption (A1) E[ g t 2 ] G 2. (A2) E[ x t x 2 ] D 2. (A3) ψ is µ-strongly convex. Modify the update rule as ḡ t = 2 (t + 1)(t + 2) t τg τ, x t = prox ( η t ḡ t η t ψ ), τ=1 and take the weighted average x T = 2 (T +1)(T +2) T t=0 (t + 1)x t. Theorem For η t = (t + 1)(t + 2)/ξ, it holds E z1:t [L ψ ( x T ) L ψ (x )] ξ x x 0 2 T 2 + 2G 2 T µ. ξ 0 yields 2G 2 T µ : minimax optimal. 69 / 73

80 General convergence analysis Let the weight s t > 0 (t = 1,... ) is any positive sequence. We generalize the update rule as t τ=1 ḡ t = s τ g τ t+1 τ=1 s, τ x t = prox( η t ḡ t η t ψ) (t = 1,..., T ). T τ=0 Let the weighted average of (x t ) t be x T = s τ+1x τ T. τ=0 s τ+1 Assumption: (A1) E[ g t 2 ] G 2, (A3) L ψ is µ-strongly convex (µ can be 0). Theorem Suppse that η t /( t+1 τ=1 s τ ) is non-decreasing, then E z1:t [L ψ ( x T ) L ψ (x )] ( T +1 1 T +1 t=1 s t t=1 s 2 t 2[( t τ=1 s τ )(µ + 1/η t 1 )] G 2 + ) T +2 t=1 s t x x η T / 73

81 Computational cost and generalization error The optimal learning rate for a strongly convex expected risk (generalization error) is O(1/n) (n is the sample size). To achieve O(1/n) generalization error, we need to decrease the training error to O(1/n). Normal gradient des. SGD Time per iteration n 1 Number of iterations until ϵ error log(1/ϵ) 1/ϵ Time until ϵ error n log(1/ϵ) 1/ϵ Time until 1/n error n log(n) n (Bottou, 2010) SGD is O(log(n)) faster with respect to the generalization error. 71 / 73

82 Typical behavior Trainig error SGD Batch Elaplsed time (s) Generalization error Elaplsed time (s) Normal gradient descent v.s. SGD Logistic regression with L 1 -regularization: n = 10, 000, p = 2. SGD decreases the objective rapidly, and after a while, the batch gradient method catches up and slightly surpasses. 72 / 73

83 Summary of part I Overview of stochastic optimization Convex analysis First order method Proximal gradient descent Online stochastic optimization Stochastic gradient descent (SGD) Stochastic regularized dual averaging (SRDA) RG T G 2 µt convergence in general convergence for strongly convex risk Next part: faster algorithm of online stochastic optimization, and batch stochastic optimization method 73 / 73

84 A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM journal on imaging sciences, 2(1): , L. Bottou. Online algorithms and stochastic approximations URL revised, oct L. Bottou. Large-scale machine learning with stochastic gradient descent. In Proceedings of COMPSTAT 2010, pages Springer, L. Bottou and Y. LeCun. Large scale online learning. In S. Thrun, L. Saul, and B. Schölkopf, editors, Advances in Neural Information Processing Systems 16. MIT Press, Cambridge, MA, URL J. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12: , O. Güler. On the convergence of the proximal point algorithm for convex minimization. SIAM Journal on Control and Optimization, 29(2): , R. Johnson and T. Zhang. Accelerating stochastic gradient descent using predictive variance reduction. In C. Burges, L. Bottou, M. Welling, 73 / 73

85 Z. Ghahramani, and K. Weinberger, editors, Advances in Neural Information Processing Systems 26, pages Curran Associates, Inc., URL accelerating-stochastic-gradient-descent-using-predicti pdf. N. Le Roux, M. Schmidt, and F. R. Bach. A stochastic gradient method with an exponential convergence rate for finite training sets. In F. Pereira, C. Burges, L. Bottou, and K. Weinberger, editors, Advances in Neural Information Processing Systems 25, pages Curran Associates, Inc., URL a-stochastic-gradient-method-with-an-exponential-conve rate-for-finite-training-sets.pdf. A. Nemirovskii and D. Yudin. On cezari s convergence of the steepest descent method for approximating saddle points of convex-concave functions. Soviet Mathematics Doklady, 19(2): , Y. Nesterov. Primal-dual subgradient methods for convex problems. Mathematical Programming, 120(1): , B. T. Polyak and A. B. Juditsky. Acceleration of stochastic approximation 73 / 73

86 by averaging. SIAM Journal on Control and Optimization, 30(4): , A. Rakhlin, O. Shamir, and K. Sridharan. Making gradient descent optimal for strongly convex stochastic optimization. In J. Langford and J. Pineau, editors, Proceedings of the 29th International Conference on Machine Learning, pages Omnipress, ISBN H. Robbins and S. Monro. A stochastic approximation method. The Annals of Mathematical Statistics, 22(3): , F. Rosenblatt. The perceptron: A perceiving and recognizing automaton. Technical Report Technical Report , Project PARA, Cornell Aeronautical Lab., D. Ruppert. Efficient estimations from a slowly convergent robbins-monro process. Technical report, Cornell University Operations Research and Industrial Engineering, S. Shalev-Shwartz and T. Zhang. Stochastic dual coordinate ascent methods for regularized loss minimization. Journal of Machine Learning Research, 14: , / 73

87 Y. Singer and J. C. Duchi. Efficient learning using forward-backward splitting. In Advances in Neural Information Processing Systems, pages , L. Xiao. Dual averaging methods for regularized stochastic learning and online optimization. In Advances in Neural Information Processing Systems / 73

Stochastic Optimization: First order method

Stochastic Optimization: First order method Taiji Suzuki Tokyo Institute of Technology Graduate School of Information Science and Engineering Department of Mathematical and Computing Sciences JST, PRESTO