Stochastic Optimization Part I: Convex analysis and online stochastic optimization

Size: px
Start display at page:

Download "Stochastic Optimization Part I: Convex analysis and online stochastic optimization"

Transcription

1 Stochastic Optimization Part I: Convex analysis and online stochastic optimization Taiji Suzuki Tokyo Institute of Technology Graduate School of Information Science and Engineering Department of Mathematical and Computing Sciences 1 / 73

2 Outline 1 Introduction 2 Short course to convex analysis Convexity and related concepts Duality Smoothness and strong convexity 3 First order method Proximal gradient descent Nesterov s acceleration and optimal convergence 4 Online stochastic optimization Stochastic gradient descent Stochastic regularized dual averaging 2 / 73

3 Lecture plan Part I: introduction to stochastic optimization Convex analysis First order method Online stochastic optimization method: SGD, SRDA Part II: faster stochastic gradient methods and batch methods AdaGrad, acceleration of SGD Batch stochastic optimization method: SDCA, SVRG, SAG Part III: stochastic ADMM and distributed optimization Stochastic ADMM for structured sparsity Distributed optimization (very shortly) 3 / 73

4 Outline 1 Introduction 2 Short course to convex analysis Convexity and related concepts Duality Smoothness and strong convexity 3 First order method Proximal gradient descent Nesterov s acceleration and optimal convergence 4 Online stochastic optimization Stochastic gradient descent Stochastic regularized dual averaging 4 / 73

5 Machine learning as optimization Machine learning is a methodology to deal with a lot of uncertain data. Generalization error minimization Empirical approximation min E Z [l θ (Z)] θ Θ 1 n min l θ (z i ) θ Θ n Stochastic optimization is an intersection of learning and optimization. i=1 5 / 73

6 x 1 x 2 x 3 x 4... Recently stochastic optimization is used to treat huge data. 1 n l θ (z i ) +ψ(θ) n i=1 }{{} Huge How to optimize this in efficient way? Do we need to go through the whole data at every iteration? 6 / 73

7 History of stochastic optimization for ML 1951 Robbins and Monro Stochastic approximation for root finding problem 1957 Rosenblatt Perceptron 1978 Nemirovskii and Yudin Robustification via averaging for non-smooth obj Ruppert Robust step size policy and averaging 1992 Polyak and Juditsky for smooth obj Bottou Online stochastic optimization 2004 Bottou and LeCun for large scale ML task Singer and Duchi; et al.; Xiao Duchi FOBOS, AdaGrad, RDA Le Roux et al. Linear convergence on batch data 2013 Shalev-Shwartz and Zhang (SAG,SDCA,SVRG) Johnson and Zhang 7 / 73

8 Overview of stochastic optimization min f (x) x Stochastic approximation (SA) Optimization for systems with uncertainty, e.g., machine control, traffic management, social science, and so on. g t = f (x (t) ) + ξ t is observed where ξ t is noise (typically i.i.d.). Stochastic approximation for machine learning and statistics Typically generalization error minimization: min x f (x) = min E Z [l(z, x)]. x g t = l(z t, x (t) ) is observed where z t P(Z) is i.i.d. data. l(z, x) is a loss function: e.g., logistic loss l((w, y), x) = log(1 + exp( yw x)) for z = (w, y) R p {±1}. Used for huge dataset. We don t need exact optimization. Optimization with certain precision (typically O(1/n)) is sufficient. 8 / 73

9 Two types of stochastic optimization Online type stochastic optimization: We observe data sequentially. Each observation is used just once (basically). min x E Z [l(z, x)] Batch type stochastic optimization The whole sample has been already observed. We may use training data multiple times. 1 min x n n l(z i, x) i=1 9 / 73

10 Summary of convergence rates Online methods (expected risk minimization): GR T (non-smooth, non-strongly convex) G 2 (non-smooth, strongly convex) µt σr + R2 L T T 2 (smooth, non-strongly convex) σ 2 ( ) µ µt + exp L T (smooth, strongly convex) Batch methods (empirical risk minimization) ( ) exp 1 n+ T (smooth loss, strongly convex reg) ( µ L ) exp 1 T (smooth loss, strongly convex reg with acceleration) nµ n+ L G: upper bound of norm of gradient, R: diameter of the domain, L: smoothness, µ: strong convexity, σ: variance of the gradient 10 / 73

11 Example of empirical risk minimization: High dimensional data analysis Bio-informatics Text data Image data 11 / 73

12 Example of empirical risk minimization: High dimensional data analysis Redundant information deteriorates the estimation accuracy. Bio-informatics Text data Image data 11 / 73

13 Sparse estimation Cut off redundant information sparsity R. Tsibshirani (1996). Regression shrinkage and selection via the lasso. J. Royal. Statist. Soc B., Vol. 58, No. 1, pages / 73

14 Lasso estimator Lasso [L 1 regularization] Convex optimization ˆβ Lasso = argmin β R p Y X β 2 + λ β 1 where β 1 = p j=1 β j. L 1 norm is the convex hull of L 0 norm on [ 1, 1] p (the largest convex function which supports from below). L 1 norm is the Lovász extension of the cardinality function. 13 / 73

15 Lasso estimator Lasso [L 1 regularization] ˆβ Lasso = argmin β R p Y X β 2 + λ β 1 where β 1 = p j=1 β j. Convex optimization L 1 norm is the convex hull of L 0 norm on [ 1, 1] p (the largest convex function which supports from below). L 1 norm is the Lovász extension of the cardinality function. More generally for a loss function l (logistic loss, hinge loss,...) { n min x i=1 l(z i, x) + λ x 1 } 13 / 73

16 Outline 1 Introduction 2 Short course to convex analysis Convexity and related concepts Duality Smoothness and strong convexity 3 First order method Proximal gradient descent Nesterov s acceleration and optimal convergence 4 Online stochastic optimization Stochastic gradient descent Stochastic regularized dual averaging 14 / 73

17 Regularized empirical risk minimization Basically, we want to solve Empirical risk minimization: 1 min x R p n n l(z i, x). i=1 Regularized empirical risk minimization: 1 min x R p n n l(z i, x) + ψ(x). In this lecture, we assume l and ψ are convex. convex analysis to exploit the properties of convex functions. i=1 15 / 73

18 Outline 1 Introduction 2 Short course to convex analysis Convexity and related concepts Duality Smoothness and strong convexity 3 First order method Proximal gradient descent Nesterov s acceleration and optimal convergence 4 Online stochastic optimization Stochastic gradient descent Stochastic regularized dual averaging 16 / 73

19 Convex set Definition (Convex set) A convex set is a set that contains the segment connecting two points in the set: x 1, x 2 C = θx 1 + (1 θ 2 )x 2 C (θ [0, 1]). 17 / 73

20 Epigraph and domain Let R := R { }. Definition (Epigraph and domain) The epigraph of a function f : R p R is given by epi(f ) := {(x, µ) R p+1 : f (x) µ}. The domain of a function f : R p R is given by dom(f ) := {x R p : f (x) < }. 18 / 73

21 Convex function Let R := R { }. Definition (Convex function) A function f : R p R is a convex function if f satisfies θf (x) + (1 θ)f (y) f (θx + (1 θ)y) ( x, y R p, θ [0, 1]), where + =,. f is convex epi(f ) is a convex set. 19 / 73

22 Proper and closed convex function If the domain of a function f is not empty (dom(f ) ), f is called proper. If the epigraph of a convex function f is a closed set, then f is called closed. (We are interested in only a proper closed function in this lecture.) Even if f is closed, it s domain is not necessarily closed (even for 1D). f is closed does not imply f is continuous. Closed convex function is continuous on a segment in its domain. Closed function is lower semicontinuity. 20 / 73

23 Convex loss functions (regression) All well used loss functions are (closed) convex. The followings are convex w.r.t. u with a fixed label y R. Squared loss: l(y, u) = 1 2 (y u)2. τ-quantile loss: l(y, u) = (1 τ) max{u y, 0} + τ max{y u, 0}. for some τ (0, 1). Used for quantile regression. ϵ-sensitive loss: l(y, u) = max{ y u ϵ, 0} for some ϵ > 0. Used for support vector regression. 3 τ f-y 21 / 73

24 Convex surrogate loss (classification) y {±1} Logistic loss: l(y, u) = log((1 + exp( yu))/2). Hinge loss: l(y, u) = max{1 yu, 0}. Exponential loss: l(y, u) = exp( yu). Smoothed hinge loss: 0, (yu 1), l(y, u) = 1 2 yu, (yu < 0), 1 2 (1 yu)2, (otherwise) yf 22 / 73

25 Convex regularization functions Ridge regularization: L 1 regularization: R(x) = x 2 2 := p j=1 x j 2. R(x) = x 1 := p j=1 x j. Trace norm regularization: R(X ) = X tr = min{q,r} k=1 σ j (X ) where σ j (X ) 0 is the j-th singular value n n i=1 (y i z i x) 2 + λ x 1 : Lasso 1 n n i=1 log(1 + exp( y iz i x)) + λ X tr : Low rank matrix recovery 23 / 73

26 Other definitions of sets Convex hull: conv(c) is the smallest convex set that contains a set C R p. Affine set: A set A is an affine set if and only if x, y A, the line that intersects x and y lies in A: λx + (1 λ)y λ R. Affine hull: The smallest affine set that contains a set C R p. Relative interior: ri(c). Let A be the affine hull of a convex set C R p. ri(c) is a set of internal points with respect to the relative topology induced by the affine hull A. 24 / 73

27 Continuity of a closed convex function Theorem For a (possibly non-convex) function f : R p R, the following three conditions are equivalent to each other. 1 f is lower semi-continuous. 2 For any converging sequence {x n } n=1 Rp s.t. x = lim n x n, lim inf n f (x n ) f (x ). 3 f is closed. Remark: Any convex function f is continuous in ri(dom(f )). The continuity could be broken on the boundary of the domain. 25 / 73

28 Outline 1 Introduction 2 Short course to convex analysis Convexity and related concepts Duality Smoothness and strong convexity 3 First order method Proximal gradient descent Nesterov s acceleration and optimal convergence 4 Online stochastic optimization Stochastic gradient descent Stochastic regularized dual averaging 26 / 73

29 Subgradient We want to deal with non-differentiable function such as L 1 regularization. To do so, we need to define something like gradient. Definition (Subdifferential, subgradient) For a proper convex function f : R p R, the subdifferential of f at x dom(f ) is defined by f (x) := {g R p x x, g + f (x) f (x ) ( x R p )}. An element of the subdifferential is called subgradient. f(x) 27 / 73

30 Properties of subgradient Subgradient does not necessarily exist ( f (x) could be empty). f (x) = x log(x) (x 0) is proper convex but not subdifferentiable at x = 0. Subgradient always exists on ri(dom(f )). 28 / 73

31 Properties of subgradient Subgradient does not necessarily exist ( f (x) could be empty). f (x) = x log(x) (x 0) is proper convex but not subdifferentiable at x = 0. Subgradient always exists on ri(dom(f )). If f is differentiable at x, its gradient is the unique element of subdiff. f (x) = { f (x)}. 28 / 73

32 Properties of subgradient Subgradient does not necessarily exist ( f (x) could be empty). f (x) = x log(x) (x 0) is proper convex but not subdifferentiable at x = 0. Subgradient always exists on ri(dom(f )). If f is differentiable at x, its gradient is the unique element of subdiff. f (x) = { f (x)}. If ri(dom(f )) ri(dom(h)), then (f + h)(x) = f (x) + h(x) = {g + g g f (x), g h(x)} ( x dom(f ) dom(h)). 28 / 73

33 Properties of subgradient Subgradient does not necessarily exist ( f (x) could be empty). f (x) = x log(x) (x 0) is proper convex but not subdifferentiable at x = 0. Subgradient always exists on ri(dom(f )). If f is differentiable at x, its gradient is the unique element of subdiff. f (x) = { f (x)}. If ri(dom(f )) ri(dom(h)), then (f + h)(x) = f (x) + h(x) = {g + g g f (x), g h(x)} ( x dom(f ) dom(h)). For all g f (x) and all g f (x ) (x, x dom(f )), g g, x x / 73

34 Legendre transform Defines the other representation on the dual space (the space of gradients). Definition (Legendre transform) Let f be a (possibly non-convex) function f : R p R s.t. dom(f ). Its convex conjugate is given by f (y) := sup x R p { x, y f (x)}. The map from f to f is called Legendre transform. f(x) y 0 x* f(x*) x f*(y) 29 / 73

35 Examples f (x) f (y) 1 Squared loss 2 { y 2 2 y ( 1 y 0), Hinge loss max{1 x, 0} (otherwise). { ( y) log( y) + (1 + y) log(1 + y) ( 1 y 0), (otherwise). Logistic loss log(1 + exp( x)) { 0 (max j y j 1), L 1 regularization x 1 (otherwise). L p regularization d j=1 x j p p d p 1 j=1 p y j p 1 (p > 1) p p 1 logistic dual of logistic L 1 -norm dual 30 / 73

36 Properties of Legendre transform f is a convex function even if f is not. f is the closure of the convex hull of f : Corollary f = cl(conv(f )). Legendre transform is a bijection from the set of proper closed convex functions onto that defined on the dual space. f (proper closed convex) f (proper closed convex) / 73

37 Connection to subgradient Lemma y f (x) f (x) + f (y) = x, y x f (y). y f (x) x = argmax x R p { x, y f (x )} (take the derivative of x, y f (x )) f (y) = x, y f (x). Remark: By definition, we always have Young-Fenchel s inequality. f (x) + f (y) x, y. 32 / 73

38 Fenchel s duality theorem Theorem (Fenchel s duality theorem) Let f : R p R, g : R q R be proper closed convex, and A R q p. Suppose that either of condition (a) or (b) is satisfied, then it holds that inf x Rp{f (x) + g(ax)} = sup { f (A y) g ( y)}. y R q (a) x R p s.t. x ri(dom(f )) and Ax ri(dom(g)). (b) y R q s.t. A y ri(dom(f )) and y ri(dom(g )). If (a) is satisfied, there exists y R q that attains sup of the RHS. If (b) is satisfied, there exists x R p that attains inf of the LHS. Under (a) and (b), x, y are the optimal solutions of the each side iff Karush-Kuhn-Tucker condition. A y f (x ), Ax g ( y ). 33 / 73

39 Equivalence to the separation theorem 34 / 73

40 Applying Fenchel s duality theorem to RERM RERM (Regularized Empirical Risk Minimizatino): Let l i (zi x) = l(y i, zi x) where (z i, y i ) is the input-output pair of the i-th observation. { n } (Primal) inf l i (z x R p i x) + ψ(x) i=1 }{{} (Dual) f (Zx) [Fenchel s duality theorem] inf x Rp{f (Zx) + ψ(x)} = inf y R n{f (y) + ψ ( Z y)} sup y R n { n } l i (y i ) + ψ ( Z y) i=1 This fact will be used to derive dual coordinate descent alg. 35 / 73

41 Outline 1 Introduction 2 Short course to convex analysis Convexity and related concepts Duality Smoothness and strong convexity 3 First order method Proximal gradient descent Nesterov s acceleration and optimal convergence 4 Online stochastic optimization Stochastic gradient descent Stochastic regularized dual averaging 36 / 73

42 Smoothness and strong convexity Definition Smoothness: the gradient is Lipschitz continuous: f (x) f (x ) L x x. Strong convexity: θ (0, 1), x, y dom(f ), µ 2 θ(1 θ) x y 2 + f (θx + (1 θ)y) θf (x) + (1 θ)f (y). Smooth but not strongly convex Smooth and Strongly convex Strongly convex but not smooth 37 / 73

43 Duality between smoothness and strong convexity Smoothness and strong convexity is in a relation of duality. Theorem Let f : R p R be proper closed convex. f is L-smooth f is 1/L-strongly convex. logistic loss its dual function Smooth but not strongly convex Strongly convex but not smooth (gradient ) 38 / 73

44 Outline 1 Introduction 2 Short course to convex analysis Convexity and related concepts Duality Smoothness and strong convexity 3 First order method Proximal gradient descent Nesterov s acceleration and optimal convergence 4 Online stochastic optimization Stochastic gradient descent Stochastic regularized dual averaging 39 / 73

45 Regularized learning problem Lasso: 1 min x R p n n i=1 (y i zi x) 2 + x }{{} 1. regularization 40 / 73

46 Regularized learning problem Lasso: 1 min x R p n n i=1 (y i zi x) 2 + x }{{} 1. regularization General regularized learning problem: 1 min x R p n n l(z i, x) + ψ(x). i=1 Difficulty: Sparsity inducing regularization is usually non-smooth. 40 / 73

47 First order optimization x t+1 x t x t-1 Optimization methods that use only the function value f (x) and the first order gradient g f (x). Computation per iteration is light, and suited for high dimensional problems. Newton method is a second order method. 41 / 73

48 Outline 1 Introduction 2 Short course to convex analysis Convexity and related concepts Duality Smoothness and strong convexity 3 First order method Proximal gradient descent Nesterov s acceleration and optimal convergence 4 Online stochastic optimization Stochastic gradient descent Stochastic regularized dual averaging 42 / 73

49 Gradient descent Let f (x) = n i=1 l(z i, x). min x f (x). Subgradient method Differentiable f (x): x t = x t 1 η t f (x t 1 ). 43 / 73

50 Gradient descent Let f (x) = n i=1 l(z i, x). Subgradient method Subdifferentiable f (x): min x f (x). g t f (x t 1 ), x t = x t 1 η t g t. 43 / 73

51 Gradient descent Let f (x) = n i=1 l(z i, x). min x f (x). Subgradient method (equivalent formula) Subdifferentiable f (x): where g t f (x t 1 ). x t = argmin x Proximal point algorithm: x t = argmin x { x, g t + 1 } x x t 1 2, 2η t { f (x) + 1 } x x t η t f (x t ) optimum for any convex f and η t = η > 0 (Güler, 1991). ( ) t 1 If f (x) is strongly convex: f (x t ) f (x ) 1 1 2η 1+ση x0 x / 73

52 Let f (x) = n i=1 l(z i, x). Proximal gradient descent min x f (x) + ψ(x). Proximal gradient descent { x t = argmin x, g t + ψ(x) + 1 } x x t 1 2 x 2η t = argmin {η t ψ(x) + 12 } x (x t 1 η t g t ) 2 x where g t f (x t 1 ). The update rule is given by proximal mapping: { prox(q ψ) = argmin ψ(x) + 12 } x q 2 x By using the proximal mapping, we can avoid bad properties (e.g., non-smoothness) of ψ. 44 / 73

53 Example L 1 regularization: ψ(x) = C x 1. x t,j = ST Cηt (x t 1,j η t g t,j ) (j-th component) where ST C (q) = sign(q) max{ q C, 0}. Unimportant elements are forced to be 0. For many practically used regularizations, analytic form is obtained. 45 / 73

54 Example of proximal mapping (cont.) Trace norm: ψ(x ) = C X tr = C j σ j(x ) (sum of singular values). Let X t 1 η t G t = U diag(σ 1,..., σ d )V, then ST Cηt (σ 1 ) X t = U... V. STCη(σd) 46 / 73

55 Convergence of proximal gradient descent Strong convexity and smoothness of f determines the convergence rate. x t = prox(x t 1 η t g t η t ψ(x)). property of f µ-strongly ( convex non-strongly conv γ-smooth exp t µ ) γ γ t 1 1 Non-smooth µt t The step size η t should be appropriately chosen. Setting of η t Strongly conv non-strongly conv 1 1 Smooth γ γ 2 Non-smooth 1 µt t To achieve this convergence rate, we need to take an average of {x t } t appropriately; Polyak-Ruppert averaging, polynomially decaying averaging. 47 / 73

56 Convergence of proximal gradient descent Strong convexity and smoothness of f determines the convergence rate. x t = prox(x t 1 η t g t η t ψ(x)). property of f µ-strongly ( convex ) non-strongly conv µ γ γ-smooth exp t γ t 2 Non-smooth 1 µt The step size η t should be appropriately chosen. 1 t Setting of η t Strongly conv non-strongly conv 1 1 Smooth γ γ 2 Non-smooth 1 µt t To achieve this convergence rate, we need to take an average of {x t } t appropriately; Polyak-Ruppert averaging, polynomially decaying averaging. Convergence for smooth loss can be improved by Nesterov s acceleration. Optimal rate 47 / 73

57 Outline 1 Introduction 2 Short course to convex analysis Convexity and related concepts Duality Smoothness and strong convexity 3 First order method Proximal gradient descent Nesterov s acceleration and optimal convergence 4 Online stochastic optimization Stochastic gradient descent Stochastic regularized dual averaging 48 / 73

58 Nesterov s acceleration (non-strongly convex) min x {f (x) + ψ(x)} Assumption: f (x) is γ-smooth. Nesterov s acceleration scheme Let s 1 = 1 and η = 1 γ, and iterate the following for t = 1, 2,... 1 Let g t f (y t ), and update x t = prox(y t ηg t ηψ). 2 Set s t+1 = s 2 t 2. 3 Update y t+1 = x t + ( st 1 s t+1 ) (x t x t 1 ). If f is γ-smooth, then f (x t ) f (x ) 2γ x t x t 2. This is also called Fast Iterative Shrinkage Thresholding Algorithm (FISTA) (Beck and Teboulle, 2009). The step size η = 1/γ can be adaptively determined: back-tracking technique. 49 / 73

59 Nesterov s acceleration (strongly convex) min x {f (x) + ψ(x)} Assumption: f (x) is γ-smooth and µ-strongly convex. (it must be γ > µ) Nesterov s acceleration scheme Let A 1 = 1, α 1 = γ/µ and η = 1 γ, and iterate the following for t = 1, 2,... 1 Let g t f (y t ), and update x t = prox(y t ηg t ηψ). 2 Set α t+1 > 1 so that (γ µ)αt+1 2 (2γ + A t)α t+1 + γ = 0, and let A t+1 = A t /α t+1. ( ) 3 Update y t+1 = x t + (x t x t 1 ). µ+a t (γ µ)(α t+1 1)(α t 1) If f is γ-smooth and µ-strongly convex, then ( f (x t ) f (x ) γ 1 ) γ t x 0 x 2. µ 50 / 73

60 51 / 73

61 10 4 Normal Nesterov 10 2 Relative objective (f(x t ) - f * ) Iteration Nesterov s acceleration v.s. normal gradient descent Lasso: n = 8, 000, p = / 73

62 Outline 1 Introduction 2 Short course to convex analysis Convexity and related concepts Duality Smoothness and strong convexity 3 First order method Proximal gradient descent Nesterov s acceleration and optimal convergence 4 Online stochastic optimization Stochastic gradient descent Stochastic regularized dual averaging 53 / 73

63 Two types of stochastic optimization Online type stochastic optimization: We observe data sequentially. We don t need to wait until the whole sample is obtained. Each observation is obtained just once (basically). min x E Z [l(z, x)] Batch type stochastic optimization The whole sample has been already observed. We can make use of the (finite and fixed) sample size. We may use sample multiple times. 1 min x n n l(z i, x) i=1 54 / 73

64 Online method You don t need to wait until the whole sample arrives. Update the parameter at each data observation. P(z) z t-1 x t z t x t+1 Observe Update Observe Update min x E[l(Z, x)] + ψ(x) min x 1 T T l(z t, x) + ψ(x) t=1 55 / 73

65 The objective of online stochastic optimization Let l(z, x) be a loss of x for an observation z. (Expected loss) L(x) = E Z [l(z, x)] or (EL with regularization) L ψ (x) = E Z [l(z, x)] + ψ(x) The distribution of Z could be the true population L(x) is the generalization error. an empirical distribution of stored data in a storage L (or L ψ ) is the (regularized) empirical risk. Online stochastic optimization itself is learning! 56 / 73

66 Outline 1 Introduction 2 Short course to convex analysis Convexity and related concepts Duality Smoothness and strong convexity 3 First order method Proximal gradient descent Nesterov s acceleration and optimal convergence 4 Online stochastic optimization Stochastic gradient descent Stochastic regularized dual averaging 57 / 73

67 Three steps to stochastic gradient descent E[l(Z, x)] = l(z, x)dp(z) l(z t, x t 1 ) (sampling) (1) xl(z t, x t 1 ), x (linearlization) (2) The approximation is correct just around x t 1. { min E[l(Z, x)] min x l(z t, x t 1 ), x + 1 } x x t 1 2 x (3) x 2η t (proximation) x t-1 58 / 73

68 Stochastic gradient descent (SGD) SGD (without regularization) Observe z t P(Z), and let l t (x) := l(z t, x). Calculate subgradient: g t x l t (x t 1 ). Update x as x t = x t 1 η t g t. We just need to observe one training data z t at each iteration. O(1) computation per iteration (O(n) for batch gradient descent). We do not need to go through the whole sample {z i } n i=1 Reminder: prox(q ψ) := argmin x { ψ(x) x q 2}. 59 / 73

69 Stochastic gradient descent (SGD) SGD (with regularization) Observe z t P(Z), and let l t (x) := l(z t, x). Calculate subgradient: g t x l t (x t 1 ). Update x as x t = prox(x t 1 η t g t η t ψ). We just need to observe one training data z t at each iteration. O(1) computation per iteration (O(n) for batch gradient descent). We do not need to go through the whole sample {z i } n i=1 Reminder: prox(q ψ) := argmin x { ψ(x) x q 2}. 59 / 73

70 Convergence analysis of SGD Assumption (A1) E[ g t 2 ] G 2. (A2) E[ x t x 2 ] D 2. Theorem Let x T = 1 T +1 T t=0 x t. For η t = η 0 t, it holds E z1:t [L ψ ( x T ) L ψ (x )] η 0G 2 + D 2 /η 0 T. For η 0 = D G, we have 2GD T. This is minimax optimal (up to constant). G is independent of ψ thanks to the proximal mapping. Note that ψ(x) C p for L 1 -reg. 60 / 73

71 Convergence analysis of SGD (strongly convex) Assumption Theorem (A1) E[ g t 2 ] G 2. (A3) L ψ is µ-strongly convex. Let x T = 1 T T +1 t=0 x t. For η t = 1 µt, it holds E z1:t [L ψ ( x T ) L ψ (x )] G 2 log(t ). T µ Better than non-strongly convex situation. But, this is not minimax optimal. The bound is tight (Rakhlin et al., 2012). 61 / 73

72 Polynomial averaging for strongly convex risk Assumption (A1) E[ g t 2 ] G 2. (A3) L ψ is µ-strongly convex. Modify the update rule as ( x t = prox x t 1 η t and take the weighted average x T = Theorem ) t t + 1 g t η t ψ, 2 (T + 1)(T + 2) For η t = 2 µt, it holds E z 1:T [L ψ ( x T ) L ψ (x )] 2G 2 T µ. T (t + 1)x t. t=0 log(t ) is removed. This is minimax optimal (explained later). 62 / 73

73 Remark on polynomial averaging x T = O(T ) computation? No. x T can be efficiently updated: 2 (T + 1)(T + 2) x t = T (t + 1)x t t=0 t t + 2 x t t + 2 x t. 63 / 73

74 General step size and weighting policy Let s t (t = 1, 2,..., T + 1) be a positive sequence such that T +1 t=1 s t = 1. ( ) s t x t = prox x t 1 η t g t η t ψ (t = 1,..., T ) s t+1 T x T = s t+1 x t. t=0 Assumption: (A1) E[ g t 2 ] G 2, (A2) E[ x t x 2 ] D 2, (A3) L ψ is µ-strongly convex. Theorem E z1:t [L ψ ( x T ) L ψ (x )] T T s t+1 η t+1 1 max{ s t+2 G 2 η + t+1 s t+1 ( 1 η t + µ), 0}D t=1 t=0 As for t = 0, we set 1/η 0 = / 73

75 Special case Let the weight proportion to the step size (step size could be seen as importance): η t s t = T +1 τ=1 η. τ In this setting, the previous theorem gives ensures the convergence. T E z1:t [L ψ ( x T ) L ψ (x t=1 )] η2 t G 2 + D 2 2 T t=1 η t η t = t=1 ηt 2 < t=1 65 / 73

76 Outline 1 Introduction 2 Short course to convex analysis Convexity and related concepts Duality Smoothness and strong convexity 3 First order method Proximal gradient descent Nesterov s acceleration and optimal convergence 4 Online stochastic optimization Stochastic gradient descent Stochastic regularized dual averaging 66 / 73

77 Stochastic regularized dual averaging (SRDA) The second assumption E[ x t x 2 ] D 2 can be removed by using dual averaging (Nesterov, 2009, Xiao, 2009). SRDA Observe z t P(Z), and let l t (x) := l(z t, x). Calculate gradient: g t x l t (x t 1 ). Take the average of the gradients: ḡ t = 1 t g τ. t τ=1 Update as { x t = argmin ḡ t, x + ψ(x) + 1 } x 2 x R p 2η t = prox( η t ḡ t η t ψ). The information of old observations is maintained by taking average of gradients. 67 / 73

78 Convergence analysis of SRDA Assumption (A1) E[ g t 2 ] G 2. (A2) E[ x t x 2 ] D 2. Theorem Let x T = 1 T +1 T t=0 x t. For η t = η 0 t, it holds E z1:t [L ψ ( x T ) L ψ (x )] η 0G 2 + x x 0 2 /η 0 T. If x x 0 R, then for η 0 = R G, we have 2RG. T This is minimax optimal (up to constant). The norm of intermediate solution x t is well controlled. Thus (A2) is not required. 68 / 73

79 Convergence analysis of SRDA (strongly convex) Assumption (A1) E[ g t 2 ] G 2. (A2) E[ x t x 2 ] D 2. (A3) ψ is µ-strongly convex. Modify the update rule as ḡ t = 2 (t + 1)(t + 2) t τg τ, x t = prox ( η t ḡ t η t ψ ), τ=1 and take the weighted average x T = 2 (T +1)(T +2) T t=0 (t + 1)x t. Theorem For η t = (t + 1)(t + 2)/ξ, it holds E z1:t [L ψ ( x T ) L ψ (x )] ξ x x 0 2 T 2 + 2G 2 T µ. ξ 0 yields 2G 2 T µ : minimax optimal. 69 / 73

80 General convergence analysis Let the weight s t > 0 (t = 1,... ) is any positive sequence. We generalize the update rule as t τ=1 ḡ t = s τ g τ t+1 τ=1 s, τ x t = prox( η t ḡ t η t ψ) (t = 1,..., T ). T τ=0 Let the weighted average of (x t ) t be x T = s τ+1x τ T. τ=0 s τ+1 Assumption: (A1) E[ g t 2 ] G 2, (A3) L ψ is µ-strongly convex (µ can be 0). Theorem Suppse that η t /( t+1 τ=1 s τ ) is non-decreasing, then E z1:t [L ψ ( x T ) L ψ (x )] ( T +1 1 T +1 t=1 s t t=1 s 2 t 2[( t τ=1 s τ )(µ + 1/η t 1 )] G 2 + ) T +2 t=1 s t x x η T / 73

81 Computational cost and generalization error The optimal learning rate for a strongly convex expected risk (generalization error) is O(1/n) (n is the sample size). To achieve O(1/n) generalization error, we need to decrease the training error to O(1/n). Normal gradient des. SGD Time per iteration n 1 Number of iterations until ϵ error log(1/ϵ) 1/ϵ Time until ϵ error n log(1/ϵ) 1/ϵ Time until 1/n error n log(n) n (Bottou, 2010) SGD is O(log(n)) faster with respect to the generalization error. 71 / 73

82 Typical behavior Trainig error SGD Batch Elaplsed time (s) Generalization error Elaplsed time (s) Normal gradient descent v.s. SGD Logistic regression with L 1 -regularization: n = 10, 000, p = 2. SGD decreases the objective rapidly, and after a while, the batch gradient method catches up and slightly surpasses. 72 / 73

83 Summary of part I Overview of stochastic optimization Convex analysis First order method Proximal gradient descent Online stochastic optimization Stochastic gradient descent (SGD) Stochastic regularized dual averaging (SRDA) RG T G 2 µt convergence in general convergence for strongly convex risk Next part: faster algorithm of online stochastic optimization, and batch stochastic optimization method 73 / 73

84 A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM journal on imaging sciences, 2(1): , L. Bottou. Online algorithms and stochastic approximations URL revised, oct L. Bottou. Large-scale machine learning with stochastic gradient descent. In Proceedings of COMPSTAT 2010, pages Springer, L. Bottou and Y. LeCun. Large scale online learning. In S. Thrun, L. Saul, and B. Schölkopf, editors, Advances in Neural Information Processing Systems 16. MIT Press, Cambridge, MA, URL J. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12: , O. Güler. On the convergence of the proximal point algorithm for convex minimization. SIAM Journal on Control and Optimization, 29(2): , R. Johnson and T. Zhang. Accelerating stochastic gradient descent using predictive variance reduction. In C. Burges, L. Bottou, M. Welling, 73 / 73

85 Z. Ghahramani, and K. Weinberger, editors, Advances in Neural Information Processing Systems 26, pages Curran Associates, Inc., URL accelerating-stochastic-gradient-descent-using-predicti pdf. N. Le Roux, M. Schmidt, and F. R. Bach. A stochastic gradient method with an exponential convergence rate for finite training sets. In F. Pereira, C. Burges, L. Bottou, and K. Weinberger, editors, Advances in Neural Information Processing Systems 25, pages Curran Associates, Inc., URL a-stochastic-gradient-method-with-an-exponential-conve rate-for-finite-training-sets.pdf. A. Nemirovskii and D. Yudin. On cezari s convergence of the steepest descent method for approximating saddle points of convex-concave functions. Soviet Mathematics Doklady, 19(2): , Y. Nesterov. Primal-dual subgradient methods for convex problems. Mathematical Programming, 120(1): , B. T. Polyak and A. B. Juditsky. Acceleration of stochastic approximation 73 / 73

86 by averaging. SIAM Journal on Control and Optimization, 30(4): , A. Rakhlin, O. Shamir, and K. Sridharan. Making gradient descent optimal for strongly convex stochastic optimization. In J. Langford and J. Pineau, editors, Proceedings of the 29th International Conference on Machine Learning, pages Omnipress, ISBN H. Robbins and S. Monro. A stochastic approximation method. The Annals of Mathematical Statistics, 22(3): , F. Rosenblatt. The perceptron: A perceiving and recognizing automaton. Technical Report Technical Report , Project PARA, Cornell Aeronautical Lab., D. Ruppert. Efficient estimations from a slowly convergent robbins-monro process. Technical report, Cornell University Operations Research and Industrial Engineering, S. Shalev-Shwartz and T. Zhang. Stochastic dual coordinate ascent methods for regularized loss minimization. Journal of Machine Learning Research, 14: , / 73

87 Y. Singer and J. C. Duchi. Efficient learning using forward-backward splitting. In Advances in Neural Information Processing Systems, pages , L. Xiao. Dual averaging methods for regularized stochastic learning and online optimization. In Advances in Neural Information Processing Systems / 73

Stochastic Optimization: First order method

Stochastic Optimization: First order method Stochastic Optimization: First order method Taiji Suzuki Tokyo Institute of Technology Graduate School of Information Science and Engineering Department of Mathematical and Computing Sciences JST, PRESTO

More information

Stochastic Optimization Introduction + Sparse regularization + Convex analysis

Stochastic Optimization Introduction + Sparse regularization + Convex analysis Stochastic Optimization Introduction + Sparse regularization + Convex analysis Taiji Suzuki Tokyo Institute of Technology Graduate School of Information Science and Engineering Department of Mathematical

More information

Nesterov s Acceleration

Nesterov s Acceleration Nesterov s Acceleration Nesterov Accelerated Gradient min X f(x)+ (X) f -smooth. Set s 1 = 1 and = 1. Set y 0. Iterate by increasing t: g t 2 @f(y t ) s t+1 = 1+p 1+4s 2 t 2 y t = x t + s t 1 s t+1 (x

More information

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique Master 2 MathBigData S. Gaïffas 1 3 novembre 2014 1 CMAP - Ecole Polytechnique 1 Supervised learning recap Introduction Loss functions, linearity 2 Penalization Introduction Ridge Sparsity Lasso 3 Some

More information

Proximal Minimization by Incremental Surrogate Optimization (MISO)

Proximal Minimization by Incremental Surrogate Optimization (MISO) Proximal Minimization by Incremental Surrogate Optimization (MISO) (and a few variants) Julien Mairal Inria, Grenoble ICCOPT, Tokyo, 2016 Julien Mairal, Inria MISO 1/26 Motivation: large-scale machine

More information

Adaptive Subgradient Methods for Online Learning and Stochastic Optimization John Duchi, Elad Hanzan, Yoram Singer

Adaptive Subgradient Methods for Online Learning and Stochastic Optimization John Duchi, Elad Hanzan, Yoram Singer Adaptive Subgradient Methods for Online Learning and Stochastic Optimization John Duchi, Elad Hanzan, Yoram Singer Vicente L. Malave February 23, 2011 Outline Notation minimize a number of functions φ

More information

Large-scale Stochastic Optimization

Large-scale Stochastic Optimization Large-scale Stochastic Optimization 11-741/641/441 (Spring 2016) Hanxiao Liu hanxiaol@cs.cmu.edu March 24, 2016 1 / 22 Outline 1. Gradient Descent (GD) 2. Stochastic Gradient Descent (SGD) Formulation

More information

Big Data Analytics: Optimization and Randomization

Big Data Analytics: Optimization and Randomization Big Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML 2015 Hong Kong Department of Computer Science, The University of Iowa, IA, USA Nov. 20, 2015 Yang Tutorial for ACML 15 Nov.

More information

Stochastic and online algorithms

Stochastic and online algorithms Stochastic and online algorithms stochastic gradient method online optimization and dual averaging method minimizing finite average Stochastic and online optimization 6 1 Stochastic optimization problem

More information

Learning with stochastic proximal gradient

Learning with stochastic proximal gradient Learning with stochastic proximal gradient Lorenzo Rosasco DIBRIS, Università di Genova Via Dodecaneso, 35 16146 Genova, Italy lrosasco@mit.edu Silvia Villa, Băng Công Vũ Laboratory for Computational and

More information

Convex Optimization Lecture 16

Convex Optimization Lecture 16 Convex Optimization Lecture 16 Today: Projected Gradient Descent Conditional Gradient Descent Stochastic Gradient Descent Random Coordinate Descent Recall: Gradient Descent (Steepest Descent w.r.t Euclidean

More information

Optimization for Machine Learning

Optimization for Machine Learning Optimization for Machine Learning (Lecture 3-A - Convex) SUVRIT SRA Massachusetts Institute of Technology Special thanks: Francis Bach (INRIA, ENS) (for sharing this material, and permitting its use) MPI-IS

More information

Stochastic gradient methods for machine learning

Stochastic gradient methods for machine learning Stochastic gradient methods for machine learning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France Joint work with Eric Moulines, Nicolas Le Roux and Mark Schmidt - September 2012 Context Machine

More information

Stochastic gradient methods for machine learning

Stochastic gradient methods for machine learning Stochastic gradient methods for machine learning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France Joint work with Eric Moulines, Nicolas Le Roux and Mark Schmidt - January 2013 Context Machine

More information

Composite Objective Mirror Descent

Composite Objective Mirror Descent Composite Objective Mirror Descent John C. Duchi 1,3 Shai Shalev-Shwartz 2 Yoram Singer 3 Ambuj Tewari 4 1 University of California, Berkeley 2 Hebrew University of Jerusalem, Israel 3 Google Research

More information

SVRG++ with Non-uniform Sampling

SVRG++ with Non-uniform Sampling SVRG++ with Non-uniform Sampling Tamás Kern András György Department of Electrical and Electronic Engineering Imperial College London, London, UK, SW7 2BT {tamas.kern15,a.gyorgy}@imperial.ac.uk Abstract

More information

Stochastic Gradient Descent with Variance Reduction

Stochastic Gradient Descent with Variance Reduction Stochastic Gradient Descent with Variance Reduction Rie Johnson, Tong Zhang Presenter: Jiawen Yao March 17, 2015 Rie Johnson, Tong Zhang Presenter: JiawenStochastic Yao Gradient Descent with Variance Reduction

More information

ECS289: Scalable Machine Learning

ECS289: Scalable Machine Learning ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Sept 29, 2016 Outline Convex vs Nonconvex Functions Coordinate Descent Gradient Descent Newton s method Stochastic Gradient Descent Numerical Optimization

More information

STA141C: Big Data & High Performance Statistical Computing

STA141C: Big Data & High Performance Statistical Computing STA141C: Big Data & High Performance Statistical Computing Lecture 8: Optimization Cho-Jui Hsieh UC Davis May 9, 2017 Optimization Numerical Optimization Numerical Optimization: min X f (X ) Can be applied

More information

Optimal Regularized Dual Averaging Methods for Stochastic Optimization

Optimal Regularized Dual Averaging Methods for Stochastic Optimization Optimal Regularized Dual Averaging Methods for Stochastic Optimization Xi Chen Machine Learning Department Carnegie Mellon University xichen@cs.cmu.edu Qihang Lin Javier Peña Tepper School of Business

More information

Stochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure

Stochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure Stochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure Alberto Bietti Julien Mairal Inria Grenoble (Thoth) March 21, 2017 Alberto Bietti Stochastic MISO March 21,

More information

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference ECE 18-898G: Special Topics in Signal Processing: Sparsity, Structure, and Inference Sparse Recovery using L1 minimization - algorithms Yuejie Chi Department of Electrical and Computer Engineering Spring

More information

Stochastic Optimization

Stochastic Optimization Introduction Related Work SGD Epoch-GD LM A DA NANJING UNIVERSITY Lijun Zhang Nanjing University, China May 26, 2017 Introduction Related Work SGD Epoch-GD Outline 1 Introduction 2 Related Work 3 Stochastic

More information

Beyond stochastic gradient descent for large-scale machine learning

Beyond stochastic gradient descent for large-scale machine learning Beyond stochastic gradient descent for large-scale machine learning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France Joint work with Eric Moulines, Nicolas Le Roux and Mark Schmidt - CAP, July

More information

Incremental and Stochastic Majorization-Minimization Algorithms for Large-Scale Machine Learning

Incremental and Stochastic Majorization-Minimization Algorithms for Large-Scale Machine Learning Incremental and Stochastic Majorization-Minimization Algorithms for Large-Scale Machine Learning Julien Mairal Inria, LEAR Team, Grenoble Journées MAS, Toulouse Julien Mairal Incremental and Stochastic

More information

Lecture 3: Minimizing Large Sums. Peter Richtárik

Lecture 3: Minimizing Large Sums. Peter Richtárik Lecture 3: Minimizing Large Sums Peter Richtárik Graduate School in Systems, Op@miza@on, Control and Networks Belgium 2015 Mo@va@on: Machine Learning & Empirical Risk Minimiza@on Training Linear Predictors

More information

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016 ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016 LECTURERS: NAMAN AGARWAL AND BRIAN BULLINS SCRIBE: KIRAN VODRAHALLI Contents 1 Introduction 1 1.1 History of Optimization.....................................

More information

Minimizing Finite Sums with the Stochastic Average Gradient Algorithm

Minimizing Finite Sums with the Stochastic Average Gradient Algorithm Minimizing Finite Sums with the Stochastic Average Gradient Algorithm Joint work with Nicolas Le Roux and Francis Bach University of British Columbia Context: Machine Learning for Big Data Large-scale

More information

Fast Stochastic Optimization Algorithms for ML

Fast Stochastic Optimization Algorithms for ML Fast Stochastic Optimization Algorithms for ML Aaditya Ramdas April 20, 205 This lecture is about efficient algorithms for minimizing finite sums min w R d n i= f i (w) or min w R d n f i (w) + λ 2 w 2

More information

Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade

Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade Machine Learning for Big Data CSE547/STAT548 University of Washington S. M. Kakade (UW) Optimization for

More information

Fast proximal gradient methods

Fast proximal gradient methods L. Vandenberghe EE236C (Spring 2013-14) Fast proximal gradient methods fast proximal gradient method (FISTA) FISTA with line search FISTA as descent method Nesterov s second method 1 Fast (proximal) gradient

More information

Accelerating Stochastic Optimization

Accelerating Stochastic Optimization Accelerating Stochastic Optimization Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem and Mobileye Master Class at Tel-Aviv, Tel-Aviv University, November 2014 Shalev-Shwartz

More information

6. Proximal gradient method

6. Proximal gradient method L. Vandenberghe EE236C (Spring 2016) 6. Proximal gradient method motivation proximal mapping proximal gradient method with fixed step size proximal gradient method with line search 6-1 Proximal mapping

More information

Math 273a: Optimization Convex Conjugacy

Math 273a: Optimization Convex Conjugacy Math 273a: Optimization Convex Conjugacy Instructor: Wotao Yin Department of Mathematics, UCLA Fall 2015 online discussions on piazza.com Convex conjugate (the Legendre transform) Let f be a closed proper

More information

Newton s Method. Javier Peña Convex Optimization /36-725

Newton s Method. Javier Peña Convex Optimization /36-725 Newton s Method Javier Peña Convex Optimization 10-725/36-725 1 Last time: dual correspondences Given a function f : R n R, we define its conjugate f : R n R, f ( (y) = max y T x f(x) ) x Properties and

More information

A Fast Augmented Lagrangian Algorithm for Learning Low-Rank Matrices

A Fast Augmented Lagrangian Algorithm for Learning Low-Rank Matrices A Fast Augmented Lagrangian Algorithm for Learning Low-Rank Matrices Ryota Tomioka 1, Taiji Suzuki 1, Masashi Sugiyama 2, Hisashi Kashima 1 1 The University of Tokyo 2 Tokyo Institute of Technology 2010-06-22

More information

Optimization methods

Optimization methods Optimization methods Optimization-Based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_spring16 Carlos Fernandez-Granda /8/016 Introduction Aim: Overview of optimization methods that Tend to

More information

Proximal methods. S. Villa. October 7, 2014

Proximal methods. S. Villa. October 7, 2014 Proximal methods S. Villa October 7, 2014 1 Review of the basics Often machine learning problems require the solution of minimization problems. For instance, the ERM algorithm requires to solve a problem

More information

Barzilai-Borwein Step Size for Stochastic Gradient Descent

Barzilai-Borwein Step Size for Stochastic Gradient Descent Barzilai-Borwein Step Size for Stochastic Gradient Descent Conghui Tan Shiqian Ma Yu-Hong Dai Yuqiu Qian May 16, 2016 Abstract One of the major issues in stochastic gradient descent (SGD) methods is how

More information

Barzilai-Borwein Step Size for Stochastic Gradient Descent

Barzilai-Borwein Step Size for Stochastic Gradient Descent Barzilai-Borwein Step Size for Stochastic Gradient Descent Conghui Tan The Chinese University of Hong Kong chtan@se.cuhk.edu.hk Shiqian Ma The Chinese University of Hong Kong sqma@se.cuhk.edu.hk Yu-Hong

More information

LECTURE 25: REVIEW/EPILOGUE LECTURE OUTLINE

LECTURE 25: REVIEW/EPILOGUE LECTURE OUTLINE LECTURE 25: REVIEW/EPILOGUE LECTURE OUTLINE CONVEX ANALYSIS AND DUALITY Basic concepts of convex analysis Basic concepts of convex optimization Geometric duality framework - MC/MC Constrained optimization

More information

Coordinate Descent and Ascent Methods

Coordinate Descent and Ascent Methods Coordinate Descent and Ascent Methods Julie Nutini Machine Learning Reading Group November 3 rd, 2015 1 / 22 Projected-Gradient Methods Motivation Rewrite non-smooth problem as smooth constrained problem:

More information

Optimization. Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison

Optimization. Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison Optimization Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison optimization () cost constraints might be too much to cover in 3 hours optimization (for big

More information

Online Convex Optimization

Online Convex Optimization Advanced Course in Machine Learning Spring 2010 Online Convex Optimization Handouts are jointly prepared by Shie Mannor and Shai Shalev-Shwartz A convex repeated game is a two players game that is performed

More information

Adaptive Gradient Methods AdaGrad / Adam. Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade

Adaptive Gradient Methods AdaGrad / Adam. Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade Adaptive Gradient Methods AdaGrad / Adam Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade 1 Announcements: HW3 posted Dual coordinate ascent (some review of SGD and random

More information

ONLINE VARIANCE-REDUCING OPTIMIZATION

ONLINE VARIANCE-REDUCING OPTIMIZATION ONLINE VARIANCE-REDUCING OPTIMIZATION Nicolas Le Roux Google Brain nlr@google.com Reza Babanezhad University of British Columbia rezababa@cs.ubc.ca Pierre-Antoine Manzagol Google Brain manzagop@google.com

More information

Beyond stochastic gradient descent for large-scale machine learning

Beyond stochastic gradient descent for large-scale machine learning Beyond stochastic gradient descent for large-scale machine learning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France Joint work with Eric Moulines - October 2014 Big data revolution? A new

More information

OLSO. Online Learning and Stochastic Optimization. Yoram Singer August 10, Google Research

OLSO. Online Learning and Stochastic Optimization. Yoram Singer August 10, Google Research OLSO Online Learning and Stochastic Optimization Yoram Singer August 10, 2016 Google Research References Introduction to Online Convex Optimization, Elad Hazan, Princeton University Online Learning and

More information

Optimization methods

Optimization methods Lecture notes 3 February 8, 016 1 Introduction Optimization methods In these notes we provide an overview of a selection of optimization methods. We focus on methods which rely on first-order information,

More information

Convex Optimization. Dani Yogatama. School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA. February 12, 2014

Convex Optimization. Dani Yogatama. School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA. February 12, 2014 Convex Optimization Dani Yogatama School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA February 12, 2014 Dani Yogatama (Carnegie Mellon University) Convex Optimization February 12,

More information

Stochastic gradient descent and robustness to ill-conditioning

Stochastic gradient descent and robustness to ill-conditioning Stochastic gradient descent and robustness to ill-conditioning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France ÉCOLE NORMALE SUPÉRIEURE Joint work with Aymeric Dieuleveut, Nicolas Flammarion,

More information

Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization

Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization Shai Shalev-Shwartz and Tong Zhang School of CS and Engineering, The Hebrew University of Jerusalem Optimization for Machine

More information

Accelerating SVRG via second-order information

Accelerating SVRG via second-order information Accelerating via second-order information Ritesh Kolte Department of Electrical Engineering rkolte@stanford.edu Murat Erdogdu Department of Statistics erdogdu@stanford.edu Ayfer Özgür Department of Electrical

More information

Accelerate Subgradient Methods

Accelerate Subgradient Methods Accelerate Subgradient Methods Tianbao Yang Department of Computer Science The University of Iowa Contributors: students Yi Xu, Yan Yan and colleague Qihang Lin Yang (CS@Uiowa) Accelerate Subgradient Methods

More information

1 Sparsity and l 1 relaxation

1 Sparsity and l 1 relaxation 6.883 Learning with Combinatorial Structure Note for Lecture 2 Author: Chiyuan Zhang Sparsity and l relaxation Last time we talked about sparsity and characterized when an l relaxation could recover the

More information

Accelerated Randomized Primal-Dual Coordinate Method for Empirical Risk Minimization

Accelerated Randomized Primal-Dual Coordinate Method for Empirical Risk Minimization Accelerated Randomized Primal-Dual Coordinate Method for Empirical Risk Minimization Lin Xiao (Microsoft Research) Joint work with Qihang Lin (CMU), Zhaosong Lu (Simon Fraser) Yuchen Zhang (UC Berkeley)

More information

Convex envelopes, cardinality constrained optimization and LASSO. An application in supervised learning: support vector machines (SVMs)

Convex envelopes, cardinality constrained optimization and LASSO. An application in supervised learning: support vector machines (SVMs) ORF 523 Lecture 8 Princeton University Instructor: A.A. Ahmadi Scribe: G. Hall Any typos should be emailed to a a a@princeton.edu. 1 Outline Convexity-preserving operations Convex envelopes, cardinality

More information

Math 273a: Optimization Subgradients of convex functions

Math 273a: Optimization Subgradients of convex functions Math 273a: Optimization Subgradients of convex functions Made by: Damek Davis Edited by Wotao Yin Department of Mathematics, UCLA Fall 2015 online discussions on piazza.com 1 / 42 Subgradients Assumptions

More information

Subgradient. Acknowledgement: this slides is based on Prof. Lieven Vandenberghes lecture notes. definition. subgradient calculus

Subgradient. Acknowledgement: this slides is based on Prof. Lieven Vandenberghes lecture notes. definition. subgradient calculus 1/41 Subgradient Acknowledgement: this slides is based on Prof. Lieven Vandenberghes lecture notes definition subgradient calculus duality and optimality conditions directional derivative Basic inequality

More information

SIAM Conference on Imaging Science, Bologna, Italy, Adaptive FISTA. Peter Ochs Saarland University

SIAM Conference on Imaging Science, Bologna, Italy, Adaptive FISTA. Peter Ochs Saarland University SIAM Conference on Imaging Science, Bologna, Italy, 2018 Adaptive FISTA Peter Ochs Saarland University 07.06.2018 joint work with Thomas Pock, TU Graz, Austria c 2018 Peter Ochs Adaptive FISTA 1 / 16 Some

More information

Modern Stochastic Methods. Ryan Tibshirani (notes by Sashank Reddi and Ryan Tibshirani) Convex Optimization

Modern Stochastic Methods. Ryan Tibshirani (notes by Sashank Reddi and Ryan Tibshirani) Convex Optimization Modern Stochastic Methods Ryan Tibshirani (notes by Sashank Reddi and Ryan Tibshirani) Convex Optimization 10-725 Last time: conditional gradient method For the problem min x f(x) subject to x C where

More information

Accelerated Block-Coordinate Relaxation for Regularized Optimization

Accelerated Block-Coordinate Relaxation for Regularized Optimization Accelerated Block-Coordinate Relaxation for Regularized Optimization Stephen J. Wright Computer Sciences University of Wisconsin, Madison October 09, 2012 Problem descriptions Consider where f is smooth

More information

Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method

Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method Davood Hajinezhad Iowa State University Davood Hajinezhad Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method 1 / 35 Co-Authors

More information

Stochastic Composition Optimization

Stochastic Composition Optimization Stochastic Composition Optimization Algorithms and Sample Complexities Mengdi Wang Joint works with Ethan X. Fang, Han Liu, and Ji Liu ORFE@Princeton ICCOPT, Tokyo, August 8-11, 2016 1 / 24 Collaborators

More information

A Modular Analysis of Adaptive (Non-)Convex Optimization: Optimism, Composite Objectives, and Variational Bounds

A Modular Analysis of Adaptive (Non-)Convex Optimization: Optimism, Composite Objectives, and Variational Bounds Proceedings of Machine Learning Research 76: 40, 207 Algorithmic Learning Theory 207 A Modular Analysis of Adaptive (Non-)Convex Optimization: Optimism, Composite Objectives, and Variational Bounds Pooria

More information

Ad Placement Strategies

Ad Placement Strategies Case Study : Estimating Click Probabilities Intro Logistic Regression Gradient Descent + SGD AdaGrad Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox January 7 th, 04 Ad

More information

Dual and primal-dual methods

Dual and primal-dual methods ELE 538B: Large-Scale Optimization for Data Science Dual and primal-dual methods Yuxin Chen Princeton University, Spring 2018 Outline Dual proximal gradient method Primal-dual proximal gradient method

More information

Large-scale machine learning and convex optimization

Large-scale machine learning and convex optimization Large-scale machine learning and convex optimization Francis Bach INRIA - Ecole Normale Supérieure, Paris, France ÉCOLE NORMALE SUPÉRIEURE IFCAM, Bangalore - July 2014 Big data revolution? A new scientific

More information

Machine Learning CS 4900/5900. Lecture 03. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Machine Learning CS 4900/5900. Lecture 03. Razvan C. Bunescu School of Electrical Engineering and Computer Science Machine Learning CS 4900/5900 Razvan C. Bunescu School of Electrical Engineering and Computer Science bunescu@ohio.edu Machine Learning is Optimization Parametric ML involves minimizing an objective function

More information

Accelerated Proximal Gradient Methods for Convex Optimization

Accelerated Proximal Gradient Methods for Convex Optimization Accelerated Proximal Gradient Methods for Convex Optimization Paul Tseng Mathematics, University of Washington Seattle MOPTA, University of Guelph August 18, 2008 ACCELERATED PROXIMAL GRADIENT METHODS

More information

Linear Convergence under the Polyak-Łojasiewicz Inequality

Linear Convergence under the Polyak-Łojasiewicz Inequality Linear Convergence under the Polyak-Łojasiewicz Inequality Hamed Karimi, Julie Nutini, Mark Schmidt University of British Columbia Linear of Convergence of Gradient-Based Methods Fitting most machine learning

More information

Constrained Optimization and Lagrangian Duality

Constrained Optimization and Lagrangian Duality CIS 520: Machine Learning Oct 02, 2017 Constrained Optimization and Lagrangian Duality Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the lecture. They may or may

More information

Convex Analysis Notes. Lecturer: Adrian Lewis, Cornell ORIE Scribe: Kevin Kircher, Cornell MAE

Convex Analysis Notes. Lecturer: Adrian Lewis, Cornell ORIE Scribe: Kevin Kircher, Cornell MAE Convex Analysis Notes Lecturer: Adrian Lewis, Cornell ORIE Scribe: Kevin Kircher, Cornell MAE These are notes from ORIE 6328, Convex Analysis, as taught by Prof. Adrian Lewis at Cornell University in the

More information

Large Scale Machine Learning with Stochastic Gradient Descent

Large Scale Machine Learning with Stochastic Gradient Descent Large Scale Machine Learning with Stochastic Gradient Descent Léon Bottou leon@bottou.org Microsoft (since June) Summary i. Learning with Stochastic Gradient Descent. ii. The Tradeoffs of Large Scale Learning.

More information

ECS171: Machine Learning

ECS171: Machine Learning ECS171: Machine Learning Lecture 4: Optimization (LFD 3.3, SGD) Cho-Jui Hsieh UC Davis Jan 22, 2018 Gradient descent Optimization Goal: find the minimizer of a function min f (w) w For now we assume f

More information

CSCI 1951-G Optimization Methods in Finance Part 12: Variants of Gradient Descent

CSCI 1951-G Optimization Methods in Finance Part 12: Variants of Gradient Descent CSCI 1951-G Optimization Methods in Finance Part 12: Variants of Gradient Descent April 27, 2018 1 / 32 Outline 1) Moment and Nesterov s accelerated gradient descent 2) AdaGrad and RMSProp 4) Adam 5) Stochastic

More information

Statistical Machine Learning II Spring 2017, Learning Theory, Lecture 4

Statistical Machine Learning II Spring 2017, Learning Theory, Lecture 4 Statistical Machine Learning II Spring 07, Learning Theory, Lecture 4 Jean Honorio jhonorio@purdue.edu Deterministic Optimization For brevity, everywhere differentiable functions will be called smooth.

More information

Lasso: Algorithms and Extensions

Lasso: Algorithms and Extensions ELE 538B: Sparsity, Structure and Inference Lasso: Algorithms and Extensions Yuxin Chen Princeton University, Spring 2017 Outline Proximal operators Proximal gradient methods for lasso and its extensions

More information

Linear Convergence under the Polyak-Łojasiewicz Inequality

Linear Convergence under the Polyak-Łojasiewicz Inequality Linear Convergence under the Polyak-Łojasiewicz Inequality Hamed Karimi, Julie Nutini and Mark Schmidt The University of British Columbia LCI Forum February 28 th, 2017 1 / 17 Linear Convergence of Gradient-Based

More information

Randomized Smoothing Techniques in Optimization

Randomized Smoothing Techniques in Optimization Randomized Smoothing Techniques in Optimization John Duchi Based on joint work with Peter Bartlett, Michael Jordan, Martin Wainwright, Andre Wibisono Stanford University Information Systems Laboratory

More information

One Mirror Descent Algorithm for Convex Constrained Optimization Problems with Non-Standard Growth Properties

One Mirror Descent Algorithm for Convex Constrained Optimization Problems with Non-Standard Growth Properties One Mirror Descent Algorithm for Convex Constrained Optimization Problems with Non-Standard Growth Properties Fedor S. Stonyakin 1 and Alexander A. Titov 1 V. I. Vernadsky Crimean Federal University, Simferopol,

More information

Optimization and Optimal Control in Banach Spaces

Optimization and Optimal Control in Banach Spaces Optimization and Optimal Control in Banach Spaces Bernhard Schmitzer October 19, 2017 1 Convex non-smooth optimization with proximal operators Remark 1.1 (Motivation). Convex optimization: easier to solve,

More information

Subgradient Method. Guest Lecturer: Fatma Kilinc-Karzan. Instructors: Pradeep Ravikumar, Aarti Singh Convex Optimization /36-725

Subgradient Method. Guest Lecturer: Fatma Kilinc-Karzan. Instructors: Pradeep Ravikumar, Aarti Singh Convex Optimization /36-725 Subgradient Method Guest Lecturer: Fatma Kilinc-Karzan Instructors: Pradeep Ravikumar, Aarti Singh Convex Optimization 10-725/36-725 Adapted from slides from Ryan Tibshirani Consider the problem Recall:

More information

Agenda. Fast proximal gradient methods. 1 Accelerated first-order methods. 2 Auxiliary sequences. 3 Convergence analysis. 4 Numerical examples

Agenda. Fast proximal gradient methods. 1 Accelerated first-order methods. 2 Auxiliary sequences. 3 Convergence analysis. 4 Numerical examples Agenda Fast proximal gradient methods 1 Accelerated first-order methods 2 Auxiliary sequences 3 Convergence analysis 4 Numerical examples 5 Optimality of Nesterov s scheme Last time Proximal gradient method

More information

A Parallel SGD method with Strong Convergence

A Parallel SGD method with Strong Convergence A Parallel SGD method with Strong Convergence Dhruv Mahajan Microsoft Research India dhrumaha@microsoft.com S. Sundararajan Microsoft Research India ssrajan@microsoft.com S. Sathiya Keerthi Microsoft Corporation,

More information

Notes on AdaGrad. Joseph Perla 2014

Notes on AdaGrad. Joseph Perla 2014 Notes on AdaGrad Joseph Perla 2014 1 Introduction Stochastic Gradient Descent (SGD) is a common online learning algorithm for optimizing convex (and often non-convex) functions in machine learning today.

More information

Classification Logistic Regression

Classification Logistic Regression Announcements: Classification Logistic Regression Machine Learning CSE546 Sham Kakade University of Washington HW due on Friday. Today: Review: sub-gradients,lasso Logistic Regression October 3, 26 Sham

More information

Accelerating Online Convex Optimization via Adaptive Prediction

Accelerating Online Convex Optimization via Adaptive Prediction Mehryar Mohri Courant Institute and Google Research New York, NY 10012 mohri@cims.nyu.edu Scott Yang Courant Institute New York, NY 10012 yangs@cims.nyu.edu Abstract We present a powerful general framework

More information

BASICS OF CONVEX ANALYSIS

BASICS OF CONVEX ANALYSIS BASICS OF CONVEX ANALYSIS MARKUS GRASMAIR 1. Main Definitions We start with providing the central definitions of convex functions and convex sets. Definition 1. A function f : R n R + } is called convex,

More information

Convex Functions. Pontus Giselsson

Convex Functions. Pontus Giselsson Convex Functions Pontus Giselsson 1 Today s lecture lower semicontinuity, closure, convex hull convexity preserving operations precomposition with affine mapping infimal convolution image function supremum

More information

Full-information Online Learning

Full-information Online Learning Introduction Expert Advice OCO LM A DA NANJING UNIVERSITY Full-information Lijun Zhang Nanjing University, China June 2, 2017 Outline Introduction Expert Advice OCO 1 Introduction Definitions Regret 2

More information

Support Vector Machines and Kernel Methods

Support Vector Machines and Kernel Methods 2018 CS420 Machine Learning, Lecture 3 Hangout from Prof. Andrew Ng. http://cs229.stanford.edu/notes/cs229-notes3.pdf Support Vector Machines and Kernel Methods Weinan Zhang Shanghai Jiao Tong University

More information

CS260: Machine Learning Algorithms

CS260: Machine Learning Algorithms CS260: Machine Learning Algorithms Lecture 4: Stochastic Gradient Descent Cho-Jui Hsieh UCLA Jan 16, 2019 Large-scale Problems Machine learning: usually minimizing the training loss min w { 1 N min w {

More information

Dual Proximal Gradient Method

Dual Proximal Gradient Method Dual Proximal Gradient Method http://bicmr.pku.edu.cn/~wenzw/opt-2016-fall.html Acknowledgement: this slides is based on Prof. Lieven Vandenberghes lecture notes Outline 2/19 1 proximal gradient method

More information

Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron

Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron CS446: Machine Learning, Fall 2017 Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron Lecturer: Sanmi Koyejo Scribe: Ke Wang, Oct. 24th, 2017 Agenda Recap: SVM and Hinge loss, Representer

More information

Beyond stochastic gradient descent for large-scale machine learning

Beyond stochastic gradient descent for large-scale machine learning Beyond stochastic gradient descent for large-scale machine learning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France ÉCOLE NORMALE SUPÉRIEURE Joint work with Aymeric Dieuleveut, Nicolas Flammarion,

More information

Subgradient Method. Ryan Tibshirani Convex Optimization

Subgradient Method. Ryan Tibshirani Convex Optimization Subgradient Method Ryan Tibshirani Convex Optimization 10-725 Consider the problem Last last time: gradient descent min x f(x) for f convex and differentiable, dom(f) = R n. Gradient descent: choose initial

More information

Coordinate descent methods

Coordinate descent methods Coordinate descent methods Master Mathematics for data science and big data Olivier Fercoq November 3, 05 Contents Exact coordinate descent Coordinate gradient descent 3 3 Proximal coordinate descent 5

More information

Stochastic Gradient Descent. Ryan Tibshirani Convex Optimization

Stochastic Gradient Descent. Ryan Tibshirani Convex Optimization Stochastic Gradient Descent Ryan Tibshirani Convex Optimization 10-725 Last time: proximal gradient descent Consider the problem min x g(x) + h(x) with g, h convex, g differentiable, and h simple in so

More information

Convex Optimization. (EE227A: UC Berkeley) Lecture 15. Suvrit Sra. (Gradient methods III) 12 March, 2013

Convex Optimization. (EE227A: UC Berkeley) Lecture 15. Suvrit Sra. (Gradient methods III) 12 March, 2013 Convex Optimization (EE227A: UC Berkeley) Lecture 15 (Gradient methods III) 12 March, 2013 Suvrit Sra Optimal gradient methods 2 / 27 Optimal gradient methods We saw following efficiency estimates for

More information