Stochastic Optimization: First order method

Size: px

Start display at page:

Download "Stochastic Optimization: First order method"

Chester Richards
5 years ago
Views:

1 Stochastic Optimization: First order method Taiji Suzuki Tokyo Institute of Technology Graduate School of Information Science and Engineering Department of Mathematical and Computing Sciences JST, PRESTO Intensive Nagoya University / 6

2 Outline First order method Proximal gradient descent Nesterov s acceleration and optimal convergence 2 / 6

3 Outline First order method Proximal gradient descent Nesterov s acceleration and optimal convergence 3 / 6

4 Regularized learning problem Lasso: min x R p n n i= (y i zi x) 2 + x }{{}. regularization 4 / 6

5 Regularized learning problem Lasso: min x R p n n i= (y i zi x) 2 + x }{{}. regularization General regularized learning problem: min x R p n n l(z i, x) + ψ(x). i= Difficulty: Sparsity inducing regularization is usually non-smooth. 4 / 6

6 First order optimization x t+ x t x t- Optimization methods that use only the function value f (x) and the first order gradient g f (x). Computation per iteration is light, and suited for high dimensional problems. Newton method is a second order method. 5 / 6

7 Outline First order method Proximal gradient descent Nesterov s acceleration and optimal convergence 6 / 6

8 Gradient descent Let f (x) = n i= l(z i, x). min x f (x). Subgradient method Differentiable f (x): x t = x t η t f (x t ). 7 / 6

9 Gradient descent Let f (x) = n i= l(z i, x). Subgradient method Subdifferentiable f (x): min x f (x). g t f (x t ), x t = x t η t g t. 7 / 6

10 Gradient descent Let f (x) = n i= l(z i, x). min x f (x). Subgradient method (equivalent formula) Subdifferentiable f (x): where g t f (x t ). x t = argmin x Proximal point algorithm: x t = argmin x { x, g t + } x x t 2, 2η t { f (x) + } x x t 2. 2η t f (x t ) optimum for any convex f and η t = η > 0 (?). ( ) t If f (x) is strongly convex: f (x t ) f (x ) 2η +ση x0 x 2. 7 / 6

11 Let f (x) = n i= l(z i, x). Proximal gradient descent min x f (x) + ψ(x). Proximal gradient descent { x t = argmin x, g t + ψ(x) + } x x t 2 x 2η t = argmin {η t ψ(x) + 2 } x (x t η t g t ) 2 x where g t f (x t ). The update rule is given by proximal mapping: { prox(q ψ) = argmin ψ(x) + 2 } x q 2 x By using the proximal mapping, we can avoid bad properties (e.g., non-smoothness) of ψ. 8 / 6

12 Example L regularization: ψ(x) = C x. x t,j = ST Cηt (x t,j η t g t,j ) (j-th component) where ST C (q) = sign(q) max{ q C, 0}. Unimportant elements are forced to be 0. For many practically used regularizations, analytic form is obtained. 9 / 6

13 Example of proximal mapping (cont.) Trace norm: ψ(x ) = C X tr = C j σ j(x ) (sum of singular values). Let X t η t G t = U diag(σ,..., σ d )V, then ST Cηt (σ ) X t = U... V. STCη(σd) 0 / 6

14 Convergence of proximal gradient descent Strong convexity and smoothness of f determines the convergence rate. x t = prox(x t η t g t η t ψ(x)). property of f µ-strongly ( convex non-strongly conv γ-smooth exp t µ ) γ γ t Non-smooth µt t The step size η t should be appropriately chosen. Setting of η t Strongly conv non-strongly conv Smooth γ γ 2 Non-smooth µt t To achieve this convergence rate, we need to take an average of {x t } t appropriately; Polyak-Ruppert averaging, polynomially decaying averaging. / 6

15 Convergence of proximal gradient descent Strong convexity and smoothness of f determines the convergence rate. x t = prox(x t η t g t η t ψ(x)). property of f µ-strongly ( convex ) non-strongly conv µ γ γ-smooth exp t γ t 2 Non-smooth µt The step size η t should be appropriately chosen. t Setting of η t Strongly conv non-strongly conv Smooth γ γ 2 Non-smooth µt t To achieve this convergence rate, we need to take an average of {x t } t appropriately; Polyak-Ruppert averaging, polynomially decaying averaging. Convergence for smooth loss can be improved by Nesterov s acceleration. Optimal rate / 6

16 Outline First order method Proximal gradient descent Nesterov s acceleration and optimal convergence 2 / 6

17 Nesterov s acceleration (non-strongly convex) min x {f (x) + ψ(x)} Assumption: f (x) is γ-smooth. Nesterov s acceleration scheme Let s = and η = γ, and iterate the following for t =, 2,... Let g t f (y t ), and update x t = prox(y t ηg t ηψ). 2 Set s t+ = + +4s 2 t 2. 3 Update y t+ = x t + ( st s t+ ) (x t x t ). If f is γ-smooth, then f (x t ) f (x ) 2γ x t x t 2. This is also called Fast Iterative Shrinkage Thresholding Algorithm (FISTA) (?). The step size η = /γ can be adaptively determined: back-tracking. Momentum method is important for deep learning (?). 3 / 6

18 Nesterov s acceleration (strongly convex) min x {f (x) + ψ(x)} Assumption: f (x) is γ-smooth and µ-strongly convex. (it must be γ > µ) Nesterov s acceleration scheme Let A =, α = γ/µ and η = γ, and iterate the following for t =, 2,... Let g t f (y t ), and update x t = prox(y t ηg t ηψ). 2 Set α t+ > so that (γ µ)αt+ 2 (2γ + A t)α t+ + γ = 0, and let A t+ = A t /α t+. ( ) 3 Update y t+ = x t + (x t x t ). µ+a t (γ µ)(α t+ )(α t ) If f is γ-smooth and µ-strongly convex, then ( f (x t ) f (x ) γ ) γ t x 0 x 2. µ 4 / 6

19 5 / 6

20 0 4 Normal Nesterov 0 2 Relative objective (f(x t ) - f * ) Iteration Nesterov s acceleration v.s. normal gradient descent Lasso: n = 8, 000, p = / 6

21 A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM journal on imaging sciences, 2(): , O. Güler. On the convergence of the proximal point algorithm for convex minimization. SIAM Journal on Control and Optimization, 29(2): , 99. I. Sutskever, J. Martens, G. Dahl, and G. Hinton. On the importance of initialization and momentum in deep learning. In Proceedings of the 30th international conference on machine learning (ICML-3), pages 39 47, / 6

Stochastic Optimization Part I: Convex analysis and online stochastic optimization

Stochastic Optimization Part I: Convex analysis and online stochastic optimization Taiji Suzuki Tokyo Institute of Technology Graduate School of Information Science and Engineering Department of Mathematical