Nesterov s Acceleration

Size: px

Start display at page:

Download "Nesterov s Acceleration"

Eustacia May
5 years ago
Views:

1 Nesterov s Acceleration Nesterov Accelerated Gradient min X f(x)+ (X) f -smooth. Set s 1 = 1 and = 1. Set y 0. Iterate by increasing t: g t t ) s t+1 = 1+p 1+4s 2 t 2 y t = x t + s t 1 s t+1 (x t x t 1 ) x t+1 =prox(y t g t ) 1 a.k.a FISTA

Iterate: g t 2 @f(y t ) x t =prox(y t g t ) f(x t ) f(x

2 Nesterov s Acceleration f -smooth. Set s 1 = 1 and = 1. Set y 0. Iterate: g t t ) x t =prox(y t g t ) f(x t ) f(x ) apple 2 kx t x k t 2 s t+1 = 1+p 1+4s 2 t 2 y t = x t + s t 1 s t+1 (x t x t 1 ) 2 source: T. Suzuki

3 Nesterov s Acceleration min 1 n P i ( T x i y i ) 2 + k k Normal Nesterov 10 2 Relative objective (f(x t ) - f * ) Iteration 3 source: T. Suzuki

4 Stochastic Gradient We want to minimize min L( ) :=E[l (Z)] 2R p Due to practical constraints, samples only come one by one, each at a time t, and cannot be stored. Only previous parameter is stored. We use a double approximation E[l(,Z)] l(,z t ) l( t 1,z t )+hrl( t 1,z t ), i 4

5 Stochastic Gradient To approximate the minimization of min E[l (Z)] 2R p we use the approximated problem, only valid around the previous iterate t := arg min 2R phrl( t 1,z t ), i t k t 1 k 2 5

6 SG (no regularization) min L( ) :=E[l (Z)] 2R p Stochastic Gradient Method (regularization) Set 0 and sequence t. Repeat: Sample z t P (Z). Compute subgradient g t l(,z t ) Update t = t 1 t g t Output : T = 1 T +1 P T t=0 t 6

7 SG (regularization) We want to minimize now: min L ( ) :=E[l (Z)] + ( ) 2R p Stochastic Gradient Method (regularization) Set 0 and sequence t. Repeat: Sample z t P (Z). Compute subgradient g t l(,z t ) Update t =prox( t 1 t g t t ) Output : T = 1 T +1 P T t=0 t 7

8 Polynomial Averaging Stochastic Gradient Method (regularization) Set 0 and sequence t. Repeat: Sample z t P (Z). Compute subgradient g t l(,z t ) Update t =prox( t 1 t g t t ) Output : T = 2 (T +1)(T +2) P T t=0 (t + 1) t 8

9 Batch Problems SGMethods have several drawbacks, chief among them is the choice of a stepsize. Is there a setting where this can be mitigated? Yes, when the expectation is in fact a large sum: min L ( ) :=E[l (Z)] + ( ) 2R p + 1 min 2R p n nx i=1 l(,z i )+ ( ) 9

10 Batch Methods We would like to have the benefits of SGM (low cost per iteration) without the disadvantages (slow convergence near optimum, step size selection) 10 source:

11 Batch Methods Trainig error SGD Batch Elaplsed time (s) Generalization error Elaplsed time (s) Logistic Regression L1 regularization 11 source: T. Suzuki

12 Three Methods Primal methods! Stochastic Average Gradient (A) descent, SAG(A) (Le Roux et al., 2012, Schmidt et al., 2013, Defazio et al., 2014)! Stochastic Variance Reduced Gradient descent, SVRG (Johnson and Zhang, 2013, Xiao and Zhang, 2014)! Dual methods (see Fenchel duality) Stochastic Dual Coordinate ascent, SDCA (Shalev-Shwartz and Zhang, 2013a) 12

13 Primal Methods 1 min 2R p n nx l i ( )+ ( ) i=1 strongly! smooth convex We want to approximate r 1 n P n i=1 l i( ) 13

14 Primal Methods 1 min 2R p n nx l i ( )+ ( ) i=1 strongly! smooth convex We want to approximate r 1 n P n i=1 l i( ) X E i unif{1,...,n} [rl i ( )] = 1 n rl i ( ) =rl( ) Randomizing points in the dataset gives a way to get an unbiased estimator of the gradient. 13 i

15 Primal Methods 1 min 2R p n nx l i ( )+ ( ) i=1 strongly! smooth convex We want to approximate r 1 n P n i=1 l i( ) X E i unif{1,...,n} [rl i ( )] = 1 n rl i ( ) =rl( ) Problem: Variance! i 13

16 SVRG nx g = rl i ( ) rl i (ˆ )+ 1 n rl j (ˆ ) j=1 easy to show that this gradient estimate is unbiased Variance is controlled by how far, ˆ are. 14

17 SVRG g = rl i ( ) var[g] = 1 n rl i (ˆ )+ 1 n nx rl j (ˆ ) j=1 easy to show i=1that this gradient estimate is unbiased Variance is controlled by how far = 1 n apple 1 n nx nx i=1 nx i=1 apple 2 k ˆ k 2 krl i ( ) rl i (ˆ )+rl(ˆ ) rl( )k 2, ˆ are. krl i ( ) rl i (ˆ )k 2 krl(ˆ ) rl( )k 2 krl i ( ) rl i (ˆ )k 2 14

18 SVRG SVRG Set ˆ 0. For t =1,...,T, Set ˆ = ˆ t 1. 0 = ˆ. ĝ = 1 P n n i=1 rl i(ˆ ) : full gradient, at ˆ. For k =1,...,m Sample i {1,...,n} g = rl i ( k 1 ) rl i (ˆ )+ĝ : variance reduction k =prox( k 1 g ) ˆ t = 1 m P m k=1 k 15

19 SAGA ˆ depends on the data index. (SVRG) g = rl i ( t 1 ) rl i (ˆ )+ 1 n P n j=1 rl j(ˆ ) (SAGA) g = rl i ( t 1 ) rl i (ˆ i )+ 1 n P n j=1 rl j(ˆ j ) ˆ i is updated at every iteration. (ˆ i = t 1 i chosen ˆ i unchanged otherwise. onsequence: larger storage is necessary, but no double loop 16

20 SAGA SAGA Set ĝ i =ḡ =0,i2 {1,...,n}, Set.. Pick i 2 {1,...,n} randomly. Update g i = rl i ( ) Estimate gradient ĝ = g i ĝ i +ḡ Update average gradient ḡ =ḡ + 1 n (g i ĝ i ). Update stored gradients ĝ i = g i. Update prox( ĝ ) Step size: ~1/γ, convergence guaranteed. In practice: important to use mini-batches. 17

21 Recent Extensions: Point SAGA No regularizer required, proximal operator of each function. 18

22 Dual Methods Set x 0 =(x 0 1,...,x 0 n), For k =1,...,K x k+1 i = arg min y2r f(x k+1 1,...,x k+1 i 1,y,xk i+1,...,xk n) 19 source: wikipedia

23 Dual Methods Reminders on Coordinate Descent Set x 0 =(x 0 1,...,x 0 n), For k =1,...,K x k+1 i = arg min y2r f(x k+1 1,...,x k+1 i 1,y,xk i+1,...,xk n) 19 source: wikipedia

24 Dual Methods Reminders on Coordinate Descent 19 source: wikipedia

25 Dual Methods Reminders on Coordinate Descent 19 source: wikipedia

26 Dual Methods Reminders on Coordinate Descent 19 source: wikipedia

27 Dual Methods Reminders on Coordinate Descent To ensure success of CD, some progress must be guaranteed. Separability of the objective function helps. 19 source: wikipedia

28 Dual Methods Set 0 =( 0 1,..., 0 p), For k =1,...,K Sample j. Compute g j )/@ j j arg min y2r g j y + j (y)+ 1 2 t ky j k 2 20 source: wikipedia

29 Dual Methods Coordinate Descent on Primal Problem Set 0 =( 0 1,..., 0 p), For k =1,...,K Sample j. Compute g j )/@ j j arg min y2r g j y + j (y)+ 1 2 t ky j k 2 Regularizer must be separable. 20 source: wikipedia

30 Fenchel Duality Theorem Theorem Let f : R p! R and g : R q! R be closed convex, and A 2 R q p a linear map. Suppose that either condition (a) or (b) is satisfied. Then inf f(x)+g(ax) = sup f (A T y) g ( y) x2rp y2r q 1 min 2R p n nx i=1 l (z i )+ ( ) l (z i )=l(y i,x T i ) 1 sup y2r n n X i l i (y i )+ ( X T y/n) = r ( X T y /n) 21

Big Data Analytics: Optimization and Randomization

Big Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML 2015 Hong Kong Department of Computer Science, The University of Iowa, IA, USA Nov. 20, 2015 Yang Tutorial for ACML 15 Nov.