Block stochastic gradient update method

Size: px

Start display at page:

Download "Block stochastic gradient update method"

Gabriel Goodman
6 years ago
Views:

1 Block stochastic gradient update method Yangyang Xu and Wotao Yin IMA, University of Minnesota Department of Mathematics, UCLA November 1, 2015 This work was done while in Rice University 1 / 26

2 Stochastic gradient method Consider the stochastic programming Stochastic gradient update (SG): min F(x) = E ξ f (x; ξ). x X x k+1 = P X ( x k α k g k) g k a stochastic gradient, often E[ g k ] F(x k ) Originally for stochastic problem where exact gradient not available Now also popular for deterministic problem where exact gradient expensive; e.g., F(x) = 1 N fi(x) with large N N i=1 Faster than deterministic gradient method to reach not-high accuracy 2 / 26

3 Stochastic gradient method First appears in [Robbins-Monro 51]; now tons of works O(1/ k) rate for weakly convex problem and O(1/k) for strongly convex problem (e.g., [Nemirovski et. al 09]) For deterministic problem, linear convergence is possible if exact gradient allowed periodically [Xiao-Zhang 14] Convergence in terms of first-order optimality condition for nonconvex problem [Ghadimi-Lan 13] 3 / 26

4 Block gradient descent Consider the problem f is smooth min F(x) = f (x 1,..., x s) + x r i s possibly nonsmooth and extended-valued Block gradient update (BGD): x k+1 i k s r i(x i). = arg min x ik ik f (x k ), x ik x k i k + 1 2α k x ik x k i k r ik (x ik ) Simpler than classic block coordinate descent, i.e., block minimization Allows different ways to choose i k : cyclicly, greedily, or randomly Low iteration complexity Larger stepsize than full gradient and often faster convergence i=1 4 / 26

5 Block gradient descent First appears in [Tseng-Yun 09]; famous since [Nesverov 12] Both cyclic and randomized selection: O(1/k) for weakly convex problem and linear convergence for strongly convex problem (e.g., [Hong et. al 15]) Cyclic version harder than random or greedy version to analyze Subsequence convergence for nonconvex problem and whole sequence convergence if certain local property holds (e.g., [X.-Yin 14]) 5 / 26

6 Stochastic programming with block structure Consider problem min Φ(x) = E ξ f (x 1,..., x s; ξ) + x s r i(x i) i=1 (BSP) Example: tensor regression [Zhou-Li-Zhu 13] min E[l(X 1 X s; A, b)] X 1,...,X s This talk presents an algorithm for (BSP) with properties: Only requiring stochastic block gradient Simple update and low computational complexity Guaranteed convergence Optimal convergence rate if the problem is convex 6 / 26

7 How and why use stochastic partial gradient in BGD exact partial gradient unavailable or expensive stochastic gradient works but performs not as well random cyclic selection, i.e., shuffle and then cycle random shuffling for faster convergence and more stable performance [Chang-Hsieh-Lin 08] cyclic for lower computational complexity but analysis more difficult 7 / 26

8 Block stochastic gradient method At each iteration/cycle k 1. Sample one function or a batch of functions 2. Random shuffle blocks to (k 1,..., k s) 3. From i = 1 through s, do x k+1 k i = arg min x ki g k k i, x ki + 1 2α k k i x ki x k k i 2 + r ki (x ki ) g k k i stochastic partial gradient, dependent on sampled functions and intermediate point (x k+1 k <i, xk k i ) possibly biased estimate, i.e., E[ g k k i F(x k+1 k <i, x k k i )] 0, where F(x) = E ξ f (x; ξ) 8 / 26

9 Pros and cons of cyclic selection Pros: lower computational complexity, e.g., for Φ(x) = E (a,b) (a x b) 2 with x R n cyclic selection takes about 2n to update all coordinates once random selection takes n to update one coordinate Gauss-seidel type fast convergence (see numerical results later) Cons: biased stochastic partial gradient makes analysis more difficult 9 / 26

10 Literature Just a few papers so far [Liu-Wright, arxiv14]: an asynchronous parallel randomized Kaczmarz algorithm [Dang-Lan, SIOPT15]: stochastic block mirror descent methods for nonsmooth and stochastic optimization [Zhao et al. NIPS14]: accelerated mini-batch randomized block coordinate descent method [Wang-Banerjee, arxiv14]: randomized block coordinate descent for online and stochastic optimization [Hua-Kadomoto-Yamashita, OptOnline15]: regret analysis of block coordinate gradient methods for online convex programming 10 / 26

11 Assumptions Recall F(x) = E ξ f (x; ξ). Let δ k i = g k i xi F(x k+1 <i, x k i). Error bound of stochastic partial gradient: E[δ k i x k 1 ] A max αj k, i, k, (A = 0 if unbiased) j E δ k i 2 σ 2 k σ 2, i, k. Lipschitz continuous partial gradient: xi F(x + (0,..., d i,..., 0)) xi F(x) L i d i, i, x, d. 11 / 26

12 Convergence of block stochastic gradient Convex case: F and r i s are convex. Take α k i = α k = O( 1 k ), i, k, and let x k = k κ=1 ακxκ+1 k κ=1 ακ. Then E[Φ( x k ) Φ(x )] O(log k/ k). Can be improved to O(1/ k) if the number of iterations is pre-known, and thus achieves the optimal order of rate Strongly convex case: Φ is strongly convex. Take αi k = α k = O( 1 ), i, k. k Then E x k x 2 O(1/k). Again, optimal order of rate is achieved 12 / 26

13 Convergence of block stochastic gradient Unconstrained smooth nonconvex case: If {α k i } is taken such that then αi k =, k=1 (αi k ) 2 <, i, k=1 lim E Φ(x k ) = 0. k Nonsmooth nonconvex case: If α k i < 2 L i, i, k, and k=1 σ2 k <, then lim E [ dist(0, Φ(x k )) ] = 0. k 13 / 26

14 Numerical experiments Tested problems Stochastic least square Linear logistic regression Bilinear logistic regression Low-rank tensor recovery from Gaussian measurements Tested methods block stochastic gradient (BSG) [proposed] block gradient (deterministic) stochastic gradient method (SG) stochastic block mirror descent (SBMD) [Dang-Lan 15] 14 / 26

15 Stochastic least square Consider min x E (a,b) 1 2 (a x b) 2 a N (0, I), b = a ˆx + η, and η N (0, 0.01) ˆx is the optimal solution, and minimum value {(a i, b i)} observed sequentially from i = 1 to N Deterministic (partial) gradient unavailable 15 / 26

16 Objective values by different methods N Samples BSG SG SBMD-10 SBMD-50 SBMD e e e e e e e e e e e e e e-3 One coordinate as one block SBMD-t: SBMD with t coordinates selected each update Objective valued by another 100,000 samples Each update of all methods costs O(n) Observation: Better to update more coordinates and BSG performs best 16 / 26

17 Logistic regression min w,b 1 N N log ( 1 + exp [ ( y i x i w + b )]) i=1 (LR) Training samples {(x i, y i)} N i=1 with y i { 1, +1} Deterministic problem but exact gradient expensive for large N Stochastic gradient faster than deterministic gradient for not-high accuracy 17 / 26

18 Performance of different methods on logistic regression Objective Optimal value BSG SBMD 10 SBMD 50 SBMD 100 SG Epochs (a) random dataset Objective Optimal value BSG SBMD 1k SBMD 3k SG Epochs (b) gisette dataset Random dataset: 2000 Gaussian random samples of dimension 200 gisette dataset: 6,000 samples of dimension 5000, from LIBSVM Datasets Observation: BSG gives best performance among compared methods. 18 / 26

19 Low-rank tensor recovery from Gaussian measurements min X 1 2N N (A l (X 1 X 2 X 3) b l ) 2 (LRTR) l=1 b l = A l (M) = G l, M with G l N (0, I), l G l s are dense For large N, reading all G l may out of memory even for medium G l Deterministic problem but exact gradient too expensive for large N 19 / 26

Peformance of block deterministic and stochastic gradient 10 0 Objective 10 2 10 4 10 6 BSG

gradient G l R 60 60 60 and N = 40, 000 Original (top) and recovered (bottom) by BSG with 50

20 Peformance of block deterministic and stochastic gradient 10 0 Objective BSG BCGD Epochs G l R and N = 15, 000 BCGD: block deterministic gradient G l R and N = 40, 000 Original (top) and recovered (bottom) by BSG with 50 epochs BSG: block stochastic gradient Observation: BSG faster and BCGD trapped at bad local solution 20 / 26

21 Bilinear logistic regression 1 min U,V,b N N log ( 1 + exp [ ( y i UV, X i + b )]) i=1 (BLR) Training samples {(X i, y i)} N i=1 with y i { 1, +1} Better than linear logistic regression for 2D dataset [Dyrholm et al. 07] 21 / 26

22 BCI competition EEG dataset Recorded from a healthy person using 118 channels; Visual cues (letter presentation) were shown; Performed: left hand, right foot, or tongue; 2100 marked data points of left hand and right foot were used; Each data point is / 26

23 Performance of block deterministic and stochastic gradient Deterministic Stochastic Objective Epochs Observation: stochastic method faster than deterministic one 23 / 26

24 Performance of linear and bilinear logistic regression 1 Prediction Accuracy BSG BCGD LibLinear Runs Each run, 2000 for training and 100 for testing BSG and BCGD run to 30 epochs LibLinear solves linear logistic regression Observation: bilinear model better than linear one on EEG data 24 / 26

25 Conclusions Proposed a block stochastic gradient method for stochastic programming Combines block gradient and stochastic gradient methods Inherits both advantages and better than either one individually Analyzed its convergence and rate Optimal order of convergence rate for convex problems Convergence in terms of first-order optimality condition for nonconvex problems Tested on both convex and nonconvex problems stochastic least square linear and bilinear logistic regression low-rank tensor recovery from dense Gaussian measurements 25 / 26

26 References Y. Xu and W. Yin. Block stochastic gradient iteration for convex and nonconvex optimization. SIOPT15. J. Shi, Y. Xu and R. Baraniuk. Sparse bilinear logistic regression. arxiv14. C. Dang and G. Lan. Stochastic block mirror descent methods for nonsmooth and stochastic optimization. SIOPT15. Y. Xu and W. Yin. A block coordinate descent method for regularized multi-convex optimization with applications to nonnegative tensor factorization and completion. SIIMS13. A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro. Robust stochastic approximation approach to stochastic programming, SIOPT / 26

ARock: an algorithmic framework for asynchronous parallel coordinate updates

ARock: an algorithmic framework for asynchronous parallel coordinate updates Zhimin Peng, Yangyang Xu, Ming Yan, Wotao Yin ( UCLA Math, U.Waterloo DCO) UCLA CAM Report 15-37 ShanghaiTech SSDS 15 June 25,