Coordinate Update Algorithm Short Course The Package TMAC

Size: px

Start display at page:

Download "Coordinate Update Algorithm Short Course The Package TMAC"

Jennifer Carpenter
5 years ago
Views:

1 Coordinate Update Algorithm Short Course The Package TMAC Instructor: Wotao Yin (UCLA Math) Summer / 16

2 TMAC: A Toolbox of Async-Parallel, Coordinate, Splitting, and Stochastic Methods C++11 multi-threading (no shared-memory parallelism in Matlab) Plug in your operators, get free coordinate-update and async-parallelism github.com/uclaopt/tmac committers: Brent Edmunds, Zhimin Peng contributors: Yerong Li, Yezheng Li, Tianyu Wu mentor: Y. supports: Windows, Mac, Linux 2 / 16

3 Speedup Speedup model: Example: l 1 logistic regression minimize x R n λ x N N log ( 1 + exp( b i a T i x) ), (1) sparse numerical linear algebra are used for datasets: news20, url i= sync async ideal sync async ideal Threads dataset news Threads dataset url 3 / 16

4 Objective Speedup Example: nonnegative matrix factorization model: minimize X,Y 0 A X T Y 2 F, (2) despite nonconvex, amenable to parallel coordinate descent core 2 cores 4 cores 8 cores 16 cores async ideal Time(s) Threads 4 / 16

5 Case study: l 1 logistic regression training data: {(a i, b i)} m i=1, (a i R n, b {1, 1}) problem: minimize x R n λ x 1 + m i=1 ) log (1 + e b i a T i x, forward-backward splitting iteration: ( ( m x k+1 = prox γλ 1 x k γ x log(1 + e b i a T i xk ) )), }{{} backward operator } i=1 {{ } } forward operator {{ } forward-backward splitting scheme model parameter λ controls solution sparsity step size γ decides convergence 5 / 16

6 Algorithm 1: TMAC for l 1 logistic regression. input : A, b and x are shared variables, p agents, K > 0. // interface initialization: m foward(x) := x γ x log(1 + i=1 e b i a T i x ) // forward operator backward(x) := prox γλ 1 (x) // backward operator fbs(x) := backward(forward(x)) // forward-backward splitting create p computing agents // multicore driver while each of the p agents continuously do selects i {1,..., n} based on some index rule // kernel updates x i x i η (x i fbs i(x)) // kernel output: x // interface damping parameter η is adaptive to async-delay for convergence 6 / 16

7 For best performance, BLAS is called for numerical linear algebra operations such as vector-vector: y αx + y matrix-vector: y αax + βy matrix-matrix: Y αab + βc LAPACK is called for tasks: solving linear systems, least squares, eigenvalue problems, etc. 7 / 16

8 Architecture layers numerical linear algebra (NLA): BLAS, LAPACK operator: for many functions and sets; calls NLA scheme: FBS, BFS, DRS, ADMM, DYS; calls operators kernel: chooses coordinates, calls scheme driver: C++11 multi-threading + a controller thread; spawns threads to run kernel 8 / 16

9 Call TMAC from command line # running with 1 thread # $ tmac_ fbs_ l1_ log - data news20. svm - epoch lambda 1 - nthread 1 [ some output skipped ] Computing time is: 29.53( s). # running with 4 threads # $ tmac_ fbs_ l1_ log - data news20. svm - epoch lambda 1 - nthread 4 [ some output skipped ] Computing time is: 11.01( s). # running with 16 threads # $ tmac_ fbs_ l1_ log - data news20. svm - epoch lambda 1 - nthread 16 [ some output skipped ] Computing time is: 3.87( s). 9 / 16

10 Code snippet apps/tmac_fbs_l1_log.cc // [...] parameters are defined above // forward operator : gradient step for logistic loss f o r w a r d g r a d f o r l o g l o s s <SpMat> f o r w a r d (&A,&b,&Atx, e t a ) ; // backward operator : proximal operator for l1 norm p r o x l 1 backward ( eta, lambda ) ; // forward - backward splitting scheme F o r w a r d B a c k w a r d S p l i t t i n g <f o r w a r d g r a d f o r l o g l o s s <SpMat>,\ p r o x l 1 > f b s (&x, forward, backward ) ; // the multicore driver TMAC( f b s, params ) ; 10 / 16

11 Change the regularization function Use Tikhonov regularization minimize x R n λ x Replace lines 5 and 7 with: m log ( 1 + exp( b i a T i x) ). i=1 p r o x s u m s q u a r e backward ( eta, lambda ) ; F o r w a r d B a c k w a r d S p l i t t i n g <f o r w a r d g r a d f o r l o g l o s s <SpMat>,\ p r o x s u m s q u a r e > f b s (&x, forward, backward ) ; 11 / 16

12 Change the loss function LASSO uses the square loss: Replace lines 3 and 7 with: minimize x R n λ x Ax b 2. f o r w a r d g r a d f o r s q u a r e l o s s <Matrix > f o r w a r d (&A,&b,&Atx, e t a ) ; F o r w a r d B a c k w a r d S p l i t t i n g <f o r w a r d g r a d f o r s q u a r e l o s s <Matrix >,\ p r o x l 1 > f b s (&x, forward, backward ) ; 12 / 16

13 Templating motivation: some codes are identical for double and single, for dense and sparse goal: reduce code redundancy templates: not objects, but blueprints for constructing objects examples: dense matrix: forward_grad_for_square_loss<matrix> sparse matrix: forward_grad_for_log_loss<spmat> 13 / 16

14 Interface: the operator example purpose: separate structure from implementation class O p e r a t o r I n t e r f a c e { public : // compute operator at index virtual double operator ( ) ( V e c t o r v, int i n d e x = 0)=0; // compute operator using val at index virtual double operator ( ) ( double v a l, int i n d e x = 0)=0; // compute full operator using v_in, storing in v_out virtual void operator ( ) ( V e c t o r v i n, V e c t o r v o u t )=0; // update operator related step size virtual void u p d a t e s t e p s i z e ( double s t e p s i z e )=0; // update cache variable following an update at index i virtual void u p d a t e c a c h e v a r s ( double o l d x i, \ double n e w x i, int i )=0; // update cache variables based upon rank of calling thread virtual void u p d a t e c a c h e v a r s ( V e c t o r x, \ int rank, int num threads )=0; } ; 14 / 16

15 Open development GitHub: github.com/uclaopt/tmac Lots to do still: features, applications, interfaces (next slide) Credit: developers get credits for codes they write and publish papers Our roles: mentoring and moderating Typical flow: Fork Write code Test Pull request and Merge Publication 15 / 16

16 Possible future work... New applications Stochastic (gradient) algorithms Cluster computing, add MPI Interface with Matlab, R, Python Automatic parameters 16 / 16

17 Possible future work... New applications Stochastic (gradient) algorithms Cluster computing, add MPI Interface with Matlab, R, Python Automatic parameters Join us today or in the future! 16 / 16

ECS289: Scalable Machine Learning

ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Sept 27, 2015 Outline Linear regression Ridge regression and Lasso Time complexity (closed form solution) Iterative Solvers Regression Input: training