Parallel Numerical Algorithms

Size: px

Start display at page:

Download "Parallel Numerical Algorithms"

Jean Johnson
5 years ago
Views:

1 Parallel Numerical Algorithms Chapter 6 Structured and Low Rank Matrices Section 6.3 Numerical Optimization Michael T. Heath and Edgar Solomonik Department of Computer Science University of Illinois at Urbana-Champaign CS 554 / CSE 512 Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 1 / 21

2 Outline Alternating Least Squares 1 Alternating Least Squares Quadratic Optimization Parallel ALS 2 Cyclic 3 Stochastic Parallel SGD 4 Nonlinear Equations Optimization Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 2 / 21

3 Quadratic Optimization: Matrix Completion Given a subset of entries W R m k,h R n k (i,j) Ω Ω {1,..., m} {1,..., n} of the entries of matrix A R m n, seek rank-k approximation ( argmin a ij ) 2 w il h jl + λ( W F + H F ) l } {{ } (A W H T ) ij Problems of these type studied in sparse approximation Ω may be randomly selected sample subset Methods for this problem are typical of numerical optimization and machine learning Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 3 / 21

4 Alternating Least Squares Quadratic Optimization Parallel ALS Alternating least squares (ALS) fixes W and solves for H then vice versa until convergence Each step improves approximation, convergence to a minimum expected given satisfactory starting guess We have a quadratic optimization problem ( argmin a ij ) 2 w il h jl + λ W F W R m k l (i,j) Ω The optimization problem is independent for rows of W Letting w i = w i, h i = h i, Ω i = {j : (i, j) Ω}, seek ( ) 2 argmin a ij w i h T j + λ wi 2 w i R k j Ω i Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 4 / 21

5 ALS: Quadratic Optimization Quadratic Optimization Parallel ALS Seek minimizer w i for quadratic vector equation f(w i ) = j Ω i (a ij w i h T j ) 2 + λ wi Differentiating with respect to w i gives f(w i ) w i = 2 j Ω i h T j (w i h T j a ij ) + 2λw i = 0 Rotating w i h T j = h jw T i and defining G (i) = j Ω i h T j h j, (G (i) + λi)w T i = j Ω i h T j a ij which is a k k symmetric linear system of equations Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 5 / 21

6 ALS: Iteration Cost Alternating Least Squares Quadratic Optimization Parallel ALS For updating each w i, ALS is dominated in cost by two steps 1 G (i) = j Ω i h T j h j dense matrix-matrix product O( Ω i k 2 ) work logarithmic depth 2 Solve linear system with G (i) + λi dense symmetric k k linear solve O(k 3 ) work typically O(k) depth Can do these for all m rows of W independently Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 6 / 21

7 Parallel ALS Alternating Least Squares Quadratic Optimization Parallel ALS Let each task optimize a row w i of W Need to compute G (i) for each task Specific subset of rows of H needed for each G (i) Task execution is embarassingly parallel if all of H stored on each processor Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 7 / 21

8 Quadratic Optimization Parallel ALS Memory-Constrained Parallel ALS May not have enough memory to replicate H on all processors Communication required and pattern is data-dependent Could rotate rows of H along a ring of processors Each processor computes contributions to the G (i) it owns Requires Θ(p) latency cost for each iteration of ALS Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 8 / 21

9 Updating a Single Variable Cyclic Rather than whole rows w i solve for elements of W, recall ( argmin a ij ) 2 w il h jl + λ W F W R m k l (i,j) Ω Coordinate descent finds the best replacement µ for w it ( µ = argmin a ij µh jt ) 2 w il h jl + λµ 2 µ l t j Ω i The solution is given by µ = j Ω i h jt (a ij l t w ilh jl ) λ + j Ω i h 2 jt Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 9 / 21

10 Cyclic For (i, j) Ω compute elements r ij of R = A W H T so that we can optimize via j Ω i h jt (a ij ) l t w ilh jl µ = λ + j Ω i h 2 jt after which we can update R via = j Ω i h jt (r ij + w it h jt ) λ + j Ω i h 2 jt r ij r ij (µ w it )h jt j Ω i both using O( Ω i ) operations Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 10 / 21

11 Cyclic Cyclic (CCD) Updating w i costs O( Ω i k) operations with coordinate descent rather than O( Ω i k 2 + k 3 ) operations with ALS By solving for all of w i at once, ALS obtains a more accurate solution than coordinate descent Coordinate descent with different update orderings: Cyclic coordinate descent (CCD) updates all columns of W then all columns of H (ALS-like ordering) CCD++ alternates between columns of W and H All entries within a column can be updated concurrently Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 11 / 21

12 Parallel CCD++ Alternating Least Squares Cyclic Yu, Hsieh, Si, and Dhillon 2013 propose using a row-blocked layout of H and W They keep track of a corresponding m/p and n/p rows and columns of A and R on each processor (using twice the minimal amount of memory) Every column update in CCD++ is then fully parallelized, but an allgather of each column is required to update R The complexity of updating all of W and all of H is then T p (m, n, k) = Θ(kTp allgather (m + n) + γq 1 (m, n, k)/p) = O(αk log p + β(m + n)k + γ Ω k/p) Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 12 / 21

13 Gradient-Based Update Stochastic Parallel SGD ALS minimizes w i, gradient descent methods only improve it Recall that we seek to minimize f(w i ) = j Ω i (a ij w i h T j ) 2 + λ wi and use the partial derivative f(w i ) w i = 2 j Ω i h T j Gradient descent method updates ) ) (w i h T j a ij +2λw i = 2 (λw i j Ωi r ij h j w i = w i η f(w i) w i where parameter η is our step-size Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 13 / 21

14 Stochastic Parallel SGD Stochastic (SGD) Stochastic gradient descent (SGD) performs fine-grained updates based on a component of the gradient Again the full gradient is ) f(w i ) = 2 (λw i j Ωi r ij h j = 2 λw i / Ω i r ij h j w i j Ω i SGD selects random (i, j) Ω and updates w i using h j w i w i η(λw i / Ω i r ij h j ) SGD then updates r ij = a ij wi T h j Each update costs O(k) operations Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 14 / 21

15 Asynchronous SGD Stochastic Parallel SGD Parallelizing SGD is easy aside from ensuring concurrent updates do not conflict Asynchronous shared-memory implementations of SGD are popular and achieve high performance For sufficiently small step-size, inconsistencies among updates (e.g. duplication) are not problematic statistically Asynchronicity can slow down convergence Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 15 / 21

16 Blocked SGD Alternating Least Squares Stochastic Parallel SGD Distributed blocking SGD introduces further considerations Associate a task with updates on a block Can define p p grid of blocks of dimension m/p n/p Diagonal/superdiagonals/subdiagonals of blocks updated independently, so p tasks can execute concurrently Assuming Θ( Ω /p 2 ) updates are performed on each block, the execution time for Ω updates is T p (m, n, k) = Θ(αp log p + β min(m, n)k + γ Ω k/p) Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 16 / 21

17 Nonlinear Equations Nonlinear Equations Optimization Potential sources of parallelism in solving nonlinear equation f(x) = 0 include Evaluation of function f and its derivatives in parallel Parallel implementation of linear algebra computations (e.g., solving linear system in Newton-like methods) Simultaneous exploration of different regions via multiple starting points (e.g., if many solutions are sought or convergence is difficult to achieve) Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 17 / 21

18 Optimization Alternating Least Squares Nonlinear Equations Optimization Sources of parallelism in optimization problems include Evaluation of objective and constraint functions and their derivatives in parallel Parallel implementation of linear algebra computations (e.g., solving linear system in Newton-like methods) Simultaneous exploration of different regions via multiple starting points (e.g., if global optimum is sought or convergence is difficult to achieve) Multi-directional searches in direct search methods Decomposition methods for structured problems, such as linear, quadratic, or separable programming Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 18 / 21

19 References Alternating Least Squares Candés, Emmanuel J., and Benjamin Recht. "Exact matrix completion via convex optimization." Foundations of computational mathematics 9.6 (2009): 717. Jain, Prateek, Praneeth Netrapalli, and Sujay Sanghavi. "Low-rank matrix completion using alternating minimization." Proceedings of the forty-fifth annual ACM Symposium on Theory of Computing. ACM, Yu, H.F., Hsieh, C.J., Si, S. and Dhillon, I., 2012, December. Scalable coordinate descent approaches to parallel matrix factorization for recommender systems. In 2012 IEEE 12th International Conference on Data Mining (pp ). Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 19 / 21

20 References Alternating Least Squares Recht, Benjamin, Christopher Re, Stephen Wright, and Feng Niu. "Hogwild: A lock-free approach to parallelizing stochastic gradient descent." In Advances in neural information processing systems, pp Gemulla, Rainer, Erik Nijkamp, Peter J. Haas, and Yannis Sismanis. "Large-scale matrix factorization with distributed stochastic gradient descent." In Proceedings of the 17th ACM SIGKDD international conference on knowledge discovery and data mining, pp ACM, Karlsson, Lars, Daniel Kressner, and André Uschmajew. "Parallel algorithms for tensor completion in the CP format." Parallel computing 57 (2016): Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 20 / 21

21 References Parallel Optimization J. E. Dennis and V. Torczon, Direct search methods on parallel machines, SIAM J. Optimization 1: , 1991 J. E. Dennis and Z. Wu, Parallel continuous optimization, J. Dongarra et al., eds., Sourcebook of Parallel Computing, pp , Morgan Kauffman, 2003 F. A. Lootsma and K. M. Ragsdell, State-of-the-art in parallel nonlinear optimization, Parallel Computing 6: , 1988 R. Schnabel, Sequential and parallel methods for unconstrained optimization, M. Iri and K. Tanabe, eds., Mathematical Programming: Recent Developments and Applications, pp , Kluwer, 1989 S. A. Zenios, Parallel numerical optimization: current trends and an annotated bibliography, ORSA J. Comput. 1:20-43, 1989 Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 21 / 21

ECS289: Scalable Machine Learning

ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 11, 2016 Paper presentations and final project proposal Send me the names of your group member (2 or 3 students) before October 15 (this Friday)