Tall-and-skinny! QRs and SVDs in MapReduce

Size: px

Start display at page:

Download "Tall-and-skinny! QRs and SVDs in MapReduce"

Derick Russell
5 years ago
Views:

1 A 1 A 2 A 3 Tall-and-skinny! QRs and SVDs in MapReduce Yangyang Hou " Purdue, CS Austin Benson " Stanford University Paul G. Constantine Col. School. Mines " Joe Nichols U. of Minn James Demmel " UC Berkeley Joe Ruthruff " Jeremy Templeton" Sandia CA Mainly funded by Sandia National Labs CSAR project, recently by NSF CAREER," and Purdue Research Foundation. David F. Gleich! Computer Science! Purdue University! 1

2 Big " simulation data 2

Nonlinear heat transfer model in random media Each run takes 5 hours on 8 processors, outputs 4M (node) by 9 (time-step) simulation Look at temperature Apply heat We

3 Nonlinear heat transfer model in random media Each run takes 5 hours on 8 processors, outputs 4M (node) by 9 (time-step) simulation Look at temperature Apply heat We did 8192 runs (128 samples of bubble locations, 64 bubble radii) 4.5 TB of data in Exodus II (NetCDF) publicdata/heat-transfer/ 3

4 Non-insulator regime 1 Proportion of temp. > 475 K True ROM RS 0.1 Insulator regime Bubble radius 4

5 S V T Extract 128 x 128 face to laptop A SVD U U F S V T Each simulation is a column 5B-by-64 matrix 2.2TB 5

6 Non-insulator regime Proportion of temp. > 475 K True ROM RS 0.1 Insulator regime Bubble radius 6

7 20 P. G. CONSTANTINE, D. F. GLEICH, Y. HOU, AND J. TEMPLETON s τ error s R(s, ) E(s, ) e e e e-04 std e e (c) Error, s =1.95 cm (d) Std, 1.01 s =1.95 cm e e Fig. 4.5: Error in the reduce order model compared to the prediction standard deviation23for one realization of the bubble locations at the final time for two values of e e-03 the bubble 19 radius, s =0.39 and s =1.95 cm. (Colors are visible in the electronic e-03 version.) e e e-02 std the varying conductivity fields took approximately 20 twenty minutes 2.26 to construct e-02 using Cubit Fig. after 4.4: substantial The log of optimizations. the relative10error in Working with the simulation data0 the mean prediction of the ROM as involved a func-tion of s and (d) Std, the s threshold =1.95 cm. (Colors are Constantine, Gleich,! few pre- and post-processing steps: interpret 4TB of Exodus II files from Aria, globally transpose the data, compute the TSSVD, and compute predictions and errors. The preprocessing steps took approximately 8-15 hours. We collected precise timing information, but we do not report visible in the electronic version.) values of Hou s. & Templeton arxiv it as these times are from a multi-tenant, unoptimized David Gleich Hadoop Purdue cluster Simons where other PDAIO Table 4.1: The split and the correspond ing ROM error for =0.55 and di eren 7

8 Dynamic Mode Decomposition One simulation, ~10TB of data, compute the SVD of a space-by-time matrix. DMD video 8

9 Is this BIG Data? A BIG Data has two properties - too big for one hard drive - skewed distribution BIG Data = Big Internet Giant Data BIG Data = Big In Gineering Data Engineering 9

10 A matrix A : m n, m n! is tall and skinny when O(n 2 )! work and storage is cheap compared to m. -- Austin Benson 10

11 Quick review of QR Let A : m n, m n, real A = QR Q is m n orthogonal (Q T Q = I ) R is n n upper triangular A = Q 0 R 11

12 Tall-and-skinny SVD and RSVD R SVD V Let A : m n, m n, real A TSQR Q A = UΣV T U is m n orthogonal Σ is m n nonneg, diag. V is n n orthogonal 12

13 There are good MPI implementations. What s left to do? 13

14 Moving data to an MPI cluster may be infeasible or costly 14

15 How to store tall-and-skinny matrices in Hadoop A : m x n, m n Key is an arbitrary row-id Value is the 1 x n array " for a row (or b x n block) Each submatrix A i is an " the input to a map task. A 1 A 2 A 3 15

16 Still, isn t this easy to do? Current MapReduce algs use the normal equations A = QR A T A Cholesky! R T R Q = AR 1 A 1 A 2 A 3 Map! A ii to A it A i Reduce! R T R = Sum(A it A i ) Map 2! A ii to A i R -1 Two problems! R inaccurate if A illconditioned Q not numerically orthogonal (House-" holder assures this) 16

17 Numerical stability was a problem for prior approaches 10 0 Prior work Previous methods couldn t ensure that the matrix Q was orthogonal norm ( Q T Q I ) AR Condition number 17

18 Four things that are better 1. A simple algorithm to compute R accurately. (but doesn t help get Q orthogonal). 2. Fast algorithm to get Q numerically orthogonal in most cases. 3. Multi-pass algorithm to get Q numerically orthogonal in virtually all cases. 4. A direct algorithm for a numerically orthogonal Q in all cases. Constantine & Gleich MapReduce 2011 Benson, Gleich & Demmel IEEE BigData

19 Numerical stability was a problem for prior approaches Previous methods couldn t ensure that the matrix Q was orthogonal norm ( Q T Q I ) Prior work AR Constantine & Gleich, MapReduce 2011 AR -1 + " iterative refinement Benson, Gleich, Demmel, BigData 13 Condition number 2. Benson, Gleich, Demmel, BigData Direct TSQR Benson, Gleich, " Demmel, BigData 13 19

20 MapReduce is great for TSQR! You don t need A T A Data A tall and skinny (TS) matrix by rows Input 500,000,000-by-50 matrix" Each record 1-by-50 row" HDFS Size GB Time to compute read A 253 sec. write A 848 sec.! Time to compute R in qr(a) 526 sec. w/ Q=AR sec. " Time to compute Q in qr(a) 3090 sec. (numerically stable)! 20

21 Communication avoiding QR (Demmel et al. 2008) First, do QR factorizations of each local matrix Second, compute a QR factorization of the new R 21

22 Serial QR factorizations! (Demmel et al. 2008) Compute QR of, read, update QR, 22

2 A 3 A 1 A 2 qr Q 2 R 2 A 3 qr Map QR factorization of rows Reduce QR factorization of rows Q 3 R 3

23 Communication avoiding QR (Demmel et al. 2008)! on MapReduce (Constantine and Gleich, 2011) Algorithm Data Rows of a matrix Mapper 1 Serial TSQR A 1 A 2 A 3 A 1 A 2 qr Q 2 R 2 A 3 qr Map QR factorization of rows Reduce QR factorization of rows Q 3 R 3 qr Q 4 R 4 emit Mapper 2 Serial TSQR A 5 A 6 A 7 A 5 A 6 qr Q 6 R 6 A 7 qr Q 7 R 7 A 8 A 8 qr Q 8 R 8 emit Reducer 1 Serial TSQR R 4 R 8 R 4 R 8 qr Q R emit 23

24 Too many maps cause too much data to one reducer! map R 1 emit reduce R 2,1 emit reduce R emit A A 1 A 2 Mapper 1-1 Serial TSQR map R 2 Mapper 1-2 Serial TSQR map R 3 emit emit shuffle S (1) S 1 AS 2 Reducer 1-1 Serial TSQR reduce R 2,2 Reducer 1-2 Serial TSQR reduce R 2,3 emit emit identity map shuffle S A (2) 2 Reducer 2-1 Serial TSQR A 3 Mapper 1-3 Serial TSQR AS 23 Reducer 1-3 Serial TSQR map R 4 emit A 34 Mapper 1-4 Serial TSQR Iteration 1 Iteration 2 24

25 Getting Q 25

26 Numerical stability was a problem for prior approaches Previous methods couldn t ensure that the matrix Q was orthogonal norm ( Q T Q I ) Prior work AR Constantine & Gleich, MapReduce 2011! AR -1 AR -1 + " iterative refinement Benson, Gleich, Demmel, BigData 13 Condition number 2. Benson, Gleich, Demmel, BigData Direct TSQR Benson, Gleich, " Demmel, BigData 13 26

27 Iterative refinement helps Mapper 1 R -1 A 1 Q 1 Mapper 2 T -1 Q 1 Q 1 TSQR R A 2 A 3 R -1 R -1 Q 2 Q 3 TDistribute T TSQR Q 2 Q 3 T -1 T -1 Q 2 Q 3 Distribute R R -1 Q 4 Q 4 T -1 Q 4 Iterative refinement is like using Newton s method to solve Ax = b. It s folklore that two iterations of iterative refinement are enough 27

28 What if iterative refinement is too slow? Mapper 1 R -1 A 1 Q 1 Mapper 2 R -1 T -1 A 1 Q 1 Sample S Compute QR, " distribute R A 2 A 3 R -1 R -1 Q 2 Q 3 TDistribute TR TSQR A 2 A 3 R -1 R -1 T -1 T -1 Q 2 Q 3 R -1 Q 4 R -1 T -1 Q 4 Estimate the norm by S Based on recent work by random matrix community on approximating QR with a random subset of rows. Also assumes that you can get a subset of rows cheaply possible, but nontrivial in Hadoop. 28

29 Numerical stability was a problem for prior approaches Previous methods couldn t ensure that the matrix Q was orthogonal norm ( Q T Q I ) Prior work AR Constantine & Gleich, MapReduce 2011 AR -1 AR -1 + " iterative refinement Benson, Gleich, Demmel, BigData 13! Condition number 2. Benson, Gleich, Demmel, BigData 13! 4. Direct TSQR Benson, Gleich, " Demmel, BigData 13 29

30 Recreate Q by storing the history of the factorization 3. Distribute the pieces of Q *1 and form the true Q Mapper 1 A 1 A 2 Q 1 Q 2 R 1 R 2 R output R 1 R 2 R 3 R 4 Task 2 Q 11 R Q 21 Q 31 Q 41 Q output Mapper 3 Q 11 Q 1 Q 21 Q 2 Q 1 Q 2 A 3 Q 3 Q 4 R 3 R 4 2. Collect R on one node, compute Qs for each piece Q 3 Q 4 Q 31 Q 41 Q 3 Q 4 1. Output local Q and R in separate files 30

31 Theoretical lower bound on runtime for a few cases on our small cluster Model Actual Rows Cols Old R-only + no IR R-only + PIR R-only + IR Direct TSQR 4.0B B B B Rows Cols Old R-only + no IR R-only + PIR R-only + IR Direct TSQR 4.0B B B B All values in seconds Only two params needed read and write bandwidth for the cluster in order to derive a performance model of the algorithm. This simple model is almost within a factor of two of the true runtime. " (10-node cluster, 60 disks) 31

32 " Papers Constantine & Gleich, MapReduce 2011 Benson, Gleich & Demmel, BigData 13 Constantine & Gleich, ICASSP 2012 Constantine, Gleich, Hou & Templeton, " arxiv 2013 Code Questions? BIG Bloody Imposing Graphs Building Impressions of Groundtruth Blockwise Independent Guesses Best Implemented at Google 32

Experiences with Model Reduction and Interpolation

Experiences with Model Reduction and Interpolation Paul Constantine Stanford University, Sandia National Laboratories Qiqi Wang (MIT) David Gleich (Purdue) Emory University March 7, 2012 Joe Ruthruff (SNL)