Practicality of Large Scale Fast Matrix Multiplication

Size: px

Start display at page:

Download "Practicality of Large Scale Fast Matrix Multiplication"

Lilian Jackson
6 years ago
Views:

Practicality of Large Scale Fast Matrix Multiplication Grey Ballard, James Demmel, Olga Holtz, Benjamin Lipshitz and Oded Schwartz UC Berkeley IWASEP June 5, 2012 Napa Valley, CA Research supported

1 Practicality of Large Scale Fast Matrix Multiplication Grey Ballard, James Demmel, Olga Holtz, Benjamin Lipshitz and Oded Schwartz UC Berkeley IWASEP June 5, 2012 Napa Valley, CA Research supported by Microsoft (Award #024263) and Intel (Award #024894) funding and by matching funding by U.C. Discovery (Award #DIG ). Additional support comes from Par Lab affiliates National Instruments, NEC, Nokia, NVIDIA, and Samsung. Research is also supported by DOE grants DE-SC , DE-SC , and DE-AC02-05CH11231; the Sofja Kovalevskaja programme of Alexander von Humboldt Foundation; and by the National Science Foundation under agreement DMS Benjamin Lipshitz IWASEP 9 1

2 Introduction Classical matrix multiplication is nearly ubiquitous, even though asymptotically faster algorithms have been know since 1969 Concerns about fast matrix multiplication: Practical speed Stability This talk addresses both concerns Benjamin Lipshitz IWASEP 9 2

3 Outline Strassen s algorithm New parallel algorithm Communication optimal Faster in practice Stability of Strassen Normwise error bound Diagonal scaling, improved error bounds Stability experiments Benjamin Lipshitz IWASEP 9 3

M 1 = (A 11 + A 22) (B 11 + B 22) M 2 = (A 21 + A 22) B 11 M 3 = A 11 (B 12 B 22) M 4 = A 22 (B 21 B 11) M 5 = (A 11 + A 12) B 22 M 6 = (A

4 Recall: Strassen s fast matrix multiplication Strassen s original algorithm uses 7 multiplies and 18 adds for n = 2. It is applied recursively (blockwise). M 1 = (A 11 + A 22) (B 11 + B 22) M 2 = (A 21 + A 22) B 11 M 3 = A 11 (B 12 B 22) M 4 = A 22 (B 21 B 11) M 5 = (A 11 + A 12) B 22 M 6 = (A 21 A 11) (B 11 + B 12) M 7 = (A 12 A 22) (B 21 + B 22) C 11 = M 1 + M 4 M 5 + M 7 C 12 = M 3 + M 5 C 21 = M 2 + M 4 T (n) = 7 T (n/2) + O(n 2 ) T (n) = Θ (n log 7) 2 Improved by Winograd to 15 additions C 22 = M 1 M 2 + M 3 + M 6 Benjamin Lipshitz IWASEP 9 4

(sequential case) over a network connecting processors (parallel case)

5 Communication costs Two kinds of costs: Arithmetic (FLOPs) Communication: moving data between levels of a memory hierarchy (sequential case) over a network connecting processors (parallel case) Communication is becoming more expensive relative to computation Benjamin Lipshitz IWASEP 9 5

6 Communication lower bounds for matrix multiplication Ω Strassen: ( ( ) ) n log2 7 M M Ω Classic (cubic): ( ( ) ) n log2 8 M M Ω ( ( ) ) n log2 7 M M P ( ) n 2 Ω P 2/ log 2 7 Ω ( ( ) ) n log2 8 M M P ( ) n 2 Ω P 2/ log 2 8 Benjamin Lipshitz IWASEP 9 6

7 Communication lower bounds for matrix multiplication Algorithms attaining these bounds? Ω Strassen: ( ( ) ) n log2 7 M M Ω Classic (cubic): ( ( ) ) n log2 8 M M Ω ( ( ) ) n log2 7 M M P ( ) n 2 Ω P 2/ log 2 7 Ω ( ( ) ) n log2 8 M M P ( ) n 2 Ω P 2/ log 2 8 Benjamin Lipshitz IWASEP 9 6

8 Communication lower bounds for matrix multiplication Algorithms attaining these bounds? Ω Strassen: ( ( ) ) n log2 7 M M Ω Classic (cubic): ( ( ) ) n log2 8 M M Ω ( ( ) ) n log2 7 M M P ( ) n 2 Ω P 2/ log 2 7 Ω ( ( ) ) n log2 8 M M P ( ) n 2 Ω P 2/ log 2 8 Benjamin Lipshitz IWASEP 9 6

9 Lessons from lower bounds Don t use a classical algorithm for the communication Strassen can communicate less than classical Make local multiplies as large as possible Use all available memory, up to O(n 2 /P 2/ log 2 7 ) Communication bound decreases with increased memory Send memory size messages to minimize latency Benjamin Lipshitz IWASEP 9 7

10 Main Idea of CAPS algorithm At each level of recursion tree, choose either breadth-first or depth-first traversal of the recursion tree Breadth-First-Search (BFS) Depth-First-Search (DFS) Runs all 7 multiplies in parallel each uses P/7 processors Requires 7/4 as much extra memory Requires communication, but All BFS minimizes communication if possible Runs all 7 multiplies sequentially each uses all P processors Requires 1/4 as much extra memory No immediate communication Increases bandwidth by factor of 7/4 Increases latency by factor of 7 Benjamin Lipshitz IWASEP 9 8

11 Main Idea of CAPS algorithm At each level of recursion tree, choose either breadth-first or depth-first traversal of the recursion tree Breadth-First-Search (BFS) Depth-First-Search (DFS) CAPS if enough memory and P 7 then BFS step else DFS step end if Benjamin Lipshitz IWASEP 9 8

12 Asymptotic costs analysis Strassen Classical Flops Lower Bound n ω 0 P 2D-Strassen n ω 0 Strassen-2D ( 7 8 CAPS n ω 0 P Lower Bound n 3 P Bandwidth { } max n ω 0 n, 2 PM ω 0 /2 1 P 2/ω 0 n 2 P (ω 0 1)/2 P 1/2 ) l n 3 P 2D n 3 P 2.5D n 3 P ( 7 4) l n 2 { max n ω 0 max max P 1/2 n, 2 PM ω 0 /2 1 P 2/ω 0 { n 3 PM 1/2, n 2 P 2/3 } n 2 P 1/2 { n 3 PM 1/2, n 2 P 2/3 } } Benjamin Lipshitz IWASEP 9 9

13 Performance of CAPS Strong-scaling on Franklin (Cray XT4), n = %-184% faster than previous Strassen-based algorithms 51%-84% faster than best classical algorithm Benjamin Lipshitz IWASEP 9 10

14 CAPS Summary The CAPS matrix multiplication algorithm is communication optimal matches the communication lower bounds moves asymptotically less data than all existing algorithms is faster: asymptotically and in practice faster than any parallel classical algorithm can be faster than any parallel Strassen-based algorithm we are aware of applies to other fast matrix multiplication algorithms but there might not be any other practical ones Benjamin Lipshitz IWASEP 9 11

15 Stability CAPS has the same stability properties as any other Strassen (Strassen-Winograd) algorithm. Weaker stability guarantee than classical, but still norm-wise stable. This can be improved through diagonal scaling. Two best scaling schemes give incomparable bounds Can check which bound is better in O(n 2 ) time The improved error bounds match those of matrix factorization such as classical LU and QR. Benjamin Lipshitz IWASEP 9 12

16 Diagonal Scaling Outside scaling: D A CD B = (D A A)(BD B ) Scale so each row of A and each column of B has unit norm. Explicitly: Let D A ii = ( A(i, :) ) 1, and D B jj = ( B(:, j) ) 1. Scale A = D A A, and B = BD B. Use Strassen for the product C = A B. Unscale C = ( D A) 1 C ( D B) 1. Benjamin Lipshitz IWASEP 9 13

17 Diagonal Scaling Outside scaling: D A CD B = (D A A)(BD B ) Scale so each row of A and each column of B has unit norm. Explicitly: Let D A ii = ( A(i, :) ) 1, and D B jj = ( B(:, j) ) 1. Scale A = D A A, and B = BD B. Use Strassen for the product C = A B. Unscale C = ( D A) 1 C ( D B) 1. Inside scaling: C = (AD)(D 1 B) Scale so each column of A has the same norm as the corresponding row of B. Explicitly: Let D ii = ( A(:, i) / B(i, :) ) 1/2. Scale A = AD, and B = D 1 B. Use Strassen for the product C = A B. Benjamin Lipshitz IWASEP 9 13

18 Error bounds C ij Ĉij O(ɛ)f (n)... Classical: ( A B ) ij Outside-Inside: A(i,:) B(:,j) D A A BD B Inside-Outside: (AD)(i,:) (D 1 B)(:,j) Outside: A(i,:) B(:,j) Inside: A B No Scaling: A B X Y means that bound X is stronger than bound Y. Benjamin Lipshitz IWASEP 9 14

19 Scaling example: easy case Ĉij Cij ( A B )ij maxij e-06 1e-08 1e-10 1e-12 1e-14 1e-16 ( ) No scaling Outer Inner Outer-Inner Inner-Outer ( ) Number of Strassen Steps n = 1024 Benjamin Lipshitz IWASEP 9 15

20 Scaling example: needs outer scaling Ĉij Cij ( A B )ij maxij e-06 1e-08 1e-10 1e-12 1e-14 1e-16 ( ) No scaling Outer Inner Outer-Inner Inner-Outer ( ɛ 1 ɛ ) Number of Strassen Steps n = 1024 Benjamin Lipshitz IWASEP 9 16

21 Scaling example: needs inner and outer Ĉij Cij ( A B )ij maxij e-06 1e-08 1e-10 1e-12 ( 1 ɛ ɛ ɛ ) No scaling Outer Inner Outer-Inner Inner-Outer ( ɛ 1 1 ɛ 1 ) 1e-14 1e Number of Strassen Steps n = 1024 Benjamin Lipshitz IWASEP 9 17

22 Scaling example: inner-outer better than outer-inner Ĉij Cij ( A B )ij maxij e-06 1e-08 1e-10 1e-12 1e-14 1e-16 ( ɛ ɛ 1 1 No scaling Outer Inner Outer-Inner Inner-Outer ) ( ) 1 ɛ Number of Strassen Steps n = 1024 Benjamin Lipshitz IWASEP 9 18

23 Scaling example: outer-inner better than inner-outer Ĉij Cij ( A B )ij maxij e-06 1e-08 1e-10 1e-12 ( 1 ɛ ) No scaling Outer Inner Outer-Inner Inner-Outer ( ɛ 1 ɛ 1 ) 1e-14 1e Number of Strassen Steps n = 1024 Benjamin Lipshitz IWASEP 9 19

24 Scaling example: problems scaling can t fix Ĉij Cij ( A B )ij maxij e-06 1e-08 1e-10 1e-12 ( 1 ɛ No scaling Outer Inner Outer-Inner Inner-Outer ) ( ) 1 ɛ e-14 1e Number of Strassen Steps n = 1024 Benjamin Lipshitz IWASEP 9 20

25 Error bounds C ij Ĉij O(ɛ)f (n)... Classical: ( A B ) ij Outside-Inside: A(i,:) B(:,j) D A A BD B Inside-Outside: (AD)(i,:) (D 1 B)(:,j) Outside: A(i,:) B(:,j) Inside: A B No Scaling: A B X Y means that bound X is stronger than bound Y. Benjamin Lipshitz IWASEP 9 21

26 Stability summary Scaling improves error bound of Strassen To be comparable to many other algorithms But still not a good as classical algorithm Applies to other fast matrix multiplication algorithms Inner-then-outer or outer-then-inner scaling are best Can choose between them by evaluating their bounds Open problem: simultaneously attain inner and outer scaling bounds? Benjamin Lipshitz IWASEP 9 22

27 Practicality of Large Scale Fast Matrix Multiplication Grey Ballard, James Demmel, Olga Holtz, Benjamin Lipshitz and Oded Schwartz Thank You! Research supported by Microsoft (Award #024263) and Intel (Award #024894) funding and by matching funding by U.C. Discovery (Award #DIG ). Additional support comes from Par Lab affiliates National Instruments, NEC, Nokia, NVIDIA, and Samsung. Research is also supported by DOE grants DE-SC , DE-SC , and DE-AC02-05CH11231; the Sofja Kovalevskaja programme of Alexander von Humboldt Foundation; and by the National Science Foundation under agreement DMS Benjamin Lipshitz IWASEP 9 23

28 References G. Ballard, J. Demmel, O. Holtz, B. Lipshitz, and O. Schwartz. Communication-optimal parallel algorithm for Strassen s matrix multiplication. Technical Report EECS , UC Berkeley, March To appear in SPAA G. Ballard, J. Demmel, O. Holtz, B. Lipshitz, and O. Schwartz. Strong scaling of matrix multiplication algorithms and memory-independent communication lower bounds. Technical Report EECS , UC Berkeley, March To appear in SPAA G. Ballard, J. Demmel, O. Holtz, and O. Schwartz. Graph expansion and communication costs of fast matrix multiplication. In Proceedings of the 23rd ACM Symposium on Parallelism in Algorithms and Architectures, SPAA 11, pages 1 12, New York, NY, USA, ACM. B. Lipshitz, G. Ballard, O. Schwartz, and J. Demmel. Communication-avoiding parallel Strassen: Implementation and performance. Technical Report EECS , UC Berkeley, May Submitted to SC Benjamin Lipshitz IWASEP 9 24

29 Extra slides 1 Previous Parallel Strassen 2 Data Layout 3 Strassen-Winograd Algorithm 4 Hardware scaling 5 Time breakdown 6 Small problems Benjamin Lipshitz IWASEP 9 25

30 Previous parallel Strassen-based algorithms 2D-Strassen: [Luo, Drake 95] Run classical 2D inter-processors. Same communication costs as classical 2D. Run Strassen locally. Can t use Strassen on the full matrix size. Extras Benjamin Lipshitz IWASEP 9 26

31 Previous parallel Strassen-based algorithms 2D-Strassen: [Luo, Drake 95] Run classical 2D inter-processors. Same communication costs as classical 2D. Run Strassen locally. Can t use Strassen on the full matrix size. Strassen-2D: [Luo, Drake 95; Grayson, Shah, van de Geijn 95] Run Strassen inter-processors This part can be done without communication. Then run classical 2D. Communication costs grow exponentially with the number of Strassen steps. Extras Benjamin Lipshitz IWASEP 9 26

Same communication costs as classical 2D. Run Strassen locally.

Strassen-2D: [Luo, Drake 95; Grayson, Shah, van de Geijn 95] Run

32 Previous parallel Strassen-based algorithms 2D-Strassen: [Luo, Drake 95] Run classical 2D inter-processors. Same communication costs as classical 2D. Run Strassen locally. Can t use Strassen on the full matrix size. Strassen-2D: [Luo, Drake 95; Grayson, Shah, van de Geijn 95] Run Strassen inter-processors This part can be done without communication. Then run classical 2D. Communication costs grow exponentially with the number of Strassen steps. Neither is communication optimal Extras Benjamin Lipshitz IWASEP 9 26

33 Data Layout Extras Benjamin Lipshitz IWASEP 9 27

34 Strassen-Winograd Algorithm ( C11 C 12 C 21 C 22 ) ( A11 A = C = A B = 12 A 21 A 22 ) ( A11 A 12 A 21 A 22 ) Q i = S i T i S 0 = A 11 T 0 = B 11 U 1 = Q i + Q 4 S 1 = A 12 T 1 = B 21 U 2 = U 1 + Q 5 S 2 = A 21 + A 22 T 2 = B 12 + B 11 U 3 = U 1 + Q 5 S 3 = S 2 A 12 T 3 = B 22 T 2 C 11 = Q 1 + Q 2 S 4 = A 11 A 21 T 4 = B 22 B 12 C 12 = U 3 + Q 6 S 5 = A 12 + S 3 T 5 = B 22 C 21 = U 2 Q 7 S 6 = A 22 T 6 = T 3 B 21 C 22 = U 2 + Q 3 Extras Benjamin Lipshitz IWASEP 9 28

35 Implications for hardware scaling Requirements so that dense matrix multiplication is computation bound. Bandwidth Latency Requirement Requirement Classic γm 1/2 β γm 3/2 α Strassen γm ω0/2 1 β γm ω0/2 α Strassen performs fewer flops and less communication, but is more demanding on the hardware. Extras Benjamin Lipshitz IWASEP 9 29

36 CAPS time breakdown n = on Franklin local multiplications communication local additions re-arrangement other perfect scaling Wall time (seconds) 10 1 strong scaling range 0.1 P=49 P=343 P=2401 Extras Benjamin Lipshitz IWASEP 9 30

37 Performance on small problems n = 3136 on Franklin Execution time, seconds e1 1e2 1e3 1e4 Number of Cores Extras Benjamin Lipshitz IWASEP 9 31

38 Sequential recursive Strassen is communication optimal Run Strassen algorithm recursively. When blocks are small enough, work in localy memory, so no further bandwidth cost { 7W ( n W (n, M) = 2, M) + O(n2 ) if 3n 2 > M O(n 2 ) otherwise Solution is ( n ω 0 W (n, M) = O M ω 0/2 1 ) Extras Benjamin Lipshitz IWASEP 9 32

39 Extra slides 1 Previous Parallel Strassen 2 Data Layout 3 Strassen-Winograd Algorithm 4 Hardware scaling 5 Time breakdown 6 Small problems Benjamin Lipshitz IWASEP 9 33

Lower Bounds on Algorithm Energy Consumption: Current Work and Future Directions. March 1, 2013

Lower Bounds on Algorithm Energy Consumption: Current Work and Future Directions James Demmel, Andrew Gearhart, Benjamin Lipshitz and Oded Schwartz Electrical Engineering and Computer Sciences University