Discovering Fast Matrix Multiplication Algorithms via Tensor Decomposition

Size: px

Start display at page:

Download "Discovering Fast Matrix Multiplication Algorithms via Tensor Decomposition"

Ginger Doyle
5 years ago
Views:

1 Discovering Fast Matrix Multiplication Algorithms via Tensor Decomposition Grey Ballard SIAM onference on omputational Science & Engineering March, 207

2 ollaborators This is joint work with... Austin Benson - Stanford University hristian Ikenmeyer - Max-Planck Institute for Informatics Tammy Kolda - Sandia National Laboratories JM Landsberg - Texas A&M University Kathryn Rouse - Wake Forest University Nick Ryder - University of alifornia Berkeley Ballard 4

3 Fast Matrix Multiplication Definition Fast matrix multiplication algorithms require o(n 3 ) arithmetic operations to multiply n n matrices. Examples Strassen [Str69]: O(n log 2 7 )=O(n 2.8 ) oppersmith-winograd [W87]: O(n ) Le Gall [LG4]: O(n ) Ballard 5

4 Strassen s Algorithm Strassen s algorithm uses 7 multiplies for 2 2 multiplication apple apple apple 2 A A = 2 B B A 2 A 22 B 2 B 22 lassical Algorithm M = A B M 2 = A 2 B 2 M 3 = A B 2 M 4 = A 2 B 22 M 5 = A 2 B M 6 = A 22 B 2 M 7 = A 2 B 2 Strassen s Algorithm M = (A + A 22 ) (B + B 22 ) M 2 = (A 2 + A 22 ) B M 3 = A (B 2 B 22 ) M 4 = A 22 (B 2 B ) M 5 = (A + A 2 ) B 22 M 6 = (A 2 A ) (B + B 2 ) M 7 = (A 2 A 22 ) (B 2 + B 22 ) M 8 = A 22 B 22 = M + M 2 2 = M 3 + M 4 2 = M 5 + M 6 22 = M 7 + M 8 = M + M 4 M 5 + M 7 2 = M 3 + M 5 2 = M 2 + M 4 22 = M M 2 + M 3 + M 6 Ballard 6

5 Strassen s Algorithm Strassen s algorithm uses 7 multiplies for 2 2 multiplication apple apple apple 2 A A = 2 B B A 2 A 22 B 2 B 22 For n n matrices, we split into quadrants and use recursion Flop count recurrence: F(n) =7 F(n/2)+O(n 2 ) F () = F(n) =O n log 7 2 log Strassen s Algorithm M = (A + A 22 ) (B + B 22 ) M 2 = (A 2 + A 22 ) B M 3 = A (B 2 B 22 ) M 4 = A 22 (B 2 B ) M 5 = (A + A 2 ) B 22 M 6 = (A 2 A ) (B + B 2 ) M 7 = (A 2 A 22 ) (B 2 + B 22 ) = M + M 4 M 5 + M 7 2 = M 3 + M 5 2 = M 2 + M 4 22 = M M 2 + M 3 + M 6 Ballard 6

6 Recursion allows us to focus on base case 2 2 apple a a 2 a 2 a 22 apple apple b b 2 c c = 2 b 2 b 22 c 2 c 22 multiplies flop count O n 2.58 O n 2.8 O n 3 Ballard 7

7 Recursion allows us to focus on base case 2 2 apple a a 2 a 2 a 22 apple apple b b 2 c c = 2 b 2 b 22 c 2 c 22 multiplies flop count O n 2.58 O n 2.8 O n 3 Ballard 7

8 Recursion allows us to focus on base case 2 2 apple a a 2 a 2 a 22 apple apple b b 2 c c = 2 b 2 b 22 c 2 c 22 multiplies flop count O n 2.58 O n 2.8 O n a a 2 a 3 b b 2 b 3 c c 2 c 3 4a 2 a 22 a b 2 b 22 b 23 5 = 4c 2 c 22 c 23 5 a 3 a 32 a 33 b 3 b 32 b 33 c 3 c 32 c 33 multiplies flop count O n 2.68 O n 2.77 O n 2.85 O n 3 Ballard 7

9 Searching for a base case algorithm Finding a base case algorithm corresponds to computing an exact P decomposition of a particular 3D tensor = + + RX T = u r v r w r r= *Note the = sign: we re not looking for an approximation Ballard 8

10 2 2 matrix multiplication as a tensor operation a a A B = 2 b b 2 c c = 2 = a 2 a 22 b 2 b 22 c 2 c 22 is equivalent to 0 a T a 2 b = T Ba a 2 A Bb 2 b 2 A = Bc c 2 A = c a 22 b 22 c 22 0 b 0 c where T is a tensor with the following slices: 0 T = A T 2 = 0 A T 3 = 0 0 A T 4 = A Ballard 9

11 2 2 matrix multiplication as a tensor operation a a A B = 2 b b 2 c c = 2 = a 2 a 22 b 2 b 22 c 2 c 22 is equivalent to 0 a T a 2 b = T Ba a 2 A Bb 2 b 2 A = Bc c 2 A = c a 22 b 22 c 22 0 b 0 c For example: a b T 2 Ba a 2 A Bb 2 2@ b 2 A = a a 2 a 2 a 22 a 22 b 22 0 b Bb 2 b 2 A = a 2b +a 22 b 2 = c 2 b 22 Ballard 9

12 General matrix multiplication tensor hm, P, Ni means multiplying M P by P N P N N M A P B M Matrix multiplication tensor T is MP PN MN Number of s in T is MPN hm, P, Ni hn, M, Pi hp, N, Mi hp, M, Ni hm, N, Pi hn, P, Mi Ballard 0

13 Matrix multiplication using low-rank decomposition Here s the matrix multiplication as tensor operation again: T a 2 b = c Here s our low-rank decomposition: RX T = u r v r w r r= Here s an encoding of our matrix multiplication algorithm: T a 2 b = RX (u r v r w r ) a 2 b = r= RX r= h i (a T u r ) (b T v r ) w r = c Ballard

14 onnection between factor matrices and algorithm Strassen s algorithm Strassen s factor matrices: M = (A + A 22 ) (B + B 22 ) M 2 = (A 2 + A 22 ) B M 3 = A (B 2 B 22 ) M 4 = A 22 (B 2 B ) M 5 = (A + A 2 ) B 22 M 6 = (A 2 A ) (B + B 2 ) M 7 = (A 2 A 22 ) (B 2 + B 22 ) = M + M 4 M 5 + M 7 2 = M 3 + M 5 2 = M 2 + M 4 22 = M M 2 + M 3 + M U = B A V = B A W = B A U, V, W matrices encode the algorithm Ballard 2

15 onnection between factor matrices and algorithm Strassen s algorithm Strassen s factor matrices: M = (A + A 22 ) (B + B 22 ) M 2 = (A 2 + A 22 ) B M 3 = A (B 2 B 22 ) M 4 = A 22 (B 2 B ) M 5 = (A + A 2 ) B 22 M 6 = (A 2 A ) (B + B 2 ) M 7 = (A 2 A 22 ) (B 2 + B 22 ) = M + M 4 M 5 + M 7 2 = M 3 + M 5 2 = M 2 + M 4 22 = M M 2 + M 3 + M 6 U V W M M 2 M 3 M 4 M 5 M 6 M 7 A A 2 A 2 A 22 B B 2 B 2 B U, V, W matrices encode the algorithm Ballard 2

16 How do you solve it? Problem: Find U, V, W such that T = P u r v r w r the problem is NP-complete (for general T) many combinatorial formulations of the problem efficient numerical methods can compute low-rank approximations typical approach is alternating least squares (ALS) pitfall: getting stuck at local minima > 0 pitfall: facing ill-conditioned linear least squares problems pitfall: numerical solution is good only to machine precision we seek exact, discrete, and sparse solutions Ballard 3

17 Alternating least squares with regularization Most successful scheme due to Smirnov [Smi3] Repeat U = arg min U T (U) U(W V) T 2 F + U Ũ 2 F 2 V = arg min V T (V ) V(W U) T 2 F + V Ṽ 2 F 3 W = arg min W Until convergence T (W ) W(V U) T 2 F + W W 2 F Art of optimization scheme in tinkering with, Ũ, Ṽ, W (each iteration) Ballard 4

18 Discovered algorithms (a subset) Algorithm Multiplies Multiplies Speedup per base case (fast) (classical) recursive step! 0 h2, 2, 3i [BB5] 2 9% 2.89 h2, 2, 5i [BB5] 8 20 % 2.89 h2, 2, 2i [Str69] 7 8 4% 2.8 h2, 2, 4i [BB5] 4 6 4% 2.85 h3, 3, 3i [BB5] % 2.85 h2, 3, 3i [BB5] % 2.8 h2, 3, 4i [BB5] % 2.83 h2, 4, 4i [BB5] % 2.82 h3, 3, 4i [BB5] % 2.82 h3, 4, 4i [Smi7] % 2.82 h3, 3, 6i [Smi3] % 2.77 h2, 2, 3i* [BRL79] % 2.78 h3, 3, 3i* [Sch8] % 2.77 Ballard 5

19 Are these algorithms faster in practice? Effective GFLOPS Square Matrix Multiplication [BB5] Dimension (n) Ballard 6

20 Are these algorithms faster in practice? Effective GFLOPS Square Matrix Multiplication [BB5] 22 MKL STRASSEN 20 BINI <3,2,3> 8 <3,3,3> <4,2,4> <3,3,6> Dimension (n) Ballard 6

21 Rectangular Matrix Multiplication [HRMvdG7] k m = n = 4400, level, AB, core, Actual 34 Effective GFLOPS (2 m n k/time) Effe Ballard m , 2, 2 2, 3, 2 2, 3, 4 2, 4, 3 2, 5, 2 3, 2, 2 3, 2, 3 3, 2, 4 3, 3, 2 3, 3, 3 3, 3, 6 3, 4, 2 3, 4, 3 3, 5, 3 3, 6, 3 4, 2, 3 4, 2, 4 4, 3, 2 4, 3, 3 6, 3, 3 BLIS 4, 4, 2 5, 2, 2 4, 2, 2 M KL Effective GFLOPS (2 m n k/time) Effe Are these algorithms faster in practice? k m = n = 4400, level, Naive, core, Actual m 6 =

22 How big can we go? urrent numerical techniques are hitting their limits tensor size grows like N 6 if M = P = N number of variables grows faster than 3N 4 if M = P = N Nothing new has been found for h4, 4, 4i Strassen s h2, 2, 2i algorithm can be used twice an we exploit properties particular to matrix multiplication? Ballard 7

23 yclic symmetry of square matrix multiplication Let M be the matrix multiplication tensor for M = P = N M has cyclic symmetry: m ijk = m kij = m jki Ballard 8

24 yclic symmetry of square matrix multiplication Let M be the matrix multiplication tensor for M = P = N M has cyclic symmetry: m ijk = m kij = m jki This means if U, V, W is a solution, then so are W, U, V and V, W, U Ballard 8

25 yclic symmetry of square matrix multiplication Let M be the matrix multiplication tensor for M = P = N M has cyclic symmetry: m ijk = m kij = m jki This means if U, V, W is a solution, then so are W, U, V and V, W, U Is this property reflected in the low-rank decomposition? X u r v r w r X w r u r v r X v r w r u r? r r r Ballard 8

26 yclic invariance of Strassen s algorithm U = V = W = A A A Ballard 9

27 yclic invariance of Strassen s algorithm U = V = W = A A A If you cyclically permute U, V, W, you get the same rank-one components in a different order Ballard 9

28 Searching for cyclic-invariant solutions U A B D N 2 V A D B N 2 W A D B N 2 S T T R = S + 3T T Ballard 20

29 yclic-invariant decompositions Decomposition with no constraints RX T = u r v r w r r= Decomposition with cyclic-invariant constraint SX TX T = a s a s a s + (b t c t d t + d t b t c t + c t d t b t ) s= t= Number of variables reduced by factor of 3, but expression is no longer multilinear (not linear in A) Ballard 2

30 New rank-23 cyclic-invariant solutions for h3, 3, 3i A B D U = A B D V = A D B W = A D B Rank-23 is the best known exact rank for h3, 3, 3i; many previous solutions exist but none are cyclic invariant We computed cyclic-invariant solutions with S = 2, 5, Ballard 22

31 What about h4, 4, 4i? Performing two steps of Strassen s algorithm yields rank-49 cyclic-invariant solution No known exact decomposition of rank < 49 cyclic invariant or otherwise Ballard 23

32 What about h4, 4, 4i? Performing two steps of Strassen s algorithm yields rank-49 cyclic-invariant solution No known exact decomposition of rank < 49 cyclic invariant or otherwise No success yet in computing cyclic-invariant solutions but the truth is out there Ballard 23

33 Summary Discovering fast matrix multiplication algorithms corresponds to computing an exact P decomposition Any new algorithm can be implemented efficiently Exploiting symmetry of matrix multiplication can reduce the size of the problem, want to tackle h4, 4, 4i Still exploring how to use more structure to scale up to larger dimensions Ballard 24

34 Thank You! Discovering Fast Matrix Multiplication Algorithms via Tensor Decomposition Grey Ballard Ballard 25

35 More structure... Transposition symmetry where t jik = t 0 ijk T 0 = T P 2 P 3 P and P is the vec-permutation matrix Ballard 26

36 More structure... Transposition symmetry where t jik = t 0 ijk T 0 = T P 2 P 3 P and P is the vec-permutation matrix This is derived from the fact that AB = implies B T A T = T Ballard 26

37 More structure... Multiplication invariance T = T Y T X 2 Z T Y 3 Z T X where X, Y, Z are nonsingular matrices Ballard 27

38 More structure... Multiplication invariance T = T Y T X 2 Z T Y 3 Z T X where X, Y, Z are nonsingular matrices This is derived from the fact that AB = implies XAY YBZ = XZ Ballard 27

39 Example algorithm: h4, 2, 4i Partition matrices like this: A A 2 apple A 2 A 22 7 B B 2 B 3 B 4 4A 3 A 32 5 = B 2 B 22 B 23 B A 4 A Take 26 linear combos of A ij s according to U (68 adds) 2 Take 26 linear combos of B ij s according to V (52 adds) 3 Perform 26 multiplies (recursively) 4 Take linear combos of outputs to form ij s acc. to W (69 adds) lassical algorithm performs 32 multiplies yielding a possible speedup of 23% per step Ballard 28

40 References I Austin R. Benson and Grey Ballard. A framework for practical parallel fast matrix multiplication. In Proceedings of the 20th AM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP 205, pages 42 53, New York, NY, USA, February 205. AM. D. Bini, M. apovani, F. Romani, and G. Lotti. O(n ) complexity for n n approximate matrix multiplication. Information Processing Letters, 8(5): , 979. D. oppersmith and S. Winograd. Matrix multiplication via arithmetic progressions. In Proceedings of the Nineteenth Annual AM Symposium on Theory of omputing, STO 87, pages 6, New York, NY, USA, 987. AM. Jianyu Huang, Leslie Rice, Devin A. Matthews, and Robert van de Geijn. Generating families of practical fast matrix multiplication algorithms. In Proceedings of the 3st IEEE International Parallel and Distributed Processing Symposium, 207. To appear. François Le Gall. Powers of tensors and fast matrix multiplication. In Proceedings of the 39th International Symposium on Symbolic and Algebraic omputation, ISSA 4, pages , New York, NY, USA, 204. AM. A. Schönhage. Partial and total matrix multiplication. SIAM Journal on omputing, 0(3): , 98. Ballard 29

41 References II A.V. Smirnov. The bilinear complexity and practical algorithms for matrix multiplication. omputational Mathematics and Mathematical Physics, 53(2):78 795, 203. Alexey Smirnov. Several bilinear algorithms for matrix multiplication. Technical report, ResearchGate, 207. V. Strassen. Gaussian elimination is not optimal. Numerische Mathematik, 3: , /BF Ballard 30

On the geometry of matrix multiplication

On the geometry of matrix multiplication J.M. Landsberg Texas A&M University Spring 2018 AMS Sectional Meeting 1 / 27 A practical problem: efficient linear algebra Standard algorithm for matrix multiplication,