Fast Matrix Multiplication Over GF3

Size: px

Start display at page:

Download "Fast Matrix Multiplication Over GF3"

Randolph Thornton
5 years ago
Views:

1 Fast Matrix Multiplication Over GF3 Matthew Lambert Advised by Dr. B. David Saunders & Dr. Stephen Siegel February 13, 2015

2 Matrix Multiplication

3 Outline Fast Matrix Multiplication

4 Outline Fast Matrix Multiplication Strassen s Algorithm: Well studied O(n ) divide-and-conquer algorithm

5 Outline Fast Matrix Multiplication Strassen s Algorithm: Well studied O(n ) divide-and-conquer algorithm Method of the Four Russians (Arlazarov et al., 1970): O( n3 log(n) ) algorithm that is effective over small finite fields (GF2, GF3, GF5). Very well studied for GF2 and anecdotally good for GF3.

6 Outline Fast Matrix Multiplication Strassen s Algorithm: Well studied O(n ) divide-and-conquer algorithm Method of the Four Russians (Arlazarov et al., 1970): O( n3 log(n) ) algorithm that is effective over small finite fields (GF2, GF3, GF5). Very well studied for GF2 and anecdotally good for GF3. Aim to find thresholds for each algorithm over GF3.

7 Arithmetic over GF3 Simply arithmetic mod 3.

8 Arithmetic over GF3 Simply arithmetic mod

9 Arithmetic over GF3 Simply arithmetic mod

10 Representation of GF3 With only three possible values we ideally need two bits for each element.

11 Representation of GF3 With only three possible values we ideally need two bits for each element. Packed storage: elements stored consecutively

12 Representation of GF3 With only three possible values we ideally need two bits for each element. Packed storage: elements stored consecutively To account for possible carry when adding, we actually need 3 bits for each element. 5 bit ops to add two 21-element words.

13 Representation of GF3 With only three possible values we ideally need two bits for each element. Packed storage: elements stored consecutively To account for possible carry when adding, we actually need 3 bits for each element. 5 bit ops to add two 21-element words. Bitsliced storage: low bits and high bits stored separately (Boothby & Bradshaw, 2009). Store 2 as 11 2 instead of

14 Representation of GF3 With only three possible values we ideally need two bits for each element. Packed storage: elements stored consecutively To account for possible carry when adding, we actually need 3 bits for each element. 5 bit ops to add two 21-element words. Bitsliced storage: low bits and high bits stored separately (Boothby & Bradshaw, 2009). Store 2 as 11 2 instead of bit ops to add or subtract two 64-element SlicedUnits. 1 bit op needed to negate.

15 Matrix Multiplication over GF = Row i of A determines row i of C

16 Matrix Multiplication over GF = Row i of A determines row i of C. C i = n j=0 A i,j B j

17 Matrix Multiplication over GF = Row i of A determines row i of C. C i = n j=0 A i,j B j Over GF2, addition is an xor operation

18 Matrix Multiplication over GF = Row i of A determines row i of C. C i = n j=0 A i,j B j Over GF2, addition is an xor operation We perform up to n 2 row additions, yielding O(n 3 ) running time.

19 Matrix Multiplication over GF = Row i of A determines row i of C. C i = n j=0 A i,j B j Over GF2, addition is an xor operation We perform up to n 2 row additions, yielding O(n 3 ) running time. We have a speedup if we do not have to do O(n 2 ) row additions.

20 Four Russians Multiplication over GF Instead of indexing one bit at a time, we index with t = 2 bits at a time, yielding n2 t additions and thus O( n3 t ) running time.

21 Four Russians Multiplication over GF Instead of indexing one bit at a time, we index with t = 2 bits at a time, yielding n2 t additions and thus O( n3 t ) running time. With multiple bits as index, we are adding multiple rows at once.

22 Four Russians Multiplication over GF Instead of indexing one bit at a time, we index with t = 2 bits at a time, yielding n2 t additions and thus O( n3 t ) running time. With multiple bits as index, we are adding multiple rows at once. We need to quickly compute the linear combinations of the t = 2 rows we are adding.

23 Four Russians Multiplication over GF

24 Four Russians Multiplication over GF index description contents r r r 1 + r

25 Four Russians Multiplication over GF index description contents r r r 1 + r

26 Four Russians Multiplication over GF index description contents r r r 3 + r =

27 Four Russians Multiplication over GF2: Fast table creation We perform n2 t row additions using n t tables.

28 Four Russians Multiplication over GF2: Fast table creation We perform n2 t row additions using n t tables. The additions are performed in O( n3 t ) time.

29 Four Russians Multiplication over GF2: Fast table creation We perform n2 t row additions using n t tables. The additions are performed in O( n3 t ) time. The tables can be constructed in O( 2t n 2 t ) time: i.e., one vector addition for each 2 t rows in n t tables.

30 Four Russians Multiplication over GF2: Fast table creation We perform n2 t row additions using n t tables. The additions are performed in O( n3 t ) time. The tables can be constructed in O( 2t n 2 t ) time: i.e., one vector addition for each 2 t rows in n t tables. index description r 1

31 Four Russians Multiplication over GF2: Fast table creation We perform n2 t row additions using n t tables. The additions are performed in O( n3 t ) time. The tables can be constructed in O( 2t n 2 t ) time: i.e., one vector addition for each 2 t rows in n t tables. index description r 1 01 r 2 11 r 1 + r 2

32 Four Russians Multiplication over GF2: Fast table creation We perform n2 t row additions using n t tables. The additions are performed in O( n3 t ) time. The tables can be constructed in O( 2t n 2 t ) time: i.e., one vector addition for each 2 t rows in n t tables. index description r r r 1 + r r r 1 + r r 2 + r r 1 + r 2 + r 3

33 Four Russians Multiplication over GF2: Algorithm Data: A, B, C, n n matrices Result: C A B for i 0 to n/t do T table of 2 t combinations of rows [i t, (i + 1) t) of B for j 0 to n do a bits [i t, (i + 1) t) of row i of A C j C j + T [a] end end

34 Four Russians Multiplication over GF2: Algorithm Data: A, B, C, n n matrices Result: C A B for i 0 to n/t do T table of 2 t combinations of rows [i t, (i + 1) t) of B for j 0 to n do a bits [i t, (i + 1) t) of row i of A C j C j + T [a] end end Running time: n2 t vector adds, totaling O( n3 t ) time; n t tables created, totaling O( 2t n 2 t ) time.

35 Four Russians Multiplication over GF2: Algorithm Data: A, B, C, n n matrices Result: C A B for i 0 to n/t do T table of 2 t combinations of rows [i t, (i + 1) t) of B for j 0 to n do a bits [i t, (i + 1) t) of row i of A C j C j + T [a] end end Running time: n2 t vector adds, totaling O( n3 t ) time; n t tables created, totaling O( 2t n 2 t ) time. n Let t = log 2 (n), then the additions take O( 3 log 2 (n)) time and n the tables take O( 3 log 2 (n)) time to create.

36 Four Russians Multiplication over GF3: Considerations In GF2, we had 2 t row combinations to compute. In GF3, we have to compute 3 t combinations of rows.

37 Four Russians Multiplication over GF3: Considerations In GF2, we had 2 t row combinations to compute. In GF3, we have to compute 3 t combinations of rows. In GF2, extracting bits for indices was easy, with bitsliced representation, this is problematic for GF3.

38 Four Russians Multiplication over GF3: Considerations In GF2, we had 2 t row combinations to compute. In GF3, we have to compute 3 t combinations of rows. In GF2, extracting bits for indices was easy, with bitsliced representation, this is problematic for GF3. How do we use as an index that corresponds to r 2 r 3, which is nominally 21 ( = )?

39 Four Russians Multiplication over GF3: Considerations In GF2, we had 2 t row combinations to compute. In GF3, we have to compute 3 t combinations of rows. In GF2, extracting bits for indices was easy, with bitsliced representation, this is problematic for GF3. How do we use as an index that corresponds to r 2 r 3, which is nominally 21 ( = )? If t = 3, there are only 27 combinations of rows, but both the low bits range between 000 and 111.

40 Four Russians Multiplication over GF3: Considerations In GF2, we had 2 t row combinations to compute. In GF3, we have to compute 3 t combinations of rows. In GF2, extracting bits for indices was easy, with bitsliced representation, this is problematic for GF3. How do we use as an index that corresponds to r 2 r 3, which is nominally 21 ( = )? If t = 3, there are only 27 combinations of rows, but both the low bits range between 000 and 111. Too expensive to map index to range 0 to 3 t directly, so we concatenate the high and low bits to give an index between 0 and 4 t.

41 Four Russians Multiplication over GF3: Considerations In GF2, we had 2 t row combinations to compute. In GF3, we have to compute 3 t combinations of rows. In GF2, extracting bits for indices was easy, with bitsliced representation, this is problematic for GF3. How do we use as an index that corresponds to r 2 r 3, which is nominally 21 ( = )? If t = 3, there are only 27 combinations of rows, but both the low bits range between 000 and 111. Too expensive to map index to range 0 to 3 t directly, so we concatenate the high and low bits to give an index between 0 and 4 t. How do our tables look with these indices?

42 Four Russians Multiplication over GF3: Tables Three approaches considered to address creation of row combinations and indexing problem.

43 Four Russians Multiplication over GF3: Tables Three approaches considered to address creation of row combinations and indexing problem. Use one table of size 2 t rows and add with it twice.

44 Four Russians Multiplication over GF3: Tables Three approaches considered to address creation of row combinations and indexing problem. Use one table of size 2 t rows and add with it twice. Use one table of size 4 t rows and add with it once.

45 Four Russians Multiplication over GF3: Tables Three approaches considered to address creation of row combinations and indexing problem. Use one table of size 2 t rows and add with it twice. Use one table of size 4 t rows and add with it once. Use one table of size 3 t rows and one of size 4 t indices and add with it once.

46 Four Russians Multiplication over GF3: 2 t approach First approach: create a table of 2 t combinations of rows as in GF2 case. Index once into the table with the low t bits and once with the high t bits. The 2 = 11 2 representation used in the bitsliced storage is advantageous.

47 Four Russians Multiplication over GF3: 2 t approach First approach: create a table of 2 t combinations of rows as in GF2 case. Index once into the table with the low t bits and once with the high t bits. The 2 = 11 2 representation used in the bitsliced storage is advantageous. index contents r 1 01 r 2 11 r 1 + r 2

48 Four Russians Multiplication over GF3: 4 t approach Second approach: create a table of size 4 t rows containing all 3 t combinations of rows and some unused rows. We will index into the rows directly.

49 Four Russians Multiplication over GF3: 4 t approach Second approach: create a table of size 4 t rows containing all 3 t combinations of rows and some unused rows. We will index into the rows directly. index contents r r r 1 + r r r 1 + r 2 index contents r r 1 r r 1 r 2

50 Four Russians Multiplication over GF3: 3 t approach Third approach: create a table of 3 t rows and a table of 4 t indices or pointers to map an index to a combination.

51 Four Russians Multiplication over GF3: 3 t approach Third approach: create a table of 3 t rows and a table of 4 t indices or pointers to map an index to a combination. index destination index contents r r r r 1 + r r 1 + r r r 1 r r 1 r

52 Four Russians Multiplication over GF3: Table Summary method memory cost element access adds ops per table 2 t 2 t rows direct t n 3 t 3 t rows and 4 t indirect t n 4 t 4 t rows direct t n

53 Four Russians Multiplication over GF3: Table Summary method memory cost element access adds ops per table 2 t 2 t rows direct t n 3 t 3 t rows and 4 t indirect t n 4 t 4 t rows direct t n Because the 3 t and 4 t approaches contain ± each combination of rows, we only need to use the six operation addition to construct half of the elements. We can then use the one operation negation to construct the other half.

54 Four Russians Multiplication over GF3: Table Summary method memory cost element access adds ops per table 2 t 2 t rows direct t n 3 t 3 t rows and 4 t indirect t n 4 t 4 t rows direct t n Because the 3 t and 4 t approaches contain ± each combination of rows, we only need to use the six operation addition to construct half of the elements. We can then use the one operation negation to construct the other half. Only one supplemental table for the 3 t approach needs to be created for each value of t used.

55 Four Russians Multiplication over GF3: Results Initial development showed 4 t method to be considerably slower, so it was abandoned in favor of optimizing the other two approaches.

56 Four Russians Multiplication over GF3: Results Initial development showed 4 t method to be considerably slower, so it was abandoned in favor of optimizing the other two approaches.

57 Four Russians Multiplication over GF3: Results n 3 t time 2 t time classical time

58 Four Russians Multiplication over GF3: Results n 3 t time 2 t time classical time

59 Four Russians Multiplication over GF3: Conclusions Reasonable speedup over classical: 2.5x (and improving)

60 Four Russians Multiplication over GF3: Conclusions Reasonable speedup over classical: 2.5x (and improving) 3 t approach is not as successful as it theoretically should be.

61 Four Russians Multiplication over GF3: Conclusions Reasonable speedup over classical: 2.5x (and improving) 3 t approach is not as successful as it theoretically should be. Larger tables do not fit into L1 cache as well as 2 t -sized tables?

62 Four Russians Multiplication over GF3: Conclusions Reasonable speedup over classical: 2.5x (and improving) 3 t approach is not as successful as it theoretically should be. Larger tables do not fit into L1 cache as well as 2 t -sized tables? More complicated access increases cost?

63 Four Russians Multiplication over GF3: Conclusions Reasonable speedup over classical: 2.5x (and improving) 3 t approach is not as successful as it theoretically should be. Larger tables do not fit into L1 cache as well as 2 t -sized tables? More complicated access increases cost? Some precedent for multiple additions: M4RI (Albrecht, 2010) increases number of additions if it means smaller tables, so multiple tables can fit into L1 cache.

64 Four Russians Multiplication over GF3: Conclusions Reasonable speedup over classical: 2.5x (and improving) 3 t approach is not as successful as it theoretically should be. Larger tables do not fit into L1 cache as well as 2 t -sized tables? More complicated access increases cost? Some precedent for multiple additions: M4RI (Albrecht, 2010) increases number of additions if it means smaller tables, so multiple tables can fit into L1 cache. Preliminary testing with multiple tables sees 3 t approach compute in seconds, and 2 t approach in 1.01 seconds.

65 Other Four Russians Thoughts We assume we can operate on full 64-bit words. Performance is decreased if we must not modify unused bits.

66 Other Four Russians Thoughts We assume we can operate on full 64-bit words. Performance is decreased if we must not modify unused bits. If t = log(n), then the table of row combinations is the same size as the matrix so there are definite practical limitations.

67 Strassen s Algorithm vs Classical Divide-and-Conquer Classical divide-and-conquer multiplication vs Strassen multiplication on top of 3 t approach of the Method of the Four Russians. Strassen was faster at all tested dimensions.

68 Strassen s Algorithm: 2 t vs 3 t as base case Three levels of recursion performed. Even with base cases where the 3 t approach should outperform 2 t approach, the 2 t base case yielded faster results in all cases (though the multiple tables method outperforms 2 t by about 14%).

69 Strassen s Algorithm: Conclusions and Thresholds Strassen s algorithm is faster than classical and due to memory requirements is at some point faster than the Method of the Four Russians.

70 Strassen s Algorithm: Conclusions and Thresholds Strassen s algorithm is faster than classical and due to memory requirements is at some point faster than the Method of the Four Russians. When using a base case of the 2 t approach, the threshold to switch from Strassen s algorithm to Four Russians on the machine used is somewhere between 960 and 1728.

71 Strassen s Algorithm: Conclusions and Thresholds Strassen s algorithm is faster than classical and due to memory requirements is at some point faster than the Method of the Four Russians. When using a base case of the 2 t approach, the threshold to switch from Strassen s algorithm to Four Russians on the machine used is somewhere between 960 and Normally Strassen s algorithm works best on matrices with dimensions a power of 2 or a power of 2 times the base case dimension. With bitsliced or packed storage it is important to try to maintain matrices with dimensions multiples of 64 at all levels of recursion, as working with less than 1 machine word is more expensive.

72 Strassen s Algorithm: Conclusions and Thresholds Strassen s algorithm is faster than classical and due to memory requirements is at some point faster than the Method of the Four Russians. When using a base case of the 2 t approach, the threshold to switch from Strassen s algorithm to Four Russians on the machine used is somewhere between 960 and Normally Strassen s algorithm works best on matrices with dimensions a power of 2 or a power of 2 times the base case dimension. With bitsliced or packed storage it is important to try to maintain matrices with dimensions multiples of 64 at all levels of recursion, as working with less than 1 machine word is more expensive. Further improvements ongoing?

73 Summary Successfully developed and implemented the Method of the Four Russians over GF3 yielding noticeable performance over classical multiplication.

74 Summary Successfully developed and implemented the Method of the Four Russians over GF3 yielding noticeable performance over classical multiplication. Implemented Strassen s algorithm on top of the Method of the Four Russians.

75 Summary Successfully developed and implemented the Method of the Four Russians over GF3 yielding noticeable performance over classical multiplication. Implemented Strassen s algorithm on top of the Method of the Four Russians. Different implementations of the Method of the Four Russians might be preferred depending on specific problem and machine.

76 Best value of t For GF2 we can assume all operations are essentially equal in time: only different cost is extracting bits. All other costs are 64-bit xors. For (m n) (n k) = (m k), we spend mnk/t time adding and nk2 t /t time creating tables. Minimize mnk t + k2t t. Partial w.r.t. t: kn(2t (t log(2) 1) m). t 2 Set equal to zero and solve for t = W (m/e)+1 log(2). dimension t log 2 (dim)

77 Selecting t in GF3 For 3 t approach: we spend mnk/t time adding and nk3 t /t time creating tables. t = W (m/e)+1 log(3). For 2 t approach: we spend 2mnk/t time adding and nk2 t /t time creating tables. t = W (2m/e)+1 log(2) dimension t min t experimental t log 3 (dim) dimension t min t experimental t log 2 (dim)

78 GF3 bitsliced operations s x 0 y 1 t x 1 y 0 r 0 (x 0 y 1 ) (x 1 y 0 ) r 1 s t Figure: GF3 bitsliced addition in six operations: r x + y t x 0 y 0 r 0 t (x 1 y 1 ) r 1 (t y 1 ) (y 0 x 1 ) Figure: GF3 bitsliced subtraction in six operations: r x y r 0 x 0 r 1 x 0 x 1 Figure: GF3 bitsliced negation in one operation: r x

Matrix Multiplication

Matrix Multiplication Matrix Multiplication Matrix multiplication. Given two n-by-n matrices A and B, compute C = AB. n c ij = a ik b kj k=1 c 11 c 12 c 1n c 21 c 22 c 2n c n1 c n2 c nn = a 11 a 12 a 1n