Implementing Fast Carryless Multiplication

Size: px

Start display at page:

Download "Implementing Fast Carryless Multiplication"

Gabriel Melton
5 years ago
Views:

1 Implementing Fast Carryless Multiplication Joris van der Hoeven, Robin Larrieu and Grégoire Lecerf CNRS & École polytechnique MACIS 2017 Nov. 15, Vienna, Austria van der Hoeven, Larrieu, Lecerf Implementing Fast Carryless Multiplication MACIS / 17

2 Introduction Outline Introduction Carryless multiplication State of the art Presentation of the algorithm Implementation details Perspectives van der Hoeven, Larrieu, Lecerf Implementing Fast Carryless Multiplication MACIS / 17

3 Introduction Carryless multiplication Carryless multiplication Multiplication in F 2 [X ], large degree (typically 10 6 ). Fast algorithms for such sizes use FFT multiplication. van der Hoeven, Larrieu, Lecerf Implementing Fast Carryless Multiplication MACIS / 17

4 Introduction Carryless multiplication Carryless multiplication Multiplication in F 2 [X ], large degree (typically 10 6 ). Fast algorithms for such sizes use FFT multiplication. Problem Not many evaluation points in F 2 work in an extension field. How to minimize the corresponding overhead? van der Hoeven, Larrieu, Lecerf Implementing Fast Carryless Multiplication MACIS / 17

5 Introduction State of the art State of the art 1. Triadic Schönhage-Strassen algorithm (gf2x Brent, Gaudry, Thomé, Zimmermann) van der Hoeven, Larrieu, Lecerf Implementing Fast Carryless Multiplication MACIS / 17

6 Introduction State of the art State of the art 1. Triadic Schönhage-Strassen algorithm (gf2x Brent, Gaudry, Thomé, Zimmermann) 2. FFT over F 2 60 (Harvey, van der Hoeven, Lecerf 2016) van der Hoeven, Larrieu, Lecerf Implementing Fast Carryless Multiplication MACIS / 17

7 Introduction State of the art State of the art 1. Triadic Schönhage-Strassen algorithm (gf2x Brent, Gaudry, Thomé, Zimmermann) 2. FFT over F 2 60 (Harvey, van der Hoeven, Lecerf 2016) 3. Additive FFT over F or F (Chen, Cheng, Kuo, Li, Yang 2017) van der Hoeven, Larrieu, Lecerf Implementing Fast Carryless Multiplication MACIS / 17

8 Introduction State of the art State of the art 1. Triadic Schönhage-Strassen algorithm (gf2x Brent, Gaudry, Thomé, Zimmermann) 2. FFT over F 2 60 (Harvey, van der Hoeven, Lecerf 2016) 3. Additive FFT over F or F (Chen, Cheng, Kuo, Li, Yang 2017) This work 1 Improvement of strategy n. o 2 using the ideas from the Frobenius FFT algorithm (van der Hoeven, Larrieu 2017). Achieves a speedup by a factor 2. 1 Source code available from revision of our SVN server ( in the justinline library van der Hoeven, Larrieu, Lecerf Implementing Fast Carryless Multiplication MACIS / 17

9 Introduction State of the art Why F 2 60? Efficient arithmetic in F 2 60 Slightly smaller than a machine word µ(x ) := X 61 1 X 1 irreducible over F 2 Efficient FFT Roots of unity with order = van der Hoeven, Larrieu, Lecerf Implementing Fast Carryless Multiplication MACIS / 17

10 Introduction State of the art Why F 2 60? Efficient arithmetic in F 2 60 Slightly smaller than a machine word µ(x ) := X 61 1 X 1 irreducible over F 2 Efficient FFT Roots of unity with order = Bonus 61 divides (Fermat s theorem) 2 generates (Z/61Z) ( µ(x ) irreducible) van der Hoeven, Larrieu, Lecerf Implementing Fast Carryless Multiplication MACIS / 17

11 Presentation of the algorithm Outline Introduction Presentation of the algorithm Kronecker segmentation vs. Frobenius FFT Our variant of the Frobenius FFT Frobenius encoding Implementation details Perspectives van der Hoeven, Larrieu, Lecerf Implementing Fast Carryless Multiplication MACIS / 17

12 Presentation of the algorithm Kronecker segmentation vs. Frobenius FFT Kronecker segmentation vs. Frobenius FFT Naive strategy F 2 [X ] <n F 2 60[X ] <n A F 2 [X ] B F 2 [X ] Ã F 2 60[X ] B F 2 60[X ] AB F 2 60[X ] AB F 2 [X ] van der Hoeven, Larrieu, Lecerf Implementing Fast Carryless Multiplication MACIS / 17

13 Presentation of the algorithm Kronecker segmentation vs. Frobenius FFT Kronecker segmentation vs. Frobenius FFT Naive strategy F 2 [X ] <n F 2 60[X ] <n A F 2 [X ] B F 2 [X ] Ã F 2 60[X ] B F 2 60[X ] AB F 2 60[X ] AB F 2 [X ] Kronecker segmentation F 2 [X ] <n F 2 [X ] <30 [Z] <n/30 F 2 60[Z] <n/30 A F 2 [X ] B F 2 [X ] Ã F 2 [X ] <30 [Z] B F 2 [X ] <30 [Z] AB F 2 [X ] <60 [Z] Z = X 30 AB F 2 [X ] van der Hoeven, Larrieu, Lecerf Implementing Fast Carryless Multiplication MACIS / 17

14 Presentation of the algorithm Kronecker segmentation vs. Frobenius FFT Kronecker segmentation vs. Frobenius FFT Naive strategy F 2 [X ] <n F 2 60[X ] <n A F 2 [X ] B F 2 [X ] Ã F 2 60[X ] B F 2 60[X ] AB F 2 60[X ] AB F 2 [X ] Kronecker segmentation F 2 [X ] <n F 2 [X ] <30 [Z] <n/30 F 2 60[Z] <n/30 A F 2 [X ] B F 2 [X ] Ã F 2 [X ] <30 [Z] B F 2 [X ] <30 [Z] AB F 2 [X ] <60 [Z] Z = X 30 AB F 2 [X ] Frobenius FFT For ω a root of unity, φ : x x 2 acts on {1, ω, ω 2, ω 3,... }. The naive DFT A [A(1), A(ω), A(ω 2 ),... ] causes redundant computation: A F 2 [X ], x F 2 60 A(x 2 ) = A(x) 2 van der Hoeven, Larrieu, Lecerf Implementing Fast Carryless Multiplication MACIS / 17

15 Presentation of the algorithm Our variant of the Frobenius FFT Our variant of the Frobenius FFT A F 2 [X ] <60m DFT ω [A(ω 61i )] [A(ω 61i+1 )] [A(ω 61i+2 )] [A(ω 61i+3 )] [A(ω 61i+4 )] 61 van der Hoeven, Larrieu, Lecerf Implementing Fast Carryless Multiplication MACIS / 17

16 Presentation of the algorithm Our variant of the Frobenius FFT Our variant of the Frobenius FFT A F 2 [X ] <60m DFT ω φ φ φ [A(ω 61i )] [A(ω 61i+1 )] [A(ω 61i+2 )] [A(ω 61i+3 )] [A(ω 61i+4 )] 61 van der Hoeven, Larrieu, Lecerf Implementing Fast Carryless Multiplication MACIS / 17

17 Presentation of the algorithm Our variant of the Frobenius FFT Our variant of the Frobenius FFT A F 2 [X ] <60m Ā F 2 60[X ] <m Frobenius encoding DFT ω DFT ω φ φ φ [A(ω 61i )] [A(ω 61i+1 )] [A(ω 61i+2 )] [A(ω 61i+3 )] [A(ω 61i+4 )] 61 van der Hoeven, Larrieu, Lecerf Implementing Fast Carryless Multiplication MACIS / 17

18 Presentation of the algorithm Our variant of the Frobenius FFT Multiplication algorithm A F 2 [X ] <a Frobenius Encoding Ā F 2 60[X ] <a/60 B F 2 60[X ] <b/60 DFT ω E ω (A) F m 2 60 a + b < 60m 61m divides AB F 2 [X ] <a+b van der Hoeven, Larrieu, Lecerf Implementing Fast Carryless Multiplication MACIS / 17

19 Presentation of the algorithm Our variant of the Frobenius FFT Multiplication algorithm A F 2 [X ] <a Frobenius Encoding Ā F 2 60[X ] <a/60 DFT ω E ω (A) F m 2 60 B F 2 [X ] <b Frobenius Encoding B F 2 60[X ] <b/60 DFT ω E ω (B) F m 2 60 a + b < 60m 61m divides AB F 2 [X ] <a+b van der Hoeven, Larrieu, Lecerf Implementing Fast Carryless Multiplication MACIS / 17

20 Presentation of the algorithm Our variant of the Frobenius FFT Multiplication algorithm A F 2 [X ] <a Frobenius Encoding Ā F 2 60[X ] <a/60 DFT ω B F 2 [X ] <b Frobenius Encoding B F 2 60[X ] <b/60 DFT ω E ω (A) F m 2 60 a + b < 60m 61m divides pointwise product E ω (AB) F m 2 60 DFT 1 ω AB F 2 60[X ] <m Frobenius Decoding AB F 2 [X ] <a+b E ω (B) F m 2 60 van der Hoeven, Larrieu, Lecerf Implementing Fast Carryless Multiplication MACIS / 17

21 Presentation of the algorithm Frobenius encoding Frobenius encoding Cooley-Tukey FFT A(ω 61i+1 ) = k<m ω k l<61 a k+ml ω ml ω 61ki van der Hoeven, Larrieu, Lecerf Implementing Fast Carryless Multiplication MACIS / 17

22 Presentation of the algorithm Frobenius encoding Frobenius encoding Cooley-Tukey FFT A(ω 61i+1 ) = k<m ω k l<61 a k+ml ω ml ω 61ki θ := ω m, ω := ω 61 Ã k := l<61 a k+mlx l F 2 [X ] <60 (A F 2 [X ] <60m ) Ā = k<m ωk Ã k (θ)z k F 2 60[Z] <m van der Hoeven, Larrieu, Lecerf Implementing Fast Carryless Multiplication MACIS / 17

23 Presentation of the algorithm Frobenius encoding Frobenius encoding Cooley-Tukey FFT A(ω 61i+1 ) = k<m ω k l<61 a k+ml ω ml ω 61ki θ := ω m, ω := ω 61 Ã k := l<61 a k+mlx l F 2 [X ] <60 (A F 2 [X ] <60m ) Ā = k<m ωk Ã k (θ)z k F 2 60[Z] <m Technical assumption Assume ω chosen such that θ = z mod µ(z) with µ(z) := z61 1 z 1 van der Hoeven, Larrieu, Lecerf Implementing Fast Carryless Multiplication MACIS / 17

24 Implementation details Outline Introduction Presentation of the algorithm Implementation details Data Representation Frobenius encoding Timings Perspectives van der Hoeven, Larrieu, Lecerf Implementing Fast Carryless Multiplication MACIS / 17

25 Implementation details Data Representation Data Representation Polynomials over F 2 in packed representation. Elements of F 2 60 on one machine word; polynomials over F 2 60 as an array of words. Matrices over F 2 in packed column representation. van der Hoeven, Larrieu, Lecerf Implementing Fast Carryless Multiplication MACIS / 17

26 Implementation details Data Representation Data Representation Polynomials over F 2 in packed representation. Elements of F 2 60 on one machine word; polynomials over F 2 60 as an array of words. Matrices over F 2 in packed column representation. A F 2 [X ] <60m A as a 60 m matrix van der Hoeven, Larrieu, Lecerf Implementing Fast Carryless Multiplication MACIS / 17

27 Implementation details Data Representation Data Representation Polynomials over F 2 in packed representation. Elements of F 2 60 on one machine word; polynomials over F 2 60 as an array of words. Matrices over F 2 in packed column representation. A F 2 [X ] <60m Ã k (X ) A as a 60 m matrix van der Hoeven, Larrieu, Lecerf Implementing Fast Carryless Multiplication MACIS / 17

28 Implementation details Frobenius encoding Frobenius encoding See A as a 60 m matrix; add 4 columns for alignment. Transpose the 64 m matrix ( [Ãk(θ)] k<m ). Multiply by the twiddle factors ω k ( Ā). van der Hoeven, Larrieu, Lecerf Implementing Fast Carryless Multiplication MACIS / 17

29 Implementation details Frobenius encoding Frobenius encoding See A as a 60 m matrix; add 4 columns for alignment. Transpose the 64 m matrix ( [Ãk(θ)] k<m ). Multiply by the twiddle factors ω k ( Ā). Matrix transposition Exploit the AVX2 instruction set Reduction (64 m) (64 256) (8 8) Transpose 4 packed 8 8 matrices at once using vector instruction. van der Hoeven, Larrieu, Lecerf Implementing Fast Carryless Multiplication MACIS / 17

30 Implementation details Frobenius encoding Frobenius encoding See A as a 60 m matrix; add 4 columns for alignment. Transpose the 64 m matrix ( [Ãk(θ)] k<m ). Multiply by the twiddle factors ω k ( Ā). Matrix transposition Exploit the AVX2 instruction set Reduction (64 m) (64 256) (8 8) Transpose 4 packed 8 8 matrices at once using vector instruction. Finally, call the efficient FFT over F 2 60 on Ā. van der Hoeven, Larrieu, Lecerf Implementing Fast Carryless Multiplication MACIS / 17

31 Implementation details Timings Timings Timings (ms) 8000 Old implementation Chen et. al gf2x Version 1.2 New implementation Input size (quadwords) van der Hoeven, Larrieu, Lecerf Implementing Fast Carryless Multiplication MACIS / 17

32 Implementation details Timings Timings Timings (ms) 8000 Old implementation New implementation Input size (quadwords) van der Hoeven, Larrieu, Lecerf Implementing Fast Carryless Multiplication MACIS / 17

33 Perspectives Outline Introduction Presentation of the algorithm Implementation details Perspectives van der Hoeven, Larrieu, Lecerf Implementing Fast Carryless Multiplication MACIS / 17

34 Perspectives Perspectives Better use of vector instructions Vectorize the FFT routine over F Support for the new AVX-512. van der Hoeven, Larrieu, Lecerf Implementing Fast Carryless Multiplication MACIS / 17

35 Perspectives Perspectives Better use of vector instructions Vectorize the FFT routine over F Support for the new AVX-512. Others Use the Truncated Fourier Transform (reduce the staircase effect) Generalization for other finite fields van der Hoeven, Larrieu, Lecerf Implementing Fast Carryless Multiplication MACIS / 17

36 Questions? Thank you for your attention van der Hoeven, Larrieu, Lecerf Implementing Fast Carryless Multiplication MACIS / 17

Implementing fast carryless multiplication

Implementing fast carryless multiplication Joris Van Der Hoeven, Robin Larrieu, Grégoire Lecerf To cite this version: Joris Van Der Hoeven, Robin Larrieu, Grégoire Lecerf. Implementing fast carryless multiplication.