Speeding up characteristic 2: I. Linear maps II. The Å(Ò) game III. Batching IV. Normal bases. D. J. Bernstein University of Illinois at Chicago

Size: px

Start display at page:

Download "Speeding up characteristic 2: I. Linear maps II. The Å(Ò) game III. Batching IV. Normal bases. D. J. Bernstein University of Illinois at Chicago"

Bruce Spencer
5 years ago
Views:

1 Speeding up characteristic 2: I. Linear maps II. The Å(Ò) game III. Batching IV. Normal bases D. J. Bernstein University of Illinois at Chicago NSF ITR

2 Part I. Linear maps Consider computing 0 = Õ 0 ; 1 = Õ 1 ; 2 = Õ 2 (Ô 0 Õ 0 Ö 0 ); 3 = (Ô 1 Õ 1 Ö 1 ); 4 = (Ô 2 Õ 2 Ö 2 ) Ö 0 ; 5 = Ö 1 ; 6 = Ö 2. Easy: 8 additions. Can find these 8 additions in several papers. But 8 is not optimal!

3 Wasting brain power is bad for the environment. Use existing algorithms to find addition chains. Apply, e.g., greedy additive CSE algorithm from 1997 Paar: find input pair 0 1 with most popular 0 1 ; compute 0 1 ; simplify using 0 1 ; repeat. This algorithm finds repeated Õ 2 Ö 0 ; uses 7 additions.

4 A new algorithm: xor largest. Start with the matrix mod 2 for the desired linear map. If two largest rows have same first bit, replace largest row by its xor with second-largest row. Otherwise change largest row by clearing first bit. In both cases, compute result recursively, and finish with one xor.

5 A small example: 1011 = Ü 0 + Ü 2 + Ü = Ü 0 + Ü 1 + Ü 2 + Ü = Ü 1 + Ü = Ü 1 + Ü 3 Replace largest row by its xor with second-largest row.

6 Recursively compute 1011 = Ü 0 + Ü 2 + Ü = Ü = Ü 1 + Ü = Ü 1 + Ü 3 plus 1 xor of first output into second output.

7 Recursively compute plus 1 input load, 2 xors.

8 Recursively compute plus 1 input load, 3 xors.

9 Recursively compute plus 1 input load, 4 xors.

10 Recursively compute plus 2 input loads, 4 xors. Note: this was just a copy.

11 Recursively compute plus 2 input loads, 4 xors.

12 Recursively compute plus 3 input loads, 5 xors.

13 Recursively compute plus 3 input loads, 5 xors.

14 Recursively compute plus 4 input loads, 5 xors.

15 Memory friendliness: Algorithm writes only to the output registers. No temporary storage. Ò inputs, Ò outputs: total 2Ò registers with 0 loads, 0 stores. Or Ò + 1 registers with Ò loads, 0 stores: each input is read only once. Or Ò registers with Ò loads, 0 stores, if platform has load-xor insn.

16 Two-operand friendliness: Platform with but without uses only Ò extra copies. Naive column sweep also uses Ò + 1 registers, Ò loads, but usually many more xors. Input partitioning (e.g., 1956 Lupanov) uses somewhat more xors, copies; somewhat more registers. Greedy additive CSE uses somewhat fewer xors but many more copies, registers.

17 For Ñ inputs and Ò outputs, average Ò Ñ matrix: The xor-largest algorithm uses ÑÒ lg Ò two-operand xors; Ò copies; Ñ loads; Ò + 1 regs.

18 For Ñ inputs and Ò outputs, average Ò Ñ matrix: The xor-largest algorithm uses ÑÒ lg Ò two-operand xors; Ò copies; Ñ loads; Ò + 1 regs. Pippenger s algorithm uses ÑÒ lg ÑÒ three-operand xors but seems to need many regs. Pippenger proved that his algebraic complexity was near optimal for most matrices (at least without mod 2), but didn t consider regs, two-operand complexity, etc.

19 Our original example: Each row has coefficients of Ô 0 Ô 1 Ô 2 Õ 0 Õ 1 Õ 2 Ö 0 Ö 1 Ö 2.

20 Our original example: plus 1 xor, 1 input load.

21 Our original example: plus 2 xors, 2 input loads.

22 Our original example: plus 3 xors, 3 input loads.

23 Our original example: plus 4 xors, 3 input loads.

24 Our original example: plus 4 xors, 4 input loads.

25 Our original example: plus 5 xors, 4 input loads.

26 Our original example: plus 5 xors, 5 input loads.

27 Our original example: plus 6 xors, 5 input loads.

28 Our original example: plus 7 xors, 6 input loads.

29 Our original example: plus 7 xors, 7 input loads.

30 Our original example: plus 7 xors, 7 input loads.

31 Our original example: plus 7 xors, 8 input loads.

32 Our original example: plus 7 xors, 8 input loads.

33 Our original example: plus 7 xors, 9 input loads. Algorithm found the speedup.

34 Part II. The Å(Ò) game Define Å(Ò) as the minimum number of bit operations (ands, xors) needed to multiply Ò-bit polys ¾ F 2 [Ü] (in standard representation). e.g. Å(2) 5: to compute Ü + 2 Ü 2 = ( Ü)( Ü) can compute 0 = 0 0, 1 = , 2 = 1 1 with 4 ands, 1 xor.

35 Schoolbook multiplication: Å(Ò) Θ(Ò 2 ) Karatsuba: Å(Ò) Θ(Ò lg 3 ) Toom: Å(Ò) Ò2 Θ(Ô lg Ò) Schönhage Strassen: Å(Ò) Θ(Ò lg Ò lg lg Ò) Fürer improves lg lg Ò for integers but doesn t help mod 2.

36 What does this tell us about Å(131) or Å(251)? Absolutely nothing! Reanalyze algorithms to see exact complexity. Rethink algorithm design to find constant-factor (and sub-constant-factor) speedups that are not visible in the asymptotics.

37 Schoolbook recursion: Å(Ò + 1) Å(Ò) + 4Ò. Hence Å(Ò) 2Ò 2 2Ò + 1. Karatsuba recursion as commonly stated: Å(2Ò) 3Å(Ò) + 8Ò 4. e.g. Karatsuba for Ò = 1: = Ü, = Ü, 0 = 0 0, 2 = 1 1, 1 = ( )( ) 0 2 µ = Ü + 2 Ü 2.

38 Karatsuba for Ò = 2: = Ü + 2 Ü Ü 3, = Ü + 2 Ü Ü 3, À 0 = ( Ü)( Ü), À 2 = ( Ü)( Ü), À 1 = ( ( )Ü) ( ( )Ü) À 0 À 2 µ = À 0 + À 1 Ü 2 + À 2 Ü 4.

39 Initial linear computation: ; cost 4. Three size-2 mults producing À 0 = Õ 0 + Õ 1 Ü + Õ 2 Ü 2 ; À 2 = Ö 0 + Ö 1 Ü + Ö 2 Ü 2 ; À 0 + À 1 + À 2 = Ô 0 + Ô 1 Ü + Ô 2 Ü 2. Final linear reconstruction: À 1 = (Ô 0 Õ 0 Ö 0 ) + (Ô 1 Õ 1 Ö 1 )Ü + (Ô 2 Õ 2 Ö 2 )Ü 2, cost 6; = À 0 + À 1 Ü 2 + À 2 Ü 4, cost 2.

40 Let s look more closely at the reconstruction: = Ü Ü 6 with 0 = Õ 0 ; 1 = Õ 1 ; 2 = Õ 2 + (Ô 0 Õ 0 Ö 0 ); 3 = (Ô 1 Õ 1 Ö 1 ); 4 = (Ô 2 Õ 2 Ö 2 ) + Ö 0 ; 5 = Ö 1 ; 6 = Ö 2.

41 Let s look more closely at the reconstruction: = Ü Ü 6 with 0 = Õ 0 ; 1 = Õ 1 ; 2 = Õ 2 + (Ô 0 Õ 0 Ö 0 ); 3 = (Ô 1 Õ 1 Ö 1 ); 4 = (Ô 2 Õ 2 Ö 2 ) + Ö 0 ; 5 = Ö 1 ; 6 = Ö 2. We ve seen this before! Reduce = 8 ops to 7 ops by reusing Õ 2 Ö 0.

42 2000 Bernstein: Å(2Ò) 3Å(Ò) + 7Ò Bernstein: new bounds on Å(Ò) from further improvements to Karatsuba, Toom, etc. binary.cr.yp.to/m.html Typically 20% smaller than 2003 Rodríguez-Henríquez Koç, 2005 Chang Kim Park Lim, 2006 Weimerskirch Paar, 2006 von zur Gathen Shokrollahi, 2007 Peter Langendörfer.

43 So far have focused on Å(Ò) for small Ò, but different techniques are better for large Ò. I m now exploring impact of 2008 Gao Mateer. For F 2 F Õ : 1988 Wang Zhu, 1989 Cantor diagonalize [Ø] (Ø Õ + Ø) using 0 5Õ lg Õ mults in, 0 5Õ(lg Õ) lg 3 adds in Gao Mateer use 0 5Õ lg Õ mults, 0 25Õ lg Õ lg lg Õ adds.

44 Who cares? Conventional wisdom: Detailed Å(Ò) analysis has very little relevance to software speed. We multiply by by looking up 4 bits of in a size-16 table of precomputed multiples of ; looking up next 4 bits; etc. One table lookup replaces many bit operations! Might use Karatsuba etc., but only for large Ò.

45 Part III. Batching Classic F index calculus Ô needs to check smoothness of many positive integers Ô. Smooth integer: integer with no prime divisors Ý. Typical: (log Ý) 2 ¾ (1 2 + Ó(1)) log Ô log log Ô. Many: typically Ý 2+Ó(1), of which Ý 1+Ó(1) are smooth. (Modern index calculus, NFS: smaller integers; smaller Ý.) How to check smoothness?

46 Old answers: Trial division, time Ý 1+Ó(1) ; rho, time Ý 1 2+Ó(1), assuming standard conjectures. Better answer: ECM etc. Time Ý Ó(1) ; specifically exp Ô (2 + Ó(1)) log Ý log log Ý, assuming standard conjectures. Much better answer (in standard RAM model): Known batch algorithms test smoothness of many integers simultaneously. Time per input: (log Ý) Ç(1) = exp Ç(log log Ý).

47 General pattern: Algorithm designer optimizes algorithm for one input. But algorithm is then applied to many inputs! Oops. Often much better speed from batch algorithms optimized for many inputs. e.g. Batch ECDL: Ô # speedup. Batch NFS: smaller exponent. Can find many more examples.

48 Surprising recent example: Batching can save time in multiplication! Largest speedups: F 2 [Ü]. Consequence: New speed record for public-key cryptography scalar mults/second on a 3.2GHz Phenom II X4 for a secure elliptic curve/f

49 Surprising recent example: Batching can save time in multiplication! Largest speedups: F 2 [Ü]. Consequence: New speed record for public-key cryptography scalar mults/second on a 3.2GHz Phenom II X4 for a secure elliptic curve/f Note: No subfields were exploited in the creation of this record.

50 Simplest batching technique: bitslicing. Transpose 128 polynomials ¾ F 2 [Ü], each having coefficients, into vectors ¾ F 128 2, where [ ] = [ ]. Vector operation 1 33 adds bit 1 of to bit 33 of for each in parallel.

51 Bitslicing disadvantages: Table lookups are expensive. e.g. tab[ mod 16]. Conditional branches are expensive. 128 volume of data; harder to avoid load/store bottlenecks. Transposition costs roughly 1 cycle per byte; frequent transposition is bad.

52 Bitslicing advantages: Free bit extraction, bit shuffling, etc. No word-size penalty. e.g. 128 additions of -bit polynomials cost vector xors instead of Huge speedup for small. µ Productive synergy with Å(Ò) techniques.

53 Elliptic-curve addition È + É traditionally uses conditional branches: É = È? É = È? etc Bernstein: cheaply avoid conditional branches in È ÒÈ if 2 = Bernstein Lange, using Edwards curves: arbitrary group ops if 2 = Bernstein Lange Rezaeian Farashahi, binary Edwards curves : arbitrary group ops if 2 = 0.

54 Part IV. Normal bases Current ECRYPT project, spearheaded by Tanja Lange: break Certicom s ECC2K-130. i.e., compute discrete log of a challenge point on Ý 2 + ÜÝ = Ü over F Carefully selected iteration function for Pollard rho involves 5 mults, 21 squarings, 7 adds, occasional inversions, and one computation of weight in normal basis.

55 F has type-2 normal basis + 1, 2 + 2, 4 + 4,, where is primitive 263rd root of 1. Weight is sum of coefficients. Squaring is rotation. Multi-squaring is rotation. Inversion by Fermat uses many multi-squarings.

56 But fast ECDL software uses polynomial basis: e.g., basis 1 Ü Ü 2 Ü 130 of F 2 [Ü] (Ü Ü 13 + Ü 2 + Ü + 1). Many obvious disadvantages: more expensive squaring, multi-squaring, inversion; must convert to normal basis (e.g., with xor-largest) before computing weight. But huge speedup in the 5 mults: polynomial multiplication uses Karatsuba etc.; reduction is very fast.

57 How slow is normal-basis mult? Type-1 normal basis of F 2 Ò, where 2 has order Ò mod Ò + 1, is a permutation of 2 Ò in F 2 [ ] ( Ò+1 1). Å(Ò) operations to multiply, obtaining coefficients of 2 3 2Ò. 2Ò 1 operations to reduce 2 3 2Ò to 2 Ò. Alternative: Å(Ò + 1) + Ò for redundant 1 Ò.

58 Type-2 normal basis of F 2 Ò, where 2 has order Ò mod 2Ò + 1, is a permutation of + 1, 2 + 2, 3,, Ò + Ò 3 + in F 2 [ ] ( 2Ò+1 1) Gao von zur Gathen Panario Shoup: 2Å(Ò) + Ç(Ò) operations to multiply on this basis. Polynomial basis of F 2 Ò is about twice as fast.

59 2007 von zur Gathen Shokrollahi Shokrollahi: Å(Ò) + Ç(Ò lg Ò) operations to multiply on this basis Bernstein: improved variant of algorithm sets Core 2 speed records for the ECC2K-130 attack Schwabe: also Cell speed records Bernstein Lange: mix normal bases with polynomial bases and speed up reduction.

60 vzg S S in a nutshell: Write Æ = + and È = ( + 1 ). If È È È 3 = Æ Æ Æ 3 and È È È 3 = Æ Æ Æ 3 then È È È È È È È 7 = 0 + ( )Æ 1 + ( )Æ 2 + ( )Æ Æ Æ Æ Æ 7.

61 Proof: e.g., ( + 1 ) 4 ( ) = so È 4 Æ 3 = Æ 7 + Æ 1. Q.E.D. So size-8 conversion from 1 È 1 È 2 È 7 to 1 Æ 1 Æ 2 Æ 7 can be done with two size-4 conversions and three additions. Apply same idea recursively: size-ò conversion uses Ò(lg Ò Inverse has same cost. 2) additions.

62 To multiply on basis Æ 1 Æ 2 Æ Ò : Convert to 1 È 1 È Ò ; cost 0 5Ò lg Ò, twice. Polynomial product; Å(Ò + 1). Convert 1 È 1 È 2Ò to 1 Æ 1 Æ 2Ò ; cost Ò lg Ò. Eliminate Æ Ò+1 Æ 2Ò using Æ 2Ò+1 = Æ ; cost Ò. Eliminate 1 using 1 + Æ Æ Ò = 0; cost Ò.

63 Some new improvements: 1. For 1 È 1 È Ò : coefficient of 1 is 0. Cost Å(Ò) instead of Å(Ò + 1). 2. For 1 È 1 È 2Ò : coefficients of 1 È 1 are 0. Reduces cost by Ò If mults share input, reuse input conversion. Reduces cost by 0 5Ò lg Ò. 4. If output is an input, use different reduction strategy to skip a first-half conversion. Reduces cost by 0 5Ò lg Ò.

64 Can represent field element using basis È 1 È Ò for fast multiplication; or basis Æ 1 Æ Ò for fast multi-squarings; or both. Can vary this choice across field-element variables. Can also vary over time. Approximate costs: È Æ: 0 5Ò lg Ò. Æ È: 0 5Ò lg Ò. È È Æ: Å(Ò) + Ò lg Ò. È È È: Å(Ò) + Ò lg Ò. Æ 2 Æ: 0.

McBits: Fast code-based cryptography

McBits: Fast code-based cryptography Peter Schwabe Radboud University Nijmegen, The Netherlands Joint work with Daniel Bernstein, Tung Chou December 17, 2013 IMA International Conference on Cryptography