Software implementation of ECC

Size: px

Start display at page:

Download "Software implementation of ECC"

Cynthia Harmon
5 years ago
Views:

1 Software implementation of ECC Radboud University, Nijmegen, The Netherlands June 4, 2015 Summer school on real-world crypto and privacy Šibenik, Croatia

2 Software implementation of (H)ECC Radboud University, Nijmegen, The Netherlands June 4, 2015 Summer school on real-world crypto and privacy Šibenik, Croatia

3 The ECDLP Definition Given two points P and Q on an elliptic curve, such that Q P, find an integer k such that kp = Q. 3

4 The ECDLP Definition Given two points P and Q on an elliptic curve, such that Q P, find an integer k such that kp = Q. Efficient ECC First idea: user needs to compute kp, so make that fast 3

5 The ECDLP Definition Given two points P and Q on an elliptic curve, such that Q P, find an integer k such that kp = Q. Efficient ECC First idea: user needs to compute kp, so make that fast Actual situation is more complex: Keypair generation: Compute kp for fixed P, don t leak information about scalar k 3

6 The ECDLP Definition Given two points P and Q on an elliptic curve, such that Q P, find an integer k such that kp = Q. Efficient ECC First idea: user needs to compute kp, so make that fast Actual situation is more complex: Keypair generation: Compute kp for fixed P, don t leak information about scalar k DH common-key computation: Compute kp for variable P, don t leak information about scalar k 3

7 The ECDLP Definition Given two points P and Q on an elliptic curve, such that Q P, find an integer k such that kp = Q. Efficient ECC First idea: user needs to compute kp, so make that fast Actual situation is more complex: Keypair generation: Compute kp for fixed P, don t leak information about scalar k DH common-key computation: Compute kp for variable P, don t leak information about scalar k Signature verification needs k1p 1 + k 2P 2, ok to leak information about (public) scalars k 1 and k 2 3

8 The ECDLP Definition Given two points P and Q on an elliptic curve, such that Q P, find an integer k such that kp = Q. Efficient ECC First idea: user needs to compute kp, so make that fast Actual situation is more complex: Keypair generation: Compute kp for fixed P, don t leak information about scalar k DH common-key computation: Compute kp for variable P, don t leak information about scalar k Signature verification needs k1p 1 + k 2P 2, ok to leak information about (public) scalars k 1 and k 2 3

9 The ECC implementation pyramid Scalar multiplication ECC add/double Finite-field arithmetic Big-integer or polynomial arithmetic 4

10 Why I don t like the pyramid... Pyramid levels are not independent Interactions through all levels, relevant for Correctness, Security, and Performance 5

11 Why I don t like the pyramid... Pyramid levels are not independent Interactions through all levels, relevant for Correctness, Security, and Performance Plan for today: demonstrate these dependencies 5

12 Why I don t like the pyramid... Pyramid levels are not independent Interactions through all levels, relevant for Correctness, Security, and Performance Plan for today: demonstrate these dependencies Fix target architecture: AMD64 (aka x86_64, aka x64) Fix target microarchitecture: Intel Sandy Bridge and Ivy Bridge 5

13 Let s start with 255-bit integers typedef struct{ unsigned long long a[4]; } bigint255; void bigint255_add(bigint255 *r, const bigint255 *x, const bigint255 *y) { r->a[0] = x->a[0] + y->a[0]; r->a[1] = x->a[1] + y->a[1]; r->a[2] = x->a[2] + y->a[2]; r->a[3] = x->a[3] + y->a[3]; } 6

14 Let s start with 255-bit integers typedef struct{ unsigned long long a[4]; } bigint255; void bigint255_add(bigint255 *r, const bigint255 *x, const bigint255 *y) { r->a[0] = x->a[0] + y->a[0]; r->a[1] = x->a[1] + y->a[1]; r->a[2] = x->a[2] + y->a[2]; r->a[3] = x->a[3] + y->a[3]; } What s wrong about this? 6

15 Let s start with 255-bit integers typedef struct{ unsigned long long a[4]; } bigint255; void bigint255_add(bigint255 *r, const bigint255 *x, const bigint255 *y) { r->a[0] = x->a[0] + y->a[0]; r->a[1] = x->a[1] + y->a[1]; r->a[2] = x->a[2] + y->a[2]; r->a[3] = x->a[3] + y->a[3]; } What s wrong about this? This performs arithmetic on a vector of 4 independent 64-bit integers (modulo 2 64 ) 6

16 Let s start with 255-bit integers typedef struct{ unsigned long long a[4]; } bigint255; void bigint255_add(bigint255 *r, const bigint255 *x, const bigint255 *y) { r->a[0] = x->a[0] + y->a[0]; r->a[1] = x->a[1] + y->a[1]; r->a[2] = x->a[2] + y->a[2]; r->a[3] = x->a[3] + y->a[3]; } What s wrong about this? This performs arithmetic on a vector of 4 independent 64-bit integers (modulo 2 64 ) This is not the same as arithmetic on 256-bit integers Need to ripple the carries of all additions! 6

17 Radix-2 51 representation Radix-2 64 representation works and is sometimes a good choice Highly depends on the efficiency of handling carries 7

18 Radix-2 51 representation Radix-2 64 representation works and is sometimes a good choice Highly depends on the efficiency of handling carries Example 1: Intel Nehalem can do 3 additions every cycle, but only 1 addition with carry every two cycles (carries cost a factor of 6!) 7

19 Radix-2 51 representation Radix-2 64 representation works and is sometimes a good choice Highly depends on the efficiency of handling carries Example 1: Intel Nehalem can do 3 additions every cycle, but only 1 addition with carry every two cycles (carries cost a factor of 6!) Example 2: When using vector arithmetic, carries are typically lost (expensive to recompute) 7

20 Radix-2 51 representation Radix-2 64 representation works and is sometimes a good choice Highly depends on the efficiency of handling carries Example 1: Intel Nehalem can do 3 additions every cycle, but only 1 addition with carry every two cycles (carries cost a factor of 6!) Example 2: When using vector arithmetic, carries are typically lost (expensive to recompute) Let s get rid of the carries, represent A as (a 0, a 1, a 2, a 3, a 4 ) with A = 4 a i 2 51 i i=0 This is called radix-2 51 representation 7

21 Radix-2 51 representation Radix-2 64 representation works and is sometimes a good choice Highly depends on the efficiency of handling carries Example 1: Intel Nehalem can do 3 additions every cycle, but only 1 addition with carry every two cycles (carries cost a factor of 6!) Example 2: When using vector arithmetic, carries are typically lost (expensive to recompute) Let s get rid of the carries, represent A as (a 0, a 1, a 2, a 3, a 4 ) with A = 4 a i 2 51 i i=0 This is called radix-2 51 representation Multiple ways to write the same integer A, for example A = 2 52 : (2 52, 0, 0, 0, 0) (0, 2, 0, 0, 0) 7

22 Addition of two bigint255 typedef struct{ unsigned long long a[5]; } bigint255; void bigint255_add(bigint255 *r, const bigint255 *x, const bigint255 *y) { r->a[0] = x->a[0] + y->a[0]; r->a[1] = x->a[1] + y->a[1]; r->a[2] = x->a[2] + y->a[2]; r->a[3] = x->a[3] + y->a[3]; r->a[4] = x->a[4] + y->a[4]; } 8

23 Addition of two bigint255 typedef struct{ unsigned long long a[5]; } bigint255; void bigint255_add(bigint255 *r, const bigint255 *x, const bigint255 *y) { r->a[0] = x->a[0] + y->a[0]; r->a[1] = x->a[1] + y->a[1]; r->a[2] = x->a[2] + y->a[2]; r->a[3] = x->a[3] + y->a[3]; r->a[4] = x->a[4] + y->a[4]; } 8

24 Addition of two bigint255 typedef struct{ unsigned long long a[5]; } bigint255; void bigint255_add(bigint255 *r, const bigint255 *x, const bigint255 *y) { r->a[0] = x->a[0] + y->a[0]; r->a[1] = x->a[1] + y->a[1]; r->a[2] = x->a[2] + y->a[2]; r->a[3] = x->a[3] + y->a[3]; r->a[4] = x->a[4] + y->a[4]; } This works as long as all coefficients are in [0,..., ] 8

25 Addition of two bigint255 typedef struct{ unsigned long long a[5]; } bigint255; void bigint255_add(bigint255 *r, const bigint255 *x, const bigint255 *y) { r->a[0] = x->a[0] + y->a[0]; r->a[1] = x->a[1] + y->a[1]; r->a[2] = x->a[2] + y->a[2]; r->a[3] = x->a[3] + y->a[3]; r->a[4] = x->a[4] + y->a[4]; } This works as long as all coefficients are in [0,..., ] When starting with 51-bit coefficients, we can do quite a few additions before we have to carry 8

26 Subtraction of two bigint255 typedef struct{ signed long long a[5]; } bigint255; void bigint255_sub(bigint255 *r, const bigint255 *x, const bigint255 *y) { r->a[0] = x->a[0] - y->a[0]; r->a[1] = x->a[1] - y->a[1]; r->a[2] = x->a[2] - y->a[2]; r->a[3] = x->a[3] - y->a[3]; r->a[4] = x->a[4] - y->a[4]; } Slightly update our bigint255 definition to work with signed 64-bit integers 9

27 Carrying in radix-2 51 With many additions, coefficients may grow larger than 63 bits They grow even faster in multiplication 10

28 Carrying in radix-2 51 With many additions, coefficients may grow larger than 63 bits They grow even faster in multiplication Eventually we have to carry en bloc: signed long long carry = r.a[0] >> 51; r.a[1] += carry; carry <<= 51; r.a[0] -= carry; 10

29 Carrying in radix-2 51 With many additions, coefficients may grow larger than 63 bits They grow even faster in multiplication Eventually we have to carry en bloc: signed long long carry = r.a[0] >> 51; r.a[1] += carry; carry <<= 51; r.a[0] -= carry; Similar for all higher coefficients... 10

30 Big integers and polynomials Addition/Subtraction code would look exactly the same for 5-coefficient polynomial addition 11

31 Big integers and polynomials Addition/Subtraction code would look exactly the same for 5-coefficient polynomial addition This is no coincidence: We actually perform arithmetic in Z[x] Inputs to addition/subtraction are 5-coefficient polynomials 11

32 Big integers and polynomials Addition/Subtraction code would look exactly the same for 5-coefficient polynomial addition This is no coincidence: We actually perform arithmetic in Z[x] Inputs to addition/subtraction are 5-coefficient polynomials Nice thing about arithmetic Z[x]: no carries! 11

33 Big integers and polynomials Addition/Subtraction code would look exactly the same for 5-coefficient polynomial addition This is no coincidence: We actually perform arithmetic in Z[x] Inputs to addition/subtraction are 5-coefficient polynomials Nice thing about arithmetic Z[x]: no carries! To go from Z[x] to Z, evaluate at the radix (this is a ring homomorphism) Carrying means evaluating at the radix 11

34 Big integers and polynomials Addition/Subtraction code would look exactly the same for 5-coefficient polynomial addition This is no coincidence: We actually perform arithmetic in Z[x] Inputs to addition/subtraction are 5-coefficient polynomials Nice thing about arithmetic Z[x]: no carries! To go from Z[x] to Z, evaluate at the radix (this is a ring homomorphism) Carrying means evaluating at the radix Thinking of multiprecision integers as polynomials is very powerful for efficient arithmetic 11

35 Using floating-point limbs Now we can also use floats for our coefficients An IEEE-754 floating-point number has value ( 1) s (1.b m 1 b m 2... b 0 ) 2 e t with b i {0, 1} 12

36 Using floating-point limbs Now we can also use floats for our coefficients An IEEE-754 floating-point number has value ( 1) s (1.b m 1 b m 2... b 0 ) 2 e t with b i {0, 1} For double-precision floats: s {0, 1} sign bit m = 52 mantissa bits e {1,..., 2046} exponent t =

37 Using floating-point limbs Now we can also use floats for our coefficients An IEEE-754 floating-point number has value ( 1) s (1.b m 1 b m 2... b 0 ) 2 e t with b i {0, 1} For double-precision floats: s {0, 1} sign bit m = 52 mantissa bits e {1,..., 2046} exponent t = 1023 For single-precision floats: s {0, 1} sign bit m = 23 mantissa bits e {1,..., 254} exponent t =

38 Using floating-point limbs Now we can also use floats for our coefficients An IEEE-754 floating-point number has value ( 1) s (1.b m 1 b m 2... b 0 ) 2 e t with b i {0, 1} For double-precision floats: s {0, 1} sign bit m = 52 mantissa bits e {1,..., 2046} exponent t = 1023 For single-precision floats: s {0, 1} sign bit m = 23 mantissa bits e {1,..., 254} exponent t = 127 Exponent = 0 used to represent 0 12

39 Using floating-point limbs Now we can also use floats for our coefficients An IEEE-754 floating-point number has value ( 1) s (1.b m 1 b m 2... b 0 ) 2 e t with b i {0, 1} For double-precision floats: s {0, 1} sign bit m = 52 mantissa bits e {1,..., 2046} exponent t = 1023 For single-precision floats: s {0, 1} sign bit m = 23 mantissa bits e {1,..., 254} exponent t = 127 Exponent = 0 used to represent 0 Any number that can be represented like this, will be precise Other numbers will be rounded, according to a rounding mode 12

40 Addition typedef struct{ double a[12]; } bigint255; void bigint255_add(bigint255 *r, const bigint255 *x, const bigint255 *y) { int i; for(i=0;i<12;i++) r->a[i] = x->a[i] + y->a[i]; } 13

41 Subtraction typedef struct{ double a[12]; } bigint255; void bigint255_sub(bigint255 *r, const bigint255 *x, const bigint255 *y) { int i; for(i=0;i<12;i++) r->a[i] = x->a[i] - y->a[i]; } 14

42 Carrying For carrying integers we used a right shift (discard lowest bits) 15

43 Carrying For carrying integers we used a right shift (discard lowest bits) For floating-point numbers we can use multiplication by the inverse of the radix Example: Radix 2 22, multiply by 2 22 This does not cut off lowest bits, need to round 15

44 Carrying For carrying integers we used a right shift (discard lowest bits) For floating-point numbers we can use multiplication by the inverse of the radix Example: Radix 2 22, multiply by 2 22 This does not cut off lowest bits, need to round Some processors have efficient rounding instructions, e.g., vroundpd 15

45 Carrying For carrying integers we used a right shift (discard lowest bits) For floating-point numbers we can use multiplication by the inverse of the radix Example: Radix 2 22, multiply by 2 22 This does not cut off lowest bits, need to round Some processors have efficient rounding instructions, e.g., vroundpd Otherwise (for double-precision): add constant subtract constant This will round the number to an integer according to the rounding mode (to nearest, towards zero, away from zero, or truncate) 15

46 Why would you want this? ECC is typically bottlenecked by speed of multiplier Intel Sandy Bridge, Ivy Bridge: One multiplication per cycle 16

47 Why would you want this? ECC is typically bottlenecked by speed of multiplier Intel Sandy Bridge, Ivy Bridge: One multiplication per cycle Four (vectorized) double-precision multiplications per cycle 16

48 Why would you want this? ECC is typically bottlenecked by speed of multiplier Intel Sandy Bridge, Ivy Bridge: One multiplication per cycle Four (vectorized) double-precision multiplications per cycle Four (vectorized) double-precision additions in the same cycle 16

49 Why would you want this? ECC is typically bottlenecked by speed of multiplier Intel Sandy Bridge, Ivy Bridge: One multiplication per cycle Four (vectorized) double-precision multiplications per cycle Four (vectorized) double-precision additions in the same cycle Operations on 256-bit vector registers introduced with AVX 16

50 Why would you want this? ECC is typically bottlenecked by speed of multiplier Intel Sandy Bridge, Ivy Bridge: One multiplication per cycle Four (vectorized) double-precision multiplications per cycle Four (vectorized) double-precision additions in the same cycle Operations on 256-bit vector registers introduced with AVX Integer operations on those registers introduced only with AVX2 Sandy Bridge and Ivy Bridge don t have AVX2 16

51 Vectorizing EC scalar multiplication Computing multiple scalar multiplications Changes the rules of the game Increases size of active data set 17

52 Vectorizing EC scalar multiplication Computing multiple scalar multiplications Changes the rules of the game Increases size of active data set Parallelism inside multiprecision arithmetic Addition (in redundant representation) is trivially vectorized Vectorizing multiplication needs many shuffles Vectorization eats up instruction-level parallelism 17

53 Vectorizing EC scalar multiplication Computing multiple scalar multiplications Changes the rules of the game Increases size of active data set Parallelism inside multiprecision arithmetic Addition (in redundant representation) is trivially vectorized Vectorizing multiplication needs many shuffles Vectorization eats up instruction-level parallelism Parallelism inside EC arithmetic Vectorize independent multiplications in EC addition May still need some shuffles (after each block of operations) Efficiency depends on EC formulas 17

54 Example: Montgomery ladder step function ladderstep(x Q P, X P, Z P, X Q, Z Q ) t 1 X P + Z P t 6 t 2 1 t 2 X P Z P t 7 t 2 2 t 5 t 6 t 7 t 3 X Q + Z Q t 4 X Q Z Q t 8 t 4 t 1 t 9 t 3 t 2 X P +Q (t 8 + t 9 ) 2 Z P +Q x Q P (t 8 t 9 ) 2 X [2]P t 6 t 7 Z [2]P t 5 (t 7 + ((A + 2)/4) t 5 ) return (X [2]P, Z [2]P, X P +Q, Z P +Q ) end function 18

55 Example: Montgomery ladder step function ladderstep(x Q P, X P, Z P, X Q, Z Q ) t 1 X P + Z P ; t 2 X P Z P ; t 3 X Q + Z Q ; t 4 X Q Z Q t 6 t 1 t 1 ; t 7 t 2 t 2 ; t 8 t 4 t 1 ; t 9 t 3 t 2 t 10 ((A + 2)/4) t 6 t 11 ((A + 2)/4 1) t 7 t 5 t 6 t 7 ; t 4 t 10 t 11 ; t 1 t 8 t 9 ; t 0 t 8 + t 9 Z [2]P t 5 t 4 ; X P +Q t 2 0; X [2]P t 6 t 7 ; t 2 t 1 t 1 Z P +Q x Q P t 2 return (X [2]P, Z [2]P, X P +Q, Z P +Q ) end function 18

56 A better candidate: Kummer surfaces Think of a Kummer surface as the Jacobian of a hyperelliptic curve modulo negation 19

57 A better candidate: Kummer surfaces Think of a Kummer surface as the Jacobian of a hyperelliptic curve modulo negation Easier way to think about it: Group modulo negation Map from group to Kummer surface by rational map X Elements represented projectively as (x : y : z : t) (x : y : z : t) = (rx : ry : rz : rt) for any r 0 Efficient doubling and efficient differential addition 19

58 A better candidate: Kummer surfaces Think of a Kummer surface as the Jacobian of a hyperelliptic curve modulo negation Easier way to think about it: Group modulo negation Map from group to Kummer surface by rational map X Elements represented projectively as (x : y : z : t) (x : y : z : t) = (rx : ry : rz : rt) for any r 0 Efficient doubling and efficient differential addition Ladderstep: gets as input X(P ) = (x 2 : y 2 : z 2 : t 2 ), X(Q) = (x 3 : y 3 : z 3 : t 3 ), and X(Q P ) = (x 1 : y 1 : z 1 : t 1 ) Computes X(2P ) = (x4 : y 4 : z 4 : t 4) Computes X(P + Q) = (x5 : y 5 : z 5 : t 5) 19

59 A better candidate: Kummer surfaces Think of a Kummer surface as the Jacobian of a hyperelliptic curve modulo negation Easier way to think about it: Group modulo negation Map from group to Kummer surface by rational map X Elements represented projectively as (x : y : z : t) (x : y : z : t) = (rx : ry : rz : rt) for any r 0 Efficient doubling and efficient differential addition Ladderstep: gets as input X(P ) = (x 2 : y 2 : z 2 : t 2 ), X(Q) = (x 3 : y 3 : z 3 : t 3 ), and X(Q P ) = (x 1 : y 1 : z 1 : t 1 ) Computes X(2P ) = (x4 : y 4 : z 4 : t 4) Computes X(P + Q) = (x5 : y 5 : z 5 : t 5) Coordinates are elements of a (large) finite field 19

60 A better candidate: Kummer surfaces Think of a Kummer surface as the Jacobian of a hyperelliptic curve modulo negation Easier way to think about it: Group modulo negation Map from group to Kummer surface by rational map X Elements represented projectively as (x : y : z : t) (x : y : z : t) = (rx : ry : rz : rt) for any r 0 Efficient doubling and efficient differential addition Ladderstep: gets as input X(P ) = (x 2 : y 2 : z 2 : t 2 ), X(Q) = (x 3 : y 3 : z 3 : t 3 ), and X(Q P ) = (x 1 : y 1 : z 1 : t 1 ) Computes X(2P ) = (x4 : y 4 : z 4 : t 4) Computes X(P + Q) = (x5 : y 5 : z 5 : t 5) Coordinates are elements of a (large) finite field For same security level, underlying field has half the size as for ECC Example: Choose 128-bit field for 128 bits of security 19

61 Arithmetic on the Kummer surface x 2 H H y 2 A2 B 2 z 2 A2 C 2 t 2 a b a c A2 D 2 a d x 3 H y 3 z 3 t 3 H x1 y 1 x1 z 1 x1 t 1 x 4 y 4 z 4 t 4 x 5 y 5 z 5 t 5 10M + 9S + 6m ladder formulas 20

62 Arithmetic on the Kummer surface x 2 H H y 2 A2 B 2 z 2 A2 C 2 t 2 a b a c A2 D 2 a d x 3 H y 3 z 3 t 3 H x1 y 1 x1 z 1 x1 t 1 x 4 y 4 z 4 t 4 x 5 y 5 z 5 t 5 10M + 9S + 6m ladder formulas x 2 H H y 2 A2 B 2 z 2 A2 C 2 t 2 a b a c A2 D 2 a d x 3 H y 3 z 3 t 3 H A2 B 2 A2 C 2 A2 D 2 x1 y 1 x1 z 1 x1 t 1 x 4 y 4 z 4 t 4 x 5 y 5 z 5 t 5 7M + 12S + 9m ladder formulas 20

63 The squared Kummer surface In fact, we use arithmetic on a different, squared surface Each point (x : y : z : t) on the original surface corresponds to (x 2 : y 2 : z 2 : t 2 ) on the squared surface No operation-count advantages Easier to construct squared surface with small constants In the following rename (x 2 : y 2 : z 2 : t 2 ) to (x : y : z : t) 21

64 Arithmetic on the squared Kummer surface x 2 H H y 2 A2 B 2 z 2 A2 C 2 t 2 A2 D 2 x 3 H y 3 z 3 t 3 H a2 b a2 2 c a2 2 d x1 2 y 1 x1 z 1 x1 t 1 x 4 y 4 z 4 t 4 x 5 y 5 z 5 t 5 10M + 9S + 6m ladder formulas 22

65 Arithmetic on the squared Kummer surface x 2 H H y 2 A2 B 2 z 2 A2 C 2 t 2 A2 D 2 x 3 H y 3 z 3 t 3 H a2 b a2 2 c a2 2 d x1 2 y 1 x1 z 1 x1 t 1 x 4 y 4 z 4 t 4 x 5 y 5 z 5 t 5 x 2 H H y 2 A2 B 2 z 2 A2 C 2 t 2 A2 D 2 x 3 H y 3 z 3 t 3 H A2 B 2 A2 C 2 A2 D 2 a2 b a2 2 c a2 2 d x1 2 y 1 x1 z 1 x1 t 1 x 4 y 4 z 4 t 4 x 5 y 5 z 5 t 5 10M + 9S + 6m ladder formulas 7M + 12S + 9m ladder formulas 22

66 Arithmetic on the (original) Kummer surface x 2 H H y 2 A2 B 2 z 2 A2 C 2 t 2 a b a c A2 D 2 a d x 3 H y 3 z 3 t 3 H x1 y 1 x1 z 1 x1 t 1 x 4 y 4 z 4 t 4 x 5 y 5 z 5 t 5 10M + 9S + 6m ladder formulas x 2 H H y 2 A2 B 2 z 2 A2 C 2 t 2 a b a c A2 D 2 a d x 3 H y 3 z 3 t 3 H A2 B 2 A2 C 2 A2 D 2 x1 y 1 x1 z 1 x1 t 1 x 4 y 4 z 4 t 4 x 5 y 5 z 5 t 5 7M + 12S + 9m ladder formulas 23

67 A suitable Kummer surface Formulas for efficient Kummer surface arithmetic known for a while Originally proposed by Chudnovsky, Chudnovsky, M + 9S + 6m formulas by Gaudry, M + 12S + 9m formulas by Bernstein,

68 A suitable Kummer surface Formulas for efficient Kummer surface arithmetic known for a while Originally proposed by Chudnovsky, Chudnovsky, M + 9S + 6m formulas by Gaudry, M + 12S + 9m formulas by Bernstein, 2006 Problem: find cryptographically secure surface with small constants 24

69 A suitable Kummer surface Formulas for efficient Kummer surface arithmetic known for a while Originally proposed by Chudnovsky, Chudnovsky, M + 9S + 6m formulas by Gaudry, M + 12S + 9m formulas by Bernstein, 2006 Problem: find cryptographically secure surface with small constants Gaudry, Schost, 2012: suitable (squared) surface: Defined over the field F (1 : a 2 /b 2 : a 2 /c 2 : a 2 /d 2 ) = ( 114 : 57 : 66 : 418) (1 : A 2 /B 2 : A 2 /C 2 : A 2 /D 2 ) = ( 833 : 2499 : 1617 : 561) 24

70 A suitable Kummer surface Formulas for efficient Kummer surface arithmetic known for a while Originally proposed by Chudnovsky, Chudnovsky, M + 9S + 6m formulas by Gaudry, M + 12S + 9m formulas by Bernstein, 2006 Problem: find cryptographically secure surface with small constants Gaudry, Schost, 2012: suitable (squared) surface: Defined over the field F (1 : a 2 /b 2 : a 2 /c 2 : a 2 /d 2 ) = ( 114 : 57 : 66 : 418) (1 : A 2 /B 2 : A 2 /C 2 : A 2 /D 2 ) = ( 833 : 2499 : 1617 : 561) Finding this surface cost CPU hours The same surface has been used by Bos, Costello, Hisil, and Lauter (Eurocrypt 2013) 24

71 Representing elements of F Represent an element A in radix-2 127/6 Write A as a 0, a 1, a 2, a 3, a 4, a 5, where a0 is a small multiple of 2 0 a1 is a small multiple of 2 22 a2 is a small multiple of 2 43 a3 is a small multiple of 2 64 a4 is a small multiple of 2 85 a5 is a small multiple of

72 Multiplication Consider multiplication of A and B with reduction mod Make use of the fact that With radix 2 127/6 we obtain: r 0 = a 0 b a 1 b a 2 b a 3 b a 4 b a 5 b 1 r 1 = a 0 b 1 + a 1 b a 2 b a 3 b a 4 b a 5 b 2 r 2 = a 0 b 2 + a 1 b 1 + a 2 b a 3 b a 4 b a 5 b 3 r 3 = a 0 b 3 + a 1 b 2 + a 2 b 1 + a 3 b a 4 b a 5 b 4 r 4 = a 0 b 4 + a 1 b 3 + a 2 b 2 + a 3 b 1 + a 4 b a 5 b 5 r 5 = a 0 b 5 + a 1 b 4 + a 2 b 3 + a 3 b 2 + a 4 b 1 + a 5 b 0 26

73 Multiplication Consider multiplication of A and B with reduction mod Make use of the fact that With radix 2 127/6 we obtain: r 0 = a 0 b a 1 b a 2 b a 3 b a 4 b a 5 b 1 r 1 = a 0 b 1 + a 1 b a 2 b a 3 b a 4 b a 5 b 2 r 2 = a 0 b 2 + a 1 b 1 + a 2 b a 3 b a 4 b a 5 b 3 r 3 = a 0 b 3 + a 1 b 2 + a 2 b 1 + a 3 b a 4 b a 5 b 4 r 4 = a 0 b 4 + a 1 b 3 + a 2 b 2 + a 3 b 1 + a 4 b a 5 b 5 r 5 = a 0 b 5 + a 1 b 4 + a 2 b 3 + a 3 b 2 + a 4 b 1 + a 5 b 0 Obviously, we always perform this whole thing 4 in parallel 26

74 Multiplication Consider multiplication of A and B with reduction mod Make use of the fact that With radix 2 127/6 we obtain: r 0 = a 0 b a 1 b a 2 b a 3 b a 4 b a 5 b 1 r 1 = a 0 b 1 + a 1 b a 2 b a 3 b a 4 b a 5 b 2 r 2 = a 0 b 2 + a 1 b 1 + a 2 b a 3 b a 4 b a 5 b 3 r 3 = a 0 b 3 + a 1 b 2 + a 2 b 1 + a 3 b a 4 b a 5 b 4 r 4 = a 0 b 4 + a 1 b 3 + a 2 b 2 + a 3 b 1 + a 4 b a 5 b 5 r 5 = a 0 b 5 + a 1 b 4 + a 2 b 3 + a 3 b 2 + a 4 b 1 + a 5 b 0 Obviously, we always perform this whole thing 4 in parallel Obviously, we specialize squaring 26

75 Multiplication Consider multiplication of A and B with reduction mod Make use of the fact that With radix 2 127/6 we obtain: r 0 = a 0 b a 1 b a 2 b a 3 b a 4 b a 5 b 1 r 1 = a 0 b 1 + a 1 b a 2 b a 3 b a 4 b a 5 b 2 r 2 = a 0 b 2 + a 1 b 1 + a 2 b a 3 b a 4 b a 5 b 3 r 3 = a 0 b 3 + a 1 b 2 + a 2 b 1 + a 3 b a 4 b a 5 b 4 r 4 = a 0 b 4 + a 1 b 3 + a 2 b 2 + a 3 b 1 + a 4 b a 5 b 5 r 5 = a 0 b 5 + a 1 b 4 + a 2 b 3 + a 3 b 2 + a 4 b 1 + a 5 b 0 Obviously, we always perform this whole thing 4 in parallel Obviously, we specialize squaring Obviously, we specialize multiplications by small constants 26

76 The Hadamard transform x y z t Only shuffeling operation in Kummer arithmetic AVX has limited shuffeling across left and right half Plain Hadamard turns out to be expensive 27

77 The Hadamard transform x y z t Permuted and negated Hadamard Only shuffeling operation in Kummer arithmetic AVX has limited shuffeling across left and right half Plain Hadamard turns out to be expensive Allow generalized Hadamard to output permuted vector Self-inverting permutation cleans after two generalized Hadamards 27

78 The Hadamard transform x y z t Permuted and negated Hadamard Only shuffeling operation in Kummer arithmetic AVX has limited shuffeling across left and right half Plain Hadamard turns out to be expensive Allow generalized Hadamard to output permuted vector Self-inverting permutation cleans after two generalized Hadamards Allow generalized Hadamard to negate vector entries Clean negations by multiplication by negated constants 27

79 Arithmetic on the squared Kummer surface x 2 y 2 z 2 x 3 y 3 z 3 H H x x x x t t z z t 2 (A 2 /D 2 ) ( A 2 /C 2 ) (A 2 /B 2 ) t y y (a 2 /b 2 ) y z z z ( a 2 /c 2 ) z y y t y t (a 2 /d 2 ) t H H x x x x t t z z t 3 (A 2 /D 2 ) ( A 2 /C 2 ) (A 2 /B 2 ) t y y (x 1 /y 1 ) y z z z ( x 1 /z 1 ) z y y y t t (x 1 /t 1 ) x 4 y 4 z 4 t 4 x 5 y 5 z 5 t 5 t 28

80 Looking back... Fastest computation units are vector units Choose (H)ECC with efficiently vectorizable formulas 29

81 Looking back... Fastest computation units are vector units Choose (H)ECC with efficiently vectorizable formulas Formulas dictate the scalar multiplication algorithm 29

82 Looking back... Fastest computation units are vector units Choose (H)ECC with efficiently vectorizable formulas Formulas dictate the scalar multiplication algorithm Choose representation of field elements for fast reduction 29

83 Looking back... Fastest computation units are vector units Choose (H)ECC with efficiently vectorizable formulas Formulas dictate the scalar multiplication algorithm Choose representation of field elements for fast reduction Adjust formulas according to fast shuffle instructions 29

84 Looking back... Fastest computation units are vector units Choose (H)ECC with efficiently vectorizable formulas Formulas dictate the scalar multiplication algorithm Choose representation of field elements for fast reduction Adjust formulas according to fast shuffle instructions Optimizations go through all levels of the pyramid! 29

85 Results 128-bit secure, constant-time scalar multiplication arch cycles open g source of software Sandy yes 1 Bernstein Duif Lange Schwabe Yang CHES 2011 Sandy ? no 1 Hamburg Sandy ? no 1 Longa Sica Asiacrypt 2012 Sandy yes 2 Bos Costello Hisil Lauter Eurocrypt 2013 Sandy yes 1 Oliveira López Aranha Rodríguez- Henríquez CHES 2013 Sandy 96000? no 1 Faz-Hernández Longa Sánchez CT- RSA 2014 Sandy 92000? no 1 Faz-Hernández Longa Sánchez July 2014 Sandy yes 2 new (our results) 30

86 Results 128-bit secure, constant-time scalar multiplication arch cycles open g source of software Ivy yes 1 Bernstein Duif Lange Schwabe Yang CHES 2011 Ivy ? yes 1 Costello Hisil Smith Eurocrypt 2014 Ivy yes 2 Bos Costello Hisil Lauter Eurocrypt 2013 Ivy yes 1 Oliveira López Aranha Rodríguez- Henríquez CHES 2013 Ivy 92000? no 1 Faz-Hernández Longa Sánchez CT- RSA 2014 Ivy 89000? no 1 Faz-Hernández Longa Sánchez July 2014 Ivy yes 2 new (our results) 30

87 More results Also optimized for Intel Haswell arch cycles open g source of software Haswell yes 1 Bernstein Duif Lange Schwabe Yang CHES 2011 Haswell yes 2 Bos Costello Hisil Lauter Eurocrypt 2013 Haswell no 1 Oliveira López Aranha Rodríguez-Henríquez CHES 2013 Haswell yes 2 new (our results) 31

88 Even more results Also optimized for ARM Cortex-A8 arch cycles open g source of software A8-slow yes 1 Bernstein Schwabe CHES 2012 A8-slow yes 2 new (our result) A8-fast yes 1 Bernstein Schwabe CHES 2012 A8-fast yes 2 new (our result) 32

89 Resources online Paper: Daniel J. Bernstein, Chitchanok Chuengsatiansup, Tanja Lange, Peter Schwabe. Kummer strikes back: new DH speed records. Software: Included in SUPERCOP, subdirectory crypto_scalarmult/kummer/ 33

(which is not the same as: hyperelliptic-curve cryptography and elliptic-curve cryptography)

Hyper-and-elliptic-curve cryptography (which is not the same as: hyperelliptic-curve cryptography and elliptic-curve cryptography) Daniel J. Bernstein University of Illinois at Chicago & Technische Universiteit