Objectives. McBits: fast constant-time code-based cryptography. Set new speed records for public-key cryptography. (to appear at CHES 2013)

Size: px

Start display at page:

Download "Objectives. McBits: fast constant-time code-based cryptography. Set new speed records for public-key cryptography. (to appear at CHES 2013)"

Darcy Henry
5 years ago
Views:

1 McBits: fast constant-time code-based cryptography (to appear at CHES 2013) Objectives Set new speed records for public-key cryptography. D. J. Bernstein University of Illinois at Chicago & Technische Universiteit Eindhoven Joint work with: Tung Chou Technische Universiteit Eindhoven Peter Schwabe Radboud University Nijmegen

2 McBits: fast constant-time code-based cryptography (to appear at CHES 2013) D. J. Bernstein University of Illinois at Chicago & Technische Universiteit Eindhoven Objectives Set new speed records for public-key cryptography. : : : at a high security level. Joint work with: Tung Chou Technische Universiteit Eindhoven Peter Schwabe Radboud University Nijmegen

3 McBits: fast constant-time code-based cryptography (to appear at CHES 2013) D. J. Bernstein University of Illinois at Chicago & Technische Universiteit Eindhoven Objectives Set new speed records for public-key cryptography. : : : at a high security level. : : : including protection against quantum computers. Joint work with: Tung Chou Technische Universiteit Eindhoven Peter Schwabe Radboud University Nijmegen

4 McBits: fast constant-time code-based cryptography (to appear at CHES 2013) D. J. Bernstein University of Illinois at Chicago & Technische Universiteit Eindhoven Joint work with: Tung Chou Technische Universiteit Eindhoven Objectives Set new speed records for public-key cryptography. : : : at a high security level. : : : including protection against quantum computers. : : : including full protection against cache-timing attacks, branch-prediction attacks, etc. Peter Schwabe Radboud University Nijmegen

5 McBits: fast constant-time code-based cryptography (to appear at CHES 2013) D. J. Bernstein University of Illinois at Chicago & Technische Universiteit Eindhoven Joint work with: Tung Chou Technische Universiteit Eindhoven Peter Schwabe Radboud University Nijmegen Objectives Set new speed records for public-key cryptography. : : : at a high security level. : : : including protection against quantum computers. : : : including full protection against cache-timing attacks, branch-prediction attacks, etc. : : : using code-based crypto with a solid track record.

6 McBits: fast constant-time code-based cryptography (to appear at CHES 2013) D. J. Bernstein University of Illinois at Chicago & Technische Universiteit Eindhoven Joint work with: Tung Chou Technische Universiteit Eindhoven Peter Schwabe Radboud University Nijmegen Objectives Set new speed records for public-key cryptography. : : : at a high security level. : : : including protection against quantum computers. : : : including full protection against cache-timing attacks, branch-prediction attacks, etc. : : : using code-based crypto with a solid track record. : : : all of the above at once.

7 tant-time ed cryptography ar at CHES 2013) rnstein y of Illinois at Chicago & he Universiteit Eindhoven rk with: ou he Universiteit Eindhoven hwabe University Nijmegen Objectives Set new speed records for public-key cryptography. : : : at a high security level. : : : including protection against quantum computers. : : : including full protection against cache-timing attacks, branch-prediction attacks, etc. : : : using code-based crypto with a solid track record. : : : all of the above at once. Example Some cy (Intel Co from ben mceliec (2008 Bi gls254 (binary e kumfp12 (hyperell curve25 (conserv mceliec ronald1

8 raphy S 2013) is at Chicago & iteit Eindhoven iteit Eindhoven y Nijmegen Objectives Set new speed records for public-key cryptography. : : : at a high security level. : : : including protection against quantum computers. : : : including full protection against cache-timing attacks, branch-prediction attacks, etc. : : : using code-based crypto with a solid track record. : : : all of the above at once. Examples of the co Some cycle counts (Intel Core i from bench.cr.yp mceliece encrypt (2008 Biswas Send gls254 DH (binary elliptic cur kumfp127g DH (hyperelliptic; Euro curve25519 DH (conservative ellipt mceliece decrypt ronald1024 decry

9 go & hoven hoven n Objectives Set new speed records for public-key cryptography. : : : at a high security level. : : : including protection against quantum computers. : : : including full protection against cache-timing attacks, branch-prediction attacks, etc. : : : using code-based crypto with a solid track record. : : : all of the above at once. Examples of the competition Some cycle counts on h9ivy (Intel Core i5-3210m, Ivy Br from bench.cr.yp.to: mceliece encrypt (2008 Biswas Sendrier, 2 80 ) gls254 DH (binary elliptic curve; CHES kumfp127g DH 1 (hyperelliptic; Eurocrypt 201 curve25519 DH 1 (conservative elliptic curve) mceliece decrypt 12 ronald1024 decrypt 13

10 Objectives Set new speed records for public-key cryptography. : : : at a high security level. : : : including protection against quantum computers. : : : including full protection against cache-timing attacks, branch-prediction attacks, etc. : : : using code-based crypto with a solid track record. : : : all of the above at once. Examples of the competition Some cycle counts on h9ivy (Intel Core i5-3210m, Ivy Bridge) from bench.cr.yp.to: mceliece encrypt (2008 Biswas Sendrier, 2 80 ) gls254 DH (binary elliptic curve; CHES 2013) kumfp127g DH (hyperelliptic; Eurocrypt 2013) curve25519 DH (conservative elliptic curve) mceliece decrypt ronald1024 decrypt

11 es speed records c-key cryptography. high security level. ding protection uantum computers. ding full protection ache-timing attacks, rediction attacks, etc. code-based crypto lid track record. f the above at once. Examples of the competition Some cycle counts on h9ivy (Intel Core i5-3210m, Ivy Bridge) from bench.cr.yp.to: mceliece encrypt (2008 Biswas Sendrier, 2 80 ) gls254 DH (binary elliptic curve; CHES 2013) kumfp127g DH (hyperelliptic; Eurocrypt 2013) curve25519 DH (conservative elliptic curve) mceliece decrypt ronald1024 decrypt New dec (n; t) = (

12 rds tography. ity level. ction omputers. rotection ng attacks, ttacks, etc. ed crypto record. e at once. Examples of the competition Some cycle counts on h9ivy (Intel Core i5-3210m, Ivy Bridge) from bench.cr.yp.to: mceliece encrypt (2008 Biswas Sendrier, 2 80 ) gls254 DH (binary elliptic curve; CHES 2013) kumfp127g DH (hyperelliptic; Eurocrypt 2013) curve25519 DH (conservative elliptic curve) mceliece decrypt ronald1024 decrypt New decoding spe (n; t) = (4096; 41);

13 Examples of the competition Some cycle counts on h9ivy (Intel Core i5-3210m, Ivy Bridge) from bench.cr.yp.to: New decoding speeds (n; t) = (4096; 41); secu, c. mceliece encrypt (2008 Biswas Sendrier, 2 80 ) gls254 DH (binary elliptic curve; CHES 2013) kumfp127g DH (hyperelliptic; Eurocrypt 2013) curve25519 DH (conservative elliptic curve) mceliece decrypt ronald1024 decrypt

14 Examples of the competition Some cycle counts on h9ivy (Intel Core i5-3210m, Ivy Bridge) from bench.cr.yp.to: New decoding speeds (n; t) = (4096; 41); security: mceliece encrypt (2008 Biswas Sendrier, 2 80 ) gls254 DH (binary elliptic curve; CHES 2013) kumfp127g DH (hyperelliptic; Eurocrypt 2013) curve25519 DH (conservative elliptic curve) mceliece decrypt ronald1024 decrypt

15 Examples of the competition Some cycle counts on h9ivy (Intel Core i5-3210m, Ivy Bridge) from bench.cr.yp.to: mceliece encrypt (2008 Biswas Sendrier, 2 80 ) gls254 DH (binary elliptic curve; CHES 2013) kumfp127g DH (hyperelliptic; Eurocrypt 2013) curve25519 DH (conservative elliptic curve) mceliece decrypt ronald1024 decrypt New decoding speeds (n; t) = (4096; 41); security: Ivy Bridge cycles. Talk will focus on this case. (Decryption is slightly slower: includes hash, cipher, MAC.)

16 Examples of the competition Some cycle counts on h9ivy (Intel Core i5-3210m, Ivy Bridge) from bench.cr.yp.to: mceliece encrypt (2008 Biswas Sendrier, 2 80 ) gls254 DH (binary elliptic curve; CHES 2013) kumfp127g DH (hyperelliptic; Eurocrypt 2013) curve25519 DH (conservative elliptic curve) mceliece decrypt ronald1024 decrypt New decoding speeds (n; t) = (4096; 41); security: Ivy Bridge cycles. Talk will focus on this case. (Decryption is slightly slower: includes hash, cipher, MAC.) (n; t) = (2048; 32); 2 80 security: Ivy Bridge cycles.

17 Examples of the competition Some cycle counts on h9ivy (Intel Core i5-3210m, Ivy Bridge) from bench.cr.yp.to: mceliece encrypt (2008 Biswas Sendrier, 2 80 ) gls254 DH (binary elliptic curve; CHES 2013) kumfp127g DH (hyperelliptic; Eurocrypt 2013) curve25519 DH (conservative elliptic curve) mceliece decrypt ronald1024 decrypt New decoding speeds (n; t) = (4096; 41); security: Ivy Bridge cycles. Talk will focus on this case. (Decryption is slightly slower: includes hash, cipher, MAC.) (n; t) = (2048; 32); 2 80 security: Ivy Bridge cycles. All load/store addresses and all branch conditions are public. Eliminates cache-timing attacks etc. Similar improvements for CFS.

18 s of the competition cle counts on h9ivy re i5-3210m, Ivy Bridge) ch.cr.yp.to: e encrypt swas Sendrier, 2 80 ) DH lliptic curve; CHES 2013) 7g DH iptic; Eurocrypt 2013) 519 DH ative elliptic curve) e decrypt decrypt New decoding speeds (n; t) = (4096; 41); security: Ivy Bridge cycles. Talk will focus on this case. (Decryption is slightly slower: includes hash, cipher, MAC.) (n; t) = (2048; 32); 2 80 security: Ivy Bridge cycles. All load/store addresses and all branch conditions are public. Eliminates cache-timing attacks etc. Similar improvements for CFS. Constant The extr to elimin Handle a using on XOR (^)

19 mpetition on h9ivy M, Ivy Bridge).to: rier, 2 80 ) ve; CHES 2013) crypt 2013) ic curve) pt New decoding speeds (n; t) = (4096; 41); security: Ivy Bridge cycles. Talk will focus on this case. (Decryption is slightly slower: includes hash, cipher, MAC.) (n; t) = (2048; 32); 2 80 security: Ivy Bridge cycles. All load/store addresses and all branch conditions are public. Eliminates cache-timing attacks etc. Similar improvements for CFS. Constant-time fana The extremist s ap to eliminate timing Handle all secret d using only bit oper XOR (^), AND (&

20 idge) ) ) New decoding speeds (n; t) = (4096; 41); security: Ivy Bridge cycles. Talk will focus on this case. (Decryption is slightly slower: includes hash, cipher, MAC.) (n; t) = (2048; 32); 2 80 security: Ivy Bridge cycles. All load/store addresses and all branch conditions are public. Eliminates cache-timing attacks etc. Similar improvements for CFS. Constant-time fanaticism The extremist s approach to eliminate timing attacks: Handle all secret data using only bit operations XOR (^), AND (&), etc.

21 New decoding speeds (n; t) = (4096; 41); security: Ivy Bridge cycles. Talk will focus on this case. (Decryption is slightly slower: includes hash, cipher, MAC.) Constant-time fanaticism The extremist s approach to eliminate timing attacks: Handle all secret data using only bit operations XOR (^), AND (&), etc. (n; t) = (2048; 32); 2 80 security: Ivy Bridge cycles. All load/store addresses and all branch conditions are public. Eliminates cache-timing attacks etc. Similar improvements for CFS.

22 New decoding speeds (n; t) = (4096; 41); security: Ivy Bridge cycles. Talk will focus on this case. (Decryption is slightly slower: includes hash, cipher, MAC.) (n; t) = (2048; 32); 2 80 security: Ivy Bridge cycles. Constant-time fanaticism The extremist s approach to eliminate timing attacks: Handle all secret data using only bit operations XOR (^), AND (&), etc. We take this approach. All load/store addresses and all branch conditions are public. Eliminates cache-timing attacks etc. Similar improvements for CFS.

23 New decoding speeds (n; t) = (4096; 41); security: Ivy Bridge cycles. Talk will focus on this case. (Decryption is slightly slower: includes hash, cipher, MAC.) (n; t) = (2048; 32); 2 80 security: Ivy Bridge cycles. All load/store addresses and all branch conditions are public. Eliminates cache-timing attacks etc. Similar improvements for CFS. Constant-time fanaticism The extremist s approach to eliminate timing attacks: Handle all secret data using only bit operations XOR (^), AND (&), etc. We take this approach. How can this be competitive in speed? Are you really simulating field multiplication with hundreds of bit operations instead of simple log tables?

24 oding speeds 4096; 41); security: y Bridge cycles. focus on this case. ion is slightly slower: hash, cipher, MAC.) 2048; 32); 2 80 security: y Bridge cycles. store addresses ranch conditions c. Eliminates ing attacks etc. mprovements for CFS. Constant-time fanaticism The extremist s approach to eliminate timing attacks: Handle all secret data using only bit operations XOR (^), AND (&), etc. We take this approach. How can this be competitive in speed? Are you really simulating field multiplication with hundreds of bit operations instead of simple log tables? Yes, we Not as s On a typ the XOR is actual operatin on vecto

25 eds security: cycles. this case. htly slower: er, MAC.) 2 80 security: cycles. esses ditions tes ks etc. nts for CFS. Constant-time fanaticism The extremist s approach to eliminate timing attacks: Handle all secret data using only bit operations XOR (^), AND (&), etc. We take this approach. How can this be competitive in speed? Are you really simulating field multiplication with hundreds of bit operations instead of simple log tables? Yes, we are. Not as slow as it s On a typical 32-bit the XOR instructio is actually 32-bit X operating in parall on vectors of 32 b

26 rity: : ity: S. Constant-time fanaticism The extremist s approach to eliminate timing attacks: Handle all secret data using only bit operations XOR (^), AND (&), etc. We take this approach. How can this be competitive in speed? Are you really simulating field multiplication with hundreds of bit operations instead of simple log tables? Yes, we are. Not as slow as it sounds! On a typical 32-bit CPU, the XOR instruction is actually 32-bit XOR, operating in parallel on vectors of 32 bits.

27 Constant-time fanaticism The extremist s approach to eliminate timing attacks: Handle all secret data using only bit operations XOR (^), AND (&), etc. We take this approach. Yes, we are. Not as slow as it sounds! On a typical 32-bit CPU, the XOR instruction is actually 32-bit XOR, operating in parallel on vectors of 32 bits. How can this be competitive in speed? Are you really simulating field multiplication with hundreds of bit operations instead of simple log tables?

28 Constant-time fanaticism The extremist s approach to eliminate timing attacks: Handle all secret data using only bit operations XOR (^), AND (&), etc. We take this approach. How can this be competitive in speed? Are you really simulating field multiplication with hundreds of bit operations instead of simple log tables? Yes, we are. Not as slow as it sounds! On a typical 32-bit CPU, the XOR instruction is actually 32-bit XOR, operating in parallel on vectors of 32 bits. Low-end smartphone CPU: 128-bit XOR every cycle. Ivy Bridge: 256-bit XOR every cycle, or three 128-bit XORs.

29 -time fanaticism emist s approach ate timing attacks: ll secret data ly bit operations, AND (&), etc. this approach. n this be ive in speed? really simulating tiplication with of bit operations f simple log tables? Yes, we are. Not as slow as it sounds! On a typical 32-bit CPU, the XOR instruction is actually 32-bit XOR, operating in parallel on vectors of 32 bits. Low-end smartphone CPU: 128-bit XOR every cycle. Ivy Bridge: 256-bit XOR every cycle, or three 128-bit XORs. Not imm that this saves tim multiplic

30 ticism proach attacks: ata ations ), etc. ach. ed? lating with erations og tables? Yes, we are. Not as slow as it sounds! On a typical 32-bit CPU, the XOR instruction is actually 32-bit XOR, operating in parallel on vectors of 32 bits. Low-end smartphone CPU: 128-bit XOR every cycle. Ivy Bridge: 256-bit XOR every cycle, or three 128-bit XORs. Not immediately o that this bitslicing saves time for, e.g multiplication in F

31 Yes, we are. Not as slow as it sounds! On a typical 32-bit CPU, the XOR instruction is actually 32-bit XOR, operating in parallel on vectors of 32 bits. Not immediately obvious that this bitslicing saves time for, e.g., multiplication in F Low-end smartphone CPU: 128-bit XOR every cycle. Ivy Bridge: 256-bit XOR every cycle, or three 128-bit XORs.

32 Yes, we are. Not as slow as it sounds! On a typical 32-bit CPU, the XOR instruction is actually 32-bit XOR, operating in parallel on vectors of 32 bits. Not immediately obvious that this bitslicing saves time for, e.g., multiplication in F Low-end smartphone CPU: 128-bit XOR every cycle. Ivy Bridge: 256-bit XOR every cycle, or three 128-bit XORs.

33 Yes, we are. Not as slow as it sounds! On a typical 32-bit CPU, the XOR instruction is actually 32-bit XOR, operating in parallel on vectors of 32 bits. Not immediately obvious that this bitslicing saves time for, e.g., multiplication in F But quite obvious that it saves time for addition in F Low-end smartphone CPU: 128-bit XOR every cycle. Ivy Bridge: 256-bit XOR every cycle, or three 128-bit XORs.

34 Yes, we are. Not as slow as it sounds! On a typical 32-bit CPU, the XOR instruction is actually 32-bit XOR, operating in parallel on vectors of 32 bits. Low-end smartphone CPU: 128-bit XOR every cycle. Ivy Bridge: 256-bit XOR every cycle, or three 128-bit XORs. Not immediately obvious that this bitslicing saves time for, e.g., multiplication in F But quite obvious that it saves time for addition in F Typical decoding algorithms have add, mult roughly balanced. Coming next: how to save many adds and most mults. Nice synergy with bitslicing.

35 are. low as it sounds! ical 32-bit CPU, instruction ly 32-bit XOR, g in parallel rs of 32 bits. smartphone CPU: OR every cycle. e: OR every cycle, 128-bit XORs. Not immediately obvious that this bitslicing saves time for, e.g., multiplication in F But quite obvious that it saves time for addition in F Typical decoding algorithms have add, mult roughly balanced. Coming next: how to save many adds and most mults. Nice synergy with bitslicing. The add Fix n = Big final is to find of f = c For each compute 41 adds,

36 ounds! CPU, n OR, el its. ne CPU: cycle. cycle, Rs. Not immediately obvious that this bitslicing saves time for, e.g., multiplication in F But quite obvious that it saves time for addition in F Typical decoding algorithms have add, mult roughly balanced. Coming next: how to save many adds and most mults. Nice synergy with bitslicing. The additive FFT Fix n = 4096 = 2 1 Big final decoding is to find all roots of f = c 41 x 41 + For each 2 F 2 12 compute f() by H 41 adds, 41 mults.

37 Not immediately obvious that this bitslicing saves time for, e.g., multiplication in F But quite obvious that it saves time for addition in F Typical decoding algorithms have add, mult roughly balanced. Coming next: how to save many adds and most mults. Nice synergy with bitslicing. The additive FFT Fix n = 4096 = 2 12, t = 41. Big final decoding step is to find all roots in F 2 12 of f = c 41 x c 0 x 0. For each 2 F 2 12, compute f() by Horner s r 41 adds, 41 mults.

38 Not immediately obvious that this bitslicing saves time for, e.g., multiplication in F But quite obvious that it saves time for addition in F Typical decoding algorithms have add, mult roughly balanced. Coming next: how to save many adds and most mults. Nice synergy with bitslicing. The additive FFT Fix n = 4096 = 2 12, t = 41. Big final decoding step is to find all roots in F 2 12 of f = c 41 x c 0 x 0. For each 2 F 2 12, compute f() by Horner s rule: 41 adds, 41 mults.

39 Not immediately obvious that this bitslicing saves time for, e.g., multiplication in F But quite obvious that it saves time for addition in F Typical decoding algorithms have add, mult roughly balanced. Coming next: how to save many adds and most mults. Nice synergy with bitslicing. The additive FFT Fix n = 4096 = 2 12, t = 41. Big final decoding step is to find all roots in F 2 12 of f = c 41 x c 0 x 0. For each 2 F 2 12, compute f() by Horner s rule: 41 adds, 41 mults. Or use Chien search: compute c i g i, c i g 2i, c i g 3i, etc. Cost per point: again 41 adds, 41 mults.

40 Not immediately obvious that this bitslicing saves time for, e.g., multiplication in F But quite obvious that it saves time for addition in F Typical decoding algorithms have add, mult roughly balanced. Coming next: how to save many adds and most mults. Nice synergy with bitslicing. The additive FFT Fix n = 4096 = 2 12, t = 41. Big final decoding step is to find all roots in F 2 12 of f = c 41 x c 0 x 0. For each 2 F 2 12, compute f() by Horner s rule: 41 adds, 41 mults. Or use Chien search: compute c i g i, c i g 2i, c i g 3i, etc. Cost per point: again 41 adds, 41 mults. Our cost: 6.01 adds, 2.09 mults.

41 ediately obvious bitslicing e for, e.g., ation in F e obvious that it e for addition in F ecoding algorithms, mult roughly balanced. next: how to save ds and most mults. ergy with bitslicing. The additive FFT Fix n = 4096 = 2 12, t = 41. Big final decoding step is to find all roots in F 2 12 of f = c 41 x c 0 x 0. For each 2 F 2 12, compute f() by Horner s rule: 41 adds, 41 mults. Or use Chien search: compute c i g i, c i g 2i, c i g 3i, etc. Cost per point: again 41 adds, 41 mults. Our cost: 6.01 adds, 2.09 mults. Asympto normally so Horne Θ(nt) =

42 bvious., that it ition in F lgorithms ghly balanced. to save st mults. bitslicing. The additive FFT Fix n = 4096 = 2 12, t = 41. Big final decoding step is to find all roots in F 2 12 of f = c 41 x c 0 x 0. For each 2 F 2 12, compute f() by Horner s rule: 41 adds, 41 mults. Or use Chien search: compute c i g i, c i g 2i, c i g 3i, etc. Cost per point: again 41 adds, 41 mults. Our cost: 6.01 adds, 2.09 mults. Asymptotics: normally t 2 Θ(n= so Horner s rule co Θ(nt) = Θ(n 2 = lg

43 12. nced. The additive FFT Fix n = 4096 = 2 12, t = 41. Big final decoding step is to find all roots in F 2 12 of f = c 41 x c 0 x 0. For each 2 F 2 12, compute f() by Horner s rule: 41 adds, 41 mults. Asymptotics: normally t 2 Θ(n= lg n), so Horner s rule costs Θ(nt) = Θ(n 2 = lg n). Or use Chien search: compute c i g i, c i g 2i, c i g 3i, etc. Cost per point: again 41 adds, 41 mults. Our cost: 6.01 adds, 2.09 mults.

44 The additive FFT Fix n = 4096 = 2 12, t = 41. Big final decoding step is to find all roots in F 2 12 of f = c 41 x c 0 x 0. Asymptotics: normally t 2 Θ(n= lg n), so Horner s rule costs Θ(nt) = Θ(n 2 = lg n). For each 2 F 2 12, compute f() by Horner s rule: 41 adds, 41 mults. Or use Chien search: compute c i g i, c i g 2i, c i g 3i, etc. Cost per point: again 41 adds, 41 mults. Our cost: 6.01 adds, 2.09 mults.

45 The additive FFT Fix n = 4096 = 2 12, t = 41. Big final decoding step is to find all roots in F 2 12 of f = c 41 x c 0 x 0. For each 2 F 2 12, compute f() by Horner s rule: 41 adds, 41 mults. Or use Chien search: compute c i g i, c i g 2i, c i g 3i, etc. Cost per point: again 41 adds, 41 mults. Asymptotics: normally t 2 Θ(n= lg n), so Horner s rule costs Θ(nt) = Θ(n 2 = lg n). Wait a minute. Didn t we learn in school that FFT evaluates an n-coeff polynomial at n points using n 1+o(1) operations? Isn t this better than n 2 = lg n? Our cost: 6.01 adds, 2.09 mults.

46 itive FFT 4096 = 2 12, t = 41. decoding step all roots in F x c 0 x 0. 2 F 2 12, f() by Horner s rule: 41 mults. hien search: compute 2i, c i g 3i, etc. Cost per ain 41 adds, 41 mults. : 6.01 adds, 2.09 mults. Asymptotics: normally t 2 Θ(n= lg n), so Horner s rule costs Θ(nt) = Θ(n 2 = lg n). Wait a minute. Didn t we learn in school that FFT evaluates an n-coeff polynomial at n points using n 1+o(1) operations? Isn t this better than n 2 = lg n? Standard Want to f = c 0 + at all the Write f Observe f() = f f( ) = f 0 has n evaluate by same Similarly

47 2, t = 41. step in F c 0 x 0., orner s rule: h: compute etc. Cost per ds, 41 mults. ds, 2.09 mults. Asymptotics: normally t 2 Θ(n= lg n), so Horner s rule costs Θ(nt) = Θ(n 2 = lg n). Wait a minute. Didn t we learn in school that FFT evaluates an n-coeff polynomial at n points using n 1+o(1) operations? Isn t this better than n 2 = lg n? Standard radix-2 F Want to evaluate f = c 0 + c 1 x + at all the nth root Write f as f 0 (x 2 ) Observe big overla f() = f 0 ( 2 ) + f( ) = f 0 ( 2 ) f 0 has n=2 coeffs; evaluate at (n=2)n by same idea recur Similarly f 1.

48 ule: te per lts. ults. Asymptotics: normally t 2 Θ(n= lg n), so Horner s rule costs Θ(nt) = Θ(n 2 = lg n). Wait a minute. Didn t we learn in school that FFT evaluates an n-coeff polynomial at n points using n 1+o(1) operations? Isn t this better than n 2 = lg n? Standard radix-2 FFT: Want to evaluate f = c 0 + c 1 x + + c n 1x n at all the nth roots of 1. Write f as f 0 (x 2 ) + xf 1 (x 2 ) Observe big overlap between f() = f 0 ( 2 ) + f 1 ( 2 ), f( ) = f 0 ( 2 ) f 1 ( 2 ). f 0 has n=2 coeffs; evaluate at (n=2)nd roots of by same idea recursively. Similarly f 1.

49 Asymptotics: normally t 2 Θ(n= lg n), so Horner s rule costs Θ(nt) = Θ(n 2 = lg n). Wait a minute. Didn t we learn in school that FFT evaluates an n-coeff polynomial at n points using n 1+o(1) operations? Isn t this better than n 2 = lg n? Standard radix-2 FFT: Want to evaluate f = c 0 + c 1 x + + c n 1x n 1 at all the nth roots of 1. Write f as f 0 (x 2 ) + xf 1 (x 2 ). Observe big overlap between f() = f 0 ( 2 ) + f 1 ( 2 ), f( ) = f 0 ( 2 ) f 1 ( 2 ). f 0 has n=2 coeffs; evaluate at (n=2)nd roots of 1 by same idea recursively. Similarly f 1.

50 tics: t 2 Θ(n= lg n), r s rule costs Θ(n 2 = lg n). inute. e learn in school evaluates ff polynomial nts +o(1) operations? better than n 2 = lg n? Standard radix-2 FFT: Want to evaluate f = c 0 + c 1 x + + c n 1x n 1 at all the nth roots of 1. Write f as f 0 (x 2 ) + xf 1 (x 2 ). Observe big overlap between f() = f 0 ( 2 ) + f 1 ( 2 ), f( ) = f 0 ( 2 ) f 1 ( 2 ). f 0 has n=2 coeffs; evaluate at (n=2)nd roots of 1 by same idea recursively. Similarly f 1. Useless i Standard FFT con 1988 Wa independ additive Still quit 1996 von some im 2010 Ga much be We use G plus som

51 lg n), sts n). school s ial ations? an n 2 = lg n? Standard radix-2 FFT: Want to evaluate f = c 0 + c 1 x + + c n 1x n 1 at all the nth roots of 1. Write f as f 0 (x 2 ) + xf 1 (x 2 ). Observe big overlap between f() = f 0 ( 2 ) + f 1 ( 2 ), f( ) = f 0 ( 2 ) f 1 ( 2 ). f 0 has n=2 coeffs; evaluate at (n=2)nd roots of 1 by same idea recursively. Similarly f 1. Useless in char 2: Standard workarou FFT considered im 1988 Wang Zhu, independently 198 additive FFT in Still quite expensiv 1996 von zur Gath some improvement 2010 Gao Mateer: much better additi We use Gao Mate plus some new imp

52 ? Standard radix-2 FFT: Want to evaluate f = c 0 + c 1 x + + c n 1x n 1 at all the nth roots of 1. Write f as f 0 (x 2 ) + xf 1 (x 2 ). Observe big overlap between f() = f 0 ( 2 ) + f 1 ( 2 ), f( ) = f 0 ( 2 ) f 1 ( 2 ). f 0 has n=2 coeffs; evaluate at (n=2)nd roots of 1 by same idea recursively. Similarly f 1. Useless in char 2: =. Standard workarounds are pa FFT considered impractical Wang Zhu, independently 1989 Cantor: additive FFT in char 2. Still quite expensive von zur Gathen Gerhar some improvements Gao Mateer: much better additive FFT. We use Gao Mateer, plus some new improvement

53 Standard radix-2 FFT: Want to evaluate f = c 0 + c 1 x + + c n 1x n 1 at all the nth roots of 1. Write f as f 0 (x 2 ) + xf 1 (x 2 ). Observe big overlap between f() = f 0 ( 2 ) + f 1 ( 2 ), f( ) = f 0 ( 2 ) f 1 ( 2 ). f 0 has n=2 coeffs; evaluate at (n=2)nd roots of 1 by same idea recursively. Similarly f 1. Useless in char 2: =. Standard workarounds are painful. FFT considered impractical Wang Zhu, independently 1989 Cantor: additive FFT in char 2. Still quite expensive von zur Gathen Gerhard: some improvements Gao Mateer: much better additive FFT. We use Gao Mateer, plus some new improvements.

54 radix-2 FFT: evaluate c 1 x + + c n 1x n 1 nth roots of 1. as f 0 (x 2 ) + xf 1 (x 2 ). big overlap between 0( 2 ) + f 1 ( 2 ), f 0 ( 2 ) f 1 ( 2 ). =2 coeffs; at (n=2)nd roots of 1 idea recursively. f 1. Useless in char 2: =. Standard workarounds are painful. FFT considered impractical Wang Zhu, independently 1989 Cantor: additive FFT in char 2. Still quite expensive von zur Gathen Gerhard: some improvements Gao Mateer: much better additive FFT. We use Gao Mateer, plus some new improvements. Gao and f = c 0 + on a size Main ide f 0 (x 2 + Big over f 0 ( 2 + and f( f 0 ( 2 + Twist Then size-(n=2 Apply sa

55 FT: + c n 1x n 1 s of 1. + xf 1 (x 2 ). p between f 1 ( 2 ), f 1 ( 2 ). d roots of 1 sively. Useless in char 2: =. Standard workarounds are painful. FFT considered impractical Wang Zhu, independently 1989 Cantor: additive FFT in char 2. Still quite expensive von zur Gathen Gerhard: some improvements Gao Mateer: much better additive FFT. We use Gao Mateer, plus some new improvements. Gao and Mateer ev f = c 0 + c 1 x + on a size-n F 2 -line Main idea: Write f f 0 (x 2 + x) + xf 1 ( Big overlap betwee f 0 ( 2 + ) + f 1 ( and f( + 1) = f 0 ( 2 + ) + ( + Twist to ensure Then 2 + is size-(n=2) F 2 -linea Apply same idea re

56 1 Useless in char 2: =. Standard workarounds are painful. FFT considered impractical. Gao and Mateer evaluate f = c 0 + c 1 x + + c n 1x n on a size-n F 2 -linear space Wang Zhu, independently 1989 Cantor: additive FFT in char 2. Still quite expensive von zur Gathen Gerhard: some improvements. Main idea: Write f as f 0 (x 2 + x) + xf 1 (x 2 + x). Big overlap between f() = f 0 ( 2 + ) + f 1 ( 2 + ) and f( + 1) = f 0 ( 2 + ) + ( + 1)f 1 ( Gao Mateer: much better additive FFT. We use Gao Mateer, plus some new improvements. Twist to ensure 1 2 space Then 2 + is a size-(n=2) F 2 -linear space. Apply same idea recursively.

57 Useless in char 2: =. Standard workarounds are painful. FFT considered impractical Wang Zhu, independently 1989 Cantor: additive FFT in char 2. Still quite expensive von zur Gathen Gerhard: some improvements Gao Mateer: much better additive FFT. We use Gao Mateer, plus some new improvements. Gao and Mateer evaluate f = c 0 + c 1 x + + c n 1x n 1 on a size-n F 2 -linear space. Main idea: Write f as f 0 (x 2 + x) + xf 1 (x 2 + x). Big overlap between f() = f 0 ( 2 + ) + f 1 ( 2 + ) and f( + 1) = f 0 ( 2 + ) + ( + 1)f 1 ( 2 + ). Twist to ensure 1 2 space. Then 2 + is a size-(n=2) F 2 -linear space. Apply same idea recursively.

58 n char 2: =. workarounds are painful. sidered impractical. ng Zhu, ently 1989 Cantor: FFT in char 2. e expensive. zur Gathen Gerhard: provements. o Mateer: tter additive FFT. ao Mateer, e new improvements. Gao and Mateer evaluate f = c 0 + c 1 x + + c n 1x n 1 on a size-n F 2 -linear space. Main idea: Write f as f 0 (x 2 + x) + xf 1 (x 2 + x). Big overlap between f() = f 0 ( 2 + ) + f 1 ( 2 + ) and f( + 1) = f 0 ( 2 + ) + ( + 1)f 1 ( 2 + ). Twist to ensure 1 2 space. Then 2 + is a size-(n=2) F 2 -linear space. Apply same idea recursively. We gene f = c 0 + for any t ) severa not all o by simpl For t = 0 For t 2 f f 1 is a c Instead o this cons multiply and com

59 =. nds are painful. practical. 9 Cantor: char 2. e. en Gerhard: s. ve FFT. er, rovements. Gao and Mateer evaluate f = c 0 + c 1 x + + c n 1x n 1 on a size-n F 2 -linear space. Main idea: Write f as f 0 (x 2 + x) + xf 1 (x 2 + x). Big overlap between f() = f 0 ( 2 + ) + f 1 ( 2 + ) and f( + 1) = f 0 ( 2 + ) + ( + 1)f 1 ( 2 + ). Twist to ensure 1 2 space. Then 2 + is a size-(n=2) F 2 -linear space. Apply same idea recursively. We generalize to f = c 0 + c 1 x + for any t < n. ) several optimiza not all of which ar by simply tracking For t = 0: copy c 0 For t 2 f1; 2g: f 1 is a constant. Instead of multiply this constant by ea multiply only by ge and compute subse

60 inful. d: s. Gao and Mateer evaluate f = c 0 + c 1 x + + c n 1x n 1 on a size-n F 2 -linear space. Main idea: Write f as f 0 (x 2 + x) + xf 1 (x 2 + x). Big overlap between f() = f 0 ( 2 + ) + f 1 ( 2 + ) and f( + 1) = f 0 ( 2 + ) + ( + 1)f 1 ( 2 + ). Twist to ensure 1 2 space. Then 2 + is a size-(n=2) F 2 -linear space. Apply same idea recursively. We generalize to f = c 0 + c 1 x + + c t x t for any t < n. ) several optimizations, not all of which are automat by simply tracking zeros. For t = 0: copy c 0. For t 2 f1; 2g: f 1 is a constant. Instead of multiplying this constant by each, multiply only by generators and compute subset sums.

61 Gao and Mateer evaluate f = c 0 + c 1 x + + c n 1x n 1 on a size-n F 2 -linear space. Main idea: Write f as f 0 (x 2 + x) + xf 1 (x 2 + x). Big overlap between f() = f 0 ( 2 + ) + f 1 ( 2 + ) and f( + 1) = f 0 ( 2 + ) + ( + 1)f 1 ( 2 + ). Twist to ensure 1 2 space. Then 2 + is a size-(n=2) F 2 -linear space. Apply same idea recursively. We generalize to f = c 0 + c 1 x + + c t x t for any t < n. ) several optimizations, not all of which are automated by simply tracking zeros. For t = 0: copy c 0. For t 2 f1; 2g: f 1 is a constant. Instead of multiplying this constant by each, multiply only by generators and compute subset sums.

62 Mateer evaluate c 1 x + + c n 1x n 1 -n F 2 -linear space. a: Write f as x) + xf 1 (x 2 + x). lap between f() = ) + f 1 ( 2 + ) + 1) = ) + ( + 1)f 1 ( 2 + ). to ensure 1 2 space. 2 + is a ) F 2 -linear space. me idea recursively. We generalize to f = c 0 + c 1 x + + c t x t for any t < n. ) several optimizations, not all of which are automated by simply tracking zeros. For t = 0: copy c 0. For t 2 f1; 2g: f 1 is a constant. Instead of multiplying this constant by each, multiply only by generators and compute subset sums. Syndrom Initial de s 0 = r 1 + s 1 = r 1 s 2 = r 1..., s t = r 1 r 1 ; r 2 ; : : scaled by Typically mapping Not as s still n 2+

63 aluate + c n 1x n 1 ar space. as x 2 + x). n f() = 2 + ) 1)f 1 ( 2 + ). 1 2 space. a r space. cursively. We generalize to f = c 0 + c 1 x + + c t x t for any t < n. ) several optimizations, not all of which are automated by simply tracking zeros. For t = 0: copy c 0. For t 2 f1; 2g: f 1 is a constant. Instead of multiplying this constant by each, multiply only by generators and compute subset sums. Syndrome comput Initial decoding ste s 0 = r 1 + r 2 + s 1 = r r 2 2 s 2 = r r , s = t r 1 t 1 + r 2 t 2 r 1 ; r 2 ; : : : ; r are r n scaled by Goppa c Typically precompu mapping bits to sy Not as slow as Chi still n 2+o(1) and h

64 1 + ).. We generalize to f = c 0 + c 1 x + + c t x t for any t < n. ) several optimizations, not all of which are automated by simply tracking zeros. For t = 0: copy c 0. For t 2 f1; 2g: f 1 is a constant. Instead of multiplying this constant by each, multiply only by generators and compute subset sums. Syndrome computation Initial decoding step: compu s 0 = r 1 + r r, n s 1 = r r r n s 2 = r r r n..., s = t r 1 t 1 + r 2 t r n r 1 ; r 2 ; : : : ; r are received bi n scaled by Goppa constants. Typically precompute matrix mapping bits to syndrome. Not as slow as Chien search still n 2+o(1) and huge secret

65 We generalize to f = c 0 + c 1 x + + c t x t for any t < n. ) several optimizations, not all of which are automated by simply tracking zeros. For t = 0: copy c 0. For t 2 f1; 2g: f 1 is a constant. Instead of multiplying this constant by each, multiply only by generators and compute subset sums. Syndrome computation Initial decoding step: compute s 0 = r 1 + r r, n s 1 = r r r n, n s 2 = r r r n 2 n,..., s = t r 1 t 1 + r 2 t r n t n. r 1 ; r 2 ; : : : ; r are received bits n scaled by Goppa constants. Typically precompute matrix mapping bits to syndrome. Not as slow as Chien search but still n 2+o(1) and huge secret key.

66 ralize to c 1 x + + c t x t < n. l optimizations, f which are automated y tracking zeros. : copy c 0. 1; 2g: onstant. f multiplying tant by each, only by generators pute subset sums. Syndrome computation Initial decoding step: compute s 0 = r 1 + r r, n s 1 = r r r n, n s 2 = r r r n 2 n,..., s = t r 1 t 1 + r 2 t r n t n. r 1 ; r 2 ; : : : ; r are received bits n scaled by Goppa constants. Typically precompute matrix mapping bits to syndrome. Not as slow as Chien search but still n 2+o(1) and huge secret key. Compare f( 1 ) = f( 2 ) =..., f( ) = n

67 + c t x t tions, e automated zeros.. ing ch, nerators t sums. Syndrome computation Initial decoding step: compute s 0 = r 1 + r r, n s 1 = r r r n, n s 2 = r r r n 2 n,..., s = t r 1 t 1 + r 2 t r n t n. r 1 ; r 2 ; : : : ; r are received bits n scaled by Goppa constants. Typically precompute matrix mapping bits to syndrome. Not as slow as Chien search but still n 2+o(1) and huge secret key. Compare to multip f( 1 ) = c 0 + c 1 1 f( 2 ) = c 0 + c , f( ) = n c 0 + c 1

68 ed Syndrome computation Initial decoding step: compute s 0 = r 1 + r r, n s 1 = r r r n, n s 2 = r r r n 2 n,..., s = t r 1 t 1 + r 2 t r n t n. r 1 ; r 2 ; : : : ; r are received bits n scaled by Goppa constants. Typically precompute matrix mapping bits to syndrome. Not as slow as Chien search but still n 2+o(1) and huge secret key. Compare to multipoint evalu f( 1 ) = c 0 + c c f( 2 ) = c 0 + c c..., f( ) = n c 0 + c n

69 Syndrome computation Initial decoding step: compute s 0 = r 1 + r r, n s 1 = r r r n, n s 2 = r r r n 2 n, Compare to multipoint evaluation: f( 1 ) = c 0 + c c t t 1, f( 2 ) = c 0 + c c t t 2,..., f( ) = n c 0 + c n c t t n...., s = t r 1 t 1 + r 2 t r n t n. r 1 ; r 2 ; : : : ; r are received bits n scaled by Goppa constants. Typically precompute matrix mapping bits to syndrome. Not as slow as Chien search but still n 2+o(1) and huge secret key.

70 Syndrome computation Initial decoding step: compute s 0 = r 1 + r r, n s 1 = r r r n, n s 2 = r r r n 2 n,..., s = t r 1 t 1 + r 2 t r n t n. r 1 ; r 2 ; : : : ; r are received bits n scaled by Goppa constants. Typically precompute matrix mapping bits to syndrome. Not as slow as Chien search but still n 2+o(1) and huge secret key. Compare to multipoint evaluation: f( 1 ) = c 0 + c c t t 1, f( 2 ) = c 0 + c c t t 2,..., f( ) = n c 0 + c n c t t n. Matrix for syndrome computation is transpose of matrix for multipoint evaluation.

71 Syndrome computation Initial decoding step: compute s 0 = r 1 + r r, n s 1 = r r r n, n s 2 = r r r n 2 n,..., s = t r 1 t 1 + r 2 t r n t n. r 1 ; r 2 ; : : : ; r are received bits n scaled by Goppa constants. Typically precompute matrix mapping bits to syndrome. Not as slow as Chien search but still n 2+o(1) and huge secret key. Compare to multipoint evaluation: f( 1 ) = c 0 + c c t t 1, f( 2 ) = c 0 + c c t t 2,..., f( ) = n c 0 + c n c t t n. Matrix for syndrome computation is transpose of matrix for multipoint evaluation. Amazing consequence: syndrome computation is as few ops as multipoint evaluation. Eliminate precomputed matrix.

72 e computation coding step: compute r r n, 1 + r r n n, r r n 2 n, t 1 + r 2 t r n t n. : ; r n are received bits Goppa constants. precompute matrix bits to syndrome. low as Chien search but o(1) and huge secret key. Compare to multipoint evaluation: f( 1 ) = c 0 + c c t t 1, f( 2 ) = c 0 + c c t t 2,..., f( ) = n c 0 + c n c t t n. Matrix for syndrome computation is transpose of matrix for multipoint evaluation. Amazing consequence: syndrome computation is as few ops as multipoint evaluation. Eliminate precomputed matrix. Transpos If a linea compute then reve exchangi compute 1956 Bo independ for Boole 1973 Fid preserves preserves number

73 ation p: compute + r, n + + r n, n + + r n 2 n, + + r n t n. eceived bits onstants. te matrix ndrome. en search but uge secret key. Compare to multipoint evaluation: f( 1 ) = c 0 + c c t t 1, f( 2 ) = c 0 + c c t t 2,..., f( ) = n c 0 + c n c t t n. Matrix for syndrome computation is transpose of matrix for multipoint evaluation. Amazing consequence: syndrome computation is as few ops as multipoint evaluation. Eliminate precomputed matrix. Transposition princ If a linear algorithm computes a matrix then reversing edg exchanging inputs/ computes the tran 1956 Bordewijk; independently 195 for Boolean matric 1973 Fiduccia ana preserves number o preserves number o number of nontrivi

74 te n, 2 n, t n. ts but key. Compare to multipoint evaluation: f( 1 ) = c 0 + c c t t 1, f( 2 ) = c 0 + c c t t 2,..., f( ) = n c 0 + c n c t t n. Matrix for syndrome computation is transpose of matrix for multipoint evaluation. Amazing consequence: syndrome computation is as few ops as multipoint evaluation. Eliminate precomputed matrix. Transposition principle: If a linear algorithm computes a matrix M then reversing edges and exchanging inputs/outputs computes the transpose of M 1956 Bordewijk; independently 1957 Lupanov for Boolean matrices Fiduccia analysis: preserves number of mults; preserves number of adds plu number of nontrivial outputs

75 Compare to multipoint evaluation: f( 1 ) = c 0 + c c t t 1, f( 2 ) = c 0 + c c t t 2,..., f( ) = n c 0 + c n c t t n. Matrix for syndrome computation is transpose of matrix for multipoint evaluation. Amazing consequence: syndrome computation is as few ops as multipoint evaluation. Eliminate precomputed matrix. Transposition principle: If a linear algorithm computes a matrix M then reversing edges and exchanging inputs/outputs computes the transpose of M Bordewijk; independently 1957 Lupanov for Boolean matrices Fiduccia analysis: preserves number of mults; preserves number of adds plus number of nontrivial outputs.

76 to multipoint evaluation: c 0 + c c t t 1, c 0 + c c t t 2, c 0 + c n c t t n. r syndrome computation ose of r multipoint evaluation. consequence: e computation is as few ultipoint evaluation. e precomputed matrix. Transposition principle: If a linear algorithm computes a matrix M then reversing edges and exchanging inputs/outputs computes the transpose of M Bordewijk; independently 1957 Lupanov for Boolean matrices Fiduccia analysis: preserves number of mults; preserves number of adds plus number of nontrivial outputs. We built producin Too man gcc ran

77 oint evaluation: + + c t t 1, + + c t t 2, + + n c t t n. e computation int evaluation. nce: tion is as few valuation. uted matrix. Transposition principle: If a linear algorithm computes a matrix M then reversing edges and exchanging inputs/outputs computes the transpose of M Bordewijk; independently 1957 Lupanov for Boolean matrices Fiduccia analysis: preserves number of mults; preserves number of adds plus number of nontrivial outputs. We built transposi producing C code. Too many variable gcc ran out of me

78 ation: t t 1, t t 2, c t t n. ation tion. few. ix. Transposition principle: If a linear algorithm computes a matrix M then reversing edges and exchanging inputs/outputs computes the transpose of M Bordewijk; independently 1957 Lupanov for Boolean matrices Fiduccia analysis: preserves number of mults; preserves number of adds plus number of nontrivial outputs. We built transposing compil producing C code. Too many variables for m = gcc ran out of memory.

79 Transposition principle: If a linear algorithm computes a matrix M then reversing edges and exchanging inputs/outputs computes the transpose of M. We built transposing compiler producing C code. Too many variables for m = 13; gcc ran out of memory Bordewijk; independently 1957 Lupanov for Boolean matrices Fiduccia analysis: preserves number of mults; preserves number of adds plus number of nontrivial outputs.

80 Transposition principle: If a linear algorithm computes a matrix M then reversing edges and exchanging inputs/outputs computes the transpose of M Bordewijk; independently 1957 Lupanov for Boolean matrices. We built transposing compiler producing C code. Too many variables for m = 13; gcc ran out of memory. Used qhasm register allocator to optimize the variables. Worked, but not very quickly Fiduccia analysis: preserves number of mults; preserves number of adds plus number of nontrivial outputs.

81 Transposition principle: If a linear algorithm computes a matrix M then reversing edges and exchanging inputs/outputs computes the transpose of M Bordewijk; independently 1957 Lupanov for Boolean matrices Fiduccia analysis: preserves number of mults; preserves number of adds plus number of nontrivial outputs. We built transposing compiler producing C code. Too many variables for m = 13; gcc ran out of memory. Used qhasm register allocator to optimize the variables. Worked, but not very quickly. Wrote faster register allocator. Still excessive code size.

82 Transposition principle: If a linear algorithm computes a matrix M then reversing edges and exchanging inputs/outputs computes the transpose of M Bordewijk; independently 1957 Lupanov for Boolean matrices Fiduccia analysis: preserves number of mults; preserves number of adds plus number of nontrivial outputs. We built transposing compiler producing C code. Too many variables for m = 13; gcc ran out of memory. Used qhasm register allocator to optimize the variables. Worked, but not very quickly. Wrote faster register allocator. Still excessive code size. Built new interpreter, allowing some code compression. Still big; still some overhead.

83 ition principle: r algorithm s a matrix M rsing edges and ng inputs/outputs s the transpose of M. rdewijk; ently 1957 Lupanov an matrices. uccia analysis: number of mults; number of adds plus of nontrivial outputs. We built transposing compiler producing C code. Too many variables for m = 13; gcc ran out of memory. Used qhasm register allocator to optimize the variables. Worked, but not very quickly. Wrote faster register allocator. Still excessive code size. Built new interpreter, allowing some code compression. Still big; still some overhead. Better so stared at wrote do with sam Small co Speedup translate to transp Further s merged fi scaling b

84 iple: M es and outputs spose of M. 7 Lupanov es. lysis: f mults; f adds plus al outputs. We built transposing compiler producing C code. Too many variables for m = 13; gcc ran out of memory. Used qhasm register allocator to optimize the variables. Worked, but not very quickly. Wrote faster register allocator. Still excessive code size. Built new interpreter, allowing some code compression. Still big; still some overhead. Better solution: stared at additive wrote down transp with same loops et Small code, no ove Speedups of additi translate easily to transposed algo Further savings: merged first stage scaling by Goppa c

85 .. s We built transposing compiler producing C code. Too many variables for m = 13; gcc ran out of memory. Used qhasm register allocator to optimize the variables. Worked, but not very quickly. Wrote faster register allocator. Still excessive code size. Built new interpreter, allowing some code compression. Still big; still some overhead. Better solution: stared at additive FFT, wrote down transposition with same loops etc. Small code, no overhead. Speedups of additive FFT translate easily to transposed algorithm. Further savings: merged first stage with scaling by Goppa constants.

86 We built transposing compiler producing C code. Too many variables for m = 13; gcc ran out of memory. Used qhasm register allocator to optimize the variables. Worked, but not very quickly. Wrote faster register allocator. Still excessive code size. Built new interpreter, allowing some code compression. Still big; still some overhead. Better solution: stared at additive FFT, wrote down transposition with same loops etc. Small code, no overhead. Speedups of additive FFT translate easily to transposed algorithm. Further savings: merged first stage with scaling by Goppa constants.

87 transposing compiler g C code. y variables for m = 13; out of memory. sm register allocator ize the variables. but not very quickly. ster register allocator. ssive code size. interpreter, some code compression. still some overhead. Better solution: stared at additive FFT, wrote down transposition with same loops etc. Small code, no overhead. Speedups of additive FFT translate easily to transposed algorithm. Further savings: merged first stage with scaling by Goppa constants. Secret p Additive field elem This is n needed i Must ap part of t Same iss Solution Almost d Beneš ne

88 ng compiler s for m = 13; mory. er allocator riables. ery quickly. er allocator. size. er, e compression. overhead. Better solution: stared at additive FFT, wrote down transposition with same loops etc. Small code, no overhead. Speedups of additive FFT translate easily to transposed algorithm. Further savings: merged first stage with scaling by Goppa constants. Secret permutation Additive FFT ) f field elements in a This is not the ord needed in code-bas Must apply a secre part of the secret k Same issue for syn Solution: Batcher Almost done with Beneš network.

McBits: fast constant-time code-based cryptography. (to appear at CHES 2013)

McBits: fast constant-time code-based cryptography (to appear at CHES 2013) D. J. Bernstein University of Illinois at Chicago & Technische Universiteit Eindhoven Joint work with: Tung Chou Technische Universiteit