Hardware implementations of ECC

Hardware implementations of ECC The University of Electro- Communications

Introduction Public- key Cryptography (PKC) The most famous PKC is RSA and ECC Used for key agreement (Diffie- Hellman), digital signatures, public- key encryption Implementation of public- key cryptosystem Slower encryption/decryption (~Mbps) than symmetric key cryptography (AES: ~Gbps) RSA: modular exponentiation, M e ECC: point multiplication, kp on elliptic curve 2/38

Point operations on elliptic curve Point Addition R = P Q Point Doubling y R = 2P P Tangent line Scalar Mult. P Q x kp Binary method, etc. y 2 = x 3 x 1 R R 3/38

Weighted projective coordinate Example of point operations All operations are over GF(p) 4/38

Data Size in Modular Operations Typical key lengths of public- key cryptography RSA w/crt: 1024 bits ECC: ~256 bits RSA: 2048 bits RSA: M e mod n ECC: kp over GF(p) and GF(2 m ) 5/38

# of Modular Multiplications ECC needs more modular multiplications RSA (2048-bit Modular Mult.) x 2048 x 1.5 CRT (1024-bit Modular Mult.) x 1024 x 1.5 x 2 ECC (256-bit Modular Mult.) x 160 x 20 x 1.5 RSA: M e mod n ECC: kp over GF(p) and GF(2 m ) 6/38

Parallelism in high- speed implementation Degree of parallelism RSA = RSA RSA Parallelism in Square and Multiply 2048 x 1 CRT = CRT CRT = CRT CRT CRT CRT Parallelism in CRT 1024 x 1.5 x 1 Parallelism in Square and Multiply 1024 x 1 x 1 ECC ECC = = ECC ECC ECC ECC ECC How much Parallelism in scalar multiplication / Add and Double?? 7/38

Basic hierarchy for ECC Optimization possible in multi- layer Scalar Multiplication Point Addition Point Doubling Finite Field Addition Finite Field Multiplication Finite Field Inversion 8/38

Modular Multiplication = Multiplication Reduction Barrett Reduction (1986): Focus on MSBs Montgomery Reduction (1985): Focus on LSBs Based on n- bit multiplier (n = 8, 16, 32, 64) E.g., CIOS [IEEE Micro KKo c and Kaliski Jr 1996]. Based on bit- serial multiplier Multiplicand x n- bit [CHES Großscha dl 2001] 9/38

Modular Reduction from LSB (binary field) LSB- first bit- serial multiplier (binary fields) X a k-1 b k-1 a 3 b 3 a 2 b 2 a 1 b 1 a 0 b 0 =A(x) =B(x) A(x) B(x) P(x) =a 0 B(x) =P(x) or 0 XOR-based Datapath k d d =a 1 B(x) =P(x) or 0 T(x)... k =a k-1 B(x) =P(x) or 0 Montgomery Multiplier: A(x)B(x)R -1 mod P(x) (R = x k ) 15/11/2007 A(x)B(x)R -k mod P(x) 0 0 0 0 0 0 0 0 - Delay is determined by d - Loop : k/d 10/38

Modular Reduction from MSB (binary field) MSB- first bit- serial multiplier (binary field) X a k-1 b k-1 a 3 a 2 a 1 a 0 b 3 b 2 b 1 b 0 =A(x) =B(x) A(x) B(x) P(x) =a k-1 B(x) =P(x) or 0 XOR-based Datapath k d =a k-2 B(x) =P(x) or 0 d T(x) k.. =a 0 B(x) A(x)B(x) mod P(x) =P(x) or 0 - Delay is determined by d - Loop : k/d 0 15/11/2007 0 0 0 0 0 0 0 A(x)B(x) mod P(x) 11/38

Modular Reduction from LSB (prime field) LSB- first bit- serial multiplier ( prime field ) X x k y k x k-1 y k-1 x 3 y 3 x 2 y 2 x 1 y 1 x 0 y 0 =X =Y X Y p =x 0 Y =p or 0 CSA-based Datapath k d d =x 1 Y =p or 0 T... =x k Y =p or 0 k Montgomery Multiplier: XYR -1 mod p (R = 2 k2 ) 15/11/2007 0 =p or 0 0 0 0 0 0 0 0 - Delay is determined by d - Loop : (k2)/d XYR -1 mod n 12/38

Final reduction for prime field p XYR - 1 mod p Intermediate value, T = ( T x i y mp)/2 < 2p Final reduction: T = T p< p 13/38

Fixed number of additions of p XYR - 1 mod p Intermediate value, T = ( T x i y mp)/2 < 3p T = ( T mp)/2 < 2p 14/38

Modular Reduction from MSB (prime field) MSB- first bit- serial multiplier (prime field) X x k-1 y k-1 x 3 x 2 x 1 x 0 y 3 y 2 y 1 y 0 =X =Y X Y p - =x k-1 y = 2p, p, or 0 CSA-based Datapath k d - =x k-2 y = 2p, p, or 0 d T k.. =x 0 y X Y mod p - = 2p, p, or 0 - Delay is determined by d - Loop : (k2)/d 0 15/11/2007 0 0 0 0 0 0 0 A(x)B(x) mod P(x) 15/38

Division in MSB- first modular multiplication XY mod p Intermediate value, T = ( 2T x i y ) < 3p Division can be escaped with precomputation [Barrett Reduction; IEEE ToC, Knezevic, Vercauteren, Verbauwhede, 2010] 16/38

Barrett Reduction T/p = T/2 k- 1 * 2 k- 1 /p Step4 : q = T/2 k- 1 - - - shift operation (wiring) Step5 : T = ( T - q * 2 k- 1 /p ) - - - precomputation But q = 0, 1, or 2 17/38

Parallelism in Modular Multiplication Both MSB/LSB Modular Multiplication [MELECON, Batina et. al., 2004; CHES, Kaihara and Takagi, 2005] Combine MSB- first and LSB- first MM A(x)B(x)R - 1 mod P(x) = (A 1 (x)r A 0 )B(x)R - 1 mod P(x) = A 1 (x)b(x) mod P(x) MSB- first A 0 (x)b(x)r - 1 mod P(x) LSB- first R = x k/2 18/38

Datapath of Bipartite Modular Multiplication Bipartite bit- serial multiplier (binary field) X a k-1 b k-1 a 3 a 2 a 1 a 0 b 3 b 2 b 1 b 0 =A(x) =B(x) MSB-first =a k-1 B(x) =P(x) or 0 =a k-2 B(x) d =P(x) or 0 LSB-first d X a k-1 b k-1 a 0 a 1 a 2 a 3 b 0 b 1 b 2 b 3 =A(x) =B(x) =a 0 B(x) =P(x) or 0 =a 1 B(x) =P(x) or 0 A(x) B(x) P(x) XOR-based Datapath k d k.. =a 0 B(x) Stop in the middle =P(x) or 0.. =a k-1 B(x) Stop in the =P(x) middle or 0 k T(x) A 0 1 0(x)B(x) 0 0 0 0 0 0mod P(x) A(x) B(x) mod P(x) A 0 0 (x)b(x)r -1 0 mod 0 0 0 0P(x) 0 0 A(x) B(x) 2 -k mod P(x) A(x)B(x)R -1 mod P(x) - Parallelized (x2) - Loop: k/2d A(x)B(x)R -1 mod P(x) 19/38

Great Merits in Bipartite Modular Multiplication Parallelism in Modular Multiplication Applicable to prime field (original proposal) Need to take care (final) reductions Almost double performance, less than double area (F/Fs are shared) 20/38

Architecture for ECC coprocessor HW/SW co- design CPU coprocessor for ECC Speed performance depends on architecture Trade- off between cost and performance Local dedicated memory - > high performance Instruction- set extension, ASIC, etc. 21/38

ECC coprocessor (I) TYPE I: Smallest implementation (baseline) Coprocessor SRAM Main CPU Program Memory Mapped I/O ROM 32-bit data 32-bit instruction Buffer Full DBC FSM IBC IQB Data Bus Instruction Bus MALU 22/38

ECC coprocessor (II) TYPE II: TYPE I µ- coded RAM Coprocessor SRAM Main CPU Program Memory Mapped I/O ROM 32-bit data 32-bit instruction Buffer Full DBC Program µ-coded RAM FSM IBC IQB Data Bus Instruction Bus MALU 23/38

TYPE III: TYPE I CPRAM ECC coprocessor (III) Coprocessor SRAM Main CPU Program Memory Mapped I/O ROM 32-bit data 32-bit instruction Buffer Full DBC FSM IBC IQB Data Bus Instruction Bus MALU Coprocessor RAM (CPRAM) 24/38

ECC coprocessor (IV) TYPE IV: TYPE I CPRAM & µ- code RAM Coprocessor SRAM Main CPU Program Memory Mapped I/O ROM 32-bit data 32-bit instruction Buffer Full DBC Program µ-coded RAM FSM IBC IQB Data Bus Instruction Bus MALU Coprocessor RAM (CPRAM) 25/38

Profiling ECC coprocessor Performance for HECC over GF(2 83 ) The number of cycles is reduced by CPRAM Micro- coded RAM can avoid instruction stalls Type I Type II Type III SW I/O Overhead Datapath SW Datapath µ-coded RAM SW Datapath CPRAM Type IV SW Datapath µ-coded RAM CPRAM Required Clock Cycles [K] 500 400 300 200 100 0 2 859 2 672 0 67 0 67 187 I/O Transfer CPRAM Access Datapath 38 38 67 0 67 TYPE I TYPE II TYPE III TYPE IV System Configuration 26/38

Cost, Performance and Security Trade- offs p Programmable VS hardwired parameters p Curve parameters p Computational sequence: Coordinates, operation form. p Prime p, irreducible polynomial P(x) p Field sizes Performance (latency) Hardwired - compact & fast Programmable: - wider-range of applications - higher-security algorithms Cost (gate area) 27/38

Cost, Performance and Security Trade- offs p Programmable VS hardwired parameters p Curve parameters p Computational sequence: Coordinates, operation form. p Prime p, irreducible polynomial P(x) p Field sizes Performance (latency) Hardwired - compact & fast Fixing parameters Programmable: - wider-range of applications - higher-security algorithms For high-speed applications Cost (gate area) 28/38

Cost, Performance and Security Trade- offs p Programmable VS hardwired parameters p Curve parameters p Computational sequence: Coordinates, operation form. p Prime p, irreducible polynomial P(x) p Field sizes Performance (latency) Hardwired - compact & fast For RFID tags Fixing parameters Programmable: - wider-range of applications - higher-security algorithms Cost (gate area) 29/38

p Programmable high- speed coprocessor p ECC / HECC over GF(2 m ) p Six MALU cores (MALU) High- speed ECC & HECC Main CPU Memory Mapped I/O ROM RAM Data 32 32 Cryptoprocessor DBC Instruction Control µ-coded RAM Program Instruction FSM IQB Instruction Bus IBC Instruction Data Bus RF1 MALU MALU MALU MALU RF2 RF3 Kazuo RF6 SAKIYAMA -1 core -2 core -3 core -6 core 30/38

Parallelism in scalar multiplication p How many instructions executed in parallel? p Operation form AB C and A(BC) D p Check the data dependencies for ECC (and HECC) Number of Clock Cycles 90,000 70,000 50,000 30,000 1xMALU 2xMALU (a) AB C HECC over GF(2 83 ) ECC over GF(2 163 ) ECC_M over GF(2 163 ) 1xMALU (b) A(BD) C 2xMALU 10,000 4xMALU 4xMALU 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 Number of Instructions to search ILP (ILPD)

Parallelism in scalar multiplication p How many instructions executed in parallel? p Operation form AB C and A(BC) D p Check the data dependencies for ECC (and HECC) Number of Clock Cycles 90,000 70,000 50,000 30,000 1xMALUb (a) AB C HECC over GF(2 83 ) ECC over GF(2 163 ) ECC_M over GF(2 163 ) (b) A(BD) C 1xMALUb 2xMALUb 2xMALUb 4xMALUb 4xMALUb 10,000 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 Number of Instructions to search ILP (ILPD)

Lightweight ECC For RFID p Low- power MALU for RFID tags p Based on MALU p Cost and performance for ECC and HECC p Power estimation with 0.13- mm CMOS library p Hardware sharing in MALU p Modular addition and multiplication use the same hardware (cell) 33/38

Low- power ECC (HECC) for RFID p Synthesis results (area & performance) p MALU hardwired controller p ECC shows better performance (106 ms @ 500kHz, d = 4) p ECC over a composite field is smaller 10000 HECC over GF(2 m ) d = 4 600,000 d = 1 HECC over GF(2 m ) Area [gate] 9000 8000 7000 6000 5000 d = 8 d = 6 d = 4 d = 8 d = 6 d = 4 d = 2 d = 1 ECC over GF((2 m ) 2 ) ECC over GF(2 m ) d = 3 d = 2 d = 1 Performance [cycle] 500,000 400,000 300,000 200,000 100,000 d = 4 d = 6 d = 8 d = 2 d = 4 d = 6 d = 8 ECC over GF((2 m ) 2 ) ECC over GF(2 m ) d = 1 d = 2 d = 3 d = 2 d = 1 4000 50 70 90 110 130 150 170 Field length [bit] d = 4 0 50 70 90 110 130 150 170 Field length [bit] 34/38

Low- power ECC (HECC) for RFID p Synthesis results (power) p Results only for coprocessor (average) p Peak power evaluation is essential p RF interface will consume more power 30 25 Clock Frequency = 500 khz HECC over GF(2 m ) ECC over GF((2 m ) 2 ) ECC over GF(2 m ) d = 4 d = 3 d = 2 25 20 Dynamic Leakage Clock Frequency = 500 khz d = 1 Power [uw] 20 d = 8 d = 6 d = 4 d = 2 d = 1 Power [uw] 15 10 15 d = 8 d = 6 d = 4 d = 2 d = 1 10 50 70 90 110 130 150 170 Field length [bit] 5 0 ECC over GF(2 131 ) d = 4 ECC over GF((2 67 ) 2 ) d = 8 HECC over GF(2 67 ) d = 8 35/38

ECC coprocessor & Side- channel attacks p Four architectures (a) TYPE A: Baseline 32 Coprocessor Instructions MicroBlaze Core OPB Interrupt Data Data Memory Program Memory 32 Coprocessor DBC 32 Coprocessor Instructions MicroBlaze Core OPB Interrupt Data Data Memory Program Memory 32 Coprocessor DBC (b) TYPE B: µ- coded ROM 32-bit Inst. Bus (a) 163-bit Data Bus MALUb RF 32-bit Inst. Bus u-code (b) 163-bit Data Bus MALUb RF (c) TYPE C: HW key register (d) TYPE D: RND for key randomization 32 Coprocessor Instructions MicroBlaze Core MALUb Data Memory Program Memory Interrupt OPB 32-bit Inst. Bus u-code Data 32 Coprocessor DBC 163-bit Data Bus Key (c) RF 32 Coprocessor Instructions MicroBlaze Core Interrupt OPB 32-bit Inst. Bus RNG u-code Key Randomization Data 32 DBC MALUb Data Memory Program Memory Coprocessor 163-bit Data Bus Key (d) RF 36/38

p Trade- offs in ECC hardware p Parallel processing p Operation forms: AB C or A(BD) C p Cost, performance, security Conclusions Scalar Multiplication Cost Point Addition Point Doubling Performance Security Finite Field Addition Finite Field Multiplication Finite Field Inversion 37/38

Thank you for your attentions! Questions? P P y Q x y 2 = x 3 x 1 R R