Hardware implementations of ECC

Similar documents
A Simple Architectural Enhancement for Fast and Flexible Elliptic Curve Cryptography over Binary Finite Fields GF(2 m )


Are standards compliant Elliptic Curve Cryptosystems feasible on RFID?

PUBLIC-KEY cryptography (PKC), a concept introduced

Dual-Field Arithmetic Unit for GF(p) and GF(2 m ) *

Elliptic Curve Cryptography and Security of Embedded Devices

Tripartite Modular Multiplication

Other Public-Key Cryptosystems

Speeding Up Bipartite Modular Multiplication

Efficient Finite Field Multiplication for Isogeny Based Post Quantum Cryptography

Low-Resource and Fast Elliptic Curve Implementations over Binary Edwards Curves

GF(2 m ) arithmetic: summary

A Bit-Serial Unified Multiplier Architecture for Finite Fields GF(p) and GF(2 m )

Arithmetic in Integer Rings and Prime Fields

Efficient Hardware Implementation of Finite Fields with Applications to Cryptography

Hardware Implementation of Elliptic Curve Point Multiplication over GF (2 m ) for ECC protocols

Accelerating AES Using Instruction Set Extensions for Elliptic Curve Cryptography. Stefan Tillich, Johann Großschädl

Definition of a finite group

Efficient and Secure Algorithms for GLV-Based Scalar Multiplication and Their Implementation on GLV-GLS Curves

Other Public-Key Cryptosystems

FPGA-Based Elliptic Curve Cryptography for RFID Tag Using Verilog

Power Consumption Analysis. Arithmetic Level Countermeasures for ECC Coprocessor. Arithmetic Operators for Cryptography.

Arithmetic operators for pairing-based cryptography

Modular Multiplication in GF (p k ) using Lagrange Representation

Faster F p -arithmetic for Cryptographic Pairings on Barreto-Naehrig Curves

An Efficient Multiplier/Divider Design for Elliptic Curve Cryptosystem over GF(2 m ) *

Lecture 8: Sequential Multipliers

Modular Reduction without Pre-Computation for Special Moduli

Finite Fields. SOLUTIONS Network Coding - Prof. Frank H.P. Fitzek

Elliptic Curves I. The first three sections introduce and explain the properties of elliptic curves.

Efficient randomized regular modular exponentiation using combined Montgomery and Barrett multiplications

Mathematical Foundations of Cryptography

8 Elliptic Curve Cryptography

An Optimized Hardware Architecture of Montgomery Multiplication Algorithm

Montgomery Multiplication in GF(2 k ) * **

AN IMPROVED LOW LATENCY SYSTOLIC STRUCTURED GALOIS FIELD MULTIPLIER

EECS150 - Digital Design Lecture 21 - Design Blocks

EECS150 - Digital Design Lecture 23 - FFs revisited, FIFOs, ECCs, LSFRs. Cross-coupled NOR gates

Compact Ring LWE Cryptoprocessor

Arithmetic Operators for Pairing-Based Cryptography

International Journal of Advanced Computer Technology (IJACT)

HARDWARE REALIZATION OF HIGH SPEED ELLIPTIC CURVE POINT MULTIPLICATION USING PRECOMPUTATION OVER GF(p)

Hardware Implementation of an Elliptic Curve Processor over GF(p)

A Hyperelliptic Curve Crypto Coprocessor for an 8051 Microcontroller

A Multiple Bit Parity Fault Detection Scheme for The Advanced Encryption Standard Galois/ Counter Mode

Mathematics of Cryptography

Optimal Extension Field Inversion in the Frequency Domain

Montgomery Multiplier and Squarer in GF(2 m )

Fast Algorithm in ECC for Wireless Sensor Network

Chapter 4 Finite Fields

Elliptic Curve Cryptography

Formal Fault Analysis of Branch Predictors: Attacking countermeasures of Asymmetric key ciphers

Arithmetic Operators for Pairing-Based Cryptography

A New Bit-Serial Architecture for Field Multiplication Using Polynomial Bases

Efficient Hardware Architecture for Scalar Multiplications on Elliptic Curves over Prime Field

EECS150 - Digital Design Lecture 26 Error Correction Codes, Linear Feedback Shift Registers (LFSRs)

Hardware Architectures for Public Key Algorithms Requirements and Solutions for Today and Tomorrow

Chapter 10 Elliptic Curves in Cryptography

Hardware Architectures of Elliptic Curve Based Cryptosystems over Binary Fields

Faster ECC over F 2. (feat. PMULL)

An Elliptic Curve Processor Suitable For RFID Tags

A low-time-complexity and secure dual-field scalar multiplication based on co-z protected NAF

A COMBINED 16-BIT BINARY AND DUAL GALOIS FIELD MULTIPLIER. Jesus Garcia and Michael J. Schulte

Side-channel attacks on PKC and countermeasures with contributions from PhD students

Galois Fields and Hardware Design

Arithmétique et Cryptographie Asymétrique

Optimal Use of Montgomery Multiplication on Smart Cards

Novel Implementation of Finite Field Multipliers over GF(2m) for Emerging Cryptographic Applications

Julio López and Ricardo Dahab. Institute of Computing (IC) UNICAMP. April,

Error-free protection of EC point multiplication by modular extension

ISSN (PRINT): , (ONLINE): , VOLUME-5, ISSUE-7,

Proceedings, 13th Symposium on Computer Arithmetic, T. Lang, J.-M. Muller, and N. Takagi, editors, pages , Asilomar, California, July 6-9,

Power Analysis to ECC Using Differential Power between Multiplication and Squaring

Flexible Prime-Field Genus 2 Hyperelliptic Curve Cryptography Processor with Low Power Consumption and Uniform Power Draw

Hardware Implementation of Elliptic Curve Processor over GF (p)

Affine Precomputation with Sole Inversion in Elliptic Curve Cryptography

DESPITE considerable progress in verification of random

Polynomial Interpolation in the Elliptic Curve Cryptosystem

Outline. EECS Components and Design Techniques for Digital Systems. Lec 18 Error Coding. In the real world. Our beautiful digital world.

Hardware Implementation of Elliptic Curve Cryptography over Binary Field

Implementation of ECM Using FPGA devices. ECE646 Dr. Kris Gaj Mohammed Khaleeluddin Hoang Le Ramakrishna Bachimanchi

Public Key Algorithms

Subquadratic space complexity multiplier for a class of binary fields using Toeplitz matrix approach

Hardware Implementation of Efficient Modified Karatsuba Multiplier Used in Elliptic Curves

New Bit-Level Serial GF (2 m ) Multiplication Using Polynomial Basis

Représentation RNS des nombres et calcul de couplages

Logic and Computer Design Fundamentals. Chapter 8 Sequencing and Control

Lecture 8. Sequential Multipliers

FPGA Realization of Low Register Systolic All One-Polynomial Multipliers Over GF (2 m ) and their Applications in Trinomial Multipliers

VLSI Implementation of Double-Base Scalar Multiplication on a Twisted Edwards Curve with an Efficiently Computable Endomorphism

ABHELSINKI UNIVERSITY OF TECHNOLOGY

Montgomery Arithmetic from a Software Perspective *

FPGA accelerated multipliers over binary composite fields constructed via low hamming weight irreducible polynomials

Finite Field Multiplier Architectures for

Analysis of Parallel Montgomery Multiplication in CUDA

Low-Weight Polynomial Form Integers for Efficient Modular Multiplication

Linear Feedback Shift Registers (LFSRs) 4-bit LFSR

Square Always Exponentiation

Discrete logarithm and related schemes

Low complexity bit-parallel GF (2 m ) multiplier for all-one polynomials

Transcription:

Hardware implementations of ECC The University of Electro- Communications

Introduction Public- key Cryptography (PKC) The most famous PKC is RSA and ECC Used for key agreement (Diffie- Hellman), digital signatures, public- key encryption Implementation of public- key cryptosystem Slower encryption/decryption (~Mbps) than symmetric key cryptography (AES: ~Gbps) RSA: modular exponentiation, M e ECC: point multiplication, kp on elliptic curve 2/38

Point operations on elliptic curve Point Addition R = P Q Point Doubling y R = 2P P Tangent line Scalar Mult. P Q x kp Binary method, etc. y 2 = x 3 x 1 R R 3/38

Weighted projective coordinate Example of point operations All operations are over GF(p) 4/38

Data Size in Modular Operations Typical key lengths of public- key cryptography RSA w/crt: 1024 bits ECC: ~256 bits RSA: 2048 bits RSA: M e mod n ECC: kp over GF(p) and GF(2 m ) 5/38

# of Modular Multiplications ECC needs more modular multiplications RSA (2048-bit Modular Mult.) x 2048 x 1.5 CRT (1024-bit Modular Mult.) x 1024 x 1.5 x 2 ECC (256-bit Modular Mult.) x 160 x 20 x 1.5 RSA: M e mod n ECC: kp over GF(p) and GF(2 m ) 6/38

Parallelism in high- speed implementation Degree of parallelism RSA = RSA RSA Parallelism in Square and Multiply 2048 x 1 CRT = CRT CRT = CRT CRT CRT CRT Parallelism in CRT 1024 x 1.5 x 1 Parallelism in Square and Multiply 1024 x 1 x 1 ECC ECC = = ECC ECC ECC ECC ECC How much Parallelism in scalar multiplication / Add and Double?? 7/38

Basic hierarchy for ECC Optimization possible in multi- layer Scalar Multiplication Point Addition Point Doubling Finite Field Addition Finite Field Multiplication Finite Field Inversion 8/38

Modular Multiplication = Multiplication Reduction Barrett Reduction (1986): Focus on MSBs Montgomery Reduction (1985): Focus on LSBs Based on n- bit multiplier (n = 8, 16, 32, 64) E.g., CIOS [IEEE Micro KKo c and Kaliski Jr 1996]. Based on bit- serial multiplier Multiplicand x n- bit [CHES Großscha dl 2001] 9/38

Modular Reduction from LSB (binary field) LSB- first bit- serial multiplier (binary fields) X a k-1 b k-1 a 3 b 3 a 2 b 2 a 1 b 1 a 0 b 0 =A(x) =B(x) A(x) B(x) P(x) =a 0 B(x) =P(x) or 0 XOR-based Datapath k d d =a 1 B(x) =P(x) or 0 T(x)... k =a k-1 B(x) =P(x) or 0 Montgomery Multiplier: A(x)B(x)R -1 mod P(x) (R = x k ) 15/11/2007 A(x)B(x)R -k mod P(x) 0 0 0 0 0 0 0 0 - Delay is determined by d - Loop : k/d 10/38

Modular Reduction from MSB (binary field) MSB- first bit- serial multiplier (binary field) X a k-1 b k-1 a 3 a 2 a 1 a 0 b 3 b 2 b 1 b 0 =A(x) =B(x) A(x) B(x) P(x) =a k-1 B(x) =P(x) or 0 XOR-based Datapath k d =a k-2 B(x) =P(x) or 0 d T(x) k.. =a 0 B(x) A(x)B(x) mod P(x) =P(x) or 0 - Delay is determined by d - Loop : k/d 0 15/11/2007 0 0 0 0 0 0 0 A(x)B(x) mod P(x) 11/38

Modular Reduction from LSB (prime field) LSB- first bit- serial multiplier ( prime field ) X x k y k x k-1 y k-1 x 3 y 3 x 2 y 2 x 1 y 1 x 0 y 0 =X =Y X Y p =x 0 Y =p or 0 CSA-based Datapath k d d =x 1 Y =p or 0 T... =x k Y =p or 0 k Montgomery Multiplier: XYR -1 mod p (R = 2 k2 ) 15/11/2007 0 =p or 0 0 0 0 0 0 0 0 - Delay is determined by d - Loop : (k2)/d XYR -1 mod n 12/38

Final reduction for prime field p XYR - 1 mod p Intermediate value, T = ( T x i y mp)/2 < 2p Final reduction: T = T p< p 13/38

Fixed number of additions of p XYR - 1 mod p Intermediate value, T = ( T x i y mp)/2 < 3p T = ( T mp)/2 < 2p 14/38

Modular Reduction from MSB (prime field) MSB- first bit- serial multiplier (prime field) X x k-1 y k-1 x 3 x 2 x 1 x 0 y 3 y 2 y 1 y 0 =X =Y X Y p - =x k-1 y = 2p, p, or 0 CSA-based Datapath k d - =x k-2 y = 2p, p, or 0 d T k.. =x 0 y X Y mod p - = 2p, p, or 0 - Delay is determined by d - Loop : (k2)/d 0 15/11/2007 0 0 0 0 0 0 0 A(x)B(x) mod P(x) 15/38

Division in MSB- first modular multiplication XY mod p Intermediate value, T = ( 2T x i y ) < 3p Division can be escaped with precomputation [Barrett Reduction; IEEE ToC, Knezevic, Vercauteren, Verbauwhede, 2010] 16/38

Barrett Reduction T/p = T/2 k- 1 * 2 k- 1 /p Step4 : q = T/2 k- 1 - - - shift operation (wiring) Step5 : T = ( T - q * 2 k- 1 /p ) - - - precomputation But q = 0, 1, or 2 17/38

Parallelism in Modular Multiplication Both MSB/LSB Modular Multiplication [MELECON, Batina et. al., 2004; CHES, Kaihara and Takagi, 2005] Combine MSB- first and LSB- first MM A(x)B(x)R - 1 mod P(x) = (A 1 (x)r A 0 )B(x)R - 1 mod P(x) = A 1 (x)b(x) mod P(x) MSB- first A 0 (x)b(x)r - 1 mod P(x) LSB- first R = x k/2 18/38

Datapath of Bipartite Modular Multiplication Bipartite bit- serial multiplier (binary field) X a k-1 b k-1 a 3 a 2 a 1 a 0 b 3 b 2 b 1 b 0 =A(x) =B(x) MSB-first =a k-1 B(x) =P(x) or 0 =a k-2 B(x) d =P(x) or 0 LSB-first d X a k-1 b k-1 a 0 a 1 a 2 a 3 b 0 b 1 b 2 b 3 =A(x) =B(x) =a 0 B(x) =P(x) or 0 =a 1 B(x) =P(x) or 0 A(x) B(x) P(x) XOR-based Datapath k d k.. =a 0 B(x) Stop in the middle =P(x) or 0.. =a k-1 B(x) Stop in the =P(x) middle or 0 k T(x) A 0 1 0(x)B(x) 0 0 0 0 0 0mod P(x) A(x) B(x) mod P(x) A 0 0 (x)b(x)r -1 0 mod 0 0 0 0P(x) 0 0 A(x) B(x) 2 -k mod P(x) A(x)B(x)R -1 mod P(x) - Parallelized (x2) - Loop: k/2d A(x)B(x)R -1 mod P(x) 19/38

Great Merits in Bipartite Modular Multiplication Parallelism in Modular Multiplication Applicable to prime field (original proposal) Need to take care (final) reductions Almost double performance, less than double area (F/Fs are shared) 20/38

Architecture for ECC coprocessor HW/SW co- design CPU coprocessor for ECC Speed performance depends on architecture Trade- off between cost and performance Local dedicated memory - > high performance Instruction- set extension, ASIC, etc. 21/38

ECC coprocessor (I) TYPE I: Smallest implementation (baseline) Coprocessor SRAM Main CPU Program Memory Mapped I/O ROM 32-bit data 32-bit instruction Buffer Full DBC FSM IBC IQB Data Bus Instruction Bus MALU 22/38

ECC coprocessor (II) TYPE II: TYPE I µ- coded RAM Coprocessor SRAM Main CPU Program Memory Mapped I/O ROM 32-bit data 32-bit instruction Buffer Full DBC Program µ-coded RAM FSM IBC IQB Data Bus Instruction Bus MALU 23/38

TYPE III: TYPE I CPRAM ECC coprocessor (III) Coprocessor SRAM Main CPU Program Memory Mapped I/O ROM 32-bit data 32-bit instruction Buffer Full DBC FSM IBC IQB Data Bus Instruction Bus MALU Coprocessor RAM (CPRAM) 24/38

ECC coprocessor (IV) TYPE IV: TYPE I CPRAM & µ- code RAM Coprocessor SRAM Main CPU Program Memory Mapped I/O ROM 32-bit data 32-bit instruction Buffer Full DBC Program µ-coded RAM FSM IBC IQB Data Bus Instruction Bus MALU Coprocessor RAM (CPRAM) 25/38

Profiling ECC coprocessor Performance for HECC over GF(2 83 ) The number of cycles is reduced by CPRAM Micro- coded RAM can avoid instruction stalls Type I Type II Type III SW I/O Overhead Datapath SW Datapath µ-coded RAM SW Datapath CPRAM Type IV SW Datapath µ-coded RAM CPRAM Required Clock Cycles [K] 500 400 300 200 100 0 2 859 2 672 0 67 0 67 187 I/O Transfer CPRAM Access Datapath 38 38 67 0 67 TYPE I TYPE II TYPE III TYPE IV System Configuration 26/38

Cost, Performance and Security Trade- offs p Programmable VS hardwired parameters p Curve parameters p Computational sequence: Coordinates, operation form. p Prime p, irreducible polynomial P(x) p Field sizes Performance (latency) Hardwired - compact & fast Programmable: - wider-range of applications - higher-security algorithms Cost (gate area) 27/38

Cost, Performance and Security Trade- offs p Programmable VS hardwired parameters p Curve parameters p Computational sequence: Coordinates, operation form. p Prime p, irreducible polynomial P(x) p Field sizes Performance (latency) Hardwired - compact & fast Fixing parameters Programmable: - wider-range of applications - higher-security algorithms For high-speed applications Cost (gate area) 28/38

Cost, Performance and Security Trade- offs p Programmable VS hardwired parameters p Curve parameters p Computational sequence: Coordinates, operation form. p Prime p, irreducible polynomial P(x) p Field sizes Performance (latency) Hardwired - compact & fast For RFID tags Fixing parameters Programmable: - wider-range of applications - higher-security algorithms Cost (gate area) 29/38

p Programmable high- speed coprocessor p ECC / HECC over GF(2 m ) p Six MALU cores (MALU) High- speed ECC & HECC Main CPU Memory Mapped I/O ROM RAM Data 32 32 Cryptoprocessor DBC Instruction Control µ-coded RAM Program Instruction FSM IQB Instruction Bus IBC Instruction Data Bus RF1 MALU MALU MALU MALU RF2 RF3 Kazuo RF6 SAKIYAMA -1 core -2 core -3 core -6 core 30/38

Parallelism in scalar multiplication p How many instructions executed in parallel? p Operation form AB C and A(BC) D p Check the data dependencies for ECC (and HECC) Number of Clock Cycles 90,000 70,000 50,000 30,000 1xMALU 2xMALU (a) AB C HECC over GF(2 83 ) ECC over GF(2 163 ) ECC_M over GF(2 163 ) 1xMALU (b) A(BD) C 2xMALU 10,000 4xMALU 4xMALU 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 Number of Instructions to search ILP (ILPD)

Parallelism in scalar multiplication p How many instructions executed in parallel? p Operation form AB C and A(BC) D p Check the data dependencies for ECC (and HECC) Number of Clock Cycles 90,000 70,000 50,000 30,000 1xMALUb (a) AB C HECC over GF(2 83 ) ECC over GF(2 163 ) ECC_M over GF(2 163 ) (b) A(BD) C 1xMALUb 2xMALUb 2xMALUb 4xMALUb 4xMALUb 10,000 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 Number of Instructions to search ILP (ILPD)

Lightweight ECC For RFID p Low- power MALU for RFID tags p Based on MALU p Cost and performance for ECC and HECC p Power estimation with 0.13- mm CMOS library p Hardware sharing in MALU p Modular addition and multiplication use the same hardware (cell) 33/38

Low- power ECC (HECC) for RFID p Synthesis results (area & performance) p MALU hardwired controller p ECC shows better performance (106 ms @ 500kHz, d = 4) p ECC over a composite field is smaller 10000 HECC over GF(2 m ) d = 4 600,000 d = 1 HECC over GF(2 m ) Area [gate] 9000 8000 7000 6000 5000 d = 8 d = 6 d = 4 d = 8 d = 6 d = 4 d = 2 d = 1 ECC over GF((2 m ) 2 ) ECC over GF(2 m ) d = 3 d = 2 d = 1 Performance [cycle] 500,000 400,000 300,000 200,000 100,000 d = 4 d = 6 d = 8 d = 2 d = 4 d = 6 d = 8 ECC over GF((2 m ) 2 ) ECC over GF(2 m ) d = 1 d = 2 d = 3 d = 2 d = 1 4000 50 70 90 110 130 150 170 Field length [bit] d = 4 0 50 70 90 110 130 150 170 Field length [bit] 34/38

Low- power ECC (HECC) for RFID p Synthesis results (power) p Results only for coprocessor (average) p Peak power evaluation is essential p RF interface will consume more power 30 25 Clock Frequency = 500 khz HECC over GF(2 m ) ECC over GF((2 m ) 2 ) ECC over GF(2 m ) d = 4 d = 3 d = 2 25 20 Dynamic Leakage Clock Frequency = 500 khz d = 1 Power [uw] 20 d = 8 d = 6 d = 4 d = 2 d = 1 Power [uw] 15 10 15 d = 8 d = 6 d = 4 d = 2 d = 1 10 50 70 90 110 130 150 170 Field length [bit] 5 0 ECC over GF(2 131 ) d = 4 ECC over GF((2 67 ) 2 ) d = 8 HECC over GF(2 67 ) d = 8 35/38

ECC coprocessor & Side- channel attacks p Four architectures (a) TYPE A: Baseline 32 Coprocessor Instructions MicroBlaze Core OPB Interrupt Data Data Memory Program Memory 32 Coprocessor DBC 32 Coprocessor Instructions MicroBlaze Core OPB Interrupt Data Data Memory Program Memory 32 Coprocessor DBC (b) TYPE B: µ- coded ROM 32-bit Inst. Bus (a) 163-bit Data Bus MALUb RF 32-bit Inst. Bus u-code (b) 163-bit Data Bus MALUb RF (c) TYPE C: HW key register (d) TYPE D: RND for key randomization 32 Coprocessor Instructions MicroBlaze Core MALUb Data Memory Program Memory Interrupt OPB 32-bit Inst. Bus u-code Data 32 Coprocessor DBC 163-bit Data Bus Key (c) RF 32 Coprocessor Instructions MicroBlaze Core Interrupt OPB 32-bit Inst. Bus RNG u-code Key Randomization Data 32 DBC MALUb Data Memory Program Memory Coprocessor 163-bit Data Bus Key (d) RF 36/38

p Trade- offs in ECC hardware p Parallel processing p Operation forms: AB C or A(BD) C p Cost, performance, security Conclusions Scalar Multiplication Cost Point Addition Point Doubling Performance Security Finite Field Addition Finite Field Multiplication Finite Field Inversion 37/38

Thank you for your attentions! Questions? P P y Q x y 2 = x 3 x 1 R R