A Digit-Serial Systolic Multiplier for Finite Fields GF(2 m )

Similar documents
AN IMPROVED LOW LATENCY SYSTOLIC STRUCTURED GALOIS FIELD MULTIPLIER

A COMBINED 16-BIT BINARY AND DUAL GALOIS FIELD MULTIPLIER. Jesus Garcia and Michael J. Schulte

FPGA accelerated multipliers over binary composite fields constructed via low hamming weight irreducible polynomials

GF(2 m ) arithmetic: summary

FPGA Realization of Low Register Systolic All One-Polynomial Multipliers Over GF (2 m ) and their Applications in Trinomial Multipliers

A New Bit-Serial Architecture for Field Multiplication Using Polynomial Bases

Transformation Techniques for Real Time High Speed Implementation of Nonlinear Algorithms

Transition Faults Detection in Bit Parallel Multipliers over GF(2 m )

Low complexity bit-parallel GF (2 m ) multiplier for all-one polynomials

Simplification of Procedure for Decoding Reed- Solomon Codes Using Various Algorithms: An Introductory Survey

THE UNIVERSITY OF MICHIGAN. Faster Static Timing Analysis via Bus Compression

An Area Efficient Enhanced Carry Select Adder

Efficient Hardware Calculation of Inverses in GF (2 8 )

Implementation of Carry Look-Ahead in Domino Logic

Low Energy Digit-serial Architectures for large GF(2 m ) multiplication

ISSN (PRINT): , (ONLINE): , VOLUME-4, ISSUE-10,

IMPLEMENTATION OF PROGRAMMABLE LOGIC DEVICES IN QUANTUM CELLULAR AUTOMATA TECHNOLOGY

An Algorithm for Inversion in GF(2 m ) Suitable for Implementation Using a Polynomial Multiply Instruction on GF(2)

Novel Implementation of Finite Field Multipliers over GF(2m) for Emerging Cryptographic Applications

EECS150 - Digital Design Lecture 21 - Design Blocks

Reduced-Area Constant-Coefficient and Multiple-Constant Multipliers for Xilinx FPGAs with 6-Input LUTs

Design and Comparison of Wallace Multiplier Based on Symmetric Stacking and High speed counters

An Efficient Multiplier/Divider Design for Elliptic Curve Cryptosystem over GF(2 m ) *

Area-Time Optimal Adder with Relative Placement Generator


AREA EFFICIENT MODULAR ADDER/SUBTRACTOR FOR RESIDUE MODULI

ABHELSINKI UNIVERSITY OF TECHNOLOGY

Design and FPGA Implementation of Radix-10 Algorithm for Division with Limited Precision Primitives

Subquadratic Computational Complexity Schemes for Extended Binary Field Multiplication Using Optimal Normal Bases

Scalable Systolic Structure to Realize Arbitrary Reversible Symmetric Functions

STUDY AND IMPLEMENTATION OF MUX BASED FPGA IN QCA TECHNOLOGY

DESIGN OF QUANTIZED FIR FILTER USING COMPENSATING ZEROS

Instruction Set Extensions for Reed-Solomon Encoding and Decoding

Novel Modulo 2 n +1Multipliers

Montgomery Multiplier and Squarer in GF(2 m )

Chapter 5. Digital Design and Computer Architecture, 2 nd Edition. David Money Harris and Sarah L. Harris. Chapter 5 <1>

Cost/Performance Tradeoff of n-select Square Root Implementations

A new class of irreducible pentanomials for polynomial based multipliers in binary fields

VLSI Signal Processing

Subquadratic space complexity multiplier for a class of binary fields using Toeplitz matrix approach

A High-Speed Realization of Chinese Remainder Theorem

PAPER A Low-Complexity Step-by-Step Decoding Algorithm for Binary BCH Codes

International Journal of Advanced Computer Technology (IJACT)

I. INTRODUCTION. CMOS Technology: An Introduction to QCA Technology As an. T. Srinivasa Padmaja, C. M. Sri Priya

Test Pattern Generator for Built-in Self-Test using Spectral Methods

Subquadratic Space Complexity Multiplication over Binary Fields with Dickson Polynomial Representation

Finite Fields. SOLUTIONS Network Coding - Prof. Frank H.P. Fitzek

Low Power, High Speed Parallel Architecture For Cyclic Convolution Based On Fermat Number Transform (FNT)

Outline. EECS Components and Design Techniques for Digital Systems. Lec 18 Error Coding. In the real world. Our beautiful digital world.

DESİGN AND ANALYSİS OF FULL ADDER CİRCUİT USİNG NANOTECHNOLOGY BASED QUANTUM DOT CELLULAR AUTOMATA (QCA)

New Bit-Level Serial GF (2 m ) Multiplication Using Polynomial Basis

FPGA BASED DESIGN OF PARALLEL CRC GENERATION FOR HIGH SPEED APPLICATION

Pipelining and Parallel Processing

Counting Two-State Transition-Tour Sequences

An Effective New CRT Based Reverse Converter for a Novel Moduli Set { 2 2n+1 1, 2 2n+1, 2 2n 1 }

Modular Multiplication in GF (p k ) using Lagrange Representation

Revisiting Finite Field Multiplication Using Dickson Bases

A Suggestion for a Fast Residue Multiplier for a Family of Moduli of the Form (2 n (2 p ± 1))

Analysis and Synthesis of Weighted-Sum Functions

Lecture 8: Sequential Multipliers

Median architecture by accumulative parallel counters

Retiming. delay elements in a circuit without affecting the input/output characteristics of the circuit.

A VLSI Algorithm for Modular Multiplication/Division

Volume 3, No. 1, January 2012 Journal of Global Research in Computer Science RESEARCH PAPER Available Online at

The equivalence of twos-complement addition and the conversion of redundant-binary to twos-complement numbers

Design and Implementation of Efficient Modulo 2 n +1 Adder

NCU EE -- DSP VLSI Design. Tsung-Han Tsai 1

A New Division Algorithm Based on Lookahead of Partial-Remainder (LAPR) for High-Speed/Low-Power Coding Applications

DESIGN OF LOW POWER-DELAY PRODUCT CARRY LOOK AHEAD ADDER USING MANCHESTER CARRY CHAIN

EECS150 - Digital Design Lecture 23 - FFs revisited, FIFOs, ECCs, LSFRs. Cross-coupled NOR gates

KEYWORDS: Multiple Valued Logic (MVL), Residue Number System (RNS), Quinary Logic (Q uin), Quinary Full Adder, QFA, Quinary Half Adder, QHA.

VHDL Implementation of Reed Solomon Improved Encoding Algorithm

Vidyalankar S.E. Sem. III [CMPN] Digital Logic Design and Analysis Prelim Question Paper Solution

Dual-Field Arithmetic Unit for GF(p) and GF(2 m ) *

DIGITAL TECHNICS. Dr. Bálint Pődör. Óbuda University, Microelectronics and Technology Institute

Chapter 2 Basic Arithmetic Circuits

Design and Implementation of Carry Tree Adders using Low Power FPGAs

Chapter 5. Cyclic Codes

Design of Low Power, High Speed Parallel Architecture of Cyclic Convolution Based on Fermat Number Transform (FNT)

Global Optimization of Common Subexpressions for Multiplierless Synthesis of Multiple Constant Multiplications

Performance Enhancement of Reversible Binary to Gray Code Converter Circuit using Feynman gate

Factorization of singular integer matrices

FAST FIR ALGORITHM BASED AREA-EFFICIENT PARALLEL FIR DIGITAL FILTER STRUCTURES

Hardware implementations of ECC

Svoboda-Tung Division With No Compensation

Construction of a reconfigurable dynamic logic cell

Are standards compliant Elliptic Curve Cryptosystems feasible on RFID?

A Novel LUT Using Quaternary Logic

High-Speed Polynomial Basis Multipliers Over for Special Pentanomials. José L. Imaña IEEE

Word-length Optimization and Error Analysis of a Multivariate Gaussian Random Number Generator

High Speed Time Efficient Reversible ALU Based Logic Gate Structure on Vertex Family

FPGA Implementation of a Predictive Controller

CORDIC, Divider, Square Root

An Optimized Hardware Architecture of Montgomery Multiplication Algorithm

Galois Field Algebra and RAID6. By David Jacob

ARITHMETIC COMBINATIONAL MODULES AND NETWORKS

VLSI Architecture of Euclideanized BM Algorithm for Reed-Solomon Code

VLSI. Faculty. Srikanth

National Taiwan University Taipei, 106 Taiwan 2 Department of Computer Science and Information Engineering

On Random Pattern Testability of Cryptographic VLSI Cores

Transcription:

A Digit-Serial Systolic Multiplier for Finite Fields GF( m ) Chang Hoon Kim, Sang Duk Han, and Chun Pyo Hong Department of Computer and Information Engineering Taegu University 5 Naeri, Jinryang, Kyungsan, Kyungbuk,7-74 KOREA Abstract: In this paper, a digit-serial systolic array is proposed for computing modular multiplication A(x), B(x) mod G(x) in finite fields GF( m ) with the standard basis representation. From the multiplication algorithm in GF( m ), we obtain a new dependence graph and design an efficient digit-serial systolic multiplier. If input data come in continuously, the proposed array can produce multiplication results at a rate of one every [m/l] clock cycles, where L is the selected digit size. The analysis results show that the proposed architecture leads to a reduction of computational delay time and it has much more simple structure than existing digit-serial systolic multiplier. Furthermore, since the new architecture has the features of regularity, modularity, and unidirectional data flow, it is well suited to VLSI implementation with fault-tolerant design. Key-Words : Digit-Serial Architecture, Finite Field Multiplier, Cryptography, Systolic Array, Finite Field Arithmetic, VLSI. Introduction Finite or Galois Field(GF)( m ) have played an important role in many application areas of communication, such as error-correcting code [] and cryptography [9]. Because addition in GF( m ) is bit independent XOR operation, it can be implemented in fast and inexpensive way. On the other hand, multiplication is more complicated and expensive. Futhermore, it is the most common arithmetic operation in GF( m ). Thus, it is desirable to design hardware-efficient multiplier for GF( m ) to meet the real-time requirement with minimum hardware complexity [3]. Many approaches and architectures have been proposed to perform multiplication in GF( m ) [][3][4][5][4]. The multipliers for GF( m ) can be classified into four types: bit-parallel, bit-serial, super-serial and digit-serial architectures. Basically a bit-parallel system reaches much better throughput performance than the others, but it involves much more hardware complexity. To improve the trade-off between throughput performance and hardware complexity, digit-serial architecture have been proposed [3][7][8]. For a digit-serial system, the data words are first partitioned This work was supported by grant No. --5-- from the Basic Research Program of the Korea Science & Engineering Foundation into digits of some bits each, and then processed and transmitted on a digit-by-digit basis. Suppose that the word size is m-bit, the digit size is L-bit, and N = [m/l], then bit-parallel, bit-serial and super-serial systems process the input data at a rate of m-bit, one-bit and less than one-bit per clock cycle respectively, while digit-serial system processes the input data at a rate of L-bit per clock cycle. In other words, digit-serial system will yield output results at a rate of one every N clock cycles, while bit-parallel, bit-serial and super-serial systems will yield output results at a rate of one, m and more than m clock cycles, respectively. If the digit size is chosen appropriately, a digit-serial architecture can meet the throughput requirement of a certain application with minimum hardware. In this paper, we first review the multiplication algorithm in GF( m ) and then derive a new dependence graph(dg). Based on a new DG, we propose an efficient digit-serial systolic multiplier and its architecture leads to a reduction of computational delay time and it shows more simple structure compared to Guo and Wang s architecture [3]. In addition, if the input data come in continuously, the proposed array can produce results at a rate of one per N cycles after an initial delay of 3N clock cycles, which is the same as the previous research [3]. Finally, since the new architecture has the features of

regularity, modularity, and unidirectional data flow, it is well suited to VLSI implementation with fault-tolerant design. The Multiplication Algorithm In this section, we first review the multiplication algorithm [][5]. Let A(x) and B(x) be two elements in GF( m ), G(x) be the primitive polynomial used to generate the field and P(x) be the result of the product A(x)B(x) mod G(x). For each polynomials, the coefficients are the binary digits and. A(x)=a m- x m- +a m- x m- + +a x +a B(x)=b m- x m- +b m- x m- + +b x +b G(x)= x m +g m- x m- +g m- x m- + +g x+g P(x)= p m- x m- +p m- x m- + +p x +p In equation (), each element is a residue mod G(x), and all arithmetic operations are performed by taking the results modulo. As described in [], P(x) can be computed recursively as follows. T (x) = () T i (x) = [T i- (x)x]mod G(x) + A(x)b m-i, where i =,,,m (3) P(x) = T m (x) (4) After m iteration, the result P(x) can be obtained. Defining T i (x) = t i,m- x m- +t i,m- x m- + +t i, x+t i, and substituting it into eqation, it can be derived that t i,k =t i-,m- g k +b m-i a k +t i-,k- for k m- with t i-,- =. () (5) (6) Based on the given algorithm, a DG can be derived as shown in Fig.. Generally, the DG consists of m m basic cells for multiplication in GF( m ). In particular, m=9 in the DG of Fig.. In addition, Fig. represents the architecture of basic cell which consisits of two -input AND gates and one 3-input XOR gate. The cells in the ith low of the array perform the operations of the ith iteration, where each basic cell computes one coefficient. The coefficients of the result P(x) emerge from the bottom low of the array after m iterations. 3 A Digit-Serial Systolic Multiplier Let L be the digit size, N=m/L be an integer, and A i s, B i s, G i s, and P i s ( i N-) be the digits of the coefficients of A(x), B(x), G(x) and P(x) respectively. Each digit consists of L bits such as the digit A i = (a il+l-, a il +L-,,a il+, a il ), and the digits B i, G i and P i are defined similarly. 3. Modification of DG and Basic Cell As the first step for construction of a new systolic array, we combine L L basic cells in Fig. into a new basic cell. Fig. 3 shows the modified DG, where L=3 and N=m/L=3, and Fig. 4 represents the modified circuits of corresponding basic cell. As described in Fig. 3, the modified DG consist of N N basic cells. In the modified DG of Fig. 3, the digits A i, G i enter the (,i)th basic cell, the B i enters the (N-i,N-)th basic cell, and the digit Pi emerges from the (N,i)th basic cell, where i N-. Although the modified DG in Fig. 3 has high regularity, it is impossible to get an one-dimensional signal flow graph (SFG) array. This is because the data flow is bi-directional horizontally in Fig. 3, the DG cannot be projected along the east direction. In other words, the (i,k)th basic cell has to get the (L-) temporary results t il-,kl-, t il-,kl-,., and t il-(l-),kl- from the right neighboring (i,k-)th basic cell. b b a 5 a 4 a 3 g 3 a g a g a g (,8) (,7) (,6) (,5) (,4) (,3) (,) (,) (,) (,8) (,7) (,6) (,5) (,4) (,3) (,) (,) (,) (3,8) (3,7) (3,6) (3,5) (3,4) (3,3) (3,) (3,) (3,) (4,8) (4,7) (4,6) (4,5) (4,4) (4,3) (4,) (4,) (4,) (5,8) (5,7) (5,6) (5,5) (5,4) (5,3) (5,) (5,) (5,) (6,8) (6,7) (6,6) (6,5) (6,4) (6,3) (6,) (6,) (6,) (7,8) (7,7) (7,6) (7,5) (7,4) (7,3) (7,) (7,) (7,) (8,8) (8,7) (8,6) (8,5) (8,4) (8,3) (8,) (8,) (8,) (9,8) (9,7) (9,6) (9,5) (9,4) (9,3) (9,) (9,) (9,) p 8 p 7 p 6 p 5 p 4 p 3 p p p Fig. Dependence graph for multiplication in GF( 9 ) [3] t i-,m- b m- t i,k a k g k t i-,k- Fig. Circuit of (i,k)th cell in Fig. [][3]

b b (,) (,) (,) g 3 g g g a 5 a 4 a 3 a a a (,) (,) (,) (3,) (3,) (3,) p 8 p 7 p 6 p 5 p 4 p 3 p p p Fig. 3 Modified dependence graph of Fig. with L=3 t (i-)l,m- + t il-,kl+ t il-,m- + t il-,kl+ t il-,m- t (i-)l,kl+ akl+ g kl+ a kl+ g kl+ t (i-)l,kl t il,kl+ til,kl+ t il,kl a kl g kl t (i-)l,kl- t il-,kl- t il-,kl- Fig. 4 Circuit of (i,k) basic cell in Fig. 3 [3] We solve such a problem with different way compared to Guo [3]. The principal idea of our approach is that the time for computation of XOR gates numbered by,,...,(l-) with dependency in Fig. 4 is delayed for one clock cycle. In other words, the computation of (L-) XOR gates with depenency are executed in next clock cycle. By applying this idea, we modify the DG of Fig. and the basic cell of Fig. 4 one more time. The modification procedures are summarized as follows. (i) To remove the dependency of (L-) XOR gates in Fig. 4, each rows in Fig. are shifted to right by one column except the first (upper) row. The resulting DG is shown in Fig. 5. (ii) According to Fig. 5, without changing the function of the basic cell, we can modify the structure of the basic cell of Fig. 4. The resulting structure of the basic cell is shown in Fig. 6, where denotes a -bit one-cycle delay element. This is because the computation of (L-) XOR gates with dependency are executed in next clock cycle. As described in Fig. 6, data buses nembered,,..., and (L-) are connected to the inputs of the XOR gates numbered by,,,and (L-) instead of the L- temporary results t il-,kl-, t il-,kl-,, and t il-(l-),kl-, respectively. Also, we add (L-) one bit latches due to the fact that the (L-) XOR gates with dependency have to be computed with the input data of the previous cycle. (iii) As depicted in Fig. 5, since all the data flow are unidirectional, we can make projection for an one dimensional array. By projecting the DG in Fig. 5 a7 a6 g a 5 g 3 g g g a4 a3 a a a 5 (,8) (,7) (,6) (,5) (,4) (,3) (,) (,) (,) (,8) (,7) (,6) (,5) (,4) (,3) (,) (,) (,) (3,8) (3,7) (3,6) (3,5) (3,4) (3,3) (3,) (3,) (3,) (4,8) (4,7) (4,6) (4,5) (4,4) (4,3) (4,) (4,) (4,) (5,8) (5,7) (5,6) (5,5) (5,4) (5,3) (5,) (5,) (5,) (6,8) (6,7) (6,6) (6,5) (6,4) (6,3) (6,) (6,) (6,) (7,8) (7,7) (7,6) (7,5) (7,4) (7,3) (7,) (7,) (7,) : denote boundary of computation areas : denote boundary am ong (L*L) basic cells w ith L=3 b (8,8) (8,7) (8,6) (8,5) (8,4) (8,3) (8,) (8,) (8,) b (9,8) (9,7) (9,6) (9,5) (9,4) (9,3) (9,) (9,) (9,) p 8 p 7 p 6 p 5 p 4 p 3 p p p Fig. 5 Modified dependence graph of Fig.

along the east direction, we finally derive the one dimensional SFG array as shown in Fig. 7. This SFG array consists of N units of identical processing element(pe), which is depicted in Fig. 8. It is controlled by a control sequence of... with length N. According to the projection, the digits A is and G is enter this array in serial form with the most siginificant digit first, while the digits of the coefficients of B(x) should reside inside the array, i.e. B i should be at one PE to be ready for execution. Since the L temporary results t il-,m-, t il-,m-,, and t (i-)l,m- must be broadcasted to all basic cells in the ith row of the DG in Fig. 5 (i.e. all the basic cells in the ith row of the DG need L temporary results for further processing), we add extra L multiplexer and L one-bit latches into each PE of the SFG array. When the control signal is in logic, the L temporary results are latched. In addition, we add extra L two-input AND gates into each PE of the SFG such that L zeros are fed into each row of the DG from the rightmost cell when the control signal is in logic. Finally, while the next new input data set come in the PE and it processes the first computation, the (L-) XOR gates with depenency have to process the last computation with the previous input data set, respectively. Thus, the inputs for p,p,...,p (L-) and b,,...,b (L-) in Fig. 8 have to be connected the inputs of the previous cycle, where a denotes the cell in Fig. and a doted line denotes boundary between the (L-) cells with dependency and the counterparts. Thus, The inputs p,p,...,p (L-) are connected to s,s,...s (L-) and the inputs b,,...,b (L-) have one bit latches, respectively. According to the modification procedure, we get the modified PE as shown in Fig. 9. As depicted in Fig. 9, the architecture of PE is much more simpler than Guo s one [3]. Fig. 6 Modified version of (i,k)th cell of Fig. 4 g a g a g a g 3 a 3 a 4 a 5 PE PE PE p 3 p 6 p p 4 p 7 p p 5 p 8 p b b Fig. 7 One-dimensional SFG array for multiplication in GF( 9 ) with L=3 p in, + + p in, p in, ctrl p' b' s s p' b' + + Fig. 8 Structure of each PE in Fig. 7 p in, + + p in, p in, ctrl p out, + + p out, p out, p out, + + p out, p out, t (i-)l,kl+ t (i-)l,kl akl+ g kl+ a kl+ g kl+ a kl g kl t (i-)l,kl- + + Fig. 9 Modified version of Fig. 8 t (i-)l,m- + t il-,kl+ ' t il-,m- + t il-,kl+ ' t il-,m- t il,kl+ til,kl+ t il,kl t il-,kl- t il-,kl- 3. Cut-Set Systolisation of SFG By applying the cut-set systolisation techniques [] to the SFG of Fig. 7, we get the digit-serial systolic multiplier depicted in Fig.. If the input data come in continuously, this array can produce results at a rate of one per N cycles after an initial delay of 3N clock cycles. The digits P is emerge from the right-hand side of the array in serial form with the most significant digit first. Also we can add extra L muliplexers and L one-bit latches into each PE so that the digits B is of the

coefficients of B(x) may also enter the array in serial form with most significant digit first. Finally the resulting systolic array and the corresponding PE is described in Fig. and Fig. respectively, where the data load operation of B i occurs when the control signal is in logic. g a g a g a g 3 a 3 a 4 a 5 PE PE PE p 3 p 6 p p 4 p 7 p p 5 p 8 p b b Fig. Digit-serial systolic array for multiplication in GF( 9 ) with L=3 g a g a g a g 3 a 3 a 4 a 5 b b PE PE PE p p 3 p 6 p p 4 p 7 p p 5 p 8 Fig. Modified version of Fig., where digits B is enter array in serial form p in, + + p in, p in, + p out, + + p out, p out, using the Synopsis FPGA-Express (version,-fe3.5), in which Altera s FPGA FLEX K family were assumed as target devices. After complete verification of the circuit layout, we extract net-list files from FPGA Express. As the next step, we made a functional simulation from the net-list file using the Mentographics VHDL (ChimSim). Fig. 3 shows the simulation result when the input data come in continuously, where A(x)=(x 8 + x 5 + x 4 + x, x 8 + x 7 + x 6 + x 5 + x 3 +), B(x)=(x 8 + x 6 + x 5 + x +, x 8 + x 7 + x 3 + x + x), and G(x)=(x 9 +x 4 +, x 9 +x 4 +) In this case, we feed two different data set in successively. The simulation results show that P(x) is given as (x 8 + x 5 +, x 7 + x 6 +), which is the same as the analytical solution. In addition, the output comes out at a rate of every 3 cycles after an initial delay of 9 clock cycles. More generally, the output will come out at a rate of every N cycles after an initial delay of 3N clock cycles. In case of larger m (up to 8) with various L, we obtained the same result that the proposed multiplier work in correctly with the same processing rate described in above. Based on the simulation result, we compared the performance of the proposed multiplier with the previous research [3]. Table summarizes the performance comparison results, in which we assume that 3-input XOR gate and 4-input XOR gate are constructed using two and three -input XOR gates respectively [3]. As described in Table, in the proposed systolic multiplier, as a whole, the computation delay time is reduced as much as 3N(T XOR ). In addition, the number of gates is reduced as much as [{N(L-)-}Latchs + (L-)(XOR+AND)], and the number of data bus is also reduced as much as (L-) compared to the Guo s one [3]. This results leads to the conclusion that the proposed digit-serial multiplier gives a reduction in computational delay time with a more simple hardware circuit. + ctrl Fig. Structure of each PE in Fig. 4 Performance Analysis To make performance analysis, the multiplier depicted in Fig. was modeled in VHDL and was synthesized 5 Conclusions In this paper, we proposed an efficient digit-serial systolic multiplier for GF( m ). From the given algorithm and DG, we first obtained a new DG through two steps of modification. In addition, by applying the cut-set systolisation techniques, we finally constructed the SFG and the PE for digit-serial multiplication. Based on the constructed array, we made simulation in various way. The simulation results show that the proposed architecture leads to a reduction of computational

Item Circuit Requirement One PE Complexity Fig. 3 Simulation result Table. Comparison of two digit-serial systolic multipliers for GF( m ) Circuit Guo s Multiplier [3] Proposed Multiplier N L- L- L +L L- L -L+ L- L L PE XOR AND AND XOR XOR3 XOR4 bit Latch N L +L 9L+ L Latency (cycles) 3N 3N L PE bit Latch AND XOR3 bit Latch Maximum T AND + T XOR4 + (L-)T + L(T AND + T XOR3 ) Propagation delay (L-)(T AND + T XOR3 + T ) Number of control signals delay time with more simple hardware structure, compared to the Guo s multiplier [3]. In particular, the structure of each PE is much more simpler than the Guo s one [3]. In addition, since the proposed multiplier shows unidirectional data flow, and is highly regular and modular, it is well suited to VLSI implementation. As described in this paper, the digit-serial multiplier is faster than the bit-serial one, and has lower hardware complexity than the bit-parallel one. In addition, in the digit-serial multiplier architecture, we can make trade-off between throughput performance and hardware complexity. Thus, by choosing the digit size appropriately, a digit-serial architecture can meet the throughput requirement with minimum hardware complexity. References: [] R. E. Blahut, Theory and Practice of Error Control Codes, MA: Addison-Wesley, 983. [] C. L. Wang and J. L. Lin, Systolic Array Implementation of Multipliers for Finite Field GF( m ), IEEE Trans. on Circuits and Syst, Vol. 38, No. 7, pp. 796-8, July. 99. [3] J. H. Guo and C. L. Wang, Digit-Serial Systolic Multiplier for Finite Field GF( m ), IEE Proc.-Comput. Digit. Tech., Vol. 45, No., pp. 43-48, March. 998. [4] L. Song and K. K. Parhi, Efficient Finite Field Serial/Parallel Multiplication. Proc. Int. Conf.

Application Specific Syst. Architectures and Processors, Chicago, IL, pp. 7-8, Aug. 996. [5] S. K. Jain, L. Song, and K. K. Parhi, Efficient Semi-Systolic Architectures for Finite Field Arithmetic. IEEE Trans. on VLSI System, Vol. 6, No., pp. -3, March. 998. [6] S. Y. Kung, VLSI Array Processors, Englewood Cliffs, NJ: Prentice Hall, 988. [7] R. Hartley and P.F. Corbett, Digit-Serial Processing Techniques, IEEE Trans. CAS-37, Vol. 37, No. 6, pp. 77-79, June. 99. [8] K. K. Parhi, Systematic Approach for Design of Digit-Serial Signal Processing Architectures, IEEE Trans., CAS-38, (4), pp. 358-357, April. 99. [9] B. Schneier, Applied Cryptography. Wiley, 996. [] S.Y. Kung, On Supercomputing with Systolic/ Wavefront Array Processors, IEEE Proceedings, Vol. 7, No.7, pp. 867-884, July. 984. [] J. A. Peter, The Designer s Guide to VHDL, Morgan Kaufmann, 996. [] W. F. Lee, VHDL Coding and Logic Synthesis with Synopsis, Academic Press,. [3] N. Weste and K. Eshraghian, Principles of CMOS VLSI Design: A System Perspective, MA, Addison-Wesley, 985. [4] G. Orlando and C. Paar, A Super-Serial Galois Fields Multiplier for FPGAs and its Application to Public-Key Algorithms, Proc. of the 7 Annual IEEE Symposium on Field-Programmable Computing Machines, pp. 3-39, 999.