A Digit-Serial Systolic Multiplier for Finite Fields GF(2 m )

A Digit-Serial Systolic Multiplier for Finite Fields GF( m ) Chang Hoon Kim, Sang Duk Han, and Chun Pyo Hong Department of Computer and Information Engineering Taegu University 5 Naeri, Jinryang, Kyungsan, Kyungbuk,7-74 KOREA Abstract: In this paper, a digit-serial systolic array is proposed for computing modular multiplication A(x), B(x) mod G(x) in finite fields GF( m ) with the standard basis representation. From the multiplication algorithm in GF( m ), we obtain a new dependence graph and design an efficient digit-serial systolic multiplier. If input data come in continuously, the proposed array can produce multiplication results at a rate of one every [m/l] clock cycles, where L is the selected digit size. The analysis results show that the proposed architecture leads to a reduction of computational delay time and it has much more simple structure than existing digit-serial systolic multiplier. Furthermore, since the new architecture has the features of regularity, modularity, and unidirectional data flow, it is well suited to VLSI implementation with fault-tolerant design. Key-Words : Digit-Serial Architecture, Finite Field Multiplier, Cryptography, Systolic Array, Finite Field Arithmetic, VLSI. Introduction Finite or Galois Field(GF)( m ) have played an important role in many application areas of communication, such as error-correcting code [] and cryptography [9]. Because addition in GF( m ) is bit independent XOR operation, it can be implemented in fast and inexpensive way. On the other hand, multiplication is more complicated and expensive. Futhermore, it is the most common arithmetic operation in GF( m ). Thus, it is desirable to design hardware-efficient multiplier for GF( m ) to meet the real-time requirement with minimum hardware complexity [3]. Many approaches and architectures have been proposed to perform multiplication in GF( m ) [][3][4][5][4]. The multipliers for GF( m ) can be classified into four types: bit-parallel, bit-serial, super-serial and digit-serial architectures. Basically a bit-parallel system reaches much better throughput performance than the others, but it involves much more hardware complexity. To improve the trade-off between throughput performance and hardware complexity, digit-serial architecture have been proposed [3][7][8]. For a digit-serial system, the data words are first partitioned This work was supported by grant No. --5-- from the Basic Research Program of the Korea Science & Engineering Foundation into digits of some bits each, and then processed and transmitted on a digit-by-digit basis. Suppose that the word size is m-bit, the digit size is L-bit, and N = [m/l], then bit-parallel, bit-serial and super-serial systems process the input data at a rate of m-bit, one-bit and less than one-bit per clock cycle respectively, while digit-serial system processes the input data at a rate of L-bit per clock cycle. In other words, digit-serial system will yield output results at a rate of one every N clock cycles, while bit-parallel, bit-serial and super-serial systems will yield output results at a rate of one, m and more than m clock cycles, respectively. If the digit size is chosen appropriately, a digit-serial architecture can meet the throughput requirement of a certain application with minimum hardware. In this paper, we first review the multiplication algorithm in GF( m ) and then derive a new dependence graph(dg). Based on a new DG, we propose an efficient digit-serial systolic multiplier and its architecture leads to a reduction of computational delay time and it shows more simple structure compared to Guo and Wang s architecture [3]. In addition, if the input data come in continuously, the proposed array can produce results at a rate of one per N cycles after an initial delay of 3N clock cycles, which is the same as the previous research [3]. Finally, since the new architecture has the features of

regularity, modularity, and unidirectional data flow, it is well suited to VLSI implementation with fault-tolerant design. The Multiplication Algorithm In this section, we first review the multiplication algorithm [][5]. Let A(x) and B(x) be two elements in GF( m ), G(x) be the primitive polynomial used to generate the field and P(x) be the result of the product A(x)B(x) mod G(x). For each polynomials, the coefficients are the binary digits and. A(x)=a m- x m- +a m- x m- + +a x +a B(x)=b m- x m- +b m- x m- + +b x +b G(x)= x m +g m- x m- +g m- x m- + +g x+g P(x)= p m- x m- +p m- x m- + +p x +p In equation (), each element is a residue mod G(x), and all arithmetic operations are performed by taking the results modulo. As described in [], P(x) can be computed recursively as follows. T (x) = () T i (x) = [T i- (x)x]mod G(x) + A(x)b m-i, where i =,,,m (3) P(x) = T m (x) (4) After m iteration, the result P(x) can be obtained. Defining T i (x) = t i,m- x m- +t i,m- x m- + +t i, x+t i, and substituting it into eqation, it can be derived that t i,k =t i-,m- g k +b m-i a k +t i-,k- for k m- with t i-,- =. () (5) (6) Based on the given algorithm, a DG can be derived as shown in Fig.. Generally, the DG consists of m m basic cells for multiplication in GF( m ). In particular, m=9 in the DG of Fig.. In addition, Fig. represents the architecture of basic cell which consisits of two -input AND gates and one 3-input XOR gate. The cells in the ith low of the array perform the operations of the ith iteration, where each basic cell computes one coefficient. The coefficients of the result P(x) emerge from the bottom low of the array after m iterations. 3 A Digit-Serial Systolic Multiplier Let L be the digit size, N=m/L be an integer, and A i s, B i s, G i s, and P i s ( i N-) be the digits of the coefficients of A(x), B(x), G(x) and P(x) respectively. Each digit consists of L bits such as the digit A i = (a il+l-, a il +L-,,a il+, a il ), and the digits B i, G i and P i are defined similarly. 3. Modification of DG and Basic Cell As the first step for construction of a new systolic array, we combine L L basic cells in Fig. into a new basic cell. Fig. 3 shows the modified DG, where L=3 and N=m/L=3, and Fig. 4 represents the modified circuits of corresponding basic cell. As described in Fig. 3, the modified DG consist of N N basic cells. In the modified DG of Fig. 3, the digits A i, G i enter the (,i)th basic cell, the B i enters the (N-i,N-)th basic cell, and the digit Pi emerges from the (N,i)th basic cell, where i N-. Although the modified DG in Fig. 3 has high regularity, it is impossible to get an one-dimensional signal flow graph (SFG) array. This is because the data flow is bi-directional horizontally in Fig. 3, the DG cannot be projected along the east direction. In other words, the (i,k)th basic cell has to get the (L-) temporary results t il-,kl-, t il-,kl-,., and t il-(l-),kl- from the right neighboring (i,k-)th basic cell. b b a 5 a 4 a 3 g 3 a g a g a g (,8) (,7) (,6) (,5) (,4) (,3) (,) (,) (,) (,8) (,7) (,6) (,5) (,4) (,3) (,) (,) (,) (3,8) (3,7) (3,6) (3,5) (3,4) (3,3) (3,) (3,) (3,) (4,8) (4,7) (4,6) (4,5) (4,4) (4,3) (4,) (4,) (4,) (5,8) (5,7) (5,6) (5,5) (5,4) (5,3) (5,) (5,) (5,) (6,8) (6,7) (6,6) (6,5) (6,4) (6,3) (6,) (6,) (6,) (7,8) (7,7) (7,6) (7,5) (7,4) (7,3) (7,) (7,) (7,) (8,8) (8,7) (8,6) (8,5) (8,4) (8,3) (8,) (8,) (8,) (9,8) (9,7) (9,6) (9,5) (9,4) (9,3) (9,) (9,) (9,) p 8 p 7 p 6 p 5 p 4 p 3 p p p Fig. Dependence graph for multiplication in GF( 9 ) [3] t i-,m- b m- t i,k a k g k t i-,k- Fig. Circuit of (i,k)th cell in Fig. [][3]

b b (,) (,) (,) g 3 g g g a 5 a 4 a 3 a a a (,) (,) (,) (3,) (3,) (3,) p 8 p 7 p 6 p 5 p 4 p 3 p p p Fig. 3 Modified dependence graph of Fig. with L=3 t (i-)l,m- + t il-,kl+ t il-,m- + t il-,kl+ t il-,m- t (i-)l,kl+ akl+ g kl+ a kl+ g kl+ t (i-)l,kl t il,kl+ til,kl+ t il,kl a kl g kl t (i-)l,kl- t il-,kl- t il-,kl- Fig. 4 Circuit of (i,k) basic cell in Fig. 3 [3] We solve such a problem with different way compared to Guo [3]. The principal idea of our approach is that the time for computation of XOR gates numbered by,,...,(l-) with dependency in Fig. 4 is delayed for one clock cycle. In other words, the computation of (L-) XOR gates with depenency are executed in next clock cycle. By applying this idea, we modify the DG of Fig. and the basic cell of Fig. 4 one more time. The modification procedures are summarized as follows. (i) To remove the dependency of (L-) XOR gates in Fig. 4, each rows in Fig. are shifted to right by one column except the first (upper) row. The resulting DG is shown in Fig. 5. (ii) According to Fig. 5, without changing the function of the basic cell, we can modify the structure of the basic cell of Fig. 4. The resulting structure of the basic cell is shown in Fig. 6, where denotes a -bit one-cycle delay element. This is because the computation of (L-) XOR gates with dependency are executed in next clock cycle. As described in Fig. 6, data buses nembered,,..., and (L-) are connected to the inputs of the XOR gates numbered by,,,and (L-) instead of the L- temporary results t il-,kl-, t il-,kl-,, and t il-(l-),kl-, respectively. Also, we add (L-) one bit latches due to the fact that the (L-) XOR gates with dependency have to be computed with the input data of the previous cycle. (iii) As depicted in Fig. 5, since all the data flow are unidirectional, we can make projection for an one dimensional array. By projecting the DG in Fig. 5 a7 a6 g a 5 g 3 g g g a4 a3 a a a 5 (,8) (,7) (,6) (,5) (,4) (,3) (,) (,) (,) (,8) (,7) (,6) (,5) (,4) (,3) (,) (,) (,) (3,8) (3,7) (3,6) (3,5) (3,4) (3,3) (3,) (3,) (3,) (4,8) (4,7) (4,6) (4,5) (4,4) (4,3) (4,) (4,) (4,) (5,8) (5,7) (5,6) (5,5) (5,4) (5,3) (5,) (5,) (5,) (6,8) (6,7) (6,6) (6,5) (6,4) (6,3) (6,) (6,) (6,) (7,8) (7,7) (7,6) (7,5) (7,4) (7,3) (7,) (7,) (7,) : denote boundary of computation areas : denote boundary am ong (L*L) basic cells w ith L=3 b (8,8) (8,7) (8,6) (8,5) (8,4) (8,3) (8,) (8,) (8,) b (9,8) (9,7) (9,6) (9,5) (9,4) (9,3) (9,) (9,) (9,) p 8 p 7 p 6 p 5 p 4 p 3 p p p Fig. 5 Modified dependence graph of Fig.

along the east direction, we finally derive the one dimensional SFG array as shown in Fig. 7. This SFG array consists of N units of identical processing element(pe), which is depicted in Fig. 8. It is controlled by a control sequence of... with length N. According to the projection, the digits A is and G is enter this array in serial form with the most siginificant digit first, while the digits of the coefficients of B(x) should reside inside the array, i.e. B i should be at one PE to be ready for execution. Since the L temporary results t il-,m-, t il-,m-,, and t (i-)l,m- must be broadcasted to all basic cells in the ith row of the DG in Fig. 5 (i.e. all the basic cells in the ith row of the DG need L temporary results for further processing), we add extra L multiplexer and L one-bit latches into each PE of the SFG array. When the control signal is in logic, the L temporary results are latched. In addition, we add extra L two-input AND gates into each PE of the SFG such that L zeros are fed into each row of the DG from the rightmost cell when the control signal is in logic. Finally, while the next new input data set come in the PE and it processes the first computation, the (L-) XOR gates with depenency have to process the last computation with the previous input data set, respectively. Thus, the inputs for p,p,...,p (L-) and b,,...,b (L-) in Fig. 8 have to be connected the inputs of the previous cycle, where a denotes the cell in Fig. and a doted line denotes boundary between the (L-) cells with dependency and the counterparts. Thus, The inputs p,p,...,p (L-) are connected to s,s,...s (L-) and the inputs b,,...,b (L-) have one bit latches, respectively. According to the modification procedure, we get the modified PE as shown in Fig. 9. As depicted in Fig. 9, the architecture of PE is much more simpler than Guo s one [3]. Fig. 6 Modified version of (i,k)th cell of Fig. 4 g a g a g a g 3 a 3 a 4 a 5 PE PE PE p 3 p 6 p p 4 p 7 p p 5 p 8 p b b Fig. 7 One-dimensional SFG array for multiplication in GF( 9 ) with L=3 p in, + + p in, p in, ctrl p' b' s s p' b' + + Fig. 8 Structure of each PE in Fig. 7 p in, + + p in, p in, ctrl p out, + + p out, p out, p out, + + p out, p out, t (i-)l,kl+ t (i-)l,kl akl+ g kl+ a kl+ g kl+ a kl g kl t (i-)l,kl- + + Fig. 9 Modified version of Fig. 8 t (i-)l,m- + t il-,kl+ ' t il-,m- + t il-,kl+ ' t il-,m- t il,kl+ til,kl+ t il,kl t il-,kl- t il-,kl- 3. Cut-Set Systolisation of SFG By applying the cut-set systolisation techniques [] to the SFG of Fig. 7, we get the digit-serial systolic multiplier depicted in Fig.. If the input data come in continuously, this array can produce results at a rate of one per N cycles after an initial delay of 3N clock cycles. The digits P is emerge from the right-hand side of the array in serial form with the most significant digit first. Also we can add extra L muliplexers and L one-bit latches into each PE so that the digits B is of the

coefficients of B(x) may also enter the array in serial form with most significant digit first. Finally the resulting systolic array and the corresponding PE is described in Fig. and Fig. respectively, where the data load operation of B i occurs when the control signal is in logic. g a g a g a g 3 a 3 a 4 a 5 PE PE PE p 3 p 6 p p 4 p 7 p p 5 p 8 p b b Fig. Digit-serial systolic array for multiplication in GF( 9 ) with L=3 g a g a g a g 3 a 3 a 4 a 5 b b PE PE PE p p 3 p 6 p p 4 p 7 p p 5 p 8 Fig. Modified version of Fig., where digits B is enter array in serial form p in, + + p in, p in, + p out, + + p out, p out, using the Synopsis FPGA-Express (version,-fe3.5), in which Altera s FPGA FLEX K family were assumed as target devices. After complete verification of the circuit layout, we extract net-list files from FPGA Express. As the next step, we made a functional simulation from the net-list file using the Mentographics VHDL (ChimSim). Fig. 3 shows the simulation result when the input data come in continuously, where A(x)=(x 8 + x 5 + x 4 + x, x 8 + x 7 + x 6 + x 5 + x 3 +), B(x)=(x 8 + x 6 + x 5 + x +, x 8 + x 7 + x 3 + x + x), and G(x)=(x 9 +x 4 +, x 9 +x 4 +) In this case, we feed two different data set in successively. The simulation results show that P(x) is given as (x 8 + x 5 +, x 7 + x 6 +), which is the same as the analytical solution. In addition, the output comes out at a rate of every 3 cycles after an initial delay of 9 clock cycles. More generally, the output will come out at a rate of every N cycles after an initial delay of 3N clock cycles. In case of larger m (up to 8) with various L, we obtained the same result that the proposed multiplier work in correctly with the same processing rate described in above. Based on the simulation result, we compared the performance of the proposed multiplier with the previous research [3]. Table summarizes the performance comparison results, in which we assume that 3-input XOR gate and 4-input XOR gate are constructed using two and three -input XOR gates respectively [3]. As described in Table, in the proposed systolic multiplier, as a whole, the computation delay time is reduced as much as 3N(T XOR ). In addition, the number of gates is reduced as much as [{N(L-)-}Latchs + (L-)(XOR+AND)], and the number of data bus is also reduced as much as (L-) compared to the Guo s one [3]. This results leads to the conclusion that the proposed digit-serial multiplier gives a reduction in computational delay time with a more simple hardware circuit. + ctrl Fig. Structure of each PE in Fig. 4 Performance Analysis To make performance analysis, the multiplier depicted in Fig. was modeled in VHDL and was synthesized 5 Conclusions In this paper, we proposed an efficient digit-serial systolic multiplier for GF( m ). From the given algorithm and DG, we first obtained a new DG through two steps of modification. In addition, by applying the cut-set systolisation techniques, we finally constructed the SFG and the PE for digit-serial multiplication. Based on the constructed array, we made simulation in various way. The simulation results show that the proposed architecture leads to a reduction of computational

Item Circuit Requirement One PE Complexity Fig. 3 Simulation result Table. Comparison of two digit-serial systolic multipliers for GF( m ) Circuit Guo s Multiplier [3] Proposed Multiplier N L- L- L +L L- L -L+ L- L L PE XOR AND AND XOR XOR3 XOR4 bit Latch N L +L 9L+ L Latency (cycles) 3N 3N L PE bit Latch AND XOR3 bit Latch Maximum T AND + T XOR4 + (L-)T + L(T AND + T XOR3 ) Propagation delay (L-)(T AND + T XOR3 + T ) Number of control signals delay time with more simple hardware structure, compared to the Guo s multiplier [3]. In particular, the structure of each PE is much more simpler than the Guo s one [3]. In addition, since the proposed multiplier shows unidirectional data flow, and is highly regular and modular, it is well suited to VLSI implementation. As described in this paper, the digit-serial multiplier is faster than the bit-serial one, and has lower hardware complexity than the bit-parallel one. In addition, in the digit-serial multiplier architecture, we can make trade-off between throughput performance and hardware complexity. Thus, by choosing the digit size appropriately, a digit-serial architecture can meet the throughput requirement with minimum hardware complexity. References: [] R. E. Blahut, Theory and Practice of Error Control Codes, MA: Addison-Wesley, 983. [] C. L. Wang and J. L. Lin, Systolic Array Implementation of Multipliers for Finite Field GF( m ), IEEE Trans. on Circuits and Syst, Vol. 38, No. 7, pp. 796-8, July. 99. [3] J. H. Guo and C. L. Wang, Digit-Serial Systolic Multiplier for Finite Field GF( m ), IEE Proc.-Comput. Digit. Tech., Vol. 45, No., pp. 43-48, March. 998. [4] L. Song and K. K. Parhi, Efficient Finite Field Serial/Parallel Multiplication. Proc. Int. Conf.

Application Specific Syst. Architectures and Processors, Chicago, IL, pp. 7-8, Aug. 996. [5] S. K. Jain, L. Song, and K. K. Parhi, Efficient Semi-Systolic Architectures for Finite Field Arithmetic. IEEE Trans. on VLSI System, Vol. 6, No., pp. -3, March. 998. [6] S. Y. Kung, VLSI Array Processors, Englewood Cliffs, NJ: Prentice Hall, 988. [7] R. Hartley and P.F. Corbett, Digit-Serial Processing Techniques, IEEE Trans. CAS-37, Vol. 37, No. 6, pp. 77-79, June. 99. [8] K. K. Parhi, Systematic Approach for Design of Digit-Serial Signal Processing Architectures, IEEE Trans., CAS-38, (4), pp. 358-357, April. 99. [9] B. Schneier, Applied Cryptography. Wiley, 996. [] S.Y. Kung, On Supercomputing with Systolic/ Wavefront Array Processors, IEEE Proceedings, Vol. 7, No.7, pp. 867-884, July. 984. [] J. A. Peter, The Designer s Guide to VHDL, Morgan Kaufmann, 996. [] W. F. Lee, VHDL Coding and Logic Synthesis with Synopsis, Academic Press,. [3] N. Weste and K. Eshraghian, Principles of CMOS VLSI Design: A System Perspective, MA, Addison-Wesley, 985. [4] G. Orlando and C. Paar, A Super-Serial Galois Fields Multiplier for FPGAs and its Application to Public-Key Algorithms, Proc. of the 7 Annual IEEE Symposium on Field-Programmable Computing Machines, pp. 3-39, 999.