Systematic Design of Original and Modified Mastrovito Multipliers for General Irreducible Polynomials

Size: px

Start display at page:

Download "Systematic Design of Original and Modified Mastrovito Multipliers for General Irreducible Polynomials"

Gertrude Hudson
5 years ago
Views:

1 4 IEEE TRANSACTIONS ON COMPUTERS, VOL 50, NO, JULY 001 Systematic Design of Original and Modified Mastrovito Multipliers for General Irreducible Polynomials Tong Zhang, Student Member, IEEE, and Keshab K Parhi, Fellow, IEEE AbstractÐThis paper considers the design of bit-parallel dedicated finite field multipliers using standard basis An explicit algorithm is proposed for efficient construction of Mastrovito product matrix, based on which we present a systematic design of Mastrovito multiplier applicable to GF m generated by an arbitrary irreducible polynomial This design effectively exploits the spatial correlation of elements in Mastrovito product matrix to reduce the complexity Using a similar methodology, we propose a systematic design of modified Mastrovito multiplier, which is suitable for GF m generated by high-hamming weight irreducible polynomials For both original and modified Mastrovito multipliers, the developed multiplier architectures are highly modular, which is desirable for VLSI hardware implementation Applying the proposed algorithm and design approach, we study the Mastrovito multipliers for several special irreducible polynomials, such as trinomial and equally-spaced-polynomial, and the obtained complexity results match the best known results Moreover, we have discovered several new special irreducible polynomials which also lead to low-complexity Mastrovito multipliers Index TermsÐFinite (or Galois) field, standard basis, multiplication, irreducible polynomials, complexity, VLSI architecture, Toeplitz matrix æ 1 INTRODUCTION EFFICIENT hardware implementations of finite (or Galois) field GF m arithmetic units are highly desirable for many applications in error-correcting coding and cryptography [1], [], [], [4] Among the GF m arithmetic operations, multiplication is the most basic and important building block in many applications A number of efficient GF m multiplication approaches and architectures have been proposed in which different basis representations of field elements are used, such as standard basis, dual basis, and normal basis Standard basis is more promising in the sense that it gives designers more freedom on irreducible polynomial selection and hardware optimization For detailed discussions of these three basis representations, readers are referred to [], [5] In this paper, we are interested in the design of bit-parallel GF m multipliers using standard basis The standard basis multiplication involves two steps: polynomial multiplication and modulo reduction Let f x be the irreducible polynomial generating GF m, c x be the product of a x and b x, where a x, b x, c x GF m The finite field multiplication is performed as c x ˆ a x b x mod f x An efficient dedicated bitparallel multiplier was proposed by Mastrovito [] in which a product matrix M is introduced to combine the above two steps together Thus, the multiplication is carried out by c ˆ M b, where c and b represent the coefficient The authors are with the Department of Electrical and Computer Engineering, University of Minnesota, 00 Union Street SE, Minneapolis, MN {tzhang, parhi}@eceumnedu Manuscript received Oct 000; revised Feb 001; accepted Apr 001 For information on obtaining reprints of this article, please send to: tc@computerorg, and reference IEEECS Log Number 110 vectors of c x and b x, respectively Given irreducible polynomial f x, each entry in matrix M is obtained by oring certain coefficients of a x Many entries in matrix M can be computed efficiently by sharing some common items, eg, two entries M i 1 ;j 1 ˆa 0 a a and M i ;j ˆa 0 a a 4 can be computed using three OR gates by sharing the common item a 0 a This method is called subexpression sharing [] Mastrovito multipliers using two special irreducible polynomials, trinomial and equallyspaced-polynomial (ESP), have been studied by many researchers for their low-complexity implementations [8], [9], [10], [11], [1], [1], [14] The essence of all these works is to find an architecture to exploit subexpression sharing efficiently based on the specific irreducible polynomials It has been shown in [8] that Mastrovito multiplier using irreducible trinomial x m x n 1 only requires m 1 OR gates and m AND gates By generalizing the approach of [8], Halbutogullari and KocË [14] discovered that the space complexity of Mastrovito multiplier using irreducible ESP x m x tr x r 1, where t 1 r ˆ m, can be reduced to m r OR gates and m AND gates Furthermore, [14] presents a new formulation of the Mastrovito product matrix for an arbitrary irreducible polynomial x m x nk x n1 1, where the space complexity is given as m AND gates and m 1 m k 1 P jn m 1 j OR gates, Nf0; 1; ;m g However, [14] fails to find a method to explicitly compute the set N, which makes its result less practicable in general In general cases, the complexity of computing product matrix is proportional to the Hamming weight of the irreducible polynomial f x (denoted as pwt) So, the Mastrovito multiplier is good only when low-hamming weight irreducible polynomials are used A modified /01/$1000 ß 001 IEEE

2 ZHANG AND PARHI: SYSTEMATIC DESIGN OF ORIGINAL AND MODIFIED MASTROVITO MULTIPLIERS FOR GENERAL IRREDUCIBLE 5 Mastrovito multiplier was proposed by Song and Parhi [15], which has a complexity proportional to m 1 pwt Its basic idea is to use the complementary irreducible polynomial for the computation of matrix M with appropriate compensation Such a multiplication scheme is efficient when GF m is generated by high-hamming weight irreducible polynomial Generally, for large m, efficient design for both original and modified Mastrovito multipliers becomes rather difficult and a systematic design approach is highly desirable In this paper, we generalize the approach of [8] in a different way compared with [14] We propose a theorem and an algorithm (which can explicitly compute set N in [14]) for the construction of product matrix in the original Mastrovito multiplier, based on which we develop an efficient systematic design of the original Mastrovito multiplier This design effectively exploits subexpression sharing in the computation of product matrix to reduce the complexity Using a similar methodology, we also develop a systematic design of modified Mastrovito multiplier For both original and modified Mastrovito multipliers, explicit algorithms and architectures are presented and the complexities are given in detail Applying our proposed design approach, we study the irreducible trinomial and ESP and the complexity results match the results in [8] and [14] Meanwhile, another computation approach for trinomial case is proposed to make a trade-off between space complexity and delay Moreover, with the aid of the proposed algorithm, we discover several new irreducible polynomials leading to low-complexity original Mastrovito multipliers, which is especially desirable when neither an irreducible trinomial nor an irreducible ESP exists This paper is organized as follows: We introduce the notation of this paper and the fundamentals of finite field and Mastrovito multiplier in Section In Section, we propose a theorem and an algorithm for the construction of product matrix, based on which a systematic design approach for original Mastrovito multiplier is developed In Section 4, using a similar design methodology, we present a systematic design approach for modified Mastrovito multiplier Efficient computation approaches for several special irreducible polynomials are discussed in Section 5 This paper is an extended version of [1] NOTATION AND PRELIMINARIES Since several arithmetic operations pertaining to matrices and vectors will be extensively used throughout this paper, we first introduce the following notation: Column vectors and matrices are represented by small and capital boldfaced characters, respectively Matlab matrix notations are used, eg, Z i; :, Z :;j, and Z i; j represent the ith row vector, jth column vector, and the entry with position i; j in matrix Z, respectively The operations of shift by feeding zero are represented by corresponding arrows, eg, v # Š, U! 1Š, and U # 1Š represent down shift of vector v by two positions, right shift of matrix U by one column, and down shift of matrix U by one row, respectively, which are explicitly given as: v # Š ˆ 0; 0;v 0 ; ;v n Š T U! 1Š ˆ o; U :; 1 ; ; U :;m 1 Š U # 1Š ˆ o; U 1; : T ; ; U m 1; : T Š T ; where o represents the zero column vector Furthermore, we note that the AND and OR gates considered in this paper are all -input gates, whose delays are denoted as T A and T, respectively Finite field GF m contains m elements and can be viewed as an m-dimensional vector space over GF, which only has two elements, 0 and 1 With the standard basis f1;x;x ; ;x m 1 g, the elements of the finite field GF m can be represented as polynomials of degree m 1 as follows: GF m ˆfa x ja x ˆ a m 1 x m 1 a m x m a 1 x a 0 ;a i GF g: Such polynomial representation is generally used for finite field arithmetic operations, where addition is carried out by polynomial addition over GF m using bit-independent OR operations Finite field multiplication using standard basis is carried out by polynomial multiplication and modulo operation Let f x ˆx m f m 1 x m 1 f 1 x 1 be the irreducible polynomial generating GF m, and c x be the product of a x and b x, where a x ;b x ;c x GF m The polynomial multiplication, d x ˆa x b x, can performed as d ˆ A b; where b ˆ b 0 ;b 1 ; ;b m 1 Š T and d ˆ d 0 ;d 1 ; ;d m Š T are the coefficient vectors of b x and d x, respectively, and matrix A is given as a a 1 a a m a m a 0 0 A ˆ a m 1 a m a 1 a 0 : 1 0 a m 1 a a a m 1 a m a m 1 Next, we perform the modulo reduction c x ˆd x mod f x ; which can be expressed as m c x ˆ d k x k mod f x kˆ0 ˆ m 1 kˆ0 m d k x k d k x k mod f x : kˆm For m k m, if we denote x k mod f x as u l x, where l ˆ k m 1, we have u l x ˆ xm mod f x ; l ˆ 1; u l 1 x xmod f x ; l ˆ ; ; ;m 1:

3 IEEE TRANSACTIONS ON COMPUTERS, VOL 50, NO, JULY 001 Since the irreducible polynomial is f x ˆx m f m 1 x m 1 f 1 x f 0 ; where f 0 ˆ 1, we get u 1 j ˆ f j ; 0 j m 1; ( j ˆ u l 1 j 1 u l 1 m 1 f j; 1 j m 1 u l 1 m 1 ; j ˆ 0; u l where u l 1 j and u l j are the coefficients of u l 1 x and u l x, respectively Let u l denote the coefficient vector of u l x, then, using vector notation, () can be rewritten as u l ˆ f; l ˆ 1 u l 1 # 1Š u l 1 4 m 1 f; l m 1; where f ˆ 1;f 1 ; ;f m 1 Š T Let s denote vector d 0 ;d 1 ; ;d m 1 Š T ; which consists of the first m entries of d, we can rewrite () in matrix notation as follows: c ˆ I mm s m 1 d l m 1 u l lˆ1 ˆ I mm ; U d ˆ I mm ; U A b; where I mm represent m m identity matrix and matrix U ˆ u 1 ; u ; ; u m 1 Š Note that all successive columns in matrix U have the recursive relation as in (4) Define M ˆ I mm ; U A: Thus, the GF m multiplication can be carried out by c ˆ M b, which is the well-known Mastrovito multiplication scheme, and the matrix M is called product matrix MASTROVITO MULTIPLIER In this section, we first introduce a theorem and an algorithm pertaining to the construction of product matrix M Then, we develop an approach to compute M by adding a series of Toeplitz matrices, through which subexpression sharing could be extensively exploited to reduce the OR complexity Accordingly, a highly modular multiplier architecture is presented 1 Proposed Theorem and Algorithm In Section, we have shown that product matrix M can be expressed as the product of two independent matrices, I mm ; UŠ and A, and there exists a recursive relation between successive columns in matrix U, as described in (4) Through the following theorem and algorithm, we will see that matrix U also can be constructed by adding a series of Toeplitz matrices Theorem 1 For the computation of matrix U in (5), we can always construct a set Nf0; 1; ;m g and U ˆ F! nš; nn 5 where the Toeplitz matrix F ˆ f; f # 1Š; ; f # m ŠŠ Let the irreducible polynomial be f x ˆx m x k s x k 1 1; where m>k s > >k 1 > 1, the set N in Theorem 1 can be explicitly constructed with the aid of a weighted tree generated based on f x The complete construction approach is described by the following algorithm: Algorithm 1 Input: The parameters of irreducible polynomial: m, k 1 ; ;k s ; Output: set Nf0; 1; ;m g Procedure: 1 Generate a weighted tree D according to the following properties: Each node d j in D has at most s child nodes and each edge has the weight w f m k i ; 1 i sg; Let d 1 denote the root and h d 1 ;d j denote the weight of path from d 1 to d j, where h d 1 ;d 1 ˆ0, we have 8 d j, if 9 r f m k i ; 1 i sg and h d 1 ;d j r <m 1, then d j always has one child node d l and the weight of edge between d j and d l is r; 8d j D, h d 1 ;d j <m 1 Construct multiset Hˆfh d 1 ;d j ; 8d j Dg and set Nˆ;; For 0 j m, do a create multiset S j ˆ;; b 8 h H,ifhˆj, then insert h into S j ; c if js j j mod ˆ1, then insert j into the set N Here, we note that a multiset is like a set, except that repeated elements are allowed, and js j j represents the order of S j A proof of Theorem 1 is given in Appendix A, in which Algorithm 1 is developed From the above algorithm, we know that the least two elements in N are always 0 and (m k s ) and we have jn j k s Example 1 Consider the multiplication of a x ˆx 4 x and b x ˆx x 1 over GF 5 with the underlying irreducible polynomial f x ˆx 5 x 4 x x 1 We have f m k i ; 1 i sg ˆf1; ; g Applying Algorithm 1, we generate the tree D as shown in Fig 1, and get the multiset Hˆf0; 1; ; ; ; ; ; g Thus, we have S 0 ˆf0g )js 0 jˆ1; S ˆf; g )js jˆ; S 1 ˆf1g )js 1 jˆ1; S ˆf; ; ; g )js jˆ4: Since js j mod ˆjS j mod ˆ 0, we get Nˆf0; 1g Therefore, we perform the multiplication as

4 ZHANG AND PARHI: SYSTEMATIC DESIGN OF ORIGINAL AND MODIFIED MASTROVITO MULTIPLIERS FOR GENERAL IRREDUCIBLE Define E j ˆ h i e j ; e j # 1Š; ; e j # m Š : We can write the Toeplitz matrix F in Theorem 1 as P s iˆ0 E k i 1, where k 0 ˆ 0, and have Fig 1 Tree structure h c ˆ M b ˆ I 55 ; i F! nš nn A b ˆ ˆ ˆ 0 ; where F ˆ h i f; f # 1Š; ; f # m Š ˆ : Thus, we have c x ˆ a x b x mod f x ˆx 4 Multiplier Architecture Applying Theorem 1 and Algorithm 1, in this section, we develop an efficient computation approach for the product matrix M and present the corresponding lowcomplexity Mastrovito multiplier architecture In the following, we frequently use two special matrices: the Toeplitz matrix and the upper-triangular Toeplitz matrix It is well-known that the sum of two upper-triangular Toeplitz matrices (or one Toeplitz matrix and one upper-triangular Toeplitz matrix) can be obtained by only computing the sum of two row vectors This property will be used to exploit the subexpression sharing in the construction of product matrix M to reduce the entire OR complexity Given a general irreducible polynomial f x ˆx m x ks x k1 1, we express its coefficient vector as f ˆ e 1 e k1 1 e ks 1, where e i is the m-dimensional ith canonical unit vector: e i ˆ 0; ; 0; 1; 0; ; 0 Š T : {z } {z } i 1 m i U ˆ F! nš ˆs nn E ki 1! nš: iˆ0 nn Moreover, according to (1), we write A in block matrix form as A ˆ A T s ; AT t ŠT ; where A s is an m m lower-triangular Toeplitz matrix and A t is an m 1 m upper-triangular Toeplitz matrix: a a 1 a 0 0 A s ˆ 4 5 ; a m 1 a m a 0 0 a m 1 a a a a A t ˆ 4 5 : a m 1 Substituting (), (8), and (9) into (5), we get M ˆ A s s iˆ0 nn 8 9 E ki 1! nša t : 11 Based on the special structure of E j as defined in (), it can be easily proven that E j! nša t ˆ ~At! nš # j 1 Š; 1 where ~ A t ˆ A T t ; ošt and o is an m-dimensional zero column vector Thus, if we denote P nn ~ A t! nš as S, (11) can be rewritten as M ˆ A s s iˆ0 ˆ A s S s nn ~A t! nš # k i Š S # k i Š : 1 Because each A ~ t! nš is an upper-triangular Toeplitz matrix, we easily know that matrix S is also an uppertriangular Toeplitz matrix with the following form: 0 s m 1 s 1 S ˆ s m and computing S 1; : ˆ ~At 1; :! nš ˆ nn nn A t 1; :! nš 15

5 8 IEEE TRANSACTIONS ON COMPUTERS, VOL 50, NO, JULY 001 Fig General Mastrovito Multiplier architecture is sufficient to completely construct matrix S Since the first entry of A t 1; : is zero, the first n 1 entries of A t 1; :! nš are also zeroes Recalling that n ˆ 0 is the least element in N, we get the OR complexity for computing S 1; : based on (15) is m n 1 ˆ m n 1 m 1 1 nn & nˆ0 nn and if binary tree structure is used, the delay will be dlog jn jet After having obtained S, we use (1) to compute the product matrix M From (10) and (14), we know that the addition of A s and S does not need any OR gates and the matrix A s S, denoted by T, is a Toeplitz matrix Therefore, (1) can be rewritten as M ˆ T S # k 1 Š S # k s Š: 1 In order to compute (1) efficiently, we introduce the following observation: Observation 1 Given an m m upper-triangular Toeplitz matrix S and m m matrix P whose last m l rows form a Toeplitz submatrix If m>jl, then computing the sum of vector P j 1; : and S 1; : is sufficient to construct the sum of P and S # jš and the last m j rows of the sum matrix still form a Toeplitz submatrix Applying Observation 1, we compute M based on (1) with linear tree structure as follows: Algorithm 1 Initially, set Q 0 ˆ T; For 1 i s, construct Q i ˆ Q i 1 S # k i Š by computing Q i k i 1; : ˆQ i 1 k i 1; : S 1;: ; Finally, set M ˆ Q s In the above algorithm, for each i, Q i 1 k i 1; : S 1;: requires m 1 OR gates, so we need s m 1 OR gates to compute M with the delay of st In order to obtain the result c x via c ˆ M b, we also need m AND gates and m m 1 OR gates to complete this matrix-vector multiplication Its delay will be T A dlog met if binary tree structure is used Based on the above computation approach, we develop the dedicated Mastrovito multiplier architecture as shown in Fig Given irreducible polynomial f x ˆx m x ks x k1 1, set N is constructed using Algorithm 1 Product Matrix Module computes matrix M and consists of two blocks: M1 and M Block M1 generates the vector S 1; : by computing P nn A t 1; :! nš Supplied with S 1; :, block M computes the product matrix M using Algorithm M contains (m 1) P j blocks, each one generates one row vector of M Ifj fk i ; 1 i sg, then P j is identical to block B; otherwise, it is identical to block B1 The Matrix-Vector Multiplication Module computes M b and consists of m identical V blocks The block V computes the inner-product of two vectors of length m The total complexity of the proposed original Mastrovito multiplier for general irreducible polynomial is given by OR Complexity: m s 1 m 1 nn m n 1 ; AND Complexity: m, Delay: T A s dlog jn je dlog me T It needs to be pointed out that, for a given irreducible polynomial, there likely exist some common items in the computation of (15) and (1) By sharing these common items, we may further optimize the hardware architecture of product matrix module to some extent Thus, the multiplier architecture shown in Fig may need some corresponding modifications and the above OR complexity value is actually an upper bound for general cases 4 MODIFIED MASTROVITO MULTIPLIER Using a similar design methodology as in Section, in this section, we develop a systematic design approach for the modified Mastrovito multiplier [15], which is preferable if high-hamming weight irreducible polynomial is used 41 Proposed Theorem and Algorithm For a polynomial f x ˆf m x m f m 1 x m 1 f 1 x f 0, we define its complementary polynomial as p x ˆ 1 f m x m 1 f m 1 x m 1 1 f 1 x 1 f 0 :

6 ZHANG AND PARHI: SYSTEMATIC DESIGN OF ORIGINAL AND MODIFIED MASTROVITO MULTIPLIERS FOR GENERAL IRREDUCIBLE 9 According to [15], we have the following theorem to construct matrix U in (5) using the complementary irreducible polynomial: Theorem 41 Given an irreducible polynomial f x ˆx m x ks x k1 1; its complementary polynomial can be written as p x ˆx t m s 1 x t 1 ; where m>t m s 1 > >t 1 > 0 Let p represent the coefficient vector of p x, then we can obtain the matrix U in (5) by adding two matrices V and Q, where V is constructed recursively as V :;i ˆ 8 p; i ˆ 1 >< p # 1Š V m; 1 1 p e 1 ; i ˆ V :;i 1 # 1Š V m; i e 1 >: V m; i V m; i 1 p; i m 1; where e 1 is the first canonical unit vector and Q ˆ w e T 1 V m; :! 1Š ; where w ˆ 1; ; 1 Š T : {z } m In the following, we propose a theorem and an algorithm to construct matrix V in Theorem 41 by adding a series of Toeplitz matrices together Theorem 4 For the computation of matrix V in Theorem 41, we can always find two sets Lf0; 1; ;m g and Jf1; ;m g, and V ˆ P! lš E 1! jš; ll jj where E 1 is defined in () and h i P ˆ p; p # 1Š; ; p # m Š : The sets L and J in Theorem 4 are constructed by the following algorithm: Algorithm 41 Input: The parameters of complementary polynomial: m, t 1 ; ;t m s 1 ; Output: set Lf0; 1; ;m g and Jf1; ;m g Procedure: 1 Generate a weighted tree D according to the following properties: The root d 1 always has one child node d connected by an edge with weight w ˆ 1 Besides d, root d 1 has at most m s 1 other child nodes, where the weight of edge w f m t i ; m t i 1 ; 1 i m s 1 g; 8 d j ˆ d 1, it has at most m s 1 child nodes, where the weight of edge w f m t i ; m t i 1 ; 1 i m s 1 g; Let h d 1 ;d j denote the weight of path from d 1 to d j, where h d 1 ;d 1 ˆ0, we have 8 d j, if 9 r f m t i ; m t i 1 ; 1 i m s 1 g and h d 1 ;d j r <m 1, then d j always has a child node d l and the weight of edge between d j and d l is r; 8d j D, h d 1 ;d j <m 1 Construct a subset, denoted as ~D, of all nodes in D such that it contains each node which is connected with its parent node by an edge with the weight z f m t i 1 ; 1 i m s 1 g; Construct multisets Hˆfh d 1 ;d j ; 8d j Dg and Gˆf1;h d 1 ;d i ; 8 d i ~Dg and sets LˆJ ˆ;; 4 For 0 j m, do a create S j ˆT j ˆ;; b 8 h H,ifh ˆ j, then insert h into S j ; 8 g G,if g ˆ j, then insert g into T j ; c if js j j mod ˆ1, then insert j into the set L; if jt j j mod ˆ1, then insert j into the set J A proof of Theorem 41 is given in Appendix B in which Algorithm 41 is developed Example 41 Consider the construction of matrix U when the high Hamming weight irreducible polynomial f x ˆx x x 5 x x x 1 is being used Its complementary polynomial is p x ˆx 4 We have f m t i ; m t i 1 ; 1 i m s 1 g ˆ f; 4g Applying Algorithm 41, we generate the tree D as shown in Fig From this tree, we get Hˆf0; 1; ; 4; 4; 5g and Gˆf1; 4; 5g Thus, we obtain Lˆf0; 1; ; 5g and Jˆf1; 4; 5g So, matrix V is computed as V ˆ P! lš E 1! jš ll jj ˆ ˆ : Since V ; : ˆ Š, we have V ; :! 1Š ˆ Š and

7 40 IEEE TRANSACTIONS ON COMPUTERS, VOL 50, NO, JULY 001 V A t ˆ P! lš E 1! jš A t ll jj m s 1 ˆ ~A t! lš # t i Š ~A t! jš; jj ll Fig Tree structure Q ˆ w e T 1 V ; :! 1Š ˆ w Š: Finally, we have U ˆ V Q ˆ ˆ : Multiplier Architecture In this section, based on the above theorem and algorithm, we develop an efficient computation approach and corresponding multiplier architecture for the modified Mastrovito multiplication scheme According to (5), (9), and Theorem 41, we have M ˆ I mm ; UŠA ˆ A s U A t ˆ A s V A {z } t Q A t ; 18 {z } M 1 M and the modified Mastrovito multiplication is carried out as c ˆ M b ˆ M 1 b M b: In the following, we develop efficient approaches for computing M 1 and M, respectively Let's begin with M 1 Recall that the complementary polynomial is p x ˆx tm s 1 x t1, thus the matrix P in Theorem 4 can be written as m s 1 P ˆ E ti 1; 19 where E i is defined in () Applying (1) and (19), we have where A ~ t ˆ A T t ; ošt and o is an m-dimensional zero column vector Denote P ~ jj A t! jš and P ~ ll A t! lš as ~S 1 and S ~, respectively, we have m s 1 M 1 ˆ A s ~ S 1 ~S # t i Š: 0 Since each A ~ t! nš is an upper-triangular Toeplitz matrix, we know that both S ~ 1 and S ~ are also upper-triangular Toeplitz matrices, and computing ~S 1 :; 1 ˆ A t 1; :! jš jj ~S :; 1 ˆ ll A t 1; :! lš 1 is sufficient to construct S ~ 1 and S ~ Since the least element in L is always 0, similar to (1), we obtain the OR complexities of computing S ~ 1 and S ~ as m j 1 m min J 1 and jj m l 1 m 1 ll with the delay of dlog jj jet and dlog jljet, respectively Similarly to Algorithm, for the computation of M 1 using (0), we have the following algorithm: Algorithm 4 1 Initially, set Q 0 ˆ A s ~ S 1 ; For 1 i m s 1, construct Q i ˆ Q i 1 ~S # t i Š by computing Q i t i 1; : ˆQ i 1 t i 1; : ~ S 1; : ; Finally, set M 1 ˆ Q m s 1 In the above algorithm, the construction of Toeplitz matrix Q 0 doesn't need any gates For 1 i m s 1, each step needs m 1 OR gates, so the OR complexity and delay of the above algorithm is m s 1 m 1 and m s 1 T, respectively Based on the above discussion, we conclude that the matrix M 1 can be computed with the following procedure: Procedure 41 1 Given a high-hamming weight irreducible polynomial f x ˆx m x ks x k1 1, get its complementary polynomial p x ˆx t m s 1 x t 1 and construct two sets L and J using Algorithm 41; Construct S ~ 1 and S ~ using (1), if we combine L and J together to form a new multiset L, then the complexity of this step is

8 ZHANG AND PARHI: SYSTEMATIC DESIGN OF ORIGINAL AND MODIFIED MASTROVITO MULTIPLIERS FOR GENERAL IRREDUCIBLE 41 Fig 4 General Modified Mastrovito Multiplier architecture OR complexity: ll m l 1 m 1 min J ; Delay: dlog DeT, where D ˆ max jlj; jj j ; Then, use Algorithm 4 to compute matrix M 1 For this step, we have OR complexity: m s 1 m 1 ; Delay: m s 1 T Next, let's consider the computation of M Denote the vector e T 1 V m; :! 1Š as qt From Theorem 41, we know that all row vectors in Q are equal to q T Thus, each row vector in M is equal to q T A t Without loss of generality, suppose q T has d nonzero entries whose positions are r d ; ;r 1, respectively, where r d > >r 1, then we have q T A t ˆ d A t r i ; : : Since A t is an upper-triangular Toeplitz matrix, we have A t r i ; : ˆA t 1; :! r i 1 Š, thus () can be rewritten as q T A t ˆ d A t 1; :! r i 1 Š: Therefore, the computation of M only needs P d iˆ m r i OR gates with the delay of dlog det In the above, we have developed the approaches for computing M 1 and M In order to complete the GF m multiplication, we only need to compute c ˆ M b ˆ M 1 b M b: The matrix-vector multiplication M 1 b requires m m 1 OR and m AND gates with the delay of T A dlog met Since all row vectors in M are identical and the first r 1 elements in each row are zeros, the matrix-vector multiplication M b only requires m r 1 1 OR and m r 1 AND gates with the delay of T A dlog m r 1 et The addition of M 1 b and M b needs m OR gates with the delay of T Based on the above computation approach, we develop the modified Mastrovito multiplier architecture as shown in Fig 4 The set L and J are constructed using Algorithm 41 Product Matrix Module computes the two matrices M 1 and M and consists of four blocks: PM1, PM, PM, and PM4 Blocks PM1 and PM generate the vector ~ S 1; : and ~S 1 1; :, respectively Block PM computes the matrix M 1 using Algorithm 4 PM contains (m 1) P j blocks, each one generates one row vector of M If j ft i ; 1 i m s 1g, then P j is identical to block B; otherwise, it is identical to block B1 Block PM4 computes the vector q T A t, the row vector of matrix M, where r 1 ; ;r d represent the position of the nonzero elements in q T The Matrix-Vector Multiplication & Addition Module computes M 1 b M b In this architecture, the operation of M 1 b and M b can be performed in parallel If their delays are denoted by T A m 1 T and T A m T, the total delay of the modified Mastrovito multiplier will be T A 1 max m 1 ;m T Therefore, the total complexity of modified Mastrovito multiplier is OR complexity: m s m 1 ll m l 1 d m r i min J ; AND complexity: m m r 1, Delay: T A 1 max m 1 ;m T, where multiset L is the combination of L and J, d is the Hamming weight of q T, and r i represents the index of nonzero element of q T, and

9 4 IEEE TRANSACTIONS ON COMPUTERS, VOL 50, NO, JULY 001 m 1 ˆ m s 1 dlog max jlj; jj j e dlog me; m ˆdlog de dlog m r 1 e: We note that, for given high Hamming weight irreducible polynomial, further hardware optimization is possible by sharing common items during computation of M 1 and M So, the above OR complexity result is also an upper bound for general cases 5 SPECIAL IRREDUCIBLE POLYNOMIALS In Section, we presented an efficient computation approach of original Mastrovito multiplication for general cases and pointed out that further simplification can be achieved for specific irreducible polynomials by further exploiting subexpression sharing In this section, we show that by applying our proposed explicit algorithm for the construction of set N, we can easily obtain efficient multiplication schemes with further reduced complexity for several special irreducible polynomials 51 m k s In the following, we show that, for irreducible polynomial f x ˆx m x k s x k 1 1, where m k s, we can reduce the OR complexity of the Mastrovito multiplier by computing S 1; : in (15) using linear tree structure instead of binary tree Since m k s, we easily have Nˆf0;m k s ;m k s 1 ; ;m k 1 g and jn j ˆ s 1 Thus, (15) can be simplified as S 1; : ˆA t 1; : s A t 1; :! m k i Š : 4 Using linear tree structure, we compute S 1; :, according to (4), as follows: Algorithm 51 1 Initially, set v T 0 ˆ A t 1; : ; For 1 i s, compute v T i ˆ v T i 1 A t 1; :! m k i Š; Finally, set S 1; : ˆv T s The OR complexity of the above algorithm is m n 1 m 1 ˆs k i 1 ; nn 5 with the delay of st Compared with using binary tree, the OR complexity doesn't change, but delay increases from dlog s 1 et to st Next, we will prove that, as compensation for the increased delay, the OR complexity of computing M using Algorithm can be reduced from s m 1 to P s m k i by sharing the intermediate results obtained in Algorithm 51 Proof In Algorithm, based on Observation 1, we construct each matrix Q i by only computing Q i k i 1; : ˆQ i 1 k i 1; : S 1; : : Moreover, since the last m k i rows of each matrix Q i form a Toeplitz submatrix, we have Q i 1 k i 1; : ˆQ i 1 k i 1 1; :! k i k i 1 Š: Therefore, based on (), (), and the fact that Q 0 ˆ T ˆ S A s, by induction we have Q i k i 1; : ˆi From (10), we have Thus, jˆ1 S 1; :! k i k j Š S 1; :! k i Š A s k i 1; : : A s k i 1; : ˆ a ki ;a ki 1 ; ;a 1 {z } k i ;a 0 ; 0; ; 0Š; A t 1; : ˆ 0;a m 1 ; ;a ki 1 {z } m k i ;a ki ; ;a 1 Š: 8 A s k i 1; : ˆA t 1; : m k i Š e T k i 1 a 0; 9 where e ki 1 is the k i 1 th canonical unit vector Therefore, (8) becomes Q i k i 1; : ˆi jˆ1 S 1; :! k i k j Š S 1; :! k i Š A t 1; : m k i Š e T k i 1 a 0: 0 Since each Q i k i 1; : is an m-dimensional row vector, the first k i entries of Q i k i 1; : are identical to the last k i entries of Q i k i 1; :! m k i Š From (14), we also know that 8 q m 1, S 1; :! qš is a zero row vector Therefore, according to (0), the last k i entries of Q i k i 1; :! m k i Š are identical to the last k i entries of ~q T i, where ~q T i is defined as ~q T i ˆ i jˆ1 S 1; :! m k j Š A t 1; : : Substituting (4) into (1), we have ~q T i ˆ i jˆ1 s jˆ1 A t 1; :! m k j Š i A t 1; :! m k j m k i Š A t 1; : : 1 Since m k s, we have 8 i; j s, m k i m k j m and A t 1; :! m k j m k i Š is actually a zero row vector So, we can rewrite () as ~q T i ˆ i jˆ1 A t 1; :! m k j Š A t 1; : : We note that ~q T i is identical to v T i, which we have obtained in Algorithm 51 So, the first k i entries of Q i k i 1; : are equal to the last k i entries of the

10 ZHANG AND PARHI: SYSTEMATIC DESIGN OF ORIGINAL AND MODIFIED MASTROVITO MULTIPLIERS FOR GENERAL IRREDUCIBLE 4 intermediate result v T i in Algorithm 51 Thus, in Algorithm, for each 1 i s, we only need to compute the last (m k i ) entries of Q i k i 1; :, which requires m k i OR gates, and the total OR complexity of Algorithm is P s m k i with the delay of st tu Therefore, the entire OR complexity of computing product matrix M is s k i 1 s m k i ˆs m 1 with the delay of st Moreover, the matrix-vector multiplication M b requires m m 1 OR and m AND gates with the delay of T A dlog met Thus, for an irreducible polynomial in which m k s, if the linear tree structure is used to compute S 1; :, the entire complexity of Mastrovito multiplier is OR complexity: m s m 1, AND complexity: m, Delay: T A s dlog me T 5 Trinomial If GF m is generated by irreducible trinomial f x ˆx m x n 1, we have Nˆ l m n ; 0 l m : m n Therefore, we get k ˆb m mn c, (15) and (1) can be simplified as S 1; : ˆk iˆ0 Obviously, computing A t 1; :! i m n Š ; 4 M ˆ A s S S # nš: M n 1; : ˆA s n 1; : S n 1; : S 1; : 5 is sufficient to construct M in (5) According to the complexity results presented in Section, we know that the OR complexities of computing (4) and () are P k m i m n 1 and m 1, respectively In the following, we will see that the above complexity values can be further reduced First, we show that m n, instead of m 1, OR gates are sufficient to complete the computation in () From (9), we have A s n 1; : ˆA t 1; : m n Š e T n 1 a 0 and, since S is an upper-triangular Toeplitz matrix, its n 1 th row can be written as S n 1; : ˆS 1; :! nš k ˆ At 1; :! i m n Š! nš iˆ0 ˆ A t 1; :! nš: So, () can be rewritten as M n 1; : ˆA t 1; : m n Š e T n 1 a 0 A t 1; :! nš S 1; : : Let's consider the addition of A t 1; : m n Š and S 1; : in () Because the last m n entries of A t 1; : m n Š are zeros, we only need to compute the first n entries of A t 1; : m n Š S 1; : Obviously, the first n entries of A t 1; : m n Š S 1; : are equal to the last n entries of A t 1; : S 1;:! m n Š and we have A t 1; : S 1;:! m n Š k ˆ A t 1; : A t 1; :! i m n Š! m n Š ˆ k iˆ0 iˆ0 A t 1; :! i m n Š ˆ S 1; : A t 1; :! k 1 m n Š: A t 1; :! k 1 m n Š Since k ˆb m m nc, we have k 1 m n m 1 Thus, the item A t 1; :! k 1 m n Š is actually a zero vector and the first n entries of A t 1; : m n Š S 1; : are identical to the last n entries of S 1; : Therefore, the addition of A t 1; : m n Š and S 1; : does not need any OR gates Furthermore, in (), the sum of the other two items, e T n 1 a 0 A t 1; :! nš ˆ 0; ; 0;a {z } 0 ;a m 1 ; ;a n 1 Š; {z } n m n has m n nonzero entries Thus, we conclude that the computation of () only needs m n OR gates The OR complexity of computing (4) can be reduced by using either the linear tree or hybrid tree method, which will lead to different trade-off between OR complexity and delay 51 Linear Tree If (4) is computed using linear tree structure as P 0 iˆk A t 1; :! i m n Š, we achieve the lowest OR complexity with the delay increasing linearly with k This approach has been thoroughly studied in [8], in which it was proven that n 1 OR gates are sufficient to compute (4) with the delay of kt We have known that the computation of (5) needs m n OR gates with delay of 1T Thus, the total complexity of the Mastrovito multiplier in [8] was obtained as: OR complexity: m 1, AND complexity: m, Delay: T A k 1 dlog me T, k ˆb m m n c Especially, it's pointed out in [8] that if n ˆ m, then the OR complexity can be further reduced from (m 1) to (m m ), with the delay reduced by 1T 5 Hybrid Tree We have known that using linear tree structure to compute (4) will lead to a very low OR complexity with the delay increasing linearly with k However, when k is very large, the delay may be intolerable for applications requiring high speed In the following, we present another approach for computing (4), where the OR complexity also can be

11 44 IEEE TRANSACTIONS ON COMPUTERS, VOL 50, NO, JULY 001 Fig 5 Hybrid tree computation scheme reduced by exploiting subexpression sharing and the delay increases linearly with blog k 1 c Let the Hamming weight of k 1 be h and blog k 1 c ˆ t h, then k 1 can always be written as k 1 ˆ t h t h 1 t 1, where t h >t h 1 > >t 1 If we denote A t 1; :! i m n Š as A t 1; : i, (4) can be rewritten as k iˆ0 where A t 1; : i w h 1 ˆ A t 1; : i w h 1 1 A t 1; : i iˆ0 w 1 1 iˆw iˆw h A t 1; : i ˆ g T h gt h 1 gt 1 ; w j ˆ h iˆj 8 t i ; 1 j h: 9 We compute the vector g T h using a binary tree B with the height of t h as shown in Fig 5, where each node represents a m-dimensional vector At each depth j, there are th j nodes with the following values: j 1 j 1 1 t h 1 A t 1; : i ; A t 1; : i ; ; A t 1; : i : 40 iˆ0 iˆ j iˆ t h j In (40), all other items can be directly obtained by right shifting the first item P j 1 iˆ0 A t 1; : i by l j columns, where 1 l th j 1 So, only computing the first node at each depth is sufficient to construct the whole binary tree The first node at depth j is computed by adding the first two nodes at depth j 1, where m j 1 m n 1 OR gates are required Thus, the OR complexity of computing g T h is th jˆ1 m j 1 m n 1 ˆ m 1 t h t h 1 m n : 41 Moreover, each vector g T i in (8), where 1 i h 1, can be obtained by right shifting the first node at depth t i in the binary tree B Thus, we compute P 1 iˆh gt i using linear tree L and obtain a hybrid tree to compute (8) as shown in Fig 5 From (8), we know that there are only m 1 w i 1 m n nonzero entries in vector g T i Therefore, it requires h 1 m 1 w i 1 m n ˆ h 1 m 1 h iˆ w i m n 4 OR gates to compute the linear tree L Combining (41) and (4) gives the total OR complexity for the computation of (8) as! t h h 1 m 1 h w i t h 1 m n iˆ and the delay is t h 1 T We have known that computing (5) requires m n OR gates with the delay of 1T Therefore, in trinomial cases, if the above hybrid tree structure is employed to compute (4), the total computation complexity of Mastrovito multiplier will be OR complexity: m t h h 1 m 1 h iˆ w i t h m n ; AND complexity: m, Delay: T A t h dlog me T, where t h ˆblog k 1 c, h is the Hamming weight of k 1, and w i is defined in (9) 5 Pentanomial A polynomial f x ˆx m x ks x k1 1 is called pentanomial if s ˆ For general irreducible pentanomials, it's impossible to write the set N in a simple form, eg, as in the trinomial case, and we have to use the general approach presented in Section to design the product matrix module and perform the possible hardware optimization for the dedicated irreducible pentanomial In the following, we present two special irreducible pentanomials for which the set N has a simple form and the complexity of corresponding Mastrovito multipliers can be easily obtained 51 Special Case 1 For irreducible pentanomials where m k, it follows from the analysis in Section 51 that if linear tree structure is used to compute S 1; :, the total complexity of the Mastrovito multiplier is OR complexity: m m 1, AND complexity: m, Delay: T A dlog me T 5 Special Case For such irreducible pentanomials that m k ˆ k k ˆ k k 1 ˆ r;

12 ZHANG AND PARHI: SYSTEMATIC DESIGN OF ORIGINAL AND MODIFIED MASTROVITO MULTIPLIERS FOR GENERAL IRREDUCIBLE 45 applying Algorithm 1, we have Nˆ n ˆ 4l r; n ˆ 4l 1 r; 0 l d ; 4 4 where d ˆb m c So, (15) can be simplified as where r S 1; : ˆbd 4 c lˆ0 g T! 4l rš ; 44 g T ˆ A t 1; : A t 1; :! rš: 45 The vector g T is computed using m r 1 OR gates with the delay of 1T If we use linear tree to compute (44) as P 0 lˆb d 4 c gt! 4l rš, then, applying the results in [8], it only requires m 4r 1 OR gates with the delay of b d 4 ct Therefore, the total OR complexity of computing S 1; : is m 5r with the delay of b d 4 c 1 T Furthermore, using Algorithm, we need m 1 OR gates to compute M with the delay of T Thus, in this case, the total complexity of the Mastrovito multiplier is given by OR complexity: m m 1 m 5r, AND complexity: m, Delay: T A b d 4 c 4 dlog me T 54 ESP A polynomial f x ˆ1 x r x r x tr x m, where t 1 r ˆ m, is called an ESP (equally-spaced-polynomial) An ESP with r ˆ 1 is usually referred as an AOP (all-onepolynomial) For irreducible ESP, applying Algorithm 1, we have Nˆf0;rg, based on which it can be shown that matrix U in (5) always has the following form: U ˆ L E 1! rš; where E 1 is defined in () and I rr j L ˆ 4 j j I rr O mm 1 r 5; where O ij and I ii represent i j zero matrix and i i identity matrix, respectively Therefore, the product matrix M can be computed as M ˆ A s U A t ˆ A s L A t E 1! rša t : Let Q 1 denote L A t, we have Q 1 ˆ L A t ˆ P T ; P T ; ; P T T ; 4 where r m matrix P ˆ A t 1:r; : Let Q denote A s E 1! rša t, we have Q ˆ A s E 1! rša t ˆ A s ~ A t! rš; 4 where ~ A t ˆ A T t ;ošt Obviously, Q 1 and Q are obtained without any computation We perform the GF m multiplication as c ˆ M b ˆ Q 1 b Q b: According to (4), we know that only computing P b is sufficient to obtain the result of Q 1 b Since P ˆ A t 1:r; :, we have P i; : ˆA t i; : ˆA t 1; :! i 1 Š; which shows that the first i entries in P i; : are zeros Therefore we get the complexity of computing Q 1 b as #ofand ˆ r #ofor ˆ r m i ˆmr r r ; m i 1 ˆmr r r ; and the delay is T A dlog m 1 et From the definition of A s and A t in (10), we know that each row vector in Q 1:m r; : contains r zero entries and each row vector Q m r i; : contains r i zeros, where 1 i r Therefore, in the computation of Q b, the numbers of AND and OR gates are #ofand ˆ m r m r r ˆ m mr r r #ofor ˆ m r 1 m r r m r i m r i 1 ˆ m mr r r m; and its delay is T A dlog met Obviously, Q 1 b and Q b can be computed in parallel At last, we also need m OR gates to add Q 1 b and Q b together to get the final result c Therefore, we get the total complexity as follows: OR complexity: m r, AND complexity: m, Delay: T A 1 dlog me T Here, we note that the above OR complexity result is identical to that obtained in [14] CONCLUSIONS In this paper, we have presented a systematic design approach for Mastrovito multipliers The complexity results are m AND gates and at most P nn m n 1 m 1 OR gates We note that, although the complexity results appear the same as those presented in [14], we propose an explicit algorithm to compute the set N, which makes our design really practical We have extended this design approach to the modified Mastrovito multiplication scheme, which is suitable for high-hamming weight irreducible polynomials For both original and modified Mastrovito multipliers with general irreducible polynomials, the developed computation approach effectively exploits subexpression sharing and the complexity analyses are given in detail The corresponding hardware architectures for both cases are highly modular Meanwhile, in this paper, we have studied several special irreducible polynomials For trinomials and ESPs, the complexity results

13 4 IEEE TRANSACTIONS ON COMPUTERS, VOL 50, NO, JULY 001 match the best known results achieved in [8] and [14] We also present another computation approach for trinomials to provide a trade-off between OR complexity and delay Moreover, several other special irreducible polynomials, which also lead to low-complexity implementation, have been discovered and corresponding complexities are given Finally, we note that, with the explicit algorithms and design procedures, all the proposed efficient design schemes can be easily employed by VLSI automation design tools for dedicated bit-parallel GF m multiplier design APPENDI A PROOF of THEOREM 1 In the following, we prove Theorem 1 in two steps, through which Algorithm 1 is developed: 1) First, we will show that the matrix U is equal to the sum of a series of matrices constructed by the following procedure: Procedure A-1 1 Initially, set i ˆ n ˆ 1 and create a matrix multiset WˆfU 1 g We define U 1 ˆ f; o; ; o Š; {z } m 1 where o is the m-dimensional zero column vector; 8 U l W, shift down its ith column by one position to serve as its i 1 th column: U l :;i 1 ˆU l :;i # 1Š: 8 U l W, if the last entry of U l 's ith column vector, U l m; i, is 1, then {Increase n by 1, create U l 's child matrix U n as U n ˆ o; ; o; f; o; ; oš {z } {z } i m i and insert U n into W (U l is called the parent matrix of U n )}; 4 Increase i by 1, if i ˆ m 1, procedure terminates, else return to Step Using the above procedure, we obtain a multiset W containing N matrices U 1 ; ; U N, where N ˆjWj is the order of W Each element matrix is m by m 1 and has the form: h i U i ˆ o; ; o; f; f # 1Š; ; f # m r {z } i Š : {z } r i m 1 r i In the following, we prove by induction that the sum of all U i sinw is identical to the matrix U: U :;l ˆN U i :;l ; 1 l m 1: When i ˆ 1, from Procedure A-1, we have U 1 :; 1 ˆf and U i :; 1 ˆo, 8 i>1 From (4), we also have U :; 1 ˆf So, we get U :; 1 ˆN Assume, for l 1, we have U :;l ˆN U i :; 1 : U i :;l : A:1 Applying the recursive relation between the successive columns of U as shown in (4), we compute U :;l 1 as follows: U :;l 1 ˆU :;l # 1Š U m; l f N N ˆ U i :;l # 1Š ˆ N U i :;l # 1Š N U i m; l f U i m; l f : A: According to Procedure A-1, we know that there always exist two integers 1 r t N and if i r, then matrix U i is created before the lth iteration in the procedure and U i :;l # 1Š ˆU i :;l 1 ; if r<i<t, then U i is created at the lth iteration, U i :;l # 1Š is zero vector, and U i :;l 1 ˆf; if t i, then U i is created after the lth iteration, both U i :;l # 1Š and U i :;l 1 are zero vectors We also know that each U i m; l ˆ1 corresponds to a matrix created at the lth iteration Thus, we have N N U i :;l # 1Š ˆ r U i :;l 1 N U i :;l 1 ; iˆt ˆ t 1 U i m; l f iˆr 1 U i :;l 1 : Substituting the above two equations into (A), we get r U :;l 1 ˆ U i :;l 1 N U i :;l 1 ˆ N t 1 iˆs 1 U i :;l 1 U i :;l 1 : iˆt A: Thus, we have derived (A) from the assumption (A1) Therefore, we can conclude that the sum of all matrices in W is equal to matrix U ) Next, we introduce another method to construct multiset W with the aid of a weighted tree Without loss of generality, let the irreducible polynomial be f x ˆx m x ks x k1 1; where m>k s > >k 1 > 1 Thus, for 0 <j<m, only when j f m k i ; 1 i sg, the last entry of f # j 1 Š is 1 From Procedure A-1, we have

14 ZHANG AND PARHI: SYSTEMATIC DESIGN OF ORIGINAL AND MODIFIED MASTROVITO MULTIPLIERS FOR GENERAL IRREDUCIBLE 4 U 1 ˆ h i f; f # 1Š; ; f # m Š and, except U 1, each other matrix U i in W can be constructed by shifting right its parent matrix by w columns, where w f m k i ; 1 i sg Thus, we can construct a weighted tree D in which each node d i represents the matrix U i in W and the weight of each edge represents the number of columns by which the parent matrix right shifts to generate its child matrix According to Procedure A-1, it can be shown that this tree is uniquely determined by the following property: Property A-1 1 Each node d j in D has at most s child nodes and each edge has the weight w f m k i ; 1 i sg Let d 1 denote the root and h d 1 ;d j denote the weight of path from d 1 to d j, where h d 1 ;d 1 ˆ0, we have 8 d j, if 9 r f m k i ; 1 i sg and h d 1 ;d j r <m 1, then d j always has a child node d l and the weight of edge between d j and d l is r 8d j D, h d 1 ;d j <m 1 Since U 1 is identical to the matrix F in Theorem 1, we can construct W as WˆfF! hš; 8h Hg; where Hˆfh d 1 ;d i ; 8 d i Dg Obviously, the multiset H may contain some repeated elements, which means W may contain some identical matrices Because here the addition is logic OR, the sum of two identical matrices is actually a zero matrix So, we can remove those repeated elements in pairs from multiset H using the following algorithm: Algorithm A-1 1 Initially, set Nˆ; For 0 j m, do create S j ˆ;, 8 h H,ifhˆj, then insert h into S j If js j j mod ˆ1, then insert j into the set N Using the above algorithm, we construct a new set N f0; 1; ;m g and F! nš ˆ F! hš ˆU: nn hh We note that combining Property A-1 and Algorithm A-1 just produces Algorithm 1 in Section 1 tu APPENDI B PROOF OF THEOREM 4 Similarly to the proof of Theorem 1, we prove Theorem 4 in two steps, through which Algorithm 41 is developed: 1) First, we use the following procedure to construct two matrix multiset W and Z, where the sum of all the matrices in these two multisets is equal to matrix V Procedure B-1 1 Initially, set i ˆ, n 1 ˆ, and n ˆ 1 Create two multisets WˆfW 1 ; W g and ZˆfZ 1 g: W 1 ˆ p; p # 1Š; o; ; o Š; {z } m 1 W ˆ o; p; o; ; o Š; {z } m 1 Z 1 ˆ o; e 1 ; o; ; o Š; {z } m 1 where p represents the coefficient vector of complementary irreducible polynomial p x and e 1 is the first canonical unit vector If W 1 m; 1 ˆ1, then {Increase n 1 by 1, create W 1 's child matrix W n1 ˆ W and insert W n1 into W}; 8 W l Wand 8 Z l Z,do W l :;i 1 ˆW l :;i # 1Š; Z l :;i 1 ˆZ l :;i # 1Š: 4 8 W l W,ifW l m; i 1 ˆ1, then {Increase both n 1 and n by 1, create W l 's child matrix W n1 and as Z n W n1 Z n ˆ o; ; o; p; o; ; oš; {z } {z } i m i ˆ o; ; o; e {z } 1 ; o; ; oš {z } i m i and insert W n1 and Z n into W and Z, respectively}; 5 8 W l W,ifW l m; i ˆ1, then {Increase n 1 by 1, create W l 's child matrix W n1 as W n1 ˆ o; ; o; p; o; ; oš {z } {z } i m i and insert W n1 into W}; Increase i by 1, if i ˆ m 1, procedure terminates, else return to Step Using the above procedure, we get two multisets W and Z Similarly to the proof of Theorem 1, based on the recursive relation of successive column vectors in matrix V as shown in Theorem 41, it can be proven by induction that matrix V is identical to the sum of all matrices in W and Z: V ˆ jwj W i jzj Z j : jˆ1 B:1 ) Next, we introduce another method to construct multiset W and Z with the aid of a weighted tree Since the complementary polynomial is p x ˆx tm s 1 x t1, the last entry of p # j 1 Š or p # j Š is 1 only when j f m t i ; m t i 1 ; 1 i m s 1 g According to Procedure B-1, each child matrix in multiset W can be obtained by right shifting its parent matrix by w f m t i ; m t i 1 ; 1 i m s 1 g columns and the whole construction process begins from two matrices: W 1 and W ˆ W 1! 1Š Therefore, we can construct a weighted tree D in which every node d i represents the matrix W i in W and the weight of each edge represents the number of columns by which the parent matrix right shifts to generate its child matrix According to Procedure B-1, it can be shown that the tree D is uniquely determined by the following property:

48 IEEE TRANSACTIONS ON COMPUTERS, VOL 50, NO, JULY 001 Property B-1 1 The root d 1, representing matrix W 1, always has one child node d (representing W ) connected by an edge with weight w ˆ 1

15 48 IEEE TRANSACTIONS ON COMPUTERS, VOL 50, NO, JULY 001 Property B-1 1 The root d 1, representing matrix W 1, always has one child node d (representing W ) connected by an edge with weight w ˆ 1 Besides d, root d 1 has at most m s 1 other child nodes, where the weight of edge w f m t i ; m t i 1 ; 1 i m s 1 g: 8 d j ˆ d 1, it has at most m s 1 child nodes, where the weight of edge w f m t i ; m t i 1 ; 1 i m s 1 g: Let h d 1 ;d j denote the weight of path from d 1 to d j, where h d 1 ;d 1 ˆ0, we have 8 d j, if 9 r f m t i ; m t i 1 ; 1 i m s 1 g and h d 1 ;d j r <m 1, then d j always has a child node d l and the weight of edge between d j and d l is r 4 8d j D, h d 1 ;d j <m 1 Since W 1 is identical to the matrix P in Theorem 4, we can construct the multiset W as follows: WˆfP! hš; 8h Hg; where Hˆfh d 1 ;d i ; 8 d i Dg Moreover, the above weighted tree D also can be used to represent all the element matrices in multiset Z Let root d 1 represent matrix Z 0 ˆ E 1, each other node d i represents a matrix generated through shifting Z 0 right by h d 1 ;d i columns According to Procedure B-1, if we introduce such a subset of all nodes in D, denoted as ~D, that it contains each node which is connected with its parent node by an edge with the weight z f m t i 1 ; 1 i m s 1 g, we have REFERENCES [1] SA Vanstone and PC Oorschot, An Introduction to Error Correcting Codes with Applications Kluwer, 1989 [] AJ Menezes, Handbook of Applied Cryptography CRC, 199 [] AJ Menezes, Applications of Finite Fields Kluwer Academic, 1994 [4] R Lidl and H Niederreiter, Introduction to Finite Fields and Their Applications Cambridge Univ Press, 1994 [5] D Jungnickel, Finite Fields: Structure and Arithmetics Wissenschaftsverlag, 199 [] ED Mastrovito, ªVLSI Designs for Multiplication over Finite Fields GF m,º Proc Sixth Int'l Conf Applied Algebra, Algebraic Algorithms, and Error-Correcting Codes (AAECC-), pp 9-09, July 1988 [] KK Parhi, VLSI Digital Signal Processing Systems: Design and Implementation John Wiley & Sons, 1999 [8] B Sunar and CË K KocË, ªMastrovito Multiplier for All Trinomials,º IEEE Trans Computers, vol 48, no 5, pp 5-5, May 1999 [9] T Itoh and S Tsujii, ªStructure of Parallel Multipliers for a Class of Finite Fields GF m,º Information and Computations, vol 8, pp 1-40, 1989 [10] MA Hasan, MZ Wang, and VK Bhargava, ªModular Construction of Low Complexity Parallel Multipliers for a Class of Finite Fields GF m,º IEEE Trans Computers, vol 41, no 8, pp 9-91 Aug 199 [11] CË K KocË and B Sunar, ªLow-Complexity Bit-Parallel Canonical and Normal Basis Multipliers for a Class of Finite Fields,º IEEE Trans Computers, vol 4, no, pp 5-5, Mar 1998 [1] H Wu and MA Hasan, ªLow-Complexity Bit-Parallel Multipliers for a Class of Finite Fields,º IEEE Trans Computers, vol 4, no 8, pp 88-88, Aug 1998 [1] A Halbutogullari and CË K KocË, ªMastrovito Multiplier for General Irreducible Polynomials,º Applied Algebra, Algebraic Algorithms and Error-Correcting Codes, pp , 1999 [14] A Halbutogullari and CË K KocË, ªMastrovito Multiplier for General Irreducible Polynomials,º IEEE Trans Computers, vol 49, no 5, pp , May 000 [15] L Song and KK Parhi, ªLow Complexity Modified Mastrovito Multipliers over Finite Fields GF m,º Proc IEEE Int'l Symp Circuits and Systems, vol 1, pp , May 1999 [1] T Zhang and KK Parhi, ªSystematic Design Approach of Mastrovito Multipliers over GF m,º Proc IEEE Workshop Signal Processing Systems (SiPS), pp 50-51, Oct 000 [1] GH Golub and CF Van Loan, Matrix Computations, third ed Johns Hopkins Univ Press, 199 ZˆfE 1! gš; 8g Gg; where Gˆf1;h d 1 ;d i ; 8 d i ~Dg Therefore, (B1) can be rewritten as V ˆ W i Z j : ih jg Similarly to the discussion in Appendix A, using Algorithm A-1, we can remove those repeated elements in pairs from multiset H and G and obtain Lf0; 1; ;m g and Jf1; ;m g, respectively, and V ˆ P! lš E 1! jš: ll jj Summarizing the above approach of constructing L and J, we get Algorithm 41 in Section 41 tu Tong Zhang (S'98) received the BS and MS degrees in electrical engineering from the ian Jiaotong University, ian, China, in 1995 and 1998, respectively He is pursuing the PhD degree in electrical engineering at the University of Minnesota His current research interests include design of VLSI architectures and circuits for digital signal processing and communication systems, with the emphasis on error-correcting coding and finite field arithmetic He is a student member of the IEEE

ZHANG AND PARHI: SYSTEMATIC DESIGN OF ORIGINAL AND MODIFIED MASTROVITO MULTIPLIERS FOR GENERAL IRREDUCIBLE 49 Keshab K Parhi (S'85-M'88-SM'91-F'9) received the BTech, MSEE, and PhD degrees from the

16 ZHANG AND PARHI: SYSTEMATIC DESIGN OF ORIGINAL AND MODIFIED MASTROVITO MULTIPLIERS FOR GENERAL IRREDUCIBLE 49 Keshab K Parhi (S'85-M'88-SM'91-F'9) received the BTech, MSEE, and PhD degrees from the Indian Institute of Technology, Kharagpur (India) (198), the University of Pennsylvania, Philadelphia (1984), and the University of California at Berkeley (1988), respectively He is a distinguished McKnight University Professor of Electrical and Computer Engineering at the University of Minnesota, Minneapolis, where he also holds the Edgar F Johnson Professorship His research interests include all aspects of VLSI implementations of broadband access networks He is currently working on VLSI adaptive digital filters, equalizers and beamformers, error control coders and cryptography architectures, low-power digital systems, and computer arithmetic He has published more than 0 papers in these areas He has authored the text book VLSI Digital Signal Processing Systems (Wiley, 1999) and coedited the reference book Digital Signal Processing for Multimedia Systems (Dekker, 1999) He received the 001 WRG Baker prize paper award from the IEEE, a Golden Jubilee medal from the IEEE Circuits and Systems Society in 1999, a 199 Design Automation Conference best paper award, the 1994 Darlington and the 199 Guillemin-Cauer best paper awards from the IEEE Circuits and Systems Society, the 1991 paper award from the IEEE Signal Processing Society, the 1991 Browder Thompson prize paper award from the IEEE, and the 199 Young Investigator Award of the US National Science Foundation Dr Parhi has served on editorial boards of the IEEE Transactions on Circuits and Systems, the IEEE Transactions on Signal Processing, the IEEE Transactions on Circuits and Sytems-Part II: Analog and Digital Signal Processing, the IEEE Transactions on VLSI Systems, and the IEEE Signal Processing Letters He is an editor of the Journal of VLSI Signal Processing He is the guest editor of a special issue of the IEEE Transactions on Signal Processing, two special issues of the Journal of VLSI Signal Processing and a special issue of the Journal of Analog Integrated Circuits and Signal Processing He served as technical program cochair of the 1995 IEEE Workshop on Signal Processing and the 199 ASAP conference, and is the general chair of the 00 IEEE Workshop on Signal Processing Systems He serves on numerous technical program committees and was a distinguished lecturer of the IEEE Circuits and Systems Society ( ) For further information on this or any computing topic, please visit our Digital Library at

Mastrovito Form of Non-recursive Karatsuba Multiplier for All Trinomials

1 Mastrovito Form of Non-recursive Karatsuba Multiplier for All Trinomials Yin Li, Xingpo Ma, Yu Zhang and Chuanda Qi Abstract We present a new type of bit-parallel non-recursive Karatsuba multiplier over