WHEN the extended binary field GFð2 n Þ is viewed as a

Size: px

Start display at page:

Download "WHEN the extended binary field GFð2 n Þ is viewed as a"

Diana Conley
5 years ago
Views:

1 4 IEEE TRANSACTIONS ON COMPUTERS, VOL. 56, NO., FEBRUARY 7 A New Approach to Subquadratic Space Complexity Parallel Multipliers for Extended Binary Fields Haining Fan and M. Anwar Hasan, Senior Member, IEEE Abstract Based on Toeplitz matrix-vector products and coordinate transformation techniques, we present a new scheme for subquadratic space complexity parallel multiplication in GFð n Þ using the shifted polynomial basis. Both the space complexity and the asymptotic gate delay of the proposed multiplier are better than those of the best existing subquadratic space complexity parallel multipliers. For example, with n being a power of, the space complexity is about percent better, while the asymptotic gate delay is about 33 percent better, respectively. Another advantage of the proposed matrix-vector product approach is that it can also be used to design subquadratic space complexity polynomial, dual, weakly dual, and triangular basis parallel multipliers. To the best of our knowledge, this is the first time that subquadratic space complexity parallel multipliers are proposed for dual, weakly dual, and triangular bases. A recursive design algorithm is also proposed for efficient construction of the proposed subquadratic space complexity multipliers. This design algorithm can be modified for the construction of most of the subquadratic space complexity multipliers previously reported in the literature. Index Terms Finite field, subquadratic space complexity multiplier, coordinate transformation, shifted polynomial basis, Toeplitz matrix. Ç 1 INTRODUCTION WHEN the extended binary field GFð n Þ is viewed as a vector space over GFðÞ, the elements of the field are often represented with respect to a basis. For the purpose of efficient field arithmetic, a number of bases have been proposed in the literature, for example, polynomial, normal, dual, weakly dual, triangular, and shifted polynomial bases. Between the two basic field arithmetic operations, namely, addition and multiplication, it is easier to implement the former using these representations. Multiplication is inherently complex and the existing bit parallel multipliers may be classified into the following two categories on the basis of the space complexity, which is often measured in terms of the number of -input AND and XOR gates: quadratic and subquadratic space complexity multipliers. Multipliers of [5], [6], [7], [], [9], [1], [11], [1], [13], [14], [15], [16], [17], [1], [19], [] belong to the former category and those of [1], [], [3], [4], [5], [6], [7], [], [9], [3], [31], [3] the latter category. The subquadratic space complexity multiplier provides a practical solution for large values of n due to its low space complexity. To this end, the polynomial version of the integer Karatsuba multiplication algorithm [1] has been widely used, e.g., [1], [], [3], [4], [5], [6], [7], [], [9], [3]. These multipliers first perform a multiplication of two binary polynomials, each of degree n 1 or less, and then a modulo reduction operation using the field generating. The authors are with the Department of Electrical and Computer Engineering, University of Waterloo, Waterloo, Ontario, Canada. hfan@vlsi.uwaterloo.ca, ahasan@ece.uwaterloo.ca. Manuscript received 4 Jan. 6; revised 3 June 6; accepted 4 July 6; published online Dec. 6. For information on obtaining reprints of this article, please send to: tc@computer.org, and reference IEEECS Log Number TC irreducible polynomial. For practical applications, n may not be an even integer. In such cases, in order to apply the Karatsuba algorithm, some slight modifications are often made to the inputs, e.g., padding zeros. The asymptotic gate delay of the Karatsuba algorithm is ð3 log n 1ÞT X þ T A for n ¼ i ði>1þ, where T A and T X are the delay of one -input AND and XOR gates. The Karatsuba algorithm-based multiplier has a low asymptotic space complexity, but its gate delay is three times more than that of the existing fast parallel multipliers [31]. In order to reduce this delay, a hybrid structure has been developed in [5]. Although this method does not improve the asymptotic gate delay, it is effective for some finite fields, e.g., GFð 33 Þ. The Winograd short convolution algorithm may be viewed as a generalization of the original Karatsubabased algorithm [], [31], and [33]. For n ¼ 3 i ði>þ, the algorithm improves the asymptotic gate delay of the Karatsuba algorithm in [1] and results in an asymptotic gate delay of ð4 log 3 n 1ÞT X þ T A ð:5 log n 1ÞT X þ T A. But, the asymptotic space complexity of this method is slightly worse than that of the Karatsuba-based multiplier. The parallel multiplier presented in [3] is based on the Chinese Remainder Theorem CRT) and Montgomery s algorithm. It allows the use of the extension field, for which an irreducible trinomial or special pentanomial does not exist. In summary, all of the above subquadratic space complexity GFð n Þ parallel multipliers are based on the improved polynomial multiplication algorithms and will be referred to as the polynomial-based multipliers in the following. In this paper, we follow a different approach using Toeplitz matrix-vector products and propose a new scheme for the subquadratic space complexity GFð n Þ parallel 1-934/7/$5. ß 7 IEEE Published by the IEEE Computer Society

2 FAN AND HASAN: A NEW APPROACH TO SUBQUADRATIC SPACE COMPLEXITY PARALLEL MULTIPLIERS FOR EXTENDED BINARY FIELDS 5 multiplier. Our scheme takes advantage of a shifted polynomial basis SPB) and applies the coordinate transformation technique of [3] and [4]. The end result is that not only the space complexity of the new multiplier is lower than that of the existing polynomial-based multipliers, e.g., [1] and [31], its asymptotic gate delay is also better, in fact considerably better, than that of these polynomial-based designs. Another advantage of the proposed matrix-vector product approach is that it can also be used to design subquadratic space complexity parallel multipliers using polynomial, dual, weakly dual, or triangular basis. Because of the lack of apparent regularity, hardware implementations of subquadratic space complexity parallel multipliers requires considerable efforts. For large n, efficient design of subquadratic space complexity multipliers is often based on recursive application of the divideand-conquer technique and is not straightforward. Therefore, a systematic design approach is desirable. In this paper, a recursive design algorithm is also proposed for an efficient construction of the proposed subquadratic space complexity multipliers. We have implemented the algorithm using ANSI C. The program generates a set of explicit Boolean statements involving only assignment, AND, and XOR operations. Therefore, it essentially provides a lower level of abstraction, e.g., the gate level description as in VHDL and Verilog. The proposed design algorithm may also be modified for the construction the Karatsuba-based subquadratic space complexity multipliers. The remainder of this paper is organized as follows: In Section, we consider asymptotic complexities for computing Toeplitz matrix-vector products based on two and three-way splits. The proposed multipliers are presented in Section 3. The recursive algorithm for the efficient construction of the proposed subquadratic space complexity multiplier is introduced in Section 4. Finally, concluding remarks are made in Section 5. ASYMPTOTIC COMPLEXITIES OF TOEPLITZ MATRIX-VECTOR PRODUCT In this section, some basic noncommutative matrix-vector multiplication schemes and their asymptotic space and gate delay complexities are presented. Elements of the matrix are in GF ðþ. These schemes will be used to design the proposed parallel multiplier in the next section. Definition 1. An n n Toeplitz matrix is a matrix ðm k;i Þ, where i; k n 1, with the property that m k;i ¼ m k 1;i 1, where 1 i; k n 1. Remark 1. An n n Toeplitz matrix is determined by the n 1 elements of the first row and the first column and adding two n n Toeplitz matrices requires n 1 addition operations..1 Two-Way Split or n ¼ i ði>þ Assume that T is an n n Toeplitz matrix and V an n 1 column vector. Matrix T and vector V can be split as follows: T ¼ T 1 T and V ¼ V ; T T 1 V 1 where T, T 1, and T are ðn=þðn=þ matrices and are individually in Toeplitz form, and V and V 1 are ðn=þ1 column vectors. Now, the following noncommutative formula can be used to compute the Toeplitz matrix-vector product TV []: TV ¼ T 1 T V ¼ P þ P ; ð1þ T T 1 V 1 P 1 þ P where P ¼ðT þ T 1 ÞV 1 ;P 1 ¼ðT 1 þ T ÞV and P ¼ T 1 ðv þ V 1 Þ. Please note that the addition and subtraction are the same in fields of characteristic two. One important implication of 1) is that the product of an n n Toeplitz matrix and an n 1 vector is primarily reduced to three products of matrix and vector of sizes ðn=þðn=þ and ðn=þ1. Remark. A straightforward matrix addition to obtain T þ T 1 and T 1 þ T requires a total of ðn 1Þ XOR gates. Owing to the special structure of the Toeplitz matrix, some terms may be reused in computing T þ T 1 and T 1 þ T. Suppose that we have obtained the summation T þ T 1. Then, we only need to compute the first column of T 1 þ T since the last n= 1 elements in the first row of T 1 þ T also appear in the first column of T þ T 1. Therefore, a total of 3n= 1 XOR gates are required to compute T þ T 1 and T 1 þ T. Let symbols S and D stand for Space and Delay, respectively. We will use S b ðnþ, S b ðnþ, D b ðnþ, and D b ðnþ to denote the number of multiplication and addition operations, the time delays introduced by multiplication and addition operations for the case of n ¼ b i ði>þ, respectively. The following recurrence relations, which describe the algorithm complexities, can be established when this formula is used recursively to compute TV in the case of n ¼ i. S ðþ ¼3; D ðþ ¼1; S ðnþ ¼3S ðn=þ; D ðnþ ¼D ðn=þ; S ðþ ¼5; D ðþ ¼; S ðnþ and ¼3S ðn=þþ3n 1; D ðnþ ¼D ðn=þþ: In order to obtain the explicit complexities of the above recurrence relations, we need the following lemmas. Proofs of these lemmas are simple and not given here. Lemma 1. Let a, b, and i be positive integers. Let n ¼ b i, a 6¼ b, and a 6¼ 1. The solution to the recurrence relations R 1 ¼ e; R n ¼ ar n=b þ cn þ d; is R n ¼ a i e þ bc ð ai b i Þ þ dai ð 1Þ a b a 1 :

3 6 IEEE TRANSACTIONS ON COMPUTERS, VOL. 56, NO., FEBRUARY 7 Lemma. Let b and i be positive integers and n ¼ b i. The solution to the recurrence relations R 1 ¼ ; R n ¼ R n=b þ d; is R n ¼ di ¼ dlog b n. Now, it is easy to obtain the following complexity results for computing TV in the case of n ¼ i ði>þ: S ðnþ ¼nlog 3 ; >< S ðnþ ¼5:5nlog 3 6n þ :5; D ðnþ ¼1; D ðnþ ¼ log n:. Three-Way Split or n ¼ 3 i ði>þ As stated earlier, assume that T is an n n Toeplitz matrix and V is an n 1 column vector. Similarly to the case n ¼ i ði >Þ, we may have a three-way split of matrix T and vector V and obtain the following noncommutative formula which computes the Toeplitz matrix-vector product TV []: 1 T T 1 T TV T 3 T T 1 A V 1 1 P þ P 3 þ P V 1 A P 1 þ P 3 þ P 5 A; T 4 T 3 T V P þ P 4 þ P 5 where T i ð i 4Þ are ðn=3þðn=3þ Toeplitz matrices, and < P ¼ðT þ T 1 þ T ÞV ; P 1 ¼ðT 1 þ T þ T 3 ÞV 1 ; : P ¼ðT þ T 3 þ T 4 ÞV ; < P 3 ¼ T 1 ðv 1 þ V Þ; P 4 ¼ T ðv þ V Þ; : P 5 ¼ T 3 ðv þ V 1 Þ: Based on Remark 1, additions of matrices in ) may require as many as 6ð n 3 1Þ XOR gates. However, by reusing repeated terms, the number of XOR gates can be considerably reduced. To this effect, we state the following lemma: Lemma 3. Matrix additions ðt þ T 1 þ T Þ, ðt 1 þ T þ T 3 Þ, and ðt þ T 3 þ T 4 Þ in ) can be performed using a total of n 1 two-input XOR gates. Proof. Let n ¼ 3m and the first row and the first column of the Toeplitz matrix T be ðt 3m 1 ;t 3m ; ;t Þ and ðt 3m 1 ;t 3m ; ;t 6m Þ T, respectively. There is a one-toone correspondence between Toeplitz matrix T and P polynomial 6m t i x i. Adding two n n Toeplitz matrices requires the same number of XOR gates as adding the corresponding polynomials of degree n. Therefore, we have the following polynomials corresponding to Toeplitz matrices T, T 1, T, T 3, and ðþ m T 4 : q ¼ X t i x i ; m q 1 ¼ X t iþm x i ; m q ¼ X t iþm x i ; m q 3 ¼ X t iþ3m x i ; and q 4 ¼ P m t iþ4m x i, respectively. Now, we can write q þ q 1 þ q ¼ Xm ðt i þ½t mþi þ t mþi ŠÞx i þðt m 1 þ½t m 1 þ t 3m 1 ŠÞx m 1 m þ X ð½t i þ t mþi Šþt mþi Þx i ; i¼m q 1 þ q þ q 3 ¼ Xm ft mþi þ t mþi þ t 3mþi gx i þð½t m 1 þ t 3m 1 Šþt 4m 1 Þx m 1 m þ X ðt mþi þ½t mþi þ t 3m 1 ŠÞx i ; i¼m q þ q 3 þ q 4 ¼ Xm ft mþi þ t 3mþi þ t 4mþi gx i þðt 3m 1 þ t 4m 1 þ t 5m 1 Þx m 1 m þ X ð½t mþi þ t 3mþi Šþt 4mþi Þx i : i¼m In 3), 4), and 5), reuses of terms occur in the following five cases: 1. Term ½t m 1 þ t 3m 1 Š in q þ q 1 þ q also appears in q 1 þ q þ q 3.. In q þ q 1 þ q, summations ½t mþi þ t mþi Šði m Þ and ½t i þ t mþi Š ðm i m Þ are the same. 3. Summations ft mþi þ t mþi þ t 3mþi g ð i m Þ in q 1 þ q þ q 3 are the same as summations ðt i þ t mþi þ t mþi Þðmim Þ in q þ q 1 þ q. 4. Summations ft mþi þ t 3mþi þ t 4mþi g ð i m Þ in q þ q 3 þ q 4 are the same as summations ðt mþi þ t mþi þ t 3mþi Þ ðm i m Þ in q 1 þ q þ q Summations ½t mþi þ t 3mþi Šðmim Þ appear in both q 1 þ q þ q 3 and q þ q 3 þ q 4. Thus, q þ q 1 þ q, q 1 þ q þ q 3, and q þ q 3 þ q 4 can be computed using n 1 XOR gates. tu This lemma is useful in determining the space complexity of the matrix-vector product TV. In order to determine the gate delay, we can consider any of the three summation terms on the right hand side of ), i.e., ð3þ ð4þ ð5þ

4 FAN AND HASAN: A NEW APPROACH TO SUBQUADRATIC SPACE COMPLEXITY PARALLEL MULTIPLIERS FOR EXTENDED BINARY FIELDS 7 P þ P 4 þ P 5 ¼ ½ðT þ T 3 þ T 4 ÞV Š þ ½T ðv þ V ÞþT 3 ðv þ V 1 ÞŠ: S ðn;p j ;p i Þ ¼ h i½h j S ðtþþk j p j t þ l j Šþk i n þ l i ; S ðn;p ¼ h i;p jþ j½h i S ðtþþk i p i t þ l i Šþk j n þ l j ; ð6þ For n ¼ 3, it is easy to see that computing the terms in the square brackets requires a gate delay of T A þ T X. Therefore, P þ P 4 þ P 5 may be obtained with a gate delay of T A þ 3T X. When this result and Lemma 3 are used recursively to compute TV, the following recurrence relations, which describe the algorithm complexities, can be established: S 3 ð3þ ¼6; D 3 ð3þ ¼1; S 3 ðnþ ¼6S 3 ðn=3þ; D 3 ðnþ ¼D 3 ðn=3þ; S 3 ð3þ ¼14; D 3 ð3þ ¼3; S 3 ðnþ and ¼6S 3 ðn=3þþ5n 1; D 3 ðnþ ¼D 3 ðn=3þþ3: After solving these recurrence relations, we obtain the following complexities for computing TV in the case of n ¼ 3 i ði>þ: S 3 ðnþ ¼nlog 3 6 ; >< S 3 ðnþ ¼4 5 nlog 3 6 5n þ 1 5 ; D 3 ðnþ ¼1; D 3 ðnþ ¼3 log 3 n:.3 Dealing with Arbitrary n In the previous two subsections, complexity results for the Toeplitz matrix-vector product are given for b ¼ and 3. Let us denote these two primes as p 1 ¼ and p ¼ 3. It is possible to find corresponding complexities for other small primes, say p 3 ¼ 5, p 4 ¼ 7; ;p w, by transposing [, Theorem 6, p. 17] the polynomial multiplication algorithms in [35] or [6]. Since we have complexity results for more than one prime, the following two questions arise while dealing with an arbitrary n: 1. If p i p j jn, where 1 i<jw, and complexity results for both p i and p j are available, how do we choose a sequence of these to obtain a lower complexity scheme?. If none of the p i s ð1 i wþ are factors of n, how can these complexity results be applied? We first discuss question 1. Let primes p j and p i divide n and ðn; p j ;p i Þ denote the Toeplitz matrix-vector product algorithm of size n that first applies the formula corresponding to p i and then that corresponding to p j. As for the XOR gate complexity, we assume that the following recurrence relations have been established for p i and p j -way splits: S p i ðnþ ¼h i S p i ðn=p i Þþk i n þ l i ; S p j ðnþ ¼h j S p j ðn=p j Þþk j n þ l j : Let n ¼ p i p j t ðt >Þ. Then, the XOR gate complexities of algorithms ðn; p j ;p i Þ and ðn; p i ;p j Þ are described as follows: where S ðtþ denotes the number of XOR gates required to compute the Toeplitz matrix-vector product of size t. Similarly to the above discussion on the XOR gate complexity, it can be easily shown that time and the AND gate complexities of algorithms ðn; p j ;p i Þ and ðn; p i ;p j Þ are the same. Since the XOR gate complexities presented in 6) involve parameters p i and p j, one can make a comparison of XOR gate complexities to select one of ðn; p j ;p i Þ and ðn; p i ;p j Þ. As an example, we now consider the special case of n ¼ 6t. The XOR gate complexities of algorithms ðn; 3; Þ and ðn; ; 3Þ are 63t 4 þ 1 S ðtþ and 66t 7 þ 1 S ðtþ, respectively. Since both algorithms ðn; ; 3Þ and ðn; 3; Þ have the same AND gate complexities and gate delays, algorithm ðn; 3; Þ is preferable. With regard to question, where none of the p i s ð1 i wþ divide n, one may choose one of the following two solutions: The first one is to pad one zero at the end of the vector and to extend the Toeplitz matrix from n n to ðn þ 1Þðn þ 1Þ by inserting zeroes at positions ð;nþ and ðn; Þ. Then, apply the complexity results for p 1 ¼, p ¼ 3, etc. The second one is to delete the first row and the last column of the Toeplitz matrix and the last element of the vector and then apply the complexity results for p 1 ¼, p ¼ 3, etc. The deleted elements are processed separately. 3 NEW SUBQUADRATIC MULTIPLIERS In this section, we will use the above scheme of Toeplitz matrix-vector product to design subquadratic space complexity multipliers. For representing elements of the field GFð n Þ, we first consider a shifted polynomial basis, which can be viewed as a generalization of the standard polynomial basis. Let x be a root of fðuþ and GFð n Þ¼GFðÞ½uŠ=ðfðuÞÞ.A shifted polynomial basis SPB) of GFð n Þ over GFðÞ is defined as follows []: Definition. Let v be an integer and the ordered set M ¼ fx i j i n 1g be a polynomial basis of GFð n Þ over GFðÞ. The ordered set x v M :¼ fx i v j i n 1g is called a shifted polynomial basis with respect to M. 3.1 Formulation Using SPB Let X ¼ðx v ;x vþ1 ; ;x n v 1 Þ T be the column vector of SPB basis elements, A ¼ða ;a 1 ; ;a n 1 Þ T be the coordinate column vector of the field element a ¼ x v P n 1 a ix i, and B, C, and D be defined similarly. For hardware implementation of a GFð n Þ SPB parallel multiplier, one method is to form a binary n n matrix Z, which depends on b and fðuþ, and then perform a matrix-vector product. Namely, the product c ¼ ab may be computed as follows c ¼ Xn 1 a i x i v b ¼ðx v b; ;x 1 b; b; xb; ;x n v 1 bþa ¼ X T ðz ; ;Z n 1 ÞA ¼ X T ZA; ð7þ

5 IEEE TRANSACTIONS ON COMPUTERS, VOL. 56, NO., FEBRUARY 7 where Z i is the coordinate column vector of x i v b with respect to the SPB ð i n 1Þ and Z is an n n matrix. From 7), we have the matrix-vector product C ¼ ZA. However, Z is not generally a Toeplitz matrix. Therefore, the subquadratic scheme presented in the previous section cannot be used directly. In [3] and [4], coordinate transformation techniques were proposed. Using this technique, one may first transform Z into a Toeplitz matrix T, i.e., T ¼ UZ, where U is the transform matrix. Then use the subquadratic scheme to compute the Toeplitz matrixvector product, D ¼ TA: Finally, the result C is obtained by C ¼ U 1 D: In the following, we will apply this idea to an arbitrary irreducible trinomial fðuþ ¼u n þ u k þ 1 ð1 k n 1Þ and a special type of pentanomials fðuþ ¼u n þ u kþ1 þ u k þ u k 1 þ 1 ð1 <k<n 1Þ and present exact expressions of C for GFð n Þ generated by these special types of irreducible polynomials. Please note that, for all practical purposes, one may only need to consider irreducible trinomials and pentanomials since at least one of the two types of irreducible polynomials is known to exist for every value of n in the range 1 <n<11. In fact, there is no known value of n for which an irreducible polynomial of weight w<6 does not exist [34]. Also note that NIST has recommended five finite fields of characteristic two for the ECDSA Elliptic Curve Digital Signature Algorithm) applications: GFð 163 Þ, GFð 33 Þ, GFð 3 Þ, GFð 49 Þ, and GFð 571 Þ, but no irreducible trinomials exist for three degrees, that is, 163, 3, and 571. For these three fields, we have found all pairs of ðn; kþ for which fðuþ ¼u n þ u kþ1 þ u k þ u k 1 þ 1 is irreducible [36]: 163, 67), 163, 69), 163, 71), 163, 9), 163, 94), 163, 96), 3, 4), 3, 133), 3, 15), 3, 59), 571, 14), 571, 3), 571, 341), and 571, 467). We assume that the value of v is equal to k in the definition of SPB for irreducible trinomials fðuþ ¼ u n þ u k þ 1 ð1 k n 1Þ and irreducible pentanomials fðuþ ¼u n þ u kþ1 þ u k þ u k 1 þ 1 ð1 <k<n 1Þ. 3. New SPB Multipliers for General Irreducible Trinomials For irreducible trinomials, we have formed a simple transformation matrix to be used with SPB. This matrix is much simpler than what can be obtained using [3] and [4] and is given as follows: U ¼ I ðn vþðn vþ I vv ðþ ð9þ ; ð1þ where I vv is the v v identity matrix. Lemma 4. Let fðuþ ¼u n þ u v þ 1 ð1 v n 1Þ be an irreducible trinomial and Z and U be matrices defined in 7) and 1). Then, T ¼ UZ is a Toeplitz matrix. Proof. Let g ¼ x j v b ¼ P n 1 g ix i v ð j n Þ. Thus, the jth column of Z in 7), i.e., Z j, is the column vector consisting of the SPB coordinates of element g. Then, column Z jþ1 is the coordinate column vector of element xg ¼ Xn 1 g i x i vþ1 ¼ Xn g i x i vþ1 þ g n 1 ðx v þ 1Þ i¼1 ¼ g n 1 x v þ Xv 1 g i 1 x i v þðg v 1 þ g n 1 Þþ Xn 1 g i 1 x i v : Let bg and cxg be the two elements of GFð n Þ whose SPB coordinates form columns j and j þ 1 of matrix T, respectively. Because of premultiplication of U to Z, the lower n v rows respectively, the upper v rows) of Z become the upper n v rows respectively, the lower v rows) of the resultant matrix T ¼ UZ. Thus, we can write!! bg ¼ Xv 1 g i x i v x n v þ Xn 1 g i x i v x v ¼ n v 1 X i¼n v g v nþi x i þ and cxg ¼ g n 1 x v þ Xv 1 g i 1 x i v i¼1 i¼v n v 1 X i¼ v x n v g vþi x i ; þ ðg v 1 þ g n 1 Þþ Xn 1 g i 1 x i v ¼ g n 1 x n v þ ¼ n v 1 X i¼n vþ1 þðg v 1 þ g n 1 Þx v n v 1 X i¼n vþ1 g v n 1 i x i þ þðg v 1 þ g n 1 Þx v : g v n 1 i x i þ Xn v i¼ vþ1 g v 1þi x i x v n v 1 X i¼ vþ1 g v 1þi x i Careful comparison shows that elements at ði; jþ and ði þ 1;jþ 1Þ of matrix T are the same for the case i; j n. Therefore, T is a Toeplitz matrix. tu Now, we present an example to illustrate the above transformation. Let fx i 4 j i 6g be the SPB for the irreducible trinomial u 7 þ u 4 þ 1. Matrices Z and T are as follows: Z ¼ 1 b 4 þ b b 3 b b 1 b b 6 b 5 b 5 þ b 1 b 4 þ b b 3 b b 1 b b 6 b 6 þ b b 5 þ b 1 b 4 þ b b 3 b b 1 b b þ b 3 b 6 þ b b 5 þ b 1 b 4 þ b b 3 b b 1 ; b 1 b b 6 b 5 b 4 b 3 þ b 6 b þ b 5 B b b 1 b b 6 b 5 b 4 b 3 þ b 6 A b 3 b b 1 b b 6 b 5 b 4

6 FAN AND HASAN: A NEW APPROACH TO SUBQUADRATIC SPACE COMPLEXITY PARALLEL MULTIPLIERS FOR EXTENDED BINARY FIELDS 9 TABLE 1 Comparisons of Some Selected Subquadratic Multipliers for n ¼ b t and T ¼ 1 b 1 b b 6 b 5 b 4 b 3 þ b 6 b þ b 5 b b 1 b b 6 b 5 b 4 b 3 þ b 6 b 3 b b 1 b b 6 b 5 b 4 b 4 þ b b 3 b b 1 b b 6 b 5 : b 5 þ b 1 b 4 þ b b 3 b b 1 b b 6 B b 6 þ b b 5 þ b 1 b 4 þ b b 3 b b 1 b A b þ b 3 b 6 þ b b 5 þ b 1 b 4 þ b b 3 b b 1 It is clear that the transformation from Z to T requires no logic gates and the complexity to form Z was presented in [] as follows: Gate delay ¼ 1T X ; n 1 k 6¼ n; XOR gates ¼ n= k ¼ n: For U as given in 1), it is clear that C is obtained from D ¼ TA with no additional logic gates. Table 1 compares the asymptotic complexity of proposed constructions with those of the existing polynomial-based multipliers for the trinomial fðuþ ¼u n þ u k þ 1 ð1 k< dn=eþ, where n ¼ t or 3 t. Since no irreducible binary trinomial exists for the case jn, we will assume that the multiplication operation is performed in the ring GF ðþ½uš=ðfðuþþ, where fðuþ is a trinomial and n ¼ i ði>þ, so that the discussion of the asymptotic complexity is meaningful. The parallel multiplier presented in [3] is based on the Chinese Remainder Theorem CRT) and Montgomery s algorithm. It allows the use of extension fields for which an irreducible trinomial or special pentanomial does not exist. Please note that the gate delay value given in [1] and [31] is ð3 log nþt X þ T A for the case n ¼ t. But, one T X gate delay may be saved for the case n ¼ since the gate delay for computing expressions in the square brackets of ða 1 x þ a Þðb 1 x þ b Þ¼a 1 b 1 x þ ð½ða 1 þ a Þðb 1 þ b ÞŠ þ½a 1 b 1 þ a b ŠÞx þ a b is T A þ T X. However, this does not improve the asymptotic gate delay since the overlapping occurs when n>. Similarly, one T X gate delay may also be saved for the case of n ¼ 3. We also note that, for the case dn=e <k<n, multipliers presented in [1] and [31] require more XOR gates and delays than values listed in the table since more than two reduction operations are performed [6]. 3.3 New SPB Multipliers for Special Pentanomials fðuþ ¼u n þ u vþ1 þ u v þ u v 1 þ 1 ð1 <v<n 1Þ For this special type of pentanomial, we transform Z into the Toeplitz matrix T via the following lemma. Lemma 5. Let fðuþ ¼u n þ u vþ1 þ u v þ u v 1 þ 1 ð1 <v< n 1Þ be an irreducible pentanomial and Z be the matrix defined in 7). Let matrix U ¼ I ðn vþðn vþ þ J ðn vþðn vþ I vv þ Jvv T ; where J vv is a v v matrix with the single entry ð;v 1Þ ¼ 1 and all remaining entries are. Then, T ¼ UZ is a Toeplitz matrix. Proof. Let g ¼ x j v b ¼ P n 1 g ix i v ð j n Þ. Thus, the jth column of Z in 7), i.e., Z j, is the column vector consisting of the SPB coordinates of element g. Then, column Z jþ1 is the coordinate column vector of element xg ¼ Xn 1 g i x i vþ1 ¼ Xn g i x i vþ1 þ g n 1 ðx v þ x 1 þ 1 þ xþ: Let bg and cxg be the two elements of GFð n Þ whose SPB coordinates form columns j and j þ 1 of matrix T, respectively. Because of premultiplication of U to Z, we can write

7 3 IEEE TRANSACTIONS ON COMPUTERS, VOL. 56, NO., FEBRUARY 7 and bg ¼ Xv g i x i v þðg v 1 þ g Þx 1 x n v þ ðg v þ g n 1 Þþ Xn 1 g i x i v ¼ðg v þ g n 1 Þx v þ Xn 1 þðg þ g v 1 Þx n v 1 ; x v g i x i v þ Xv g i x iþn v cxg ¼ g n 1 x v þ Xv 3 g i x i vþ1 þðg v þ g n 1 þ g n 1 Þx n 1 i¼1 x n v þ ðg v 1 þ g n 1 þ g n Þþðg v þ g n 1 Þx þ Xn 1 g i x i vþ1 ¼ðg v 1 þ g n þ g n 1 Þx v þðg v þ g n 1 Þx vþ1 þ Xn 1 g i x i vþ1 þ Xv g i x iþn vþ1 : x v Careful comparison shows that elements at ði; jþ and ði þ 1;jþ 1Þ of matrix T are the same for the case i; j n. Therefore, T is a Toeplitz matrix. tu Row operations described in the above lemma are as follows: 1. XOR row to row v 1;. XOR row n 1 to row v; 3. place the lower n v rows on the top of the upper v rows. Therefore, instead of computing C ¼ðc ;c 1 ; ;c n 1 Þ T ¼ Zða ;a 1 ; ;a n 1 Þ T ; we compute D ¼ðd ;d 1 ; ;d n 1 Þ T ¼ Tða ;a 1 ; ;a n 1 Þ T first, where c v þ c n 1 i ¼ ; >< c d i ¼ iþv 1 i n v 1; c iþv n n v i n ; c v 1 þ c i ¼ n 1: Then, we obtain coordinates of C from D as follows and it requires only two XOR gates: d n vþi i v ; >< d c i ¼ n 1 þ d n v i ¼ v 1; d n v 1 þ d i ¼ v; d i v v þ 1 i n 1: T ;i ¼ b v i i v 1; b v þ b n 1 i ¼ v; >< b v 1 þ b n þ b n 1 i ¼ v þ 1; b v i þ b nþv 1 i þ b nþv i þ b nþvþ1 i v þ i v; b nþv 1 i þ b nþv i þ b nþvþ1 i þ b nþv i v þ 1 i n ; b v þ b vþ1 þ b vþ þ b vþ1 þ b n 1 i ¼ n 1; T i; ¼ b vþi i n v 1; b v nþi n v i n v ; b þ b v 1 i ¼ n v 1; >< b þ b 1 þ b v i ¼ n v; b v nþi þ b v 1 nþi þ b v nþi n v þ 1 i n 3; þb vþ1 nþi b þ b v 3 þ b v þ b v 1 þ b v i ¼ n ; b 1 þ b v þ b v 1 þ b v þ b v 1 i ¼ n 1: Remark 3. Since some signals may be reused, a total of 3T X delays and no more than 5 n -input XOR gates are required to compute all elements of this Toeplitz matrix. 3.4 Use of Equally Spaced Pentanomials By carefully choosing other types of irreducible pentanomials, it is possible to reduce the space and time complexities for generating the Toeplitz matrix T and for obtaining the final product vector C. For example, consider a very special class of pentanomials of the form fðuþ ¼u 4s þ u 3s þ u s þ u s þ 1 of degree n ¼ 4s with s>and v ¼ s. Such an s-equally-spaced pentanomial is irreducible if s ¼ 5 i ði Þ [1]. These equally-spaced irreducible pentanomials are not that abundant; however, they can reduce the space and time complexities for obtaining T and C. For example, applying the following transformation matrix, the use of such an equally-spaced irreducible pentanomial of degree n requires :75n XOR gates and 1T X delay for T and :5n XOR gates and 1T X delay for C In n þ K! n n ; In n þ KT n n where Kn n is an n n matrix with its entry at ði; i þ n 4 Þ¼1 for i n 4 1 and all remaining entries being. Slightly different complexity results, namely, :75n XOR gates and 1T X delay for T and :5n XOR gates and T X delay for C, are obtained using the following transformation matrix: 1 In 4 n 4 In 4 n In 4 4 B n 4 In 4 n A : 4 In 4 n In 4 4 n 4 For the case 3 <v<ðn 3Þ=, we summarize explicit expressions of the first row and the first column of matrix T below. They are obtained by applying the above transformations on the coordinate expressions of Z in [36]. 3.5 Considerations for Other Bases While there appears to be no scheme known to directly use the well-known Karatsuba algorithm to design the subquadratic space complexity dual, weakly dual, and

8 FAN AND HASAN: A NEW APPROACH TO SUBQUADRATIC SPACE COMPLEXITY PARALLEL MULTIPLIERS FOR EXTENDED BINARY FIELDS 31 triangular basis parallel multipliers [3], [4], [1], and [19], the proposed matrix-vector product approach can be used for these bases. Namely, for a, b GFð n Þ, let a be represented with respect to a polynomial basis and b be by the dual, weakly dual, or triangular basis of the polynomial basis. Then, the product c ¼ ab can be written as a matrix-vector product C ¼ Hða ;a 1 ; ;a n 1 Þ T, where H depends on b and the field generating irreducible polynomial. Explicit expressions of entries of H and the complexity to compute H can be found in the above corresponding references. In these cases, H is not a Toeplitz matrix, but a Hankel matrix, i.e., entries at ði; jþ and ði 1;jþ 1Þ are equal. We may first exchange columns H i and H n 1 i for i<n= and reverse the column vector ða ;a 1 ; ;a n 1 Þ T. Then, perform the Toeplitz matrix-vector product. An alternative is to obtain the Hankel matrixvector product formulae, which are similar to those in Section and have the same asymptotic complexities. It is noted that no coordinate transformation is required for these parallel multipliers, which use the polynomial basis to represent input a and use the corresponding dual, weakly dual, or triangular basis to represent the other input b and the product c. 4 ALGORITHM FOR DESIGNING SUBQUADRATIC SPACE COMPLEXITY PARALLEL GFð n Þ MULTIPLIERS In order to design application-specific circuits, different levels of abstraction may be used to describe the hardware. Normally, a higher abstraction level provides more flexibility and a lower one provides a better performance. Based on the previous sections, we now present a recursive design algorithm for efficient construction of the proposed subquadratic space complexity multipliers. The Toeplitz matrix T is constructed first in the main program, then the recursive procedures are invoked, which output a set of explicit Boolean statements. These expressions involve only assignment, AND, and XOR operations. For example, T ½4Š½Š½Š½Š¼T½3Š½Š½Š½ŠT½3Š½Š½Š½1ŠT½3Š½Š½Š½Š and C½9Š ¼C½9ŠT½4Š½1Š½Š½ŠV½4Š½1Š½Š: Therefore, a lower level abstraction, e.g., the gate level in Verilog HDL, is provided for the design of the multiplier. We note that the proposed design algorithm may also be modified for the construction of the Karatsuba-based subquadratic space complexity multipliers. Algorithm A1: Design algorithm for the subquadratic space complexity multipliers. Input: Field generating irreducible polynomial fðuþ. Output: Program for computing c ¼ ab in GFð n Þ. { Clear the output vector C; Construct Toeplitz matrix T from B; Construct vector V from A; Toeplitz mvpðn; ; ; Þ; Perform the coordinate transformation. } Subprogram: Toeplitz mvpðint : ; flvl; fblk; posþ {/* : The size of the input vector. flvl: The calling level. fblk: The block number of the submatrix. pos: The final position that this mvp will XOR to. */ INT: cblk ¼, clvl ¼ flvl þ 1; IF ð ¼ 1Þ THEN { Print the sentence C½posŠ ¼C½posŠ ðm½flvlš½fblkš½š½šv½flvlš½fblkš½šþ; return;} IF ð ¼ mod Þ THEN{ // Print sentences for computing P submatrix T 1 þ T, which looks like T ½clvlŠ½cblkŠ½iŠ½jŠ ¼T ½flvlŠ½fblkŠ½iŠ½jŠ T ½flvlŠ½fblkŠ½iŠ½j þ =Š; where i; j =; 1 subvector V 1, which looks like V ½clvlŠ½cblkŠ½iŠ ¼V ½flvlŠ½fblkŠ½i þ =Š; where i =; Toeplitz mvpð=; clvl; cblk; posþ; cblk++; // End of computing P // Print sentences for computing P 1 submatrix T 1 þ T ; 1 subvector V ; Toeplitz mvpð=; clvl; cblk; pos þ =Þ; cblk++; // End of computing P 1 // Print sentences for computing P submatrix T 1 ; 1 subvector V þ V 1 ; Print sentences for operations PUSH vector P on the stack; Print sentences for operations Set vector P to ; Toeplitz mvpð=; clvl; cblk; posþ; cblk++; Print sentences for operations XOR P to P 1 and P ; // Please note that vector P is on the stack, // and vector P is in the position of P.; Print sentences for operations POP modified vector P from the stack; // End of computing P } ELSE IF ð ¼ mod 3Þ THEN{ // Sentences for 3jn are similar to those for jn. Omitted. } ELSE {// Padding Print sentences for padding zeroes after the last elements of matrix T s first row and first column, respectively; Print the sentence for padding a zero after the last element of vector V ;

9 3 IEEE TRANSACTIONS ON COMPUTERS, VOL. 56, NO., FEBRUARY 7 Print the sentence PUSHðC½pos þ ŠÞ; Toeplitz mvpð þ 1; flvl; fblk; posþ; Print the sentence POPðC½pos þ ŠÞ; } } 5 CONCLUSIONS A new scheme for the subquadratic space complexity parallel multiplier has been presented. Both the space complexity and the asymptotic gate delay of the proposed multiplier are lower than those of the best existing subquadratic space complexity parallel multipliers. For practical applications, the hybrid structure developed in [5] may also be modified to reduce the gate delay of the proposed multipliers at the cost of a slight increase in the space complexity. A recursive design algorithm has also been proposed for efficient construction of the proposed subquadratic space complexity multipliers. It may be modified for the construction of the Karatsuba-based subquadratic space complexity multipliers. We note that, although the proposed matrix-vector product approach may be used to design the subquadratic space complexity polynomial, shifted polynomial, dual, weakly dual, and triangular basis parallel multipliers, the gate delay of the SPB multiplier is always equal to or lower than those of other multipliers. For example, if we consider the irreducible trinomial fðuþ ¼u n þ u n 1 þ 1, then generating matrix T requires 1T X gate delays for the SPB, but at least ðlog nþt X for other bases. Finally, although the work presented here are primarily for hardware implementations, our Toeplitz matrix-vector product based approach can also be applied to software implementation using general purpose processors [37]. ACKNOWLEDGMENTS The authors thank the reviewers for their careful reading and useful comments. The work was supported in part by NSERC through grants awarded to Dr. Hasan. REFERENCES [1] A. Karatsuba and Y. Ofman, Multiplication of Multidigit Numbers on Automata, Soviet Physics-Doklady English translation), vol. 7, no. 7, pp , [] S. Winograd, Arithmetic Complexity of Computations. SIAM, 19. [3] M.A. Hasan and V.K. Bhargava, Division and Bit-Serial Multiplication over GF ðq m Þ, IEE Proc.-E, vol. 139, no. 3, pp. 3-36, May 199. [4] M.A. Hasan and V.K. Bhargava, Architecture for Low Complexity Rate-Adaptive Reed-Solomon Encoder, IEEE Trans. Computers, vol. 44, no. 7, pp , July [5] E.D. Mastrovito, VLSI Schemes for Multiplication over Finite Field GFð m Þ, Proc. Sixth Int l Conf. Applied Algebra, Algebraic Algorithms, and Error-Correcting Codes AAECC-6), pp , 19. [6] B. Sunar and C.K. Koc, Mastrovito Multiplier for All Trinomials, IEEE Trans. Computers, vol. 4, no. 5, pp. 5-57, May [7] C.K. Koc and B. Sunar, Low-Complexity Bit-Parallel Canonical and Normal Basis Multipliers for a Class of Finite Fields, IEEE Trans. Computers, vol. 47, no. 3, pp , Mar [] A. Halbutogullari and C.K. Koc, Mastrovito Multiplier for General Irreducible Polynomials, IEEE Trans. Computers, vol. 49, no. 5, pp , May. [9] S.O. Lee, S.W. Jung, C.H. Kim, J. Yoon, J. Koh, and D. Kim, Design of Bit Parallel Multiplier with Lower Time Complexity, Proc. Sixth Int l Conf. Information Security and Cryptology ICICS 3), pp , 3. [1] E. Savas, A.F. Tenca, and C.K. Koc, A Scalable and Unified Multiplier Scheme for Finite Fields GFðpÞ and GFð m Þ, Proc. Cryptographic Hardware and Embedded Systems CHES ), C.K. Koc and C. Paar, eds., pp. 77-9, Aug.. [11] C.-C. Wang, T.-K. Truong, H.-M. Shao, L.J. Deutsch, J.K. Omura, and I.S. Reed, VLSI Schemes for Computing Multiplications and Inverses in GFð m Þ, IEEE Trans. Computers, vol. 34, no., pp , Aug [1] M.A. Hasan, M.Z. Wang, and V.K. Bhargava, Modular Construction of Low Complexity Parallel Multipliers for a Class of Finite Fields GFð m Þ, IEEE Trans. Computers, vol. 41, no., pp , Aug [13] M.A. Hasan, M.Z. Wang, and V.K. Bhargava, A Modified Massey-Omura Parallel Multiplier for a Class of Finite Fields, IEEE Trans. Computers, vol. 4, no. 1, pp. 17-1, Oct [14] A. Reyhani-Masoleh and M.A. Hasan, A New Construction of Massey-Omura Parallel Multiplier over GFð m Þ, IEEE Trans. Computers, vol. 51, no. 5, pp , May. [15] G. Drolet, A New Representation of Elements of Finite Fields GFð m Þ Yielding Small Complexity Arithmetic Circuits, IEEE Trans. Computers, vol. 47, no. 9, pp , Sept [16] R. Katti and J. Brennan, Low Complexity Multiplication in Finite Field Using Ring Representation, IEEE Trans. Computers, vol. 5, no. 4, pp , Apr. 3. [17] C. Negre, Quadrinomial Modular Arithmetic Using Modified Polynomial Basis, Proc. Int l Conf. Information Technology: Coding and Computing ITCC 5), vol. 1, pp , 5. [1] S.T.J. Fenn, M. Benaissa, and D. Taylor, GFð m Þ Multiplication and Division over the Dual Basis, IEEE Trans. Computers, vol. 45, no. 3, pp , Mar [19] H. Wu, M.A. Hasan, and I.F. Blake, New Low-Complexity Bit- Parallel Finite Field Multipliers Using Weakly Dual Bases, IEEE Trans. Computers, vol. 47, no. 11, pp , Nov [] H. Fan and Y. Dai, Fast Bit Parallel GFð n Þ Multiplier for All Trinomials, IEEE Trans. Computers, vol. 54, no. 4, pp , Apr. 5. [1] C. Paar, A New Architecture for a Parallel Finite Field Multiplier with Low Complexity Based on Composite Fields, IEEE Trans. Computers, vol. 45, no. 7, pp , July [] C. Paar, P. Fleischmann, and P. Roelse, Efficient Multiplier Schemes for Galois Fields GFð 4n Þ, IEEE Trans. Computers, vol. 47, no., pp , Feb [3] M. Ernst, M. Jung, F. Madlener, S. Huss, and R. Blumel, A Reconfigurable System on Chip Implementation for Elliptic Curve Cryptography over GFð n Þ, Proc. Cryptographic Hardware and Embedded Systems CHES ), pp , 3. [4] M. Jung, F. Madlener, M. Ernst, and S. Huss, A Reconfigurable Coprocessor for Finite Field Multiplication in GFð n Þ, Proc. IEEE Workshop Heterogeneous Reconfigurable Systems on Chip,. [5] C. Grabbe, M. Bednara, J. Shokrollahi, J. Teich, and J. von zur Gathen, FPGA Designs of Parallel High Performance GFð 33 Þ Multipliers, Proc. Int l Symp. Circuits and Systems ISCAS 3), vol. II, pp. 6-71, 3. [6] A. Weimerskirch and C. Paar, Generalizations of the Karatsuba Algorithm for Efficient Implementations, ruhr-uni-bochum.de/imperia/md/content/texte/kaweb.pdf, 3. [7] F. Rodriguez-Henriquez and C.K. Koc, On Fully Parallel Karatsuba Multipliers for GFð m Þ, Proc. Int l Conf. Computer Science and Technology CST 3), pp , 3. [] A.E. Cohen and K.K. Parhi, Implementation of Scalable Elliptic Curve Cryptosystem Crypto-Accelerators for GFð m Þ, Proc. 13th Asilomar Conf. Signals, Systems, and Computers, vol. 1, pp , Nov. 4. [9] L.S. Cheng, A.M., and T.H. Yeap, Improved FPGA Implementations of Parallel Karatsuba Multiplication over GFð n Þ, Proc. 3rd Biennial Symp. Comm., 6. [3] J. von zur Gathen and J. Shokrollahi, Efficient FPGA-Based Karatsuba Multipliers for Polynomials over F, Proc. 1th Workshop Selected Areas in Cryptography SAC 5), 5.

17th IEEE Symp. Computer Arithmetic ARITH 5), 5. [33] D.J. Bernstein, Multidigit Multiplication for Mathematicians, Advances in Applied Math., to appear. [34] G.

10 FAN AND HASAN: A NEW APPROACH TO SUBQUADRATIC SPACE COMPLEXITY PARALLEL MULTIPLIERS FOR EXTENDED BINARY FIELDS 33 [31] B. Sunar, A Generalized Method for Constructing Subquadratic Complexity GFð k Þ Multipliers, IEEE Trans. Computers, vol. 53, no. 9, pp , Sept. 4. [3] J.-C. Bajard, L. Imbert, and G.A. Jullien, Parallel Montgomery Multiplication in GFð k Þ Using Trinomial Residue Arithmetic, Proc. 17th IEEE Symp. Computer Arithmetic ARITH 5), 5. [33] D.J. Bernstein, Multidigit Multiplication for Mathematicians, Advances in Applied Math., to appear. [34] G. Seroussi, Table of Low-Weight Binary Irreducible Polynomials, Technical Report HPL-9-135, Hewlett-Packard Laboratories, Palo Alto, Calif., HPL html, Aug [35] P.L. Montgomery, Five, Six, and Seven-Term Karatsuba-Like Formulae, IEEE Trans. Computers, vol. 54, no. 3, pp , Mar. 5. [36] H. Fan and M.A. Hasan, Fast Bit Parallel Shifted Polynomial Basis Multipliers in GFð n Þ, IEEE Trans. Circuits and Systems-I: Regular Papers, accepted. [37] H. Fan and M.A. Hasan, Alternative to the Karatsuba Algorithm for Software Implementation of GFð n Þ Multiplication, Technical Report CACR 6-13, Univ. of Waterloo, Waterloo, Canada, May 6. Haining Fan received the BSc degree in computer science and the MSc degree in military communications science, both from the Institute of Communications Engineering, Nanjing, China, in 199 and 1996, respectively. He received the PhD degree in computer science from Tsinghua University in 5. From 1996 to 4, he was with the Institute of China Electronic System Engineering Company. He is currently a postdoctoral fellow in the Department of Electrical and Computer Engineering at the University of Waterloo, Canada. His research interests include finite field arithmetic, computational complexity, and cryptography. M. Anwar Hasan received the BSc degree in electrical and electronic engineering and the MSc degree in computer engineering, both from the Bangladesh University of Engineering and Technology, in 196 and 19, respectively, and the PhD degree in electrical engineering from the University of Victoria in 199. From April to December of 199, he was a postdoctoral fellow at the University of Victoria. In January 1993, he became an assistant professor in the Department of Electrical and Computer Engineering, University of Waterloo. He was promoted to the rank of associate professor with tenure in 199 and then to full professor in. At the University of Waterloo, Dr. Hasan is also a member of the Centre for Applied Cryptographic Research, the Center for Wireless Communications, and the VLSI Research Group. From January to December of 1999, he was on sabbatical with Motorola Labs., Schaumburg, Illinois. Dr. Hasan is a recipient of the Raihan Memorial Gold Medal. At the University of Victoria, he was awarded the President s Research Scholarship four times. At the University of Waterloo, he received a Faculty of Engineering Distinguished Performance Award in and an Outstanding Performance Award in 4. He has served on the program and executive committees of several conferences. From to 4, he was an associate editor of the IEEE Transactions of Computers. He is a licensed professional engineer of Ontario. His current research interests include cryptographic hardware and embedded systems, dependable and secure computing, computer arithmetic and architecture, and computer and network security. He is a senior member of the IEEE.. For more information on this or any other computing topic, please visit our Digital Library at

A New Approach to Subquadratic Space Complexity Parallel Multipliers for Extended Binary Fields

1 A New Approach to Subquadratic Space Complexity Parallel Multipliers for Extended Binary Fields Haining Fan and M. Anwar Hasan Senior Member, IEEE July 19, 2006 Abstract Based on Toeplitz matrix-vector