A COMBINED 16-BIT BINARY AND DUAL GALOIS FIELD MULTIPLIER. Jesus Garcia and Michael J. Schulte

Size: px

Start display at page:

Download "A COMBINED 16-BIT BINARY AND DUAL GALOIS FIELD MULTIPLIER. Jesus Garcia and Michael J. Schulte"

Emmeline Carpenter
5 years ago
Views:

1 A COMBINED 16-BIT BINARY AND DUAL GALOIS FIELD MULTIPLIER Jesus Garcia and Michael J. Schulte Lehigh University Department of Computer Science and Engineering Bethlehem, PA 15 ABSTRACT Galois field arithmetic is commonly used in Reed-Solomon encoding and decoding. This paper presents the design of a combined 16-bit binary and dual Galois field (GF) multiplier. This multiplier is capable of performing either a 16-bit two s complement or unsigned multiplication, or two independent -bit GF(2 ) multiplications in SIMD fashion. The combined multiplier is designed by modifying a conventional binary tree multiplier. It uses a novel wiring methodology to provide two simultaneous GF(2 ) multiplies with a minor impact on area and delay. Three alternatives for the multiplier design are presented. Area and delay estimates indicate that compared to a conventional binary tree multiplier, the combined multiplier has roughly 6% more delay and 23% more area. 1. INTRODUCTION Galois field (GF) arithmetic is a powerful algebraic tool, employed in many encoding techniques [1]. In particular, it is used in Reed-Solomon (R-S) encoding, which frequently provides error correction for wireless communications and compact discs. R-S codes are usually implemented with the -bit field GF(2 ). A brief explanation of some key concepts and operations in GF multiplication follows. Further details can be found in [2]. A GF(2 m ) field is an extension of the field GF(2), with elements f; 1g. The operations defined in GF(2) are addition and multiplication, each performed modulo 2. When GF(2) is extended to GF(2 m ), the result is a vectorial field of dimension m over GF(2). Elements of GF(2 m ) can thus be represented as m-bit binary words. The field is characterized by the irreducible polynomial f (x) = x m + f m?1x m?1 + + f1x + f; (1) with f i 2 GF(2). All 2 m elements in the field can be represented by means of the vector basis f m?1 ; : : : ; 1 ; g, where is a root of the irreducible polynomial f (x) and is called a primitive element of the field. This base allows an element A 2 GF(2 m ) to be expressed as A = a m?1 m?1 + + a1 1 + a ; (2) with a i 2 GF(2). Thus, elements of GF(2 m ) can be associated with polynomials that have coefficients in GF(2), where the bit in the i th position represents the coefficient of i. For example, with GF(2 4 ): (1 1 )! Addition and multiplication in GF(2 m ) can be viewed as polynomial addition and multiplication modulo f (x). Since the polynomial s coefficients belong to GF(2), operations at the coefficient level are taken modulo 2. Thus, addition in GF(2 m ) is performed by bitwise XORing the m coefficients of the two polynomials being added. Similarly, multiplication is computed as a modulo 2 sum of shifted partial products, where the sum is again computed using bitwise XORing. The result of a GF(2 m ) multiplication is a (2m?1)-bit word, which represents a degree (2m? 2) polynomial, in what is called extended form. To achieve closure, the extended form polynomial is reduced modulo f (x). This is equivalent to calculating the remainder from the extended form polynomial divided by f (x). All GF(2 m ) fields are isomorphic, for a given m. Consequently, a particular field can be chosen without affecting the results, if its representation offers computational advantages. Using a fixed f (x), however, is inconvenient if we are trying to design hardware to work with existing applications that use different field representations. Thus, it is useful to allow f (x) to be programmable. There exists abundant literature on specialized designs for GF multipliers. Few of these designs, however, are oriented towards the use of GF multipliers as part of a programmable digital signal processor (DSP). Consequently, they are designed for a specific field, which makes the them less versatile. Initial designs for GF multiplication used a serial approach. Although serial GF multipliers have low hardware requirements, they are very slow. Consequently, several parallel designs for GF multiplies have been introduced: Mastrovito multipliers use a fixed irreducible polynomial that allows the hardware to be optimized [3],[4]. Al-

2 though these designs have been used successfully in application-specific implementations, they are less flexible for use in programmable DSPs. Systolic GF multipliers are very useful when the rate of the input operands is constant at high clock rates [5]. Systolic GF multipliers, however, are not suitable for implementation in programmable DSPs, which have slower clock cycles and irregular input rates. Parallel GF multipliers use a dedicated functional unit and allow f (x) to be specified. The parallel GF multiplier presented in [6] is conceptually very similar to part of the design presented here. The main difference is that the design from [6] is a dedicated unit, which does not also support conventional binary multiplications. The design from [6] is used in this paper for comparison purposes. The TMS32C64x DSP provides support for programmable GF multiplications, with m ranging from 1 to [7]. It executes four independent GF(2 ) multiplies with results ready after four cycles. No details about the actual implementation could be found. The combined GF multiplier presented in [],[9] is similar to the designs presented in this paper. Their design is based on a Wallace Tree Multiplier, which has been modified to perform either conventional binary or GF multiplication. Their polynomial reduction introduces a linear delay. A similar design is used in this paper for comparison purposes. 2. COMBINED BINARY AND DUAL GF MULTIPLIER The multiplier presented in this paper receives two 16-bit inputs, X and Y, and produces a 32-bit output Z. X and Y represent 16-bit integers, or the concatenation of two -bit independent GF(2 ) elements: X high = X[15 : : : ]; X low = X[7 : : : ] Y high = Y [15 : : : ]; Y low = Y [7 : : : ] The multiplier also receives two control signals f and t. f is one for fixed point multiplication and zero for GF multiplication. t is one for two s complement multiplication and zero for unsigned multiplication. When f is zero, t does not affect the GF multiplication. For two s complement and unsigned multiplication Z is the corresponding 32-bit (signed or unsigned) product. For GF multiplication, Z consists of sixteen leading zeros, followed by Zhigh GF = X high Y high and Zlow GF = X low Y low, with the products calculated in GF(2 ). An additional input to the unit is the vector necessary for the polynomial reduction of GF products. Depending on the polynomial reduction method, this is either an -bit or 56-bit word. The polynomial reduction unit is explained in detail in Section 2.3. Figure 1 shows a block diagram of the combined multiplier. f t Xhigh Xlow Yhigh Ylow MUX 24 Carry Lookahead Adder f 25 MUX Partial Product generation and Reduction 24 MUX Z Poly. Reducer Poly. Reducer Fig. 1. Block diagram of the combined multiplier Partial Product Matrix For both fixed point and GF multiplications, the first step is generating a matrix of partial products. These are calculated by ANDing the corresponding term in X and Y as: pp i;j = x i y j. Partial products are arranged in rows, with each row shifted i positions to the left as in Figure 2. Each dot represents the output of an AND gate. The fixed point product Z = X Y is obtained by adding the resulting partial products. The partial product matrix is composed of four submatrices, X low Y low, X low Y high, X high Y low and X high Y high, as shown in Figure 2. The upper-right and lower-left submatrices correspond to the partial products to be added for GF multiplications. These partial products are indicated by hollow dots in Figure 2. The partial products in the other two submatrices, indicated by black dots, are set to zero when calculating a GF product, by ANDing the x j to these submatrices with the control signal f. The new inputs are called Xhigh m and Xm low. This extra hardware only represents 16 AND gates. It adds only one AND gate delay to the critical path Partial Product Reduction Tree A tree multiplier was selected to implement the partial product reduction, due to its speed advantage over array multi-

3 y15 y13 y11 y9 y y7 y6 y5 y4 y3 y2 y1 y x15 y14 x14 x13 y12 x12 x11 y1 x1 Fig. 2. Partial product matrix. pliers. The Reduced Area multiplier was chosen because it requires less area than other techniques, places full adders as early as possible in the reduction tree, and minimizes the size of the carry-propagate adder [1]. The layout of the Reduced Area multiplier also offers advantages when performing GF multiplication. The 16-bit Reduced Area multiplier that serves as the base for the combined multiplier uses six reduction stages that have a worst case delay equivalent to six full adders. GF and fixed point multiplication are similar in concept, but there is an important difference in the partial product reduction. As explained in Section 1, the partial products are added modulo 2 for GF multiplication, which implies that no carries are included in the summation. Two ways to avoid adding carries in the GF multiplication are considered: Modify the full adders and half adders to set the carry output to zero when a GF multiplication is performed. This is the approach taken in [],[9]. Do not modify the adder cells. Instead, modify the wiring of the partial product reduction tree and add a small number of XOR gates, to ensure that no carries are added with partial product bits before the GF product bit for each column is calculated. With the first alternative, an extra control input to each adder cell prevents it from generating a carry when performing GF multiplication. The extra logic (an AND gate at the carry output of each adder cell) increases the area and delay. The new wiring scheme presented here as the second alternative allows important savings in terms of area and delay. Each adder cell computes a sum bit that is either relevant or irrelevant to the GF multiplication. A sum bit is relevant if it depends on at least one partial product from either GF submatrix. In this case, the sum bit cannot be an input to an adder that has inputs that depend on carry signals from other adders. Relevant sum bits have to be added together as early as possible. In practice, there are few places where this restriction modifies the wiring of the adder cells. Also there are six columns where only two GF-relevant sum bits are x9 x x7 x6 x5 x4 x3 x2 x1 x available after the first reduction stage. In these cases, both bits are extracted and added with a XOR gate, to form the expected result without modifying the partial product reduction tree. After the third reduction stage, all the GF-relevant partial products in each column have been added together, and can be extracted to form the two extended results (15- bit vectors), which are sent to polynomial reducers. The only overhead for this design is six XOR gates and some minor restrictions for the connection of the full adders. The theoretical critical delay path does not increase. The output of the partial product reduction tree for fixed point multiplication is obtained at the bottom of the tree. It consists of the least significant 7 bits of the final result, and two 25-bit vectors that are added using a carry propagate adder to yield the remaining bits of the final product [1]. The combined multiplier uses a carry-lookahead adder with a block size of 4 [11]. With the modified full adders, the GF-extended result vectors are extracted at the bottom of the tree. With the new wiring technique, each bit from the two extended result vectors is extracted from the partial product reduction tree as soon as it is ready. For the 16-bit combined multiplier, the resulting bits for the GF multiplication are ready after the third reduction stage. In both cases, the extended results are reduced modulo the irreducible polynomial. To do so, two polynomial reduction units operate in parallel Polynomial Reduction When performing a GF multiplication, the two 15-bit results obtained from the reduction tree each represent a polynomial P () of order 14 (in general, order 2m? 2 for GF(2 m )). GF multiplication is computed modulo the irreducible polynomial f (x), such that C() = P () mod f (). C() is computed as the remainder of the polynomial division. The example below shows how to compute C() using the Linear Polynomial Reduction (LPR) algorithm, with m = 4, P () = ( ), and f (x) = x 4 + x + 1! f () = (1 1 1). Subtraction in GF(2 m ) is also equivalent to bitwise XORing. P: p6 p5 p4 p3 p2 p1 p? p6 p6f3 p6f2 p6f1 p6f? p 5 p 4 p 3 p 2 p 1 p p 5 p 5 f 3 p 5 f 2 p 5 f 1 p 5 f? p 4 p 4 f 3 p 4 f 2 p 4 f 1 p 4 f C: c3 c2 c1 c p 4 p 3 p 2 p 1 p

4 ? ? ? The hardware implementation of the LPR algorithm is straight forward [12], as shown in Figure 3 for GF(2 4 ). It requires m(m? 1) AND gates and m(m? 1) XOR gates. The worst case delay is (m? 1) AND gates and (m? 1) XOR gates. p6 f3 p5 f2 p4 f1 p3 f p2 p1 p The A i () are the canonical representations in the field s base of the field elements i that appear in the first summation in Equation (3). A i () can be added with the second summation (which represents one element in canonical form), to produce the final result. The previous example for GF(2 4 ) polynomial reduction is now computed as: f (x) = x 4 + x + 1! 4 = (1); 5 = (1); 6 = (11) P () = ( )! () : p3 : : : p 11 : p6 6 1 : p5 5 + : p : c3 : : : c p p p P[3...] A 4 A 5 A c3 c2 c1 c Fig. 3. Linear polynomial reduction for m = 4. The inputs to each polynomial reduction unit are the extended GF result (15 bits), and the least significant bits of the irreducible polynomial. The MSB of f () is not needed because it is always 1. It is important to notice that the LPR algorithm causes a linear delay, as opposed to the logarithmic delay introduced by the partial product reduction tree. However, the combined structure of the multiplier masks this delay. The 25-bit wide carry lookahead adder (CLA), which operates in parallel with the polynomial reduction, is considerably slower. Therefore, the GF-specific part of the design is not on the critical path. A different implementation, the Parallel Polynomial Reduction (PPR), has been considered to reduce the delay of this stage [6]. The extended result P () can be expressed as 2m?2 X m?1 P () = p i X i + p i i ; (3) i=m i= where using Equation (2) we can substitute i = A i () = X m?1 a i;j j ; m i 2m? 2: (4) j= XOR AND AND AND XOR XOR 4 4 Fig. 4. Parallel polynomial reduction for m = 4. The design presented in this paper uses the pre-calculated canonical representation of the seven GF-elements of the form i, i = : : : 14. Each of these seven values is an - bit vector. The reduction is performed by adding the corresponding GF element A i () to substitute for each bit above the th. The seven -bit values to be added are computed as soon as the extended result is ready. The modulo 2 addition of each A i () to the least significant bits of the extended result is done in parallel, in a binary tree configuration, which has logarithmic delay. An implementation is shown for m = 4 in Figure 4, where the AND blocks perform p i A i (), and the XOR blocks perform GF addition. The gate count is m(m? 1) AND gates and m(m? 1) XOR gates, the same as for the LPR. The theoretical worst case delay of the PPR is only 1 AND gate and dlog2 me XOR gates. C 4

5 Design A Design B Design C Design D Design E Delay (ns) Norm. Delay Table 1. Total and normalized delays for each design. Design A Design B Design C Design D Design E Equiv. Gates Norm. Area Table 2. Total and normalized areas for each design. This polynomial reduction unit requires more registers, seven bytes instead of one for GF(2 ), but the decrease in delay is considerable. The delay is now comparable to that of the partial product reduction stage, which is important for pipelined designs. The outputs of the polynomial reduction units are the two -bit GF products, Zhigh GF and ZGF low. The only operation left is multiplexing the final output depending on the type of multiplication, GF or fixed point. 3. SYNTHESIS RESULTS Variations of the combined 16-bit binary and dual GF multiplier were modeled in VHDL and then synthesized. The following units were chosen for modeling and synthesis: Design A is a regular 16-bit fixed point multiplier [1], which serves as the reference design. Design B uses modified adder cells in the partial product reduction tree to avoid adding carries. The polynomial reduction stage is performed with the LPR. This design is similar to the one presented in [9]. Design C rewires the reduction tree to avoid adding carry bits before GF results are ready. It uses the LPR. Design D is the same as Design C, but uses the PPR. Design E is the regular fixed point multiplier in design A, plus two dedicated parallel GF multipliers [6]. These designs were synthesized for minimum area with Exemplar Logic s Leonardo toolset and the LCA3K.6 micron standard cell library. To minimize the effect of the particular technology, design A is used as a reference, and all measures are normalized to its area and delay. Delay figures are shown in Table 1. The worst case delay is less for designs C and D than for B, with designs C and D having roughly 7.5% less delay than B. The overhead in designs C and D corresponds to the output multiplexor and the AND gates used to produce Xhigh m and Xm low. In B, the effect of the extra AND gates in the adders of the partial product reduction tree further increases the delay. The similarity of data between C and D shows that the delay of the unit depends mostly on the partial products reduction and the carry propagate adder. Changing the implementation of the polynomial reduction does not affect the total delay, because the CPA is slower than either polynomial reduction method. This effect is also reflected in design E. The dedicated parallel GF multiplier only requires 2.9 ns to complete the multiplication, but without pipelining the clock cycle has to be set to the larger fixed point multiplication time. Table 2 gives area estimates for each design. As expected, for designs C and D these estimates are identical, since the gate counts for both polynomial reduction stages are the same. The normalized increase over the base multiplier is about 23%. In comparison, design B requires about 59% more area than the base multiplier. The extra buffers required for supplying the f signal to every adder in the reduction tree increase the area. Although not every adder needs the conditional carry function, the critical path still has six extra AND gates and area is still larger than for designs C and D. The area for design E is clearly larger than for designs C and D. The estimate given only analyzes the area invested in gates, but in practice adding dedicated GF multipliers has a much larger cost, in terms of registers, buses, extra control logic, etc. Regarding area measures, it should be noted that the registers required to store the field s representation (either f (x) for the LPR, or the seven A i () for the PPR) are not included in the gate count. These data can be kept in specialpurpose registers in the register file of the DSP. Parallel Linear Area (eq. gates) 2 2 Delay (ns) Table 3. Area and delay of the polynomial reduction stages. Table 3 shows the synthesis results for linear and parallel polynomial reduction methods. The delay is about 4.67 times smaller for the PPR, for exactly the same gate count.

6 In pipelined designs, the delay for polynomial reduction can be on the critical delay path. For processors with high clock rates, it may be desirable to pipeline the combined multiplier. One approach is to use a two-stage pipeline. For fixed point multiplication, the first pipeline stage generates partial products and reduces them to sum and carry vectors. The second stage performs the carry-propagate addition and the final output selection. Since the extended result for GF multiplication is available after two AND gate delays plus three full adder delays and the parallel polynomial reducer has one AND gate delay plus three full adder delays, the complete GF multiplication can be performed in a single pipeline stage. With this approach, fixed point multiplication has a two-cycle latency, but two parallel GF multiplications have just a one-cycle latency. 4. CONCLUSIONS This paper has shown that a DSP s 16-bit fixed point tree multiplier can be easily modified to support two parallel GF(2 ) multiplications. For GF multiplications, the addition of carries is avoided with the novel connection methodology presented in this paper. This approach requires significantly less area and delay than previous designs, which use additional gates to set the carries to zero. The GF multiply results are ready in extended form after only two AND gate delays plus three full adder delays. The subsequent polynomial reduction over the two extended GF products can be performed in one AND gate delay plus three XOR gate delays by two Parallel Polynomial Reduction units. Adding dual GF(2 ) multiplication to a 16-bit multiplier increases the delay by about 6% and the gate count by about 23%. A combined multiplier has the advantage of reusing data buses and control logic for the existing multiplier, which simplifies the implementation. Acknowledgment This material is based upon work supported by the National Science Foundation under Grant No Any opinions, findings, conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation. [3] E. D. Mastrovito, VLSI Designs for Multiplications over Finite Fields GF(2 m ), in Proc. Sixth Int l Conf. Applied Algebra, Algebraic Algorithms, and Error- Correcting Codes (AAECC-6), 19, pp [4] T. Zhang and K. K. Parhi, Systematic Design of Original and Modified Mastrovito Multipliers for General Irreducible Polynomials, IEEE Transactions on Computers, vol. 5, pp , 2. [5] C. Yeh, I. S. Reed, and T. K. Trouong, Systolic Multipliers for Finite Fields GF(2 m ), IEEE Transactions on Computers, vol. C-33, pp. 357, 194. [6] L. Gao and K.K. Parhi, Custom VLSI Design of Efficient Low Latency and Low Power Finite Field Multiplier for Reed-Solomon Codec, in Proc. of 2 IEEE International Symposium on Circuits and Systems (ISCAS), 2, pp. IV [7] Texas Instruments, TMS32C64x Technical Overview. [] W. Drescher, G. Fettweis, and K. Bachmann, VLSI Architecture For Non-Sequential Inversion Over GF(2 m ) Using The Euclidean Algorithm, in Int. Conf on Signal Processing Applications & Technology, [9] W. Drescher, K. Bachmann, and G. Fettweis, VLSI Architecture for Datapath Integration of Arithmetic over GF(2 m ) on DSPs, in Proc. IEEE ICASSP 97, [1] K. C. Bickerstaff, M. J. Schulte, and E. Schwartzlander, Parallel Reduced Area Multipliers, Journal of VLSI Signal Processing, vol. 9, pp , [11] P. Pirsh, Architectures for Digital Signal Processing, Wiley, 199. [12] M. Matsumoto and K. Murase, Multiplier in a Galois Field, U.S. Patent 4,91,63, REFERENCES [1] S. B. Wicker and V. K. Bhargava, Reed-Solomon Codes and Their Applications, IEEE Press, [2] R. Lidl and H. Niederreiter, Introduction to Finite Fields and Their Applications, Cambridge Univ. Press, 1994.

Lecture 8: Sequential Multipliers

Lecture 8: Sequential Multipliers ECE 645 Computer Arithmetic 3/25/08 ECE 645 Computer Arithmetic Lecture Roadmap Sequential Multipliers Unsigned Signed Radix-2 Booth Recoding High-Radix Multiplication