THIS paper is devoted to the study of modular multiplication

Size: px
Start display at page:

Download "THIS paper is devoted to the study of modular multiplication"

Transcription

1 AUTOATIC GENERATION OF OULAR ULTIPLIERS FOR FPGA APPLICATIONS Automatic Generation of odular ultipliers for FPGA Applications Jean-Luc Beuchat and Jean-ichel uller, Senior ember, IEEE LIP Research Report N o 7 LIP/Arénaire, CNRS INRIA ENS Lyon Université Lyon Abstract Since redundant number systems allow for constant time addition, they are often at the heart of modular multipliers designed for public key cryptography (PKC) applications. Indeed, PKC involves large operands (6 to 4 bits) and several researchers proposed carry-save or borrow-save algorithms. However, these number systems do not take advantage of the dedicated carry logic available in modern Field-Programmable Gate Arrays (FPGAs). To overcome this problem, we suggest to perform modular multiplication in a high-radix carry-save number system, where a sum bit of the carry-save representation is replaced by a sum word. Two digits are then added by means of a small Carry-Ripple Adder (CRA). Furthermore, we propose an algorithm which selects the best high-radix carry-save representation for a given modulus, and generates a synthesizable VHL description of the operator. Index Terms odular multiplication, high-radix carry-save number system, FPGA. I. INTROUCTION THIS paper is devoted to the study of modular multiplication of large operands on Field-Programmable Gate Arrays (FP- GAs). This operation is crucial in many public key cryptosystems (e.g. elliptic curve cryptography, XTR, RSA) and various solutions have already been investigated. Since iterative algorithms offer a good trade-off between calculation time and circuit area, they have received considerable attention. Least-significant-digitfirst schemes are often based on ontgomery s algorithm []. However, that approach requires pre- and post-processing and is of interest when a large amount of consecutive modular multiplications is required (e.g. modular exponentiation). In this paper, we will consider a most-significant-digit-first scheme. A. Horner s Rule-Based odular ultiplication In order to compute XY = XY mod, where is an n- bit integer such that n < < n, our algorithm is described by an iterative procedure based on the celebrated Horner s rule: XY = (... ((x r Y ) + x r Y ) +...) + x Y, where X = x r x r... x x is an unsigned r-bit integer and Y is an n-bit integer belonging to {,..., }. This equation can be expressed recursively as follows (we perform a modulo reduction at each step in order to keep an n-bit intermediate result): T[i] = Q[i + ] + x i Y, Q[i] = T[i], () Jean-Luc Beuchat is with the University of Tsukuba, Japan. J.-. uller is with the CNRS, laboratoire LIP, projet Arénaire, France. where Q[r] = and Q[] = XY. Since Q[i + ] and Y are less than or equal to, T[i] < 3 and Equation () is implemented by means of a left shift, an addition, a comparator, and up to two subtractions to perform the modulo reduction []. Since public key cryptography involves large integers, operands are often represented in the carry-save number system, which enables addition in constant time (see for instance [3]). However, due to the redundancy of this representation, comparison requires a conversion in a non-redundant number system. This operation involves carry propagations, thus losing the main advantage of the carry-save representation. Several improvements of the algorithm sketched by Equation () have therefore been investigated to avoid comparisons. They are based on the following observation: computing a number P[i] congruent to Q[i] modulo only requires to inspect a few most significant digits of T[i]. In order to avoid an expensive final modular reduction, P[i] should be less than a very small multiple of. The iteration described in this paper returns for instance XY or XY +, and requires at most one subtraction to get the final result. Koç and Hung [4], [5] proposed for instance a carry-save algorithm based on a sign estimation technique. At each step,, or is added to T[i] according to a few most significant digits of Q[i + ]. Takagi and Yajima [6], [7] applied a similar technique to design signed-digit-based architectures. When the modulus is known at design time, which is often the case in public key cryptography, another approach consists in building a table ψ(a) = a βe and in defining the following iteration: T[i] = P[i + ] + x i Y, () P[i] = ψ(t[i] div β ) + T[i] β, (3) where P[r] = and β is generally chosen equal to n or n. Since ψ(t[i] div β ) is an n-bit number, P[i] and T[i] are respectively (n + )- and (n + 3)-bit numbers. Therefore, the algorithm sketched by the above equations requires a small table. Carry-save implementations of Equations () and (3) have for example been proposed by Jeong and Burleson [8], Kim and Sobelman [9], and Peeters et al. []. Since these algorithms depend on the modulus, they seem likely candidates for cryptographic hardware based on FPGAs: the reconfigurability of these devices allows one to optimize the architecture according to some parameters (e.g. the modulus) and to modify the hardware whenever they change. B. FPGA-Specific Issues odern FPGAs are mainly designed for digital signal processing applications involving rather small operands (6 to 3

2 AUTOATIC GENERATION OF OULAR ULTIPLIERS FOR FPGA APPLICATIONS FXINA FXINB G[4:] 4 G G Look-up table A[4:] G-LUT CYSELG CYUXG CYOG COUT YBUX FiUX XORG GYUX Flipflop FFY Q YB Fi Y YQ remains unused. Though reducing the number of LUTs of the design, taking advantage of dedicated logic to describe a CSA leads to a larger operator (Table I). Similar problems arise for instance on all Virtex devices (Xilinx) and Cyclone II FPGAs (Altera). It is therefore interesting to investigate modular multiplication algorithms based on FPGA-friendly number systems. TABLE I BY GAN AREA AN NUBER OF LUTS OF THREE CARRY-SAVE ITERATION STAGES (SPARTAN-3 FPGA). F[4:] 4 F F FAN Look-up table A[4:] F-LUT CYSELF CYUXF CYOF CYINIT XBUX F5UX XORF FXUX Flipflop FFX Q XB F5 X XQ Algorithm Without carry logic With carry logic n = 3 n = 64 n = 3 n = 64 Jeong and 7 slices slices 9 slices 3 slices Burleson [8] LUTs 39 LUTs 4 LUTs 7 LUTs Kim and 74 slices 66 slices 93 slices 88 slices Sobelman [9] 39 LUTs 68 LUTs 3 LUTS 49 LUTs Peeters 74 slices 6 slices 95 slices 9 slices et al. [] 4 LUTs 7 LUTs 95 LUTs 9 LUTs BX Fig.. Simplified diagram of a slice of a Spartan-3 FPGA. bits). FPGA manufacturers chose to include dedicated carry logic enabling the implementation of fast carry-ripple adders (CRA) for such operand sizes. Let us study for example the architecture of a Spartan-3 device. Figure describes the simplified architecture of a slice, which is the main logic resource for implementing synchronous and combinatorial circuits. Each slice embeds two 4-input function generators (G-LUT and F-LUT), two storage elements (i.e. flipflops FFY and FFX), carry logic (CYSELG, CYUXG, CYSELF, CYUXF, and CYINIT), arithmetic gates (GAN, FAN, XORG, and XORF), and wide-function multiplexers. Each function generator is implemented by means of a programmable look-up table (LUT). Recall that a full-adder (FA) cell has two bits x i and y i, as well as a carry-in bit c in as inputs, and computes a sum bit s i and a carry-out bit c out such that c out + s i = x i + y i + c in. Let h i = x i y i. Then, we have: CIN s i = h i c in, (4) ( xi if h i = (i.e. x i = y i ), c out = (5) otherwise. c in Assume that the F-LUT function generator computes h i. Then, the XORF gate implements Equation (4), whereas Equation (5) involves three multiplexers (CYOF, CYSELF, and CYUXF). s i is either sent to another slice (output X) or stored in a flipflop (FFX). The G-LUT function generator allows one to implement a second FA cell within the same slice, which thus embeds a - bit CRA. Unfortunately, a conventional carry-save adder (CSA) requires twice as much hardware resources: since each slice has a single input carry CIN, it is impossible to implement two FA cells with independent carry-in bits. Therefore, hardware design tools allocate two function generators when they are provided with a VHL description of Equations (4) and (5). It is of course possible to write a VHL code which explicitly instantiates F- LUT, XORF, and CYUXF. In this case, note that G-LUT can only be used to implement the control unit or the ψ(.) table of Equation (3). According to experiment results, G-LUT often C. Our Contribution We proposed a family of radix- algorithms designed for FP- GAs embedding 4-input LUTs and dedicated carry logic in []. Table II compares our iteration stage against three carry-save schemes. Since these results do not include the conversion from carry-save to unsigned integer which occurs at the end of each multiplication, both area and delay of carry-save operators are underestimated. According to this experiment, our approach is efficient for moduli up to 3 bits. Thus, the aim of this paper is to extend our work to larger moduli. In order to benefit from dedicated carry logic available in almost all FPGA families, we suggest to choose a high-radix carry-save number system, where each sum bit of the carry-save representation is replaced by a sum word (Section II). Such a number system allows for the design of a modular multiplication algorithm based on small CRAs (Section III). The main originality of our approach is to analyze the modulus in order to select the most efficient high-radix carry-save representation and to automatically generate the VHL description of the operator (Section IV). al results validate the efficiency of the proposed modular multiplication scheme (Section V). We proposed a preliminary version of this work based on a different iteration in []. TABLE II AREA AN ELAY OF CARRY-SAVE AN RAIX- ITERATION STAGES (SPARTAN-3 FPGA). Algorithm n = 6 n = 3 n = 64 Jeong and Burleson [8] 58 slices 7 slices 36 slices 9 ns ns 4 ns Kim and Sobelman [9] 4 slices 79 slices 5 slices 8 ns ns ns Peeters et al. [] 5 slices 86 slices 63 slices 8 ns ns ns Beuchat and uller [] slices 4 slices 8 slices ns 4 ns ns II. HIGH-RAIX CARRY-SAVE NUBERS A k-digit high-radix carry-save number X is denoted by X = (x k,..., x ) = x (c) k, x(s) k,..., x (c), x(s),

3 AUTOATIC GENERATION OF OULAR ULTIPLIERS FOR FPGA APPLICATIONS 3 Sum words n 3 = 3 n = 4 n = 4 n = 4 n 3 = 3 n = 4 n = 4 n = 5 x (s) 3 x (s) x (s) x (s) z (s) 3 t (s) t (s) t (s) Carry bits x (c) x (c) x (c) (a) (b) Fig.. High-radix carry-save numbers. (a) Encoding of the number X = (b) Encoding of Z = X. where the jth digit x j consists of an n j -bit sum word x (s) j and a carry bit x (c) j such that x j = x (s) j + x (c) j nj. According to this definition, we have: X = x + x n + x n+n x k n+...+n k = x (s) + Let us define and k X i= x (c) i + x (s) P i i+ j= ni + x (c) P k k j= ni. X (s) = x (s) + x (s) n x (s) k n+...+n k, X (c) = x (c) n + x (c) n+n x (c) k n+...+n k. With this notation, we have X = X (s) + X (c). Such a number system has nice properties to deal with large numbers on FPGA: its redundancy allows one to perform addition in constant time (the critical path only depends on max n j); j k the addition of a sum word x (s) j, a carry bit x (c) j, and an n j -bit unsigned binary number can be performed by a CRA. A key observation is that all sum words do not need to have the same width. This peculiarity will allow us to select a number system according to the modulus to optimize our operators (Section IV). In the following, we will assume that the carry bit of the most significant digit is always equal to zero (the weight of the most significant carry bit is therefore equal to n+n+...+n k ). Example Figure a describes a 4-digit high-radix carry-save number with n = n = n = 4 and n 3 = 3. According to the first definition of this number system, we have: X = x (s) + x (c) x (c) + x (s) 3 + x (s) 4 + x (c) + x (s) 8 + = + ( + 4) 4 + ( + 3) 8 + ( + 7) = We can also compute and X (s) = = 954 X (c) = = 4. We obtain X = X (s) + X (c) = Consider the modular multiplication described by Equations () and (3) and assume that both T[i] and P[i] are high-radix carrysave numbers. Each equation involves now the addition of a highradix carry-save number and an unsigned integer (a partial product Unsigned binary number High-radix carry-save High-radix carry-save Fig. 3. Addition of an unsigned binary number and a high-radix carry-save number. x i Y or a number ψ(t[i] div β ) stored in the table). Figure 3 describes how to perform these operations: the integer operand is split into k words of respective lengths n,..., n k ; then, each of these words is added to a sum word and a carry bit by means of an n j -bit CRA. The high-radix carry-save encoding has unfortunately a drawback in the sense that shifting an operand modifies its representation. The following example illustrates this problem, which occurs in the computation of T[i] (Equation ()). Example Let us consider again the number X = 3366, whose format is defined in Figure a. By shifting X, we obtain Z = X = 675 (Figure b). However, the least significant sum word is now a 5-bit number and Z = z + z n+ + z n+n+ + z 3 n+n+n+ = + ( + 4) 5 + ( + 3) 9 + ( + 7) 3 = 675. According to this example, P[i + ] and P[i] do not have the same encoding, and the width of the CRA dealing with the least significant digit of P would increase at each step. The solution consists in converting T[i] to the format of P[i]. In the following, we describe a modular multiplication algorithm which minimizes the hardware overhead introduced by this conversion. III. HIGH-RAIX CARRY-SAVE OULAR ULTIPLICATION This section describes how to take advantage of a high-radix carry-save number system to perform a modular multiplication. We assume that: The modulus is an n-bit number belonging to { n +,..., n }. X is an r-bit unsigned integer. Y is an unsigned integer smaller than. α is a small integer parameter which will determine the size of the table required for the modulo reduction.

4 AUTOATIC GENERATION OF OULAR ULTIPLIERS FOR FPGA APPLICATIONS 4 The most significant sum word of P[i] contains at least n k = 5 bits if α = and n k = 6 bits if α =. These hypotheses guarantee that E P (c) [i] is smaller than n α, i.e. P (c) [i] = n α P (c) [i] (we fixed the carry bit of the most significant digit of a high-radix carry-save number to zero in Section II), and P (s) [i] is an (n + )-bit number (a proof is given in Appendix) At each iteration, we compute a high-radix carry-save number P[i] congruent to P[i + ] + x i Y modulo. According to our hypotheses, we have: P[i + ] = P (s) [i + ] div n α n α + P (s) En α E [i + ] + P (c) [i + ] = P (s) [i + ] div n α n α E + P (s) [i + ] + P(c) [i + ]. n α n α The iteration of our algorithm is slightly different from the one described in Section I. Let us define = P (s) [i + ] div n α and write the intermediate result at step i + as follows: E P[i + ] = n α + P (s) [i + ] + P(c) [i + ]. n α It is worth noticing that, according to our hypotheses, is a 3- or 4-bit unsigned number for α = or α = respectively. Thus, a small table addressed by allows one to efficiently compute a number congruent to P[i + ] modulo : P[i + ] n αe E + P (s) [i + ] +P (c) [i + ] (mod ). n α Note that, when α =, the table can be stored within the LUTs of a CRA on Spartan-3 and Virtex FPGAs []. Since we compute a high-radix carry-save number congruent to XY modulo, a conversion and a final modulo reduction are mandatory. In order to keep the hardware overhead as small as possible, we apply a trick proposed by Peeters et al. [] in the case of a carry-save implementation. At each step, our algorithm computes: P[i] = x i Y + + n αe E + P(c) [i + ]. n α P (s) [i + ] According to this equation, P[i] is always even when x i Y =. Thus, by performing an additional step with x =, we obtain an even number P[ ] congruent to XY modulo. Note that P[ ]/ is smaller than or equal to P[] and easy to compute (a right shift of one position involves only wiring). Furthermore, performing the final modulo correction with P[ ]/ requires less hardware resources. Let us define: 8 >< max j n E if α = and n j< ψ max = 4 k 5, >: max j n E if α = and n k 6. j< 3 P[ ]/ is a high-radix carry-save number equal to XY or XY +, and the final modulo reduction requires at most one subtraction in the following cases : α = and n k 5; α =, n k 6, and n + n n+...+n k + ψ max <. A proof of correctness of this modular multiplication scheme, summarized by Algorithm, is provided in Appendix. At each iteration, a new intermediate result P[i] is computed in two steps. Let P[i+] be a high-radix E carry-save number such that P (s) [i+ ] = P (s) [i + ] and P (c) [i + ] = P (c) [i + ]. We first n α carry out the sum of the partial product x i Y and P[i + ] by means of small CRAs: T[i] = P[i + ] + x i Y. By shifting the high-radix carry-save number P[i + ], we define a new internal representation for T[i] (Section II). The second step consists in adding to T[i], and in converting n α the result to the format of P[i + ]. Algorithm High-radix carry-save modulo multiplication Input: An n-bit modulus such that n < < n, an r- bit number X N, Y {,..., }, and a parameter α {, }. P[i] and T[i] are high-radix carry-save numbers. Output: P = XY. : P[r] ; : x ; 3: for i in r downto do 4: P (s) [i + ] div n α E ; 5: T[i] P (s) [i + ] + x i Y ; 6: P[i] T[i] + n α ; 7: end for 8: P P[ ]/; 9: if P > then : P P ; : end if n α + P(c) [i + ] The main difficulty of the implementation arises from the left shift of the carry bits P (c) [i + ]. Since T[i] has a different encoding, it is necessary to perform a conversion. We suggest to compute a high-radix carry-save number U[i] which has the same encoding as P[i + ], and is equal to the sum of the carry bits of T[i] and the output of the table (i.e. ). n α Therefore, we perform the following operations at each iteration of Algorithm : T[i] P[i + ] + x i Y, U[i] n αe + T(c) [i], (6) P[i] U[i] + T (s) [i]. Example 3 Let n = 6, k = 4, and n = n = n = n 3 = 4. The high-radix carry-save number T[i] contains three carry bits of respective weights 5, 9, and 3 (recall the constraint introduced in Section II: the carry bit of the most significant digit Note that the algorithm described in our preliminary work [] does not satisfy this property and the final modular reduction depends on the high-radix number system and the modulus.

5 AUTOATIC GENERATION OF OULAR ULTIPLIERS FOR FPGA APPLICATIONS 5 is always equal to zero). We split into four 4-bit n α words and perform three additions to compute U[i] (Figure 4). (j + h)th column of Ψ jth column of Ψ (j + h )th column of Ψ n 3 = 4 n = 4 n = 4 n = 4 n α n α ψ (j+h :j) w+ w n w+ = h 3 Fig. 4. Conversion of T[i] by merging its carry bits with n α. Fig. 5. to five. w+ w+ Cost of the addition of a carry bit (). In this example, h is equal IV. CHOICE OF A HIGH-RAIX CARRY-SAVE NUBER SYSTE Let us represent the table involved in the modulo correction by a matrix Ψ. Each line ψ of Ψ stores an n-bit number n α. In the following, we will have to consider a subset of consecutive columns of Ψ. Let Ψ (j+h:j) be the matrix constituted of columns j to j + h of Ψ. Each line of Ψ (j+h:j) contains an (h + )-bit number ψ (j+h:j). Example 4 Let us consider the 6-bit modulus = 547 and assume that α =. We have Ψ = According to our notation, we have for instance: 3 Ψ (6:3) = ψ (6:3) =. (7) It is worth noticing that the amount of hardware required to compute U[i] depends on the modulus and the encoding of P[i]. For instance, if a column of Ψ contains only zeroes, it can be replaced by a carry bit at no extra cost. We propose an algorithm which selects a high-radix carry-save number system minimizing the hardware overhead introduced by the computation of U[i] (Equation (6)). Assume that we want to merge w with the jth column and w+ with the (j +h)th column of Ψ, and recall that the carry bits of T[i] are left-shifted compared to those of P[i] and U[i] (Figure 5). Therefore, we compute a digit u w+ such that w+ h + w+ = ψ (j+h :j) + w + ψ (j :j ). (8) {z } h bits This operation involves at most a (h )-bit CRA. However it is unlikely that there are very long strings of consecutive ones in matrix ψ (see below). Let us denote by # ψ (j+h :j) the length of the longest string of consecutive s starting from the least significant bit of ψ (j+h :j). Then, the following cases may occur according to l = max # ψ (j+h :j) : If l =, the jth column of Ψ contains only zeroes and can be replaced by w at no extra cost. ore formally, we have (Figure 6a): w+ = 4ψ(j+h :j+) + w + ψ (j :j ), w+ =. If l = h, the addition requires a (h )-bit CRA which generates an output carry bit w+ (see Equation (8) and Figure 6b). Since this bit will be added to a few bits of T (s) [i] in order to compute p (s) w+ [i], we raise a flag which indicates this carry propagation. When < l < h, an l-bit CRA computes the sum and generates an output carry c out. If the (j + l)th column of Ψ stores only zeroes, we replace it by c out (Figure 6c). Otherwise, we need an OR gate to add c out to the (j + l)th column. Since we target FPGA applications, a more efficient solution consists in taking advantage of the dedicated carry logic to perform this operation and we add w to ψ (j+l:j) by means of an (l+)-bit CRA (Figure 6d). Note that w+ is always equal to zero. Let us try to estimate what values of l we can expect in practice. If we consider that the bits in matrix ψ can be viewed as random bits, with equal probability for as for, then the average value of l will be N(h)/ h, where N(h) is the sum of the lengths of the strings of ones that start from the rightmost bit in all possible (h )-bit strings. Since we obviously have N() = and N(h) =

6 AUTOATIC GENERATION OF OULAR ULTIPLIERS FOR FPGA APPLICATIONS 6 jth column of Ψ (only zeroes) ψ (j :j ) jth column of Ψ ψ (j :j ) (j + l + )th column of Ψ (only zeroes) jth column of Ψ ψ (j :j ) jth column of Ψ ψ (j :j ) ψ (j+h :j+) w ψ (j+h :j) w ψ (j+l :j) w ψ (j+l:j) w (h )-bit CRA c out l-bit CRA (l + )-bit CRA w w+ w+ w+ w+ = (a) Only zeroes in the jth column of Ψ w+ (b) l = h w+ = (c) l < h and the (j + l + )th column of Ψ contains only zeroes w+ = w+ (d) l < h Fig. 6. Cost of the addition of a carry bit (). N(h ) +, we immediately deduce that the average value of l is / h (i.e. less than ). Example 5 (Example 4 continued) Assume that we want to add carry bits to the 3rd and 8th columns of Ψ (i.e. j = 3 and h = 5). We have to consider the matrix Ψ (6:3) given by Equation (7) and easily determine that l =. Thus, we have to examine the third column of Ψ (6:3) in order to compute the width of the CRA. Since this column contains a non-null element, we need an (l + )-bit CRA (see Figure 6d). Let us now build a directed acyclic graph as follows: Each node represents a column of the matrix Ψ. The weight of the edge between nodes j and j + h is given by the width of the CRA responsible for the addition of and w indicates a carry propagation. ψ (j+h :j) (Equation (8)), as well as the flag which A shortest path of this graph defines the high-radix carry-save representation minimizing the hardware overhead introduced by the computation of U[i] for a given modulus. The algorithm requires two parameters to control the size of the CRAs: Since we want to perform a modular multiplication by means of small CRAs, we have to provide the algorithm with a constraint on the maximal number of positions w max between two consecutive carry bits (without this constraint, we would for instance have an edge from the first node to the last node). The minimal distance between to consecutive carries w min should be greater than or equal to two. It guarantees that the smallest building block is a -bit CRA. Algorithm describes a way to build this graph. After having computed Ψ, we have to determine to which columns the most significant bit of T[i] can be added. We denote by j max the upper index. Recall that, when α =, we assume that the most significant sum word of P[i] contains at least n k = 5 bits. Thus, we deduce that j max = n (Figure 7a). The most significant sum word of U[i] is computed as follows: k = ψ (n:jmax) + k + ψ (jmax :jmax ). When α =, we have n k 6 and j max = n 3 (Figure 7b). It is sometimes possible to relax the constraint on n k : it suffices that the addition of k to ψ(n:jmax) does not generate an output carry (see the proof in Appendix for details). This condition is satisfied if the length of the longest string of consecutive s starting from the j max column of Ψ is smaller than or equal to n j max. We have to distinguish three cases to build the graph: The first carry bit can be added to any column whose index belongs to {,..., w max + }. We create an edge of weight zero between the first node and the nodes associated with such columns. The most significant carry bit k belongs to the set {n w max +,..., j max. Let h {w min,..., w max }. There is therefore an edge between nodes j and n if j + h n. There is an edge between nodes j and j + h, where h {w min,..., w max}, if j + h j max. The next step consists in finding a shortest path in the graph. In order to minimize the critical path, we suggest to remove all edges whose carry propagation flag is set to one. If there is no path between nodes and n in this pruned graph, we have to consider the full graph. Example 6 (Example 4 continued) Let us apply Algorithm to our 6-bit example. First, we note that adding a carry bit to the 5th column of the matrix does not generate an output carry and we set j max = 5. Then, we build the graph illustrated on Figure 8 according to Algorithm. The c flag on the edge between nodes j and j + h indicates that adding a carry bit to the jth column of Ψ generates an output carry bit j. After having removed all edges labelled with a c flag and nodes without predecessor or successor, we obtain a pruned graph (Figure 9). Thus, P[i] is a 4-digit word with n = n = n = 4 and n 3 = 6 (Figure ). Since n = 6 and ψ max = () = 4497, we have: n + n + n+n + n+n+n + ψ max = = 4666 <, and α = is a valid choice. Note that, if the above equality is not satisfied, we have to build a new graph with α =. Recall that there is always a solution for α = and n k 5. Once the high-radix carry-save representation is known, the automatic generation of a VHL description of the modulo multiplier is rather straightforward. The computation of T[i]

7 AUTOATIC GENERATION OF OULAR ULTIPLIERS FOR FPGA APPLICATIONS 7 (a) Computation of U[i] when α = and j max = n (b) Computation of U[i] when α = and j max = n 3 n + bits n + bits n k bits n k bits P (s) [i + ] P (s) [i + ] = P (s) [i + ] div n α P (s) [i + ] n α = P (s) [i + ] div n α P (s) [i + ] n α p (c) k P (c) [i + ] p (c) k P (c) [i + ] x iy x iy n n k... n n k n n + n bits n n k... n n k n bits n n + T (s) [i] T (s) [i] k k n + bits n + bits P (s) [i + ] n α + P (c) [i + ] + x iy P (s) [i + ] n α + P (c) [i + ] + x iy ➀ Computation of T[i] ➀ Computation of T[i] nth column of Ψ j max = (n )th column of Ψ st column of Ψ nth column of Ψ (n 3)th column of Ψ st column of Ψ n α n α ψ (n:jmax ) ψ (n:jmax ) k k n k bits n k bits k k n + bits ➁ Computation of U[i] n + bits ➁ Computation of U[i] Fig. 7. Cost of the addition of a carry bit (3). FAs 3 FAs 3 FAs c FAs 4 FAs c 6 FAs 4 FA c FAs c FA c 8 FA c 6 FAs c 4 FA c FAs 3 FAs 3 FAs 3 FAs c FAs FAs 3 FAs c FAs c FAs c FAs FAs 3 FAs c FAs c 5 FA c 3 FA c FAs c 9 FA c 7 FA c 5 FA c 3 FAs 4 FAs c FAs 3 FAs Fig. 8. Choice of a high-radix carry-save representation for the 6-bit modulus = 547 (). involves k CRAs of respective widths n +, n,..., n k, and n n k... n (Figure 7). Each edge of the graph encodes the size of the CRA determining a digit of U[i] and the carry propagation flag indicates whether a carry bit j necessary or not. Finally, k CRAs of widths n,..., n k allow one to add T (s) [i] to U[i]. is V. IPLEENTATION RESULTS In order to compare our algorithm against modular multipliers published in the open literature, we wrote a VHL code generator whose inputs are a modulus and w max (maximal number of positions between two consecutive carry bits, see Section IV). Our tool returns a structural VHL description of a high-radix carrysave multiplier, as well as scripts to automatically place-and-route the design and collect area and timing informations. This tool also

8 AUTOATIC GENERATION OF OULAR ULTIPLIERS FOR FPGA APPLICATIONS 8 3 FAs FAs FAs 3 FAs 3 FAs FAs FAs FAs FAs Fig. 9. Choice of a high-radix carry-save representation for the 6-bit modulus = 547 (). Shaded nodes belong to the shortest path. Algorithm Selection of a high-radix carry-save number system. Input: An n-bit modulus, w max, and w min such that w max w min. A parameter α {, }. Output: : Compute the matrix Ψ according to α; : if α = then 3: j max n ; 4: else 5: j max n 3; 6: j n 7: while # n j do 8: j max j; 9: j j + ; : end while : end if : for j = to w max do 3: Create an edge of weight between nodes and j; 4: end for 5: for j = to j max do 6: for h = w min to w max do 7: if j + h n and h < w max then ψ (n:j) 8: l max ψ (n:j) ; 9: Create an edge between nodes j and n; : Compute the weight of the edge (see Figure 6); : h w max + (exit the loop); : else if j + h j max then 3: l max ψ (j+h :j) ; 4: Create an edge between nodes j and j + h; 5: Compute the weight of the edge (see Figure 6); 6: end if 7: end for 8: end for generates a VHL description of two architectures proposed by other researchers. The first one, described by Peeters et al. in [], is summarized by Algorithm 3. Intermediate results are carry-save numbers denoted by (C[i], S[i]). At each step, a CRA computes the sum k of the three most significant bits of C[i + ] and S[i+]. This four-bit word addresses a table storing k n E, k 5. Thanks to an additional iteration with x =, this algorithm returns a carry-save number (C[ ], S[ ]) which is 6th column of Ψ 5 bits n3 = 6 bits 7 bits ➀ Computation of U[i] p (c) p (c) p (c) n = 4 bits 8 bits n = 4 bits ➁ Computation of P[i] st column of Ψ n = 4 bits n α T (s) [i] Fig.. Choice of a high-radix carry-save representation for the 6-bit modulus = 547 (3). smaller than. Since our multiplier satisfies the same property, conversion in a non-redundant number system is performed with the same operator. We will therefore only consider iteration stages in our experiments in order to compare high-radix carrysave and carry-save number systems. Amanor et al. introduced a carry-save architecture optimized for modular multiplication on FPGAs in [3]. The authors assume that both and Y are known at design time. This hypothesis allows for the design of an iteration stage embedding a single CSA and a table addressed by the most significant bit of x i, C[i + ], and S[i + ] (Algorithm 4). They show that the sum of the most significant bits of C[i+] and S[i+] is always a two-bit number. Therefore the table only contains eight values. Unfortunately, the authors did not address the final conversion issue. However, since Our approach further reduces the wiring since high-radix carry-save numbers involve less carry bits than carry-save numbers.

9 AUTOATIC GENERATION OF OULAR ULTIPLIERS FOR FPGA APPLICATIONS 9 Algorithm 3 Peeters et al. s modulo multiplication []. Input: An n-bit modulus such that n < < n, an r-bit number X N, and Y {,..., }. We assume that x =. Output: P = XY. C[r] ; S[r] ; for i in r downto do k C[i + ] div n + S[i + ] div n ; T[i] x i Y + k n E ; (C[i], S[i]) ( C[i + ] n + S[i + ] n ) + T[i]; end for P (S[ ] + C[ ])/; if P > then P P ; end if C[i + ] div n + S[i + ] div n 3, we deduce that: C[i] + S[i] = (C[i + ] div n + S[i + ] div n ) n + C[i + ] n + S[i + ] n 3 n + ( n ) = n+ + n. Since belongs to { n +,..., n }, C[i] + S[i] may be greater than and Algorithm 4 requires more hardware resources than our algorithm or Peeters et al. s scheme to perform a conversion. Algorithm 4 Amanor et al. s modulo multiplication [3]. Input: An n-bit modulus such that n < < n, an r-bit number X N, and Y N. Output: P = XY. C[r] ; S[r] ; for i in r downto do k (C[i + ] div n E+ S[i + ] div n ); T[i] k n + x i Y ; (C[i], S[i]) ( C[i + ] n + S[i + ] n ) + T[i]; end for P C[] + S[] ; Figure describes place-and-route results on a Xilinx Spartan- 3 FPGA. In these experiments, the moduli are 56-bit randomly generated primes. Compared against Algorithm 3, we observe that: Our high-radix carry-save architecture allows us to significantly reduce the number of slices, while only slightly augmenting the critical path. At the price of a longer critical path, we are able to further diminish the area by increasing the parameter w max. Note that conversion from (high-radix) carry-save to unsigned binary integer is usually based on pipelined CRAs (see for instance [3]). epending on the trade-off between area and delay, this operator can be slower than an iteration stage based on (high-radix) carry-save arithmetic. The area of our operator is less sensitive to the choice of. This is mainly related to the architecture of Xilinx FPGAs: in most cases, α = and each operator embeds a table addressed by three bits. Since our target FPGA embeds fourinput LUTs, this table is embedded within the slices of the adder computing P[i] []. Since four bits address the table of Algorithm 3, additional LUTs are requested. Their amount depends on the modulus : if ψ contains null or identical columns, synthesis tools are able to simplify the design. For the moduli considered in these experiments, high-radix carrysave multipliers have roughly the same area as the operator proposed by Amanor et al. in [3]. Recall that a CSA requires twice the number of slices of a CRA on our target FPGA family. Since moduli involved in these experiments require only three bits to perform a modulo reduction, our architecture is mainly based on two CRAs. High-radix carry-save representations enable here the design of a more versatile modular multiplier (both X and Y are inputs) with the same number of slices. Table III summarizes further results obtained with a Spartan-3 FPGA. We generated prime moduli for each experiment, and reported the interval in which lie the area and delay ratios between our proposal and Algorithms 3 and 4. These experiments indicate that our approach always allow one to select a prime number for which reduces the circuit area without increasing the critical path. data data data3 data4 data data LAB Carry-In Fig.. LAB Carry-In Look-Up Table (LUT) Register Chain Routing From Previous LE Carry Chain Three-Input LUT Three-Input LUT Synchronous Load and Clear Logic LAB Carry-Out Sum Bit LAB Carry-Out ENA CLRN Q Row, Column, And irect Link Routing Row, Column, And irect Link Routing Local Routing Register Chain Output Simplified diagram of a LE in arithmetic mode (Cyclone II family). Table IV digests experiment results involving an Altera Cyclone II FPGA. Figure describes a Logic Element (LE), which is the smallest unit of configurable logic in the Cyclone II architecture. Each LE includes a four-input LUT, a storage element, as well as dedicated carry logic, and operates in normal mode or arithmetic mode. CRAs are based on LEs in arithmetic mode, in which the LUT implements two three-input function generators. It is therefore impossible to store ψ within LUTs of a CRA. This explains why our algorithm leads to slightly smaller area savings for this FPGA family. On Cyclone-II devices, CSA operators are significantly faster; however, conversion to a non-redundant number system involves pipelined CRAs. If this operator is based on 3-bit blocks, our high-radix carry-save iteration stage has a slower critical path. In this case, our approach leads to smaller modular multipliers than CSA schemes, without impacting on computation time. VI. CONCLUSION We proposed an algorithm to automatically generate VHL descriptions of modular multipliers for FPGAs. The main originality of our approach is the selection of an optimal highradix carry-save encoding of intermediate results according to a given modulus. High-radix carry-save number systems take

10 AUTOATIC GENERATION OF OULAR ULTIPLIERS FOR FPGA APPLICATIONS 65 Proposed algorithm; Algorithm 3; Algorithm Area [slices] 5 45 elay [ns] (a) Area and delay comparisons for n = 56 and w max = Area [slices] elay [ns] (b) Area and delay comparisons for n = 56 and w max = Area [slices] 5 45 elay [ns] (c) Area and delay comparisons for n = 56 and w max = Area [slices] 5 45 elay [ns] (d) Area and delay comparisons for n = 56 and w max = 3 Fig.. Area and delay of modular multipliers on a Spartan-3 FPGA. 5 prime moduli were randomly generated for each experiment.

11 AUTOATIC GENERATION OF OULAR ULTIPLIERS FOR FPGA APPLICATIONS TABLE III AREA AN ELAY RATIOS BETWEEN OUR PROPOSAL AN ALGORITHS 3 AN 4 ON A SPARTAN-3 FPGA. PRIE OULI WERE RANOLY GENERATE FOR EACH EXPERIENT. N w max Peeters et al. [] Amanor et al. [3] Area of Algorithm Area of Algorithm 3 elay of Algorithm elay of Algorithm 3 Area of Algorithm Area of Algorithm 4 elay of Algorithm elay of Algorithm 4 8 [.45,.68] [.89,.45] [.77,.] [.,.6] 6 [.39,.58] [.97,.49] [.73,.97] [.,.84] 8 [.48,.68] [.99,.3] [.89,.8] [.99,.38] 6 [.44,.55] [.99,.3] [.79,.96] [.99,.44] 3 [.43,.53] [.,.44] [.77,.9] [.,.6] 48 [.4,.5] [.4,.69] [.74,.89] [.,.79] 8 [.5,.64] [.98,.5] [.9,.9] [.99,.3] 6 [.48,.56] [.99,.9] [.84,.96] [.99,.3] 3 [.44,.53] [.4,.36] [.78,.94] [.4,.45] 48 [.36,.5] [.,.56] [.79,.89] [.8,.63] 8 [.53,.63] [.57,.38] [.9,.] [.67,.49] 6 [.48,.55] [.6,.47] [.8,.97] [.8,.59] 4 [.46,.53] [.55,.49] [.83,.98] [.85,.69] 3 [.46,.5] [.65,.46] [.79,.93] [.94,.66] 8 [.54,.63] [.89,.48] [.93,.] [.59,.7] 6 [.49,.55] [.94,.45] [.85,.98] [.67,.34] 4 [.48,.53] [.99,.59] [.83,.97] [.7,.38] 3 [.47,.53] [.,.68] [.79,.94] [.66,.35] TABLE IV AREA AN ELAY RATIOS BETWEEN OUR PROPOSAL AN ALGORITHS 3 AN 4 ON A CYCLONE II FPGA. PRIE OULI WERE RANOLY GENERATE FOR EACH EXPERIENT. N w max Peeters et al. [] Amanor et al. [3] 64 Area of Algorithm Area of Algorithm 3 elay of Algorithm elay of Algorithm 3 Area of Algorithm Area of Algorithm 4 elay of Algorithm elay of Algorithm 4 8 [.63,.77] [.45,.87] [.4,.36] [.9,.54] 6 [.6,.66] [.58,.] [.6,.9] [.8,.75] 8 [.67,.74] [.4,.87] [.4,.3] [.77,.49] 8 6 [.6,.65] [.63,.4] [.7,.5] [.9,.85] 3 [.58,.63] [.85,.73] [.,.9] [.8, 3.79] advantage of dedicated carry logic available in almost all FPGA families and reduce the amount of interconnects. Therefore, our approach allows us to significantly reduce the area of modular multipliers. ACKNOWLEGENT The authors are grateful to the anonymous referees for their valuable comments. The work described in this paper has been supported in part by the New Energy and Industrial Technology evelopment Organization (NEO), Japan, and by the Swiss National Science Foundation through the Advanced Researchers program while Jean-Luc Beuchat was at École Normale Supérieure de Lyon (grant PA 386). The authors would like to thank F. de inechin et J. etrey for administrating the CA tool servers on which experiments described in this paper were performed. REFERENCES [] P. ontgomery, odular multiplication without trial division, athematics of Computation, vol. 44, no. 7, pp. 59 5, 985. [] G. R. Blakley, A computer algorithm for calculating the product ab modulo m, IEEE Trans. Comput., vol. C 3, no. 5, pp , 983. [3].. Ercegovac and T. Lang, igital Arithmetic. organ Kaufmann, 4. [4] C. K. Koç and C. Y. Hung, Carry-save adders for computing the product AB modulo N, Electronics Letters, vol. 6, no. 3, pp , June 99. [5], A fast algorithm for modular reduction, IEE Proceedings: Computers and igital Techniques, vol. 45, no. 4, pp. 65 7, July 998. [6] N. Takagi and S. Yajima, odular multiplication hardware algorithms with a redundant representation and their application to RSA cryptosystem, IEEE Trans. Comput., vol. 4, no. 7, pp , July 99. [7] N. Takagi, A radix-4 modular multiplication hardware algorithm for modular exponentiation, IEEE Trans. Comput., vol. 4, no. 8, pp , Aug. 99. [8] Y.-J. Jeong and W. P. Burleson, VLSI array algorithms and architectures for RSA modular multiplication, IEEE Trans. VLSI Syst., vol. 5, no., pp. 7, June 997. [9] S. Kim and G. E. Sobelman, igit-serial modular multiplication using skew-tolerant domino COS, in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, vol.. IEEE Computer Society,, pp [] E. Peeters,. Neve, and. Ciet, XTR implementation on reconfigurable hardware, in Cryptographic Hardware and Embedded Systems

12 AUTOATIC GENERATION OF OULAR ULTIPLIERS FOR FPGA APPLICATIONS CHES 4, ser. Lecture Notes in Computer Science,. Joye and J.-J. Quisquater, Eds., no Springer, 4, pp [] J.-L. Beuchat and J.-. uller, odulo m multiplication-addition: Algorithms and FPGA implementation, Electronics Letters, vol. 4, no., pp , ay 4. [] R. Beguenane, J.-L. Beuchat, J.-. uller, and S. Simard, odular multiplication of large integers on FPGA, in Proceedings of the 39th Asilomar Conference on Signals, Systems & Computers. IEEE Signal Processing Society, 5. [3]. N. Amanor, C. Paar, J. Pelzl, V. Bunimov, and. Schimmler, Efficient hardware architectures for modular multiplication on FPGAs, in Proceedings of FPL 5, 5, pp APPENIX This Appendix aims at proving the correctness of Algorithm. We proceed in three steps: after establishing a property of the modulo correction considered in this paper, we show that P (s) [i] is an (n + )-bit number. We conclude by computing a bound on P[ ] which indicates that P[ ]/ <. This proof also provides the reader with all the technical details requested to implement the algorithm or an automatic code generator. A. A Property of odulo Correction The first step consists in establishing a property which will allow us to compute bounds on P[i]. Let γ = n n 4 = n + n + n 3 + n 4 and { n,..., n }. Then, k n E < γ, k,..., 4. (9) The proof is straightforward if the modulus is smaller than or equal to γ. Let us assume now that = γ + β, where β satisfies the following inequality: β n 4. For k {,,, 3}, we easily check that n < γ. Since n = n, we obtain: 4 n E = n 4 β < γ. k n E = k Consequently, we have: 5 n E = n + n 4 β < γ, 6 n E = n + n 4 β < γ, and 7 n E = n + n + n 4 β < γ. For k = 8, the following modulo operation has to be carried out: 8 n E E = n + n 4 β. Since < n + n + n 4 β n + n 4 <, we deduce that: 8 n E = n + n 4 β = n 3 β. Thus, we have: 9 n E = n + n 3 β < γ, n E = n + n 3 β < γ, and n E = n + n + n 3 β < γ. A modulo reduction is again required for k =. Since < n + n + n 3 β n + n 3 <, we obtain: n E E = n + n 3 β = n + n 3 β = n 3 + n 4 3β. Since β, we conclude the proof by noting that 3 n E = n + n 3 + n 4 3β < γ, 4 n E = n + n 3 + n 4 3β < γ, and 5 n E = n + n + n 3 + n 4 3β < γ. B. Width of P (s) [i] Let us prove by induction that P[i] is an (n + )-bit number. Since P[r] =, we check that k =, and T[r ] = P[r ] = x r Y, which is an n-bit number. This property holds for i = r. Assume now that P (s) [i + ] is an (n + )-bit number. We have to consider two cases according to the parameter α: Our hypotheses guarantee E that n k 5 for α =. Therefore, P (s) [i + ] contains k sum words of n respective widths n = n +,..., n k = n k, and n k = n k 4 (Figure 3a). Let us split the partial product x i Y intoe k blocks in order to add it word by word to P (s) [i + ] n. We know that k X i= n i = n. Since x i Y is an n-bit integer, we deduce from the above equation that its most significant sum word contains n k = n k + = n k 3 bits. Therefore E the sum of the most significant bits of P (s) [i + ] n, x iy, and a carry bit is bounded by: ( n k + ) + ( n k ) + = n k 3 + n k 4 = 3 n k 4, which is an (n k ) bit number. Therefore, since P k i= n i = n +, T (s) [i] is an (n + )-bit number. Indeed, we have: k X (n k ) + n i + (n + ) = i= k X n i i= = n +. Four most significant bits of P (s) [i + ] address the table responsible of the modulo correction (Figure 3b). Recall that we have to combine the output of this table and carry bits of T[i] in order to generate a high-radix carry-save number U[i], whose format is the one of P[i]. Since k n α is an (n+)-bit number, we split it in k words of respective lengths n, n,..., n k, and (n k ). Consider now the addition of the most significant words of k n α and T (s) [i], and the most significant carry bit of. According to our hypotheses, n k 5 and this most significant word contains at least 4 bits. Consider the worst case (Figure 3b), where n k = 5 and the weight of the

13 AUTOATIC GENERATION OF OULAR ULTIPLIERS FOR FPGA APPLICATIONS 3 most significant bit of is equal to n. We deduce from Equation (9) that is an (n + )-bit number and that its most significant sum word is smaller than or equal to 4. The addition of the most significant words of T (s) [i] and, and a carry bit never generates an output carry and P (s) [i] is therefore an (n + )-bit number (Figure 3c). Assume now that α =. The same approach allows us to show that T (s) [i] is an (n + )-bit word (Figure 4a). According to our hypotheses, the most significant word of P (s) [i ] contains at least six bits. Therefore, the weight of the most significant carry bit of is at most n 3. Since Equation (9) guarantees that k n α < n + n + n + n 3, we deduce that is an (n + )- bit number (Figure 4b). Note that, for some moduli, we can relax the constraint on n k : the remaining of the proof will only assume that is an (n + )-bit number. An automatic code generator can check this condition very easily for a given value of. Since the most significant words of T (s) [i] and have the same size, their addition may generate an output carry and P (s) [i] is therefore an (n + )- bit number (Figure 4c). C. Final odulo Correction The last step consists in proving that P[ ]/ is smaller than. We have again to consider two cases according to α: Assume that α = and consider the last iteration (i.e. i = ). Since the partial product x Y is equal to zero, we have: T[ ] = E P (s) [] + P(c) [] n n + n = n 4. Thus, P[ ] n + 6 and P[ ]/ n + 3. Since the modulus is supposed greater than n, we know that P[ ]/ is smaller than. E When α =, P (s) [i + ] is smaller than or equal n α to n. Recall that the weight of the most significant carry bit of P (c) [i + ] is equal to n + n n k (Section II). Thus, and T[ ] n + n n+...+n k +, P[ ] n + n n+...+n k + + ψmax. Therefore, P[ ]/ is smaller than if n + n n+...+n k + ψ max <.

14 AUTOATIC GENERATION OF OULAR ULTIPLIERS FOR FPGA APPLICATIONS 4 n + n + bits P (s) [i + ] = P (s) [i + ] div n α n k = 5 n + bits T (s) [i] P (c) [i + ] n bits x i Y Table n bits n α 5 bits T (s) [i] P (s) [i] n + bits n + bits n + bits P (c) [i] P (s) [i + ] div n α (a) (b) (c) Fig. 3. Proof of Algorithm for α =. a) Computation of T[i]. b) odulo reduction and conversion. c) Computation of P[i]. n + n + bits P (s) [i + ] n k = 6 = P (s) [i + ] div n α n + bits T (s) [i] P (c) [i + ] n bits x i Y Table n bits n α 6 bits T (s) [i] P (s) [i] n + bits n + bits n + bits P (c) [i] P (s) [i + ] div n α (a) (b) (c) Fig. 4. Proof of Algorithm for α =. a) Computation of T[i]. b) odulo reduction and conversion. c) Computation of P[i].

Author(s) Beuchat, Jean-Luc; Muller, Jean-Mic. Citation IEEE transactions on computers, 57(

Author(s) Beuchat, Jean-Luc; Muller, Jean-Mic. Citation IEEE transactions on computers, 57( Title Automatic Generation of Modular Mul Applications Author(s) Beuchat, Jean-Luc; Muller, Jean-Mic Citation IEEE transactions on computers, 57( Issue Date 008-1 Text version publisher URL http://hdl.handle.net/41/101169

More information

Reduced-Area Constant-Coefficient and Multiple-Constant Multipliers for Xilinx FPGAs with 6-Input LUTs

Reduced-Area Constant-Coefficient and Multiple-Constant Multipliers for Xilinx FPGAs with 6-Input LUTs Article Reduced-Area Constant-Coefficient and Multiple-Constant Multipliers for Xilinx FPGAs with 6-Input LUTs E. George Walters III Department of Electrical and Computer Engineering, Penn State Erie,

More information

A VLSI Algorithm for Modular Multiplication/Division

A VLSI Algorithm for Modular Multiplication/Division A VLSI Algorithm for Modular Multiplication/Division Marcelo E. Kaihara and Naofumi Takagi Department of Information Engineering Nagoya University Nagoya, 464-8603, Japan mkaihara@takagi.nuie.nagoya-u.ac.jp

More information

A Hardware-Oriented Method for Evaluating Complex Polynomials

A Hardware-Oriented Method for Evaluating Complex Polynomials A Hardware-Oriented Method for Evaluating Complex Polynomials Miloš D Ercegovac Computer Science Department University of California at Los Angeles Los Angeles, CA 90095, USA milos@csuclaedu Jean-Michel

More information

Arithmetic in Integer Rings and Prime Fields

Arithmetic in Integer Rings and Prime Fields Arithmetic in Integer Rings and Prime Fields A 3 B 3 A 2 B 2 A 1 B 1 A 0 B 0 FA C 3 FA C 2 FA C 1 FA C 0 C 4 S 3 S 2 S 1 S 0 http://koclab.org Çetin Kaya Koç Spring 2018 1 / 71 Contents Arithmetic in Integer

More information

An Optimized Hardware Architecture of Montgomery Multiplication Algorithm

An Optimized Hardware Architecture of Montgomery Multiplication Algorithm An Optimized Hardware Architecture of Montgomery Multiplication Algorithm Miaoqing Huang 1, Kris Gaj 2, Soonhak Kwon 3, and Tarek El-Ghazawi 1 1 The George Washington University, Washington, DC 20052,

More information

Lecture 8: Sequential Multipliers

Lecture 8: Sequential Multipliers Lecture 8: Sequential Multipliers ECE 645 Computer Arithmetic 3/25/08 ECE 645 Computer Arithmetic Lecture Roadmap Sequential Multipliers Unsigned Signed Radix-2 Booth Recoding High-Radix Multiplication

More information

A High-Speed Realization of Chinese Remainder Theorem

A High-Speed Realization of Chinese Remainder Theorem Proceedings of the 2007 WSEAS Int. Conference on Circuits, Systems, Signal and Telecommunications, Gold Coast, Australia, January 17-19, 2007 97 A High-Speed Realization of Chinese Remainder Theorem Shuangching

More information

Novel Modulo 2 n +1Multipliers

Novel Modulo 2 n +1Multipliers Novel Modulo Multipliers H. T. Vergos Computer Engineering and Informatics Dept., University of Patras, 26500 Patras, Greece. vergos@ceid.upatras.gr C. Efstathiou Informatics Dept.,TEI of Athens, 12210

More information

Residue Number Systems Ivor Page 1

Residue Number Systems Ivor Page 1 Residue Number Systems 1 Residue Number Systems Ivor Page 1 7.1 Arithmetic in a modulus system The great speed of arithmetic in Residue Number Systems (RNS) comes from a simple theorem from number theory:

More information

Design and FPGA Implementation of Radix-10 Algorithm for Division with Limited Precision Primitives

Design and FPGA Implementation of Radix-10 Algorithm for Division with Limited Precision Primitives Design and FPGA Implementation of Radix-10 Algorithm for Division with Limited Precision Primitives Miloš D. Ercegovac Computer Science Department Univ. of California at Los Angeles California Robert McIlhenny

More information

Power Consumption Analysis. Arithmetic Level Countermeasures for ECC Coprocessor. Arithmetic Operators for Cryptography.

Power Consumption Analysis. Arithmetic Level Countermeasures for ECC Coprocessor. Arithmetic Operators for Cryptography. Power Consumption Analysis General principle: measure the current I in the circuit Arithmetic Level Countermeasures for ECC Coprocessor Arnaud Tisserand, Thomas Chabrier, Danuta Pamula I V DD circuit traces

More information

Logic and Computer Design Fundamentals. Chapter 5 Arithmetic Functions and Circuits

Logic and Computer Design Fundamentals. Chapter 5 Arithmetic Functions and Circuits Logic and Computer Design Fundamentals Chapter 5 Arithmetic Functions and Circuits Arithmetic functions Operate on binary vectors Use the same subfunction in each bit position Can design functional block

More information

Arithmetic Operators for Pairing-Based Cryptography

Arithmetic Operators for Pairing-Based Cryptography Arithmetic Operators for Pairing-Based Cryptography Jean-Luc Beuchat Laboratory of Cryptography and Information Security Graduate School of Systems and Information Engineering University of Tsukuba 1-1-1

More information

Arithmetic operators for pairing-based cryptography

Arithmetic operators for pairing-based cryptography 7. Kryptotag November 9 th, 2007 Arithmetic operators for pairing-based cryptography Jérémie Detrey Cosec, B-IT, Bonn, Germany jdetrey@bit.uni-bonn.de Joint work with: Jean-Luc Beuchat Nicolas Brisebarre

More information

Lecture 8. Sequential Multipliers

Lecture 8. Sequential Multipliers Lecture 8 Sequential Multipliers Required Reading Behrooz Parhami, Computer Arithmetic: Algorithms and Hardware Design Chapter 9, Basic Multiplication Scheme Chapter 10, High-Radix Multipliers Chapter

More information

Chapter 5 Arithmetic Circuits

Chapter 5 Arithmetic Circuits Chapter 5 Arithmetic Circuits SKEE2263 Digital Systems Mun im/ismahani/izam {munim@utm.my,e-izam@utm.my,ismahani@fke.utm.my} February 11, 2016 Table of Contents 1 Iterative Designs 2 Adders 3 High-Speed

More information

A Bit-Serial Unified Multiplier Architecture for Finite Fields GF(p) and GF(2 m )

A Bit-Serial Unified Multiplier Architecture for Finite Fields GF(p) and GF(2 m ) A Bit-Serial Unified Multiplier Architecture for Finite Fields GF(p) and GF(2 m ) Johann Großschädl Graz University of Technology Institute for Applied Information Processing and Communications Inffeldgasse

More information

ECEN 248: INTRODUCTION TO DIGITAL SYSTEMS DESIGN. Week 9 Dr. Srinivas Shakkottai Dept. of Electrical and Computer Engineering

ECEN 248: INTRODUCTION TO DIGITAL SYSTEMS DESIGN. Week 9 Dr. Srinivas Shakkottai Dept. of Electrical and Computer Engineering ECEN 248: INTRODUCTION TO DIGITAL SYSTEMS DESIGN Week 9 Dr. Srinivas Shakkottai Dept. of Electrical and Computer Engineering TIMING ANALYSIS Overview Circuits do not respond instantaneously to input changes

More information

FPGA accelerated multipliers over binary composite fields constructed via low hamming weight irreducible polynomials

FPGA accelerated multipliers over binary composite fields constructed via low hamming weight irreducible polynomials FPGA accelerated multipliers over binary composite fields constructed via low hamming weight irreducible polynomials C. Shu, S. Kwon and K. Gaj Abstract: The efficient design of digit-serial multipliers

More information

On A Large-scale Multiplier for Public Key Cryptographic Hardware

On A Large-scale Multiplier for Public Key Cryptographic Hardware 1,a) 1 1 1 1 1 Wallace tree n log n 64 128 Wallace tree,, Wallace tree,, VHDL On A Large-scale Multiplier for Public Key Cryptographic Hardware Masaaki Shirase 1,a) Kimura Keigo 1 Murayama Hiroyuki 1 Kato

More information

Hardware Design I Chap. 4 Representative combinational logic

Hardware Design I Chap. 4 Representative combinational logic Hardware Design I Chap. 4 Representative combinational logic E-mail: shimada@is.naist.jp Already optimized circuits There are many optimized circuits which are well used You can reduce your design workload

More information

EECS150 - Digital Design Lecture 22 - Arithmetic Blocks, Part 1

EECS150 - Digital Design Lecture 22 - Arithmetic Blocks, Part 1 EECS150 - igital esign Lecture 22 - Arithmetic Blocks, Part 1 April 10, 2011 John Wawrzynek Spring 2011 EECS150 - Lec23-arith1 Page 1 Each cell: r i = a i XOR b i XOR c in Carry-ripple Adder Revisited

More information

Subquadratic space complexity multiplier for a class of binary fields using Toeplitz matrix approach

Subquadratic space complexity multiplier for a class of binary fields using Toeplitz matrix approach Subquadratic space complexity multiplier for a class of binary fields using Toeplitz matrix approach M A Hasan 1 and C Negre 2 1 ECE Department and CACR, University of Waterloo, Ontario, Canada 2 Team

More information

VLSI Arithmetic. Lecture 9: Carry-Save and Multi-Operand Addition. Prof. Vojin G. Oklobdzija University of California

VLSI Arithmetic. Lecture 9: Carry-Save and Multi-Operand Addition. Prof. Vojin G. Oklobdzija University of California VLSI Arithmetic Lecture 9: Carry-Save and Multi-Operand Addition Prof. Vojin G. Oklobdzija University of California http://www.ece.ucdavis.edu/acsel Carry-Save Addition* *from Parhami 2 June 18, 2003 Carry-Save

More information

A Suggestion for a Fast Residue Multiplier for a Family of Moduli of the Form (2 n (2 p ± 1))

A Suggestion for a Fast Residue Multiplier for a Family of Moduli of the Form (2 n (2 p ± 1)) The Computer Journal, 47(1), The British Computer Society; all rights reserved A Suggestion for a Fast Residue Multiplier for a Family of Moduli of the Form ( n ( p ± 1)) Ahmad A. Hiasat Electronics Engineering

More information

KEYWORDS: Multiple Valued Logic (MVL), Residue Number System (RNS), Quinary Logic (Q uin), Quinary Full Adder, QFA, Quinary Half Adder, QHA.

KEYWORDS: Multiple Valued Logic (MVL), Residue Number System (RNS), Quinary Logic (Q uin), Quinary Full Adder, QFA, Quinary Half Adder, QHA. GLOBAL JOURNAL OF ADVANCED ENGINEERING TECHNOLOGIES AND SCIENCES DESIGN OF A QUINARY TO RESIDUE NUMBER SYSTEM CONVERTER USING MULTI-LEVELS OF CONVERSION Hassan Amin Osseily Electrical and Electronics Department,

More information

Arithmetic Operators for Pairing-Based Cryptography

Arithmetic Operators for Pairing-Based Cryptography Arithmetic Operators for Pairing-Based Cryptography J.-L. Beuchat 1 N. Brisebarre 2 J. Detrey 3 E. Okamoto 1 1 University of Tsukuba, Japan 2 École Normale Supérieure de Lyon, France 3 Cosec, b-it, Bonn,

More information

9. Datapath Design. Jacob Abraham. Department of Electrical and Computer Engineering The University of Texas at Austin VLSI Design Fall 2017

9. Datapath Design. Jacob Abraham. Department of Electrical and Computer Engineering The University of Texas at Austin VLSI Design Fall 2017 9. Datapath Design Jacob Abraham Department of Electrical and Computer Engineering The University of Texas at Austin VLSI Design Fall 2017 October 2, 2017 ECE Department, University of Texas at Austin

More information

Svoboda-Tung Division With No Compensation

Svoboda-Tung Division With No Compensation Svoboda-Tung Division With No Compensation Luis MONTALVO (IEEE Student Member), Alain GUYOT Integrated Systems Design Group, TIMA/INPG 46, Av. Félix Viallet, 38031 Grenoble Cedex, France. E-mail: montalvo@archi.imag.fr

More information

EFFICIENT MULTIOUTPUT CARRY LOOK-AHEAD ADDERS

EFFICIENT MULTIOUTPUT CARRY LOOK-AHEAD ADDERS INTERNATIONAL JOURNAL OF RESEARCH IN COMPUTER APPLICATIONS AND ROBOTICS ISSN 2320-7345 EFFICIENT MULTIOUTPUT CARRY LOOK-AHEAD ADDERS B. Venkata Sreecharan 1, C. Venkata Sudhakar 2 1 M.TECH (VLSI DESIGN)

More information

SUFFIX PROPERTY OF INVERSE MOD

SUFFIX PROPERTY OF INVERSE MOD IEEE TRANSACTIONS ON COMPUTERS, 2018 1 Algorithms for Inversion mod p k Çetin Kaya Koç, Fellow, IEEE, Abstract This paper describes and analyzes all existing algorithms for computing x = a 1 (mod p k )

More information

CS 140 Lecture 14 Standard Combinational Modules

CS 140 Lecture 14 Standard Combinational Modules CS 14 Lecture 14 Standard Combinational Modules Professor CK Cheng CSE Dept. UC San Diego Some slides from Harris and Harris 1 Part III. Standard Modules A. Interconnect B. Operators. Adders Multiplier

More information

Algorithms (II) Yu Yu. Shanghai Jiaotong University

Algorithms (II) Yu Yu. Shanghai Jiaotong University Algorithms (II) Yu Yu Shanghai Jiaotong University Chapter 1. Algorithms with Numbers Two seemingly similar problems Factoring: Given a number N, express it as a product of its prime factors. Primality:

More information

Implementation Of Digital Fir Filter Using Improved Table Look Up Scheme For Residue Number System

Implementation Of Digital Fir Filter Using Improved Table Look Up Scheme For Residue Number System Implementation Of Digital Fir Filter Using Improved Table Look Up Scheme For Residue Number System G.Suresh, G.Indira Devi, P.Pavankumar Abstract The use of the improved table look up Residue Number System

More information

ARITHMETIC COMBINATIONAL MODULES AND NETWORKS

ARITHMETIC COMBINATIONAL MODULES AND NETWORKS ARITHMETIC COMBINATIONAL MODULES AND NETWORKS 1 SPECIFICATION OF ADDER MODULES FOR POSITIVE INTEGERS HALF-ADDER AND FULL-ADDER MODULES CARRY-RIPPLE AND CARRY-LOOKAHEAD ADDER MODULES NETWORKS OF ADDER MODULES

More information

High Performance GHASH Function for Long Messages

High Performance GHASH Function for Long Messages High Performance GHASH Function for Long Messages Nicolas Méloni 1, Christophe Négre 2 and M. Anwar Hasan 1 1 Department of Electrical and Computer Engineering University of Waterloo, Canada 2 Team DALI/ELIAUS

More information

Modular Multiplication in GF (p k ) using Lagrange Representation

Modular Multiplication in GF (p k ) using Lagrange Representation Modular Multiplication in GF (p k ) using Lagrange Representation Jean-Claude Bajard, Laurent Imbert, and Christophe Nègre Laboratoire d Informatique, de Robotique et de Microélectronique de Montpellier

More information

Tree and Array Multipliers Ivor Page 1

Tree and Array Multipliers Ivor Page 1 Tree and Array Multipliers 1 Tree and Array Multipliers Ivor Page 1 11.1 Tree Multipliers In Figure 1 seven input operands are combined by a tree of CSAs. The final level of the tree is a carry-completion

More information

I. INTRODUCTION. CMOS Technology: An Introduction to QCA Technology As an. T. Srinivasa Padmaja, C. M. Sri Priya

I. INTRODUCTION. CMOS Technology: An Introduction to QCA Technology As an. T. Srinivasa Padmaja, C. M. Sri Priya International Journal of Scientific Research in Computer Science, Engineering and Information Technology 2018 IJSRCSEIT Volume 3 Issue 5 ISSN : 2456-3307 Design and Implementation of Carry Look Ahead Adder

More information

Binary Multipliers. Reading: Study Chapter 3. The key trick of multiplication is memorizing a digit-to-digit table Everything else was just adding

Binary Multipliers. Reading: Study Chapter 3. The key trick of multiplication is memorizing a digit-to-digit table Everything else was just adding Binary Multipliers The key trick of multiplication is memorizing a digit-to-digit table Everything else was just adding 2 3 4 5 6 7 8 9 2 3 4 5 6 7 8 9 2 2 4 6 8 2 4 6 8 3 3 6 9 2 5 8 2 24 27 4 4 8 2 6

More information

MODULAR multiplication with large integers is the main

MODULAR multiplication with large integers is the main 1658 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 25, NO. 5, MAY 2017 A General Digit-Serial Architecture for Montgomery Modular Multiplication Serdar Süer Erdem, Tuğrul Yanık,

More information

DIGITAL TECHNICS. Dr. Bálint Pődör. Óbuda University, Microelectronics and Technology Institute

DIGITAL TECHNICS. Dr. Bálint Pődör. Óbuda University, Microelectronics and Technology Institute DIGITAL TECHNICS Dr. Bálint Pődör Óbuda University, Microelectronics and Technology Institute 4. LECTURE: COMBINATIONAL LOGIC DESIGN: ARITHMETICS (THROUGH EXAMPLES) 2016/2017 COMBINATIONAL LOGIC DESIGN:

More information

Adders, subtractors comparators, multipliers and other ALU elements

Adders, subtractors comparators, multipliers and other ALU elements CSE4: Components and Design Techniques for Digital Systems Adders, subtractors comparators, multipliers and other ALU elements Instructor: Mohsen Imani UC San Diego Slides from: Prof.Tajana Simunic Rosing

More information

A Digit-Serial Systolic Multiplier for Finite Fields GF(2 m )

A Digit-Serial Systolic Multiplier for Finite Fields GF(2 m ) A Digit-Serial Systolic Multiplier for Finite Fields GF( m ) Chang Hoon Kim, Sang Duk Han, and Chun Pyo Hong Department of Computer and Information Engineering Taegu University 5 Naeri, Jinryang, Kyungsan,

More information

Hardware Operator for Simultaneous Sine and Cosine Evaluation

Hardware Operator for Simultaneous Sine and Cosine Evaluation Hardware Operator for Simultaneous Sine and Cosine Evaluation Arnaud Tisserand To cite this version: Arnaud Tisserand. Hardware Operator for Simultaneous Sine and Cosine Evaluation. ICASSP 6: International

More information

Discrete Mathematics and Probability Theory Fall 2018 Alistair Sinclair and Yun Song Note 6

Discrete Mathematics and Probability Theory Fall 2018 Alistair Sinclair and Yun Song Note 6 CS 70 Discrete Mathematics and Probability Theory Fall 2018 Alistair Sinclair and Yun Song Note 6 1 Modular Arithmetic In several settings, such as error-correcting codes and cryptography, we sometimes

More information

Computer Architecture 10. Fast Adders

Computer Architecture 10. Fast Adders Computer Architecture 10 Fast s Ma d e wi t h Op e n Of f i c e. o r g 1 Carry Problem Addition is primary mechanism in implementing arithmetic operations Slow addition directly affects the total performance

More information

Design and Implementation of Efficient Modulo 2 n +1 Adder

Design and Implementation of Efficient Modulo 2 n +1 Adder www..org 18 Design and Implementation of Efficient Modulo 2 n +1 Adder V. Jagadheesh 1, Y. Swetha 2 1,2 Research Scholar(INDIA) Abstract In this brief, we proposed an efficient weighted modulo (2 n +1)

More information

Hybrid Binary-Ternary Joint Sparse Form and its Application in Elliptic Curve Cryptography

Hybrid Binary-Ternary Joint Sparse Form and its Application in Elliptic Curve Cryptography Hybrid Binary-Ternary Joint Sparse Form and its Application in Elliptic Curve Cryptography Jithra Adikari, Student Member, IEEE, Vassil Dimitrov, and Laurent Imbert Abstract Multi-exponentiation is a common

More information

ISSN (PRINT): , (ONLINE): , VOLUME-4, ISSUE-10,

ISSN (PRINT): , (ONLINE): , VOLUME-4, ISSUE-10, A NOVEL DOMINO LOGIC DESIGN FOR EMBEDDED APPLICATION Dr.K.Sujatha Associate Professor, Department of Computer science and Engineering, Sri Krishna College of Engineering and Technology, Coimbatore, Tamilnadu,

More information

PARALLEL MULTIPLICATION IN F 2

PARALLEL MULTIPLICATION IN F 2 PARALLEL MULTIPLICATION IN F 2 n USING CONDENSED MATRIX REPRESENTATION Christophe Negre Équipe DALI, LP2A, Université de Perpignan avenue P Alduy, 66 000 Perpignan, France christophenegre@univ-perpfr Keywords:

More information

Digital Integrated Circuits A Design Perspective. Arithmetic Circuits. Jan M. Rabaey Anantha Chandrakasan Borivoje Nikolic.

Digital Integrated Circuits A Design Perspective. Arithmetic Circuits. Jan M. Rabaey Anantha Chandrakasan Borivoje Nikolic. Digital Integrated Circuits A Design Perspective Jan M. Rabaey Anantha Chandrakasan Borivoje Nikolic Arithmetic Circuits January, 2003 1 A Generic Digital Processor MEMORY INPUT-OUTPUT CONTROL DATAPATH

More information

AN IMPROVED LOW LATENCY SYSTOLIC STRUCTURED GALOIS FIELD MULTIPLIER

AN IMPROVED LOW LATENCY SYSTOLIC STRUCTURED GALOIS FIELD MULTIPLIER Indian Journal of Electronics and Electrical Engineering (IJEEE) Vol.2.No.1 2014pp1-6 available at: www.goniv.com Paper Received :05-03-2014 Paper Published:28-03-2014 Paper Reviewed by: 1. John Arhter

More information

What s the Deal? MULTIPLICATION. Time to multiply

What s the Deal? MULTIPLICATION. Time to multiply What s the Deal? MULTIPLICATION Time to multiply Multiplying two numbers requires a multiply Luckily, in binary that s just an AND gate! 0*0=0, 0*1=0, 1*0=0, 1*1=1 Generate a bunch of partial products

More information

ECE 545 Digital System Design with VHDL Lecture 1. Digital Logic Refresher Part A Combinational Logic Building Blocks

ECE 545 Digital System Design with VHDL Lecture 1. Digital Logic Refresher Part A Combinational Logic Building Blocks ECE 545 Digital System Design with VHDL Lecture Digital Logic Refresher Part A Combinational Logic Building Blocks Lecture Roadmap Combinational Logic Basic Logic Review Basic Gates De Morgan s Law Combinational

More information

Low complexity bit-parallel GF (2 m ) multiplier for all-one polynomials

Low complexity bit-parallel GF (2 m ) multiplier for all-one polynomials Low complexity bit-parallel GF (2 m ) multiplier for all-one polynomials Yin Li 1, Gong-liang Chen 2, and Xiao-ning Xie 1 Xinyang local taxation bureau, Henan, China. Email:yunfeiyangli@gmail.com, 2 School

More information

Numbering Systems. Computational Platforms. Scaling and Round-off Noise. Special Purpose. here that is dedicated architecture

Numbering Systems. Computational Platforms. Scaling and Round-off Noise. Special Purpose. here that is dedicated architecture Computational Platforms Numbering Systems Basic Building Blocks Scaling and Round-off Noise Computational Platforms Viktor Öwall viktor.owall@eit.lth.seowall@eit lth Standard Processors or Special Purpose

More information

An Implementation of an Address Generator Using Hash Memories

An Implementation of an Address Generator Using Hash Memories An Implementation of an Address Generator Using Memories Tsutomu Sasao and Munehiro Matsuura Department of Computer Science and Electronics, Kyushu Institute of Technology, Iizuka 820-8502, Japan Abstract

More information

Volume 3, No. 1, January 2012 Journal of Global Research in Computer Science RESEARCH PAPER Available Online at

Volume 3, No. 1, January 2012 Journal of Global Research in Computer Science RESEARCH PAPER Available Online at Volume 3, No 1, January 2012 Journal of Global Research in Computer Science RESEARCH PAPER Available Online at wwwjgrcsinfo A NOVEL HIGH DYNAMIC RANGE 5-MODULUS SET WHIT EFFICIENT REVERSE CONVERTER AND

More information

Chapter 2 Basic Arithmetic Circuits

Chapter 2 Basic Arithmetic Circuits Chapter 2 Basic Arithmetic Circuits This chapter is devoted to the description of simple circuits for the implementation of some of the arithmetic operations presented in Chap. 1. Specifically, the design

More information

A Simple Left-to-Right Algorithm for Minimal Weight Signed Radix-r Representations

A Simple Left-to-Right Algorithm for Minimal Weight Signed Radix-r Representations A Simple Left-to-Right Algorithm for Minimal Weight Signed Radix-r Representations James A. Muir School of Computer Science Carleton University, Ottawa, Canada http://www.scs.carleton.ca/ jamuir 23 October

More information

Exponentiation and Point Multiplication. Çetin Kaya Koç Spring / 70

Exponentiation and Point Multiplication.   Çetin Kaya Koç Spring / 70 Exponentiation and Point Multiplication 1 2 3 4 5 6 8 7 10 9 12 16 14 11 13 15 20 http://koclab.org Çetin Kaya Koç Spring 2018 1 / 70 Contents Exponentiation and Point Multiplication Exponentiation and

More information

Section 3: Combinational Logic Design. Department of Electrical Engineering, University of Waterloo. Combinational Logic

Section 3: Combinational Logic Design. Department of Electrical Engineering, University of Waterloo. Combinational Logic Section 3: Combinational Logic Design Major Topics Design Procedure Multilevel circuits Design with XOR gates Adders and Subtractors Binary parallel adder Decoders Encoders Multiplexers Programmed Logic

More information

An Algorithm for the η T Pairing Calculation in Characteristic Three and its Hardware Implementation

An Algorithm for the η T Pairing Calculation in Characteristic Three and its Hardware Implementation An Algorithm for the η T Pairing Calculation in Characteristic Three and its Hardware Implementation Jean-Luc Beuchat 1 Masaaki Shirase 2 Tsuyoshi Takagi 2 Eiji Okamoto 1 1 Graduate School of Systems and

More information

This is a recursive algorithm. The procedure is guaranteed to terminate, since the second argument decreases each time.

This is a recursive algorithm. The procedure is guaranteed to terminate, since the second argument decreases each time. 8 Modular Arithmetic We introduce an operator mod. Let d be a positive integer. For c a nonnegative integer, the value c mod d is the remainder when c is divided by d. For example, c mod d = 0 if and only

More information

Outline. EECS Components and Design Techniques for Digital Systems. Lec 18 Error Coding. In the real world. Our beautiful digital world.

Outline. EECS Components and Design Techniques for Digital Systems. Lec 18 Error Coding. In the real world. Our beautiful digital world. Outline EECS 150 - Components and esign Techniques for igital Systems Lec 18 Error Coding Errors and error models Parity and Hamming Codes (SECE) Errors in Communications LFSRs Cyclic Redundancy Check

More information

EECS150 - Digital Design Lecture 24 - Arithmetic Blocks, Part 2 + Shifters

EECS150 - Digital Design Lecture 24 - Arithmetic Blocks, Part 2 + Shifters EECS150 - Digital Design Lecture 24 - Arithmetic Blocks, Part 2 + Shifters April 15, 2010 John Wawrzynek 1 Multiplication a 3 a 2 a 1 a 0 Multiplicand b 3 b 2 b 1 b 0 Multiplier X a 3 b 0 a 2 b 0 a 1 b

More information

Chapter 1: Solutions to Exercises

Chapter 1: Solutions to Exercises 1 DIGITAL ARITHMETIC Miloš D. Ercegovac and Tomás Lang Morgan Kaufmann Publishers, an imprint of Elsevier, c 2004 Exercise 1.1 (a) 1. 9 bits since 2 8 297 2 9 2. 3 radix-8 digits since 8 2 297 8 3 3. 3

More information

How fast can we calculate?

How fast can we calculate? November 30, 2013 A touch of History The Colossus Computers developed at Bletchley Park in England during WW2 were probably the first programmable computers. Information about these machines has only been

More information

Design of Sequential Circuits

Design of Sequential Circuits Design of Sequential Circuits Seven Steps: Construct a state diagram (showing contents of flip flop and inputs with next state) Assign letter variables to each flip flop and each input and output variable

More information

14:332:231 DIGITAL LOGIC DESIGN

14:332:231 DIGITAL LOGIC DESIGN 4:332:23 DIGITAL LOGIC DEIGN Ivan Marsic, Rutgers University Electrical & Computer Engineering Fall 23 Lecture #4: Adders, ubtracters, and ALUs Vector Binary Adder [Wakerly 4 th Ed., ec. 6., p. 474] ingle

More information

Cost/Performance Tradeoff of n-select Square Root Implementations

Cost/Performance Tradeoff of n-select Square Root Implementations Australian Computer Science Communications, Vol.22, No.4, 2, pp.9 6, IEEE Comp. Society Press Cost/Performance Tradeoff of n-select Square Root Implementations Wanming Chu and Yamin Li Computer Architecture

More information

A COMBINED 16-BIT BINARY AND DUAL GALOIS FIELD MULTIPLIER. Jesus Garcia and Michael J. Schulte

A COMBINED 16-BIT BINARY AND DUAL GALOIS FIELD MULTIPLIER. Jesus Garcia and Michael J. Schulte A COMBINED 16-BIT BINARY AND DUAL GALOIS FIELD MULTIPLIER Jesus Garcia and Michael J. Schulte Lehigh University Department of Computer Science and Engineering Bethlehem, PA 15 ABSTRACT Galois field arithmetic

More information

Numbers. Çetin Kaya Koç Winter / 18

Numbers. Çetin Kaya Koç   Winter / 18 Çetin Kaya Koç http://koclab.cs.ucsb.edu Winter 2016 1 / 18 Number Systems and Sets We represent the set of integers as Z = {..., 3, 2, 1,0,1,2,3,...} We denote the set of positive integers modulo n as

More information

Discrete Mathematics and Probability Theory Summer 2014 James Cook Note 5

Discrete Mathematics and Probability Theory Summer 2014 James Cook Note 5 CS 70 Discrete Mathematics and Probability Theory Summer 2014 James Cook Note 5 Modular Arithmetic In several settings, such as error-correcting codes and cryptography, we sometimes wish to work over a

More information

The next sequence of lectures in on the topic of Arithmetic Algorithms. We shall build up to an understanding of the RSA public-key cryptosystem.

The next sequence of lectures in on the topic of Arithmetic Algorithms. We shall build up to an understanding of the RSA public-key cryptosystem. CS 70 Discrete Mathematics for CS Fall 2003 Wagner Lecture 10 The next sequence of lectures in on the topic of Arithmetic Algorithms. We shall build up to an understanding of the RSA public-key cryptosystem.

More information

1 Short adders. t total_ripple8 = t first + 6*t middle + t last = 4t p + 6*2t p + 2t p = 18t p

1 Short adders. t total_ripple8 = t first + 6*t middle + t last = 4t p + 6*2t p + 2t p = 18t p UNIVERSITY OF CALIFORNIA College of Engineering Department of Electrical Engineering and Computer Sciences Study Homework: Arithmetic NTU IC54CA (Fall 2004) SOLUTIONS Short adders A The delay of the ripple

More information

Efficient random number generation on FPGA-s

Efficient random number generation on FPGA-s Proceedings of the 9 th International Conference on Applied Informatics Eger, Hungary, January 29 February 1, 2014. Vol. 1. pp. 313 320 doi: 10.14794/ICAI.9.2014.1.313 Efficient random number generation

More information

Modular Multiplication of Large Integers on FPGA

Modular Multiplication of Large Integers on FPGA Modular Muliplicaion of Large Inegers on FPGA Rachid Beguenane, Jean-Luc Beucha, Jean-Michel Muller,andSéphane Simard ERMETIS Déparemen des Sciences Appliquées Universié duquébec à Chicouimi, Boulevard

More information

Montgomery Multiplier and Squarer in GF(2 m )

Montgomery Multiplier and Squarer in GF(2 m ) Montgomery Multiplier and Squarer in GF( m ) Huapeng Wu The Centre for Applied Cryptographic Research Department of Combinatorics and Optimization University of Waterloo, Waterloo, Canada h3wu@cacrmathuwaterlooca

More information

Optimal Eta Pairing on Supersingular Genus-2 Binary Hyperelliptic Curves

Optimal Eta Pairing on Supersingular Genus-2 Binary Hyperelliptic Curves CT-RSA 2012 February 29th, 2012 Optimal Eta Pairing on Supersingular Genus-2 Binary Hyperelliptic Curves Joint work with: Nicolas Estibals CARAMEL project-team, LORIA, Université de Lorraine / CNRS / INRIA,

More information

How to Improve an Exponentiation Black-Box

How to Improve an Exponentiation Black-Box How to Improve an Exponentiation Black-Box [Published in K. Nyberg, Ed., Advances in Cryptology EUROCRYPT 98, vol. 1403 of Lecture Notes in Computer Science, pp. 211 220, Springer-Verlag, 1998.] Gérard

More information

Dual-Field Arithmetic Unit for GF(p) and GF(2 m ) *

Dual-Field Arithmetic Unit for GF(p) and GF(2 m ) * Institute for Applied Information Processing and Communications Graz University of Technology Dual-Field Arithmetic Unit for GF(p) and GF(2 m ) * CHES 2002 Workshop on Cryptographic Hardware and Embedded

More information

Subquadratic Space Complexity Multiplication over Binary Fields with Dickson Polynomial Representation

Subquadratic Space Complexity Multiplication over Binary Fields with Dickson Polynomial Representation Subquadratic Space Complexity Multiplication over Binary Fields with Dickson Polynomial Representation M A Hasan and C Negre Abstract We study Dickson bases for binary field representation Such representation

More information

Class Website:

Class Website: ECE 20B, Winter 2003 Introduction to Electrical Engineering, II LECTURE NOTES #5 Instructor: Andrew B. Kahng (lecture) Email: abk@ece.ucsd.edu Telephone: 858-822-4884 office, 858-353-0550 cell Office:

More information

The goal differs from prime factorization. Prime factorization would initialize all divisors to be prime numbers instead of integers*

The goal differs from prime factorization. Prime factorization would initialize all divisors to be prime numbers instead of integers* Quantum Algorithm Processor For Finding Exact Divisors Professor J R Burger Summary Wiring diagrams are given for a quantum algorithm processor in CMOS to compute, in parallel, all divisors of an n-bit

More information

ISSN (PRINT): , (ONLINE): , VOLUME-5, ISSUE-7,

ISSN (PRINT): , (ONLINE): , VOLUME-5, ISSUE-7, HIGH PERFORMANCE MONTGOMERY MULTIPLICATION USING DADDA TREE ADDITION Thandri Adi Varalakshmi Devi 1, P Subhashini 2 1 PG Scholar, Dept of ECE, Kakinada Institute of Technology, Korangi, AP, India. 2 Assistant

More information

We are here. Assembly Language. Processors Arithmetic Logic Units. Finite State Machines. Circuits Gates. Transistors

We are here. Assembly Language. Processors Arithmetic Logic Units. Finite State Machines. Circuits Gates. Transistors CSC258 Week 3 1 Logistics If you cannot login to MarkUs, email me your UTORID and name. Check lab marks on MarkUs, if it s recorded wrong, contact Larry within a week after the lab. Quiz 1 average: 86%

More information

GF(2 m ) arithmetic: summary

GF(2 m ) arithmetic: summary GF(2 m ) arithmetic: summary EE 387, Notes 18, Handout #32 Addition/subtraction: bitwise XOR (m gates/ops) Multiplication: bit serial (shift and add) bit parallel (combinational) subfield representation

More information

Implementation of ECM Using FPGA devices. ECE646 Dr. Kris Gaj Mohammed Khaleeluddin Hoang Le Ramakrishna Bachimanchi

Implementation of ECM Using FPGA devices. ECE646 Dr. Kris Gaj Mohammed Khaleeluddin Hoang Le Ramakrishna Bachimanchi Implementation of ECM Using FPGA devices ECE646 Dr. Kris Gaj Mohammed Khaleeluddin Hoang Le Ramakrishna Bachimanchi Introduction Why factor numbers? Security of RSA relies on difficulty to factor large

More information

Frequency Domain Finite Field Arithmetic for Elliptic Curve Cryptography

Frequency Domain Finite Field Arithmetic for Elliptic Curve Cryptography Frequency Domain Finite Field Arithmetic for Elliptic Curve Cryptography Selçuk Baktır, Berk Sunar {selcuk,sunar}@wpi.edu Department of Electrical & Computer Engineering Worcester Polytechnic Institute

More information

Digital Electronics Sequential Logic

Digital Electronics Sequential Logic /5/27 igital Electronics Sequential Logic r. I. J. Wassell Sequential Logic The logic circuits discussed previously are known as combinational, in that the output depends only on the condition of the latest

More information

Constrained Clock Shifting for Field Programmable Gate Arrays

Constrained Clock Shifting for Field Programmable Gate Arrays Constrained Clock Shifting for Field Programmable Gate Arrays Deshanand P. Singh Dept. of Electrical and Computer Engineering University of Toronto Toronto, Canada singhd@eecg.toronto.edu Stephen D. Brown

More information

CS1800 Discrete Structures Final Version A

CS1800 Discrete Structures Final Version A CS1800 Discrete Structures Fall 2017 Profs. Aslam, Gold, & Pavlu December 11, 2017 CS1800 Discrete Structures Final Version A Instructions: 1. The exam is closed book and closed notes. You may not use

More information

ECE 645: Lecture 3. Conditional-Sum Adders and Parallel Prefix Network Adders. FPGA Optimized Adders

ECE 645: Lecture 3. Conditional-Sum Adders and Parallel Prefix Network Adders. FPGA Optimized Adders ECE 645: Lecture 3 Conditional-Sum Adders and Parallel Prefix Network Adders FPGA Optimized Adders Required Reading Behrooz Parhami, Computer Arithmetic: Algorithms and Hardware Design Chapter 7.4, Conditional-Sum

More information

Digital Integrated Circuits A Design Perspective. Arithmetic Circuits. Jan M. Rabaey Anantha Chandrakasan Borivoje Nikolic.

Digital Integrated Circuits A Design Perspective. Arithmetic Circuits. Jan M. Rabaey Anantha Chandrakasan Borivoje Nikolic. Digital Integrated Circuits A Design Perspective Jan M. Rabaey Anantha Chandrakasan Borivoje Nikolic Arithmetic Circuits January, 2003 1 A Generic Digital Processor MEM ORY INPUT-OUTPUT CONTROL DATAPATH

More information

Implementation of Carry Look-Ahead in Domino Logic

Implementation of Carry Look-Ahead in Domino Logic Implementation of Carry Look-Ahead in Domino Logic G. Vijayakumar 1 M. Poorani Swasthika 2 S. Valarmathi 3 And A. Vidhyasekar 4 1, 2, 3 Master of Engineering (VLSI design) & 4 Asst.Prof/ Dept.of ECE Akshaya

More information

The equivalence of twos-complement addition and the conversion of redundant-binary to twos-complement numbers

The equivalence of twos-complement addition and the conversion of redundant-binary to twos-complement numbers The equivalence of twos-complement addition and the conversion of redundant-binary to twos-complement numbers Gerard MBlair The Department of Electrical Engineering The University of Edinburgh The King

More information

Montgomery s Multiplication Technique: How to Make it Smaller and Faster

Montgomery s Multiplication Technique: How to Make it Smaller and Faster Montgomery s Multiplication Technique: How to Make it Smaller and Faster Colin D. Walter Computation Department, UMIST PO Box 88, Sackville Street, Manchester M60 1QD, UK www.co.umist.ac.uk Abstract. Montgomery

More information