Radix-4 Vectoring CORDIC Algorithm and Architectures. July 1998 Technical Report No: UMA-DAC-98/20

Size: px

Start display at page:

Download "Radix-4 Vectoring CORDIC Algorithm and Architectures. July 1998 Technical Report No: UMA-DAC-98/20"

Leon Hart
5 years ago
Views:

1 Radix-4 Vectoring CORDIC Algorithm and Architectures J. Villalba E. Antelo J.D. Bruguera E.L. Zapata July 1998 Technical Report No: UMA-DAC-98/20 Published in: J. of VLSI Signal Processing Systems for Signal, Image, and Video Technology vol. 19, no. 2, pp , July 1998 University of Malaga Department of Computer Architecture C. Tecnologico PO Box 4114 E Malaga Spain

2 Radix-4 vectoring CORDIC algorithm and architectures by J. Villalba 1, E. Antelo 2, J.D. Bruguera 2 and E.L. Zapata 1 1 Dept. of Computer Architecture University of Malaga Malaga. Spain. julio@ac.uma.es 2 Dept. Electronica y Computacion. University of Santiago de Compostela. Spain elisardo@usc.es MANUSCRIPT FOR PARHI AND TAYLOR ASAP-96 SPECIAL ISSUE Author responsible for correspondence: Prof. Julio Villalba Moreno Dpto. Arquitectura de Computadores Universidad de Malaga, Complejo Tecnologico P.O.BOX : 4114 Malaga E (SPAIN) Tfno fax julio@ac.uma.es This work was supported in part by the Ministry of Education and Science (CICYT) of Spain under contract TIC C03-01

3 Radix-4 vectoring CORDIC algorithm and architectures Abstract In this work we extend the radix{4 CORDIC algorithm to the vectoring mode (the radix-4 CORDIC algorithm was proposed recently by the authors for the rotation mode). The extension to the vectoring mode is not straightforward, since the digit selection function is more complex in the vectoring case than in the rotation case; as in the rotation mode, the scale factor is not constant. Although the radix{4 CORDIC algorithm in vectoring mode has a similar recurrence as the radix{4 division algorithm, there are specic issues concerning the vectoring algorithm that demand dedicated study. We present the digit selection for non{redundant and redundant arithmetic (following two dierent approaches: arithmetic comparisons and table look{up), the computation and compensation of the scale factor, and the implementation of the algorithm (with both types of digit selection) in a word{serial architecture. When compared with conventional radix{2 (redundant and non-redundant) architectures, the radix-4 algorithms present a signicant speed up for angle calculation. For the computation of the magnitude the speed up is very slight, due to the non{constant scale factor in the radix{4 algorithm. 1 Introduction The CORDIC algorithm (COordinate Rotation DIgital Computer) was introduced to compute trigonometric functions and generalized to compute linear and hyperbolic functions [1][2]. It is an iterative algorithm suitable for VLSI implementation because it employs only adders and shifters and it has a broad application eld. Special attention has been paid by dierent researchers to the improvement of the algorithm in the last few years, as referenced in [3]. By means of the CORDIC algorithm, a vector (x; y) is rotated by an angle (rotation mode) or it is taken to the coordinate axis (vectoring mode). The algorithm is based on rotations over given elementary angles. The basic iteration or microrotation is: x i+1 = x i + i 2?i y i y i+1 = y i? i 2?i x i (1) z i+1 = z i + i tan?1 (2?i ) where (x 0 ; y 0 ) are the initial coordinates of the vector, z coordinate accumulates the angle, and i 2 f?1; +1g species the direction of each microrotation. Each iteration q introduces a scaling over both coordinates given by the expression k i = 1 + i 2 2?2i, and thus, after n iterations the nal x and y 1

4 coordinates are scaled by the factor: K = n?1 Y i=0 q i 2?2i (2) In radix-2 CORDIC, the scale factor is constant since j i j = 1 (for a detailed review of CORDIC see [3]). To preserve the magnitude of the vector, this factor must be compensated. In recent years many computionality intensive applications have been proposed, including matrix computations and computer graphics algorithms that need to compute angles [4] [5] [6]. In [4] [5] [7] the angle computation and rotation operation is performed by means of CORDIC modules. It has been shown that the processors for angle calculation and rotation based on CORDIC modules present a signicant improvement in performance as compared to the more conventional approach using standard modules like division, multiplication or square{root [4]. In the classic radix{2 CORDIC algorithm, roughly only one bit of the result is computed in each of the iterations. In [8], we design a rotator based on a radix-4 CORDIC algorithm (rotation mode only). In that paper we prove that if radix{4 instead of radix{2 is used, the total number of microrotations of the CORDIC algorithm is halved because two bits of the result are computed in each of the iterations. However the radix{4 CORDIC algorithm presents two drawbacks: 1. The selecction of i is more complex than in the radix-2 case. 2. The scale factor is not constant, since i takes values dierent for 1 or -1. As we show in [8] for the rotation mode the selection of i only depends on z i and the resulting digit selection table is simple. The scale factor is computed by combining look{up table and linear approximations of the scale factor of each microrotation. The resulting architectures demonstrate to be ecient when compared to the conventional radix-2 architectures (both with redundant and non{redundant arithmetic). In this paper we extend the radix{4 CORDIC algorithm to the vectoring mode (both for redundant and non{redundant representation of the operands). This extension is very interesting for applications that are based in the angle computation and rotation operation. However the extension is not straighforward due to the selection of i. In the vectoring case the selection of i depends on both x and y coordinates and the resulting selection is more complex than in the rotation mode. On the other hand the scale factor may be computed and compensated in a similar way as in the rotation mode. The organization of the paper is as follows. First in Section 2 we present the radix{4 CORDIC vectoring algorithm and demonstrate its convergence. 2

5 In Section 3 we deal with the selection of i. Section 4 is dedicated to the scale factor computation and compensation. In Section 5 we illustrate the implementation of the algorithm in a word{serial architecture. Finally Section 6 is dedicated to the evaluation and comparison and Section 7 to the conclusions. 2 Radix-4 CORDIC Algorithm In this section we develop a radix-4 CORDIC algorithm in the vectoring mode. We prove the convergence and give the precision and number of iterations. First we perform an extension of the iterative equations of the radix{2 CORDIC algorithm to radix{4 [8]. Basically, we use elementary angles of the form tan?1 ( i 4?i ) instead of tan?1 (2?i ), in such a way that equations (1) become: x i+1 = x i + i 4?i y i y i+1 = y i? i 4?i x i (3) z i+1 = z i + i ( i ) where i ( i ) = tan?1 ( i 4?i ), i takes values in the digit set f?a; : : :; 0; : : :; +ag, and with 2 a; 3. The number of iterations to achieve n bits of precision is n=2. The rotated coordinates are scaled by the factor K = Y i (1 + 2 i 4?2i ) 1=2 (4) The scale factor K is not constant, as it depends on the i values. In the vectoring mode, coordinate y is taken to 0 as the iterations progress. Therefore, in each iteration, the i value must be selected so that the y coordinate approaches to zero as the iterations progress. At the end of the iterations we only have to compensate the x coordinate, as the y coordinate is zero within the precision. Coordinate z does not require any correction. 2.1 Convergence of the radix{4 CORDIC algorithm in the vectoring mode In order to prove the convergence of the radix{4 CORDIC algorithm we have to prove that variable y approaches to zero as the index of the microrotations increases. In order to obtain a set of iterations where the radix{4 CORDIC vectoring may be eciently performed, we consider the scaled value of y i. That is, we dene a new variable w i : w i = 4 i y i (5) This equation introduces a scaling on y i of the same order as the decrease produced in y i in each iteration. In this way, we manage to maintain the value of w i bounded. This simplies the selection criteria and eliminates possible 3

6 imprecisions in the calculation. A similar solution is used in [5] [9]. With this change, equations (3) look like: x i+1 = x i + i 4?2i w i w i+1 = 4 (w i? i x i ) (6) z i+1 = z i + i ( i ) These equations reduce the number of barrel shifters to only one (see section 5). Based on these new equations, we may obtain a selection criteria for i in each microrotation that guarantees the convergence of the algorithm. The digit set for i is f?a; : : :; 0; : : :; +ag. This radix-4 digit set can be minimally redundant if a = 2 or maximally redundant if a = 3. As in the case of the radix-4 SRT-division [10] it is convenient to dene a redundancy factor as = a=3. The main drawback of the radix{4 algorithm is the selection of the i values. In order to bound the w variable in each iteration i must be a function of the value of both x i and w i (see iteration w in equation (6)). In this way, the selection process seems to be more complex than in the case of the radix{4 algorithm in the rotation mode [8] where the i values only depends on the z variable in each iteration. We now propose the selection intervals for i, that is if w i 2 [L q (x i ); U q (x i )] then i = q with q 2 f?a; : : :; 0; : : :; +ag. The selection interval [L q (x i ); U q (x i )] must assure that w i+1 is bounded selecting i = q. The selection intervals we propose are dened by: L q (x i ) = (q? ) x i (7) and U q (x i ) = (q + ) x i (8) These intervals are similar to the selection intervals used in the radix{4 SRT{division [10]. This is due to the fact that iteration w is similar to the iteration used in division. The iteration for division has the form w i+1 = 4(w i?q i d) where q i is the quotient digit and d is the divisor. However there are two important dierences between the division algorithm and the CORDIC algorithm: rst, in division the divisor is constant for all of the iterations whereas in the CORDIC algorithm the x coordinate takes dierent values in each one of the iterations. Secondly, if redundant arithmetic is considered, in the division algorithm the divisor is represented in a non{redundant form, but in the CORDIC algorithm the x coordinate is represented in redundant form, leading to a more complex selection function. Although both algorithms are very similar, these two dierences impose dierent constraints for both algorithms. Consequently, a particular study must be made for the radix{4 CORDIC algorithm. The intervals given in expressions (7) and (8) have to assure a) the continuity condition [10] and b) the bounding of w i. The continuity condition implies that all of the selection intervals must cover the whole range of w i. The continuity condition is assured if L q (x i ) U q?1 (x i ) for all i and q. Based on the denition of L q (x i ) and U q?1 (x i ) (see equations (7) and (8)) the continuity 4

7 condition is assured if 1=2. As we have dened = a=3 and the minimum value of a we can select is a = 2, the minimum value for is 2=3, and then the continuity condition is satised. Next, we demonstrate that w i is bounded in each of the iterations with the selection intervals given in expressions (7) and (8). Without any loss of generality in what follows we assume that x 0 0 and x 0 and y 0 are fractional values, one of them is normalized within the interval [0:5; 1). The bound for w i is: Proof: we prove this by induction. jw i j 4 x i (9) Base case (i=1). We are going to consider two sets of values that w 0 may take in relation to x 0 : that jw 0 j takes values that are lower than or equal to 4x 0 or that it takes larger values. a) Let us consider the set of values jw 0 j 4x 0. Since w 0 4x 0 it is clear that 9q 2 f?a; : : :; 0; : : :; +ag that satis- es q x 0? x 0 w 0 q x 0 + x 0 (10) Subtracting q x 0, multiplying by 4 and observing (6) we have:?4x 0 w 1 4x 0 (11) Thus as x i is a growing succession [5] we may write: jw 1 j 4x 1 (12) b) We now consider the set of values jw 0 j > 4x 0. Let us assume that for this set of values we select q = 2; we are going to see that with this selection expression (9) is satised and there is still a bound for w 1. The worst case occurs when the ratio between w 0 and x 0 is innite, that is, when x 0 = 0. If q = 2 and x 0 = 0, in the rst iteration (i = 0) we have (see equations (6)): We can thus write that: x 1 = 2w 0 w 1 = 4w 0 since = a=3 with a = 2 or a = 3. w 1 = 4w 0 = 2 2w 0 = 2x 1 4x 1 (13) Induction hypothesis (i = m? 1). We assume as true that jw m?1 j 4 x m?1 Induction step (i=m): Because of the induction hypothesis it is true that there is a q satisfying: q x m?1? x m?1 w m?1 q x m?1 + x m?1 (14) 5

8 Subtracting q x m?1, multiplying by 4 and taking (6) into account we may write:?4x m?1 w m 4x m?1 (15) Therefore, as x i is a growing succession [5] we may write: jw m j 4x m (16) Q.E.D. Now, we prove that x i is bounded. If we substitute the expression (9) in the rst equation of (6) we obtain: x i+1 x i + 4 i 4?2i x i = x i (1 + 4 i 4?2i ) (17) Expressing this inequality as a function of x 0 we have: iy (1 + 4 k 4?2k ) (18) x i+1 x 0 k=0 The maximum value of this expression is reached when j = a 8j i and = 1. Therefore, we obtain the expression: iy ( ?2k ) (19) x i+1 x 0 k=0 This product is convergent since the corresponding innite product is also convergent. The innite product is convergent since it fulls two conditions: the general term ( ?2i ) tends to 1 when k goes to innite, and the series composed by u 1 + u 2 + ::: + u p + ::: with u p = 12 4?2p is convergent too. Therefore, taking into account expression (9) and the bound of x i we conclude that w i is bounded. In Figure 1 we show the selection intervals for the case = 1 and = 2=3. In this Figure we only show the intervals for positive values of w i. The intervals are symmetrical for the negative values of w i. As we can see in this Figure there is an overlap between the selection intervals. Observe that the overlap is greater in the case = 1 than in the case = 2=3. This overlap implies that we do not need to make exact comparisons to determine the selection interval. We can determine the suitable interval based on estimations of w i and x i. This is very useful, both for non{redundant and redundant representations. For example, for redundant carry{save arithmetic and we only have to assimilate (convert from redundant to non{redundant representation) a reduced number of most signicant bits of w i and x i to determine the i value. In the next section we obtain the selection function based on estimations Precision and number of iterations obtained in radix-4 After n iterations, the angle between the vector(x p ; y p ) and the x axis is (see expressions (5) and (9)):!! tan?1 y p = tan?1 4?p w p tan?1 2 2?2p+1 (20) x p x p 6

9 If = 2=3 this value is slightly greater than the value obtained by means 2p standard radix{2 CORDIC iterations (tan?1 (2?2p+1 )) and slightly less than the value obtained with 2p + 1 iterations (tan?1 (2?2p+2 )). If = 1 this value coincides with 2p + 1 standard radix-2 iterations. Therefore, the radix{4 CORDIC algorithm in vectoring mode basically halves the number of microrotations with respect to the standard radix{2 CORDIC algorithm. 3 Selection function In this section we will obtain the selection functions for the radix{4 CORDIC algorithm in the vectoring mode. We obtain a selection function that is valid for its hardware implementation in redundant as well as non-redundant arithmetic. To do this, we use two dierent techniques that produce two dierent selection functions. The rst technique is based on arithmetic comparison and the second one is based on a look-up table. There is no clear dierence in terms of area and time between both techniques, and only a real implementation would tell us which one of them is better. For this reason, we explain both techniques. 3.1 Selection function by arithmetic comparisons This method is based on comparing coordinate w to a couple of comparison points so that the i value is obtained. To reduce the number of comparisons we choose a = 2 so that q = f0; 1; 2g. On the other hand, the only restrictions of the input data is jx 0 j < 1; jw 0 j < 1 and one of them must be normalized. Without any loss of generality we also assume that x 0 0. For clarity in the presentation, in what follows we assume non{redundant arithmetic. The extension to redundant arithmetic is considered at the end of this subsection Obtaining xed comparison points for all the iterations Let us assume the selection intervals given by [L q (x i ); U q (x i )] (see expressions (7)(8)). We dene P i (1) as the comparison point used for discriminating between values i = 0 and i = 1, and we dene P i (2) as the comparison point used for discriminating between the values i = 1 and i = 2 (We dene P i (?1) and P i (?2) in a similar way). The comparison points that we have dened must belong to the overlap intervals and be easy to calculate and implement. Two suitable selections for the comparison points are: P i (1) = 1 2 x i P i (2) = 3 2 x i (21) because they are simple and belong to the overlap intervals (see Figure 1b). However, it is necessary to recalculate the comparison points in each iteration, as they depend on the numerical value of x i. 7

10 The alternative we now present makes it only necessary to calculate the comparison points P i (1) and P i (2) in a few initial iterations, they remain xed for the rest of the iterations. We are going to calculate from which iteration the comparison points obtained are valid for the remaining iterations. As x i is a succession of growing terms, the successions of terms L q (x i ) and U q (x i ) are also growing (see expressions (7) (8) and Figure 1b). According to this, for the comparison points of the i{th stage (P i (1) and P i (2)) to still be valid as comparison points in the remaining iterations, they must belong to the overlap intervals of these iterations. In Figure 2 we see how there is a common overlap area between all the iterations for q = 0 and q = 1. We are going to seek an iteration i such that the comparison points belong to the common overlap area, that is, we seek a P i (1) and P i (2) such that (see Figure 2): L 1 (x 1 ) P i (1) U 0 (x i ) (22) L 2 (x 1 ) P i (2) U 1 (x i ) (23) (The arguments would be the same for P i (?1) and P i (?2)) Let us analyze equation (22); taking into account expressions (7) (8) and that the value of P i (1) is 1=2 x i, we may write: 1=3 x 1 1=2 x i 2=3 x i (24) A top bound for x i is obtained making i = 2 for every j > i in the equation on x in (6). If we consider i = 2 and substitute the expression (9) with = 2=3 in equation on x of (6) we obtain: x i+1 x i ?2i x i = x i ( ?2i ) (25) Now, we obtain the value x i+k as a function of x i : Y x i+k x i i+k?1 j=i ( ?2j ) (26) As shown in section 2.1, this series is convergent. The rst inequality of (24) can be written as x 1 3 (27) x i 2 For practical implementations it is enough to take x i+k with a large k value instead of x 1. Therefore, condition (27) can be substituted by x i+k x i 3 2 (28) with k large enough. From expression (26) we can write x i+k x i Y i+k?1 j=i ( ?2j ) (29) 8

11 This series converges very quickly. We have veried that for i = 0 and k = 500 a bound for this product is 8:7 which does not full condition 28, whereas for i = 1 a good bound is 1.4 which does full condition 28. Therefore, we conclude that value P 1 (1) can be used as a comparison point for the rest of the iterations, as this value is always within the overlap that is produced in the following iterations, that is: P i (1) = P 1 (1) 8i 1 (30) Proceeding in the same way from expression (23) we nd that the value P 2 (2) may be used as a comparison point for the rest of the iterations, that is: P i (2) = P 2 (2) 8i 2 (31) Therefore, we only have to calculate the comparison points in the rst three iterations. From this iteration on, the values calculated as comparison points (P 2 (2) and P 1 (1)) are valid for the next iterations. Figure 3 shows the evolution of the common overlap bound and the location of P 1 (1) and P 2 (2). As a consequence, the selection function is: i = 8 >< >: +2 if w i > P i (2) +1 if P i (1) < w i P i (2) 0 if P i (?1) < w i P i (1)?1 if P i (?2) < w i P i (?1)?2 if w i P i (?2) (32) being P i (1) = ( 1 2 x 0 if i = x 1 if i 1 P i (2) = ( 3 2 x i if i x 2 if i 2 (33) Size of the comparators As the comparison points depend on the values of the x coordinate, the comparison with these points must be carried out by means of an add/subtract operation. This operation may be carried out in parallel with the shift associated with equations (6) in each iteration. However, we need n bit addition/subtractions, which increase the hardware and slow down the comparison process (which would be signicantly longer than the delay through the shifters). In what follows we will prove that it is enough to perform the comparison with a few most signicant bits of P i (1) and P i (2) for the coecients to still be correct. This implies that fast adders/subtractors may be used for performing the comparison, saving hardware and obtaining comparison times of the same order as the delay of the shifters. Reducing the number of bits to be compared We now calculate the number of most signicant bits of w i and P i (q) needed to perform the comparison without making errors. We must prevent 9

12 the comparison from producing dierent results for the truncated and non{ truncated values of w i and P i (q). Let us call ^w i and ^Pi (q) the truncated values of w i and P i (q) using f fractional bits. Let b and c be any values taken by w i and let ^b and ^c be the corresponding truncated values. Let us assume that the truncated values with f fractional bits of the higher limit of the overlap interval U q (x i ) and the comparison point P i (q + 1) correspond to the same value (see Figure 4). The value of ^b and ^c is the same as that of ^Pi (q + 1) and for them we select i = q; however, point c is higher than U q (x i ) and does not permit any selection except i = q + 1, and thus the decision made for point c with the truncated values is not correct. In order to prevent this situation, it is necessary for the truncated values ^P i (q + 1) and ^U i (q) to be dierent. Thus, the distance between P i (q + 1) and U q (x i ) must be higher than the precision we observe for the truncated values of w i (2?f = distance between two consecutive points of ^w i, see Figure 4). In this way, we always have ^P i (q + 1) 6= ^U i (q). We can express this mathematically as: ju q (x i )? P i (q + 1)j > 2?f (34) From Figure 1 we see that the overlap interval is 1=3 x i, and as x i is a growing function, the amplitude of the intervals also grows (see Figure 3). For the same reason, the extremes of the intervals L q (x i ) and U q (x i ) are shifted in the same direction (see Figure 3). Consequently, the smallest of the distances between P i (q + 1) and U q (x i ) arises in the initial iterations, and depends on the smallest value that x i may have in these iterations. Due to the normalization employed, the minimum and maximum magnitude of the initial vector (x0; w0) are 0:5 and p 2. The extremes of the intervals are U 0 (x i ) = 2=3 x i and U 1 (x i ) = 5=3 x i (see Figure 1). We are going to calculate the number of fractional bits needed. Let us assume that i > 0; due to the normalization of the input data, the minimum value of x i is 0:5. Thus, taking into account that the distance between P i (q + 1) and U q (x i ) is greater than or equal to than 1=6 x i, it must happen that 1=6 x i 2?f. From this expression we can deduce that f > 3:5, that is, we need at least 4 fractional bits. Reasoning in a similar fashion, we nd that for i = 0 we need at least 5 fractional bits. Summarizing, we need to assimilate 5 fractional bits for x 0 and w 0 (a total of 7 bits taking into account the two bits of the integer part including sign) and 4 fractional bits for x i and w i (i 1) (a total of 8 bits taking into account the four bits of the integer part, including sign). We may now rewrite the selection function (32) as follows: i = 8 >< >: +2 if ^w i > ^P i (2) +1 if ^Pi (1) < ^w i ^P i (2) 0 if ^Pi (?1) < ^w i ^P i (1)?1 if ^Pi (?2) < ^w i ^P i (?1)?2 if ^w i ^P i (?2) 10 (35)

13 where ^Pi (1) and ^Pi (2) are the truncated values of P i (1) and P i (2) (see expressions (33)) Extension to redundant arithmetic Without any loss of generality, we assume that we use carry{save redundant arithmetic. We truncate w i with f fractional bits. Since we are not taking into account bits with weight of less than 2?f for w i, the maximum error is 2?f in the sum word, and 2?f in the carry word, so the total error is 2?f+1. As this error is positive, the truncated value ^w i and the real value w i satises: ^w i w i < ^w i + 2?f+1 (36) Now, the condition for the truncated values ^P i (q + 1) and ^U i (q) is not the same if the distance between P i (q + 1) and U q (x i ) is no larger than the precision we observe for the truncated values of w i in redundant arithmetic: 2?f+1. Condition (34) is now transformed into: ju q (x i )? P i (q + 1)j > 2?f+1 (37) Using the same arguments we nd that in redundant arithmetic it is necessary to observe one more fractional bit. However, if the input data (x0; w0) are in conventional arithmetic, this condition does not have to be applied for i = 0, and thus the number of fractional bits is 5 for all the iterations. Consequently, the selection function we propose in redundant carry{save arithmetic coincides with (35), truncating P i (q) and w i with 5 fractional bits. 3.2 Selection function by look-up table We prefer to develop this method in redundant arithmetic. The non{redundant arithmetic version can be easily obtained from the redundant one and it will be studied in section For the selection process we have to take into account that both w i and x i are represented in redundant carry{save form, but due to the overlap between the selection intervals, we can take an estimation of these values to obtain i. Assume that we assimilate w i up to the t fractional bit, and x i up to the fractional bit. We call the assimilated values ^w i and ^x i respectively. Therefore we can write that: ^w i w i < ^w i + 2?t+1 and ^x i x i < ^x i + 2?+1 (38) Now we have to obtain relations between t and that assure the convergence of the algorithm, that is, the conditions that assure a correct selection of i. We follow Figure 5 to obtain the suitable values for t and. To make a correct selection of i from an estimation of x i (^x i ), the overlap ( q [^x i ]) we have to consider between the intervals q and q? 1 is: q [^x i ] = U q?1 [^x i ]? L q [^x i + 2?+1 ] (39) 11

14 The value of q [^x i ] is the worst case overlap, only dependent on the value of the estimation of x i. In this way the selection only depends on the assimilated value of x i and not on the true value. On the other hand to make a correct selection using an estimation of w i, it is necessary that the overlap between the intervals ( q [^x i ]) be greater than 2?t (this is the same case as the radix{4 SRT{division [10]). Therefore, the selection with estimations will be correct if the following condition is satised: q [^x i ] 2?t (40) From condition (40) and taking into account equations (7), (8) and (39), we obtain: ( + q? 1) ^x i? (q? ) (^x i + 2?+1 ) 2?t (41) The worst case condition is obtained for the greatest allowable value for q, that is q = a = 3. Then we obtain a new expression: (2? 1) ^x i? 2 2?+1 2?t (42) The values of and t are constrained to the values of and ^x i. To obtain and t independent of the value of x i, we must consider the worst case in expression (42), that is, we have to take ^x i as the minimum possible value for x i. We assume that x 0 or y 0 are normalized in the range [0:5; 1), and then x 1 0:5 [5]. The minimum value of x i is 0.5 since x i+1 x i x 1 [5]. Then we take ^x i = 0:5 in expression (42). The parameter can take values 2=3 or 1. From condition (42) we obtain suitable values for and t for both cases, = 2=3 and = 1: a) = 2=3: For this case a = 2 and i 2 f?2;?1; 0; +1; +2g. As in the CORDIC iteration (see equation (6)) i multiplies the value of x i and w i. This digit set is very interesting since all digits are powers of two, and the multiplication by i can be done only by shifting. We obtain the fact that suitable values are = 5 and t = 5 (actually t = 4 could be used, but the resulting digit selection would be very complex). b) = 1: In this case a = 3 and then i 2 f?3;?2;?1; 0; +1; +2; +3g. The introduction of the value 3 in the digit set implies that an additional adder must be incorporated to make the multiplication by 3. For this case we obtain = 4 and t = 3 (as in the previous case, t = 2 could be used but this would result in a complex selection function). As the selection of i is in the critical path of our architecture, we make design decisions in order to achieve a reduced critical path time at the cost of more silicon area, so we take = 1 which leads to a less complex selection than the case = 2=3, since the values of and t are lower in case b (note that for digit selection table the number of input bits is critical). We have determined the number of fractional bits of x i and w i that must be assimilated ( and t). Now we have to determine the number of integer bits to assimilate of both operands, to obtain the total number of bits to be assimilated. 12

15 The maximum value of x i (x max ) is given by [5] x max = K max (x w2 0 )1=2 where K max is the maximum value of the scale factor. The maximum value of the scale factor depends on the value of a. If the scale factor is too large the range of x is also large and then more integer bits must be assimilated. In order to reduce the value of K max we make the microrotation i = 0 as a radix-2 microrotation with i 2 f?1; +1g. We can do that since we assume that the maximum angle to be computed is within the interval [?=2; +=2]. It can be easily demonstrated that making the microrotation i = 0 as radix{2, this range is covered. In this way, according to equations (2) and (4) and taking the maximun value for i in every iteration (that is 0 = 1 and i = 38i 1), we obtain a scale factor of K max = 1:80068, and taking into account the expression for x max, we obtain x max = 2:55. As the range for the angle is [?=2; +=2], the value of x i is always positive. Then we have to assimilate six bits of x i in each iteration, two integer bits (without sign) and four fractional bits. The maximum value for w i (w max ) is easy to obtain, taking into account that jw i j < 4 x i. As we have selected = 1, w max = 4 x max = 10:2. Then we have to assimilate eight bits, ve integer bits and three fractional bits of w Size of the look-up table The selection of i is done by implementing a selection function (usually a look{up table) whose inputs are the assimilated bits of w i and x i. In this way the look{up table will have a total of 14 input bits. This seems very large and look{up operation would be too slow. To reduce the complexity of the look{up table, we make use of the scaling technique. This technique has been widely used for division [10], and consists of the scaling of the dividend and the divisor such that the scaled divisor is within a certain range. The scaling does not aect the result since the quotient only depends on the ratio between the dividend and the divisor, and scaling does not aect this ratio. This idea can be applied to our CORDIC algorithm, that is the scaling of w i and x i does not aect the angle to be computed. To make the implementation simpler we perform the scaling over the assimilated values, and not over the full length words of w i and x i. We propose scaling the value of ^x i to have the scaled value in the interval [0:5; 1). As the range of ^x i is [0:5; 2:55) the scaling operation involves only right shifts, and then the scaling does not aect the result, since the assimilation error is also reduced. The scaling operation is very simple: if ^x i 2 [0:5; 1) then no scaling is performed, if ^x i 2 [1; 2) then a right shift is performed, and if ^x i 2 [2; 2:55), two right shifts are carried out. The scaling also aects ^w i, and the scaled value is within the range (?4:5; 4:25), We only have to consider 3 bits of ^x i (since ^x i 0:5 the bit with weight 0.5 is always one) and 7 bits of ^w i. Following a similar procedure to the case of the radix{4 SRT{division [10] we have obtained the selection function for i. In Table 1 we show the selection function to be implemented by means of a look{up table. As we can 13

16 ^wi 4 ^xi i =?3 i =?2 i =?1 i = 0 i = 1 i = 2 i = 3 [0:5000; 0:5625) [?11;?8] [?7;?6] [?5;?3] [?2; 0] [1; 2] [3; 4] [5; 10] [0:5625; 0:6250) [?12;?8] [?7;?6] [?5;?3] [?2; 0] [1; 2] [3; 5] [6; 11] [0:6250; 0:6875) [?13;?8] [?7;?6] [?5;?3] [?2; 0] [1; 2] [3; 5] [6; 12] [0:6875; 0:7500) [?14;?8] [?7;?6] [?5;?3] [?2; 0] [1; 4] [5; 6] [7; 13] [0:7500; 0:8125) [?15;?8] [?7;?6] [?5;?3] [?2; 0] [1; 4] [5; 6] [7; 14] [0:8125; 0:8750) [?16;?8] [?7;?6] [?5;?3] [?2; 0] [1; 4] [5; 8] [9; 15] [0:8750; 0:9375) [?17;?13] [?12;?6] [?5;?3] [?2; 0] [1; 4] [5; 8] [9; 16] [0:9375; 1:0000) [?18;?13] [?12;?6] [?5;?3] [?2; 0] [1; 4] [5; 8] [9; 17] Table 1: Selection table for i. see in this table, the least signicant bit of the scaled value of ^w i does not aect the selection, and therefore the look{up table will have 9 input bits, 3 corresponding to ^x i, and 6 corresponding to ^w i, and 3 output bits The look-up table method in non-redundant arithmetic The non-redundant arithmetic case can be seen as a simplication of the redundant arithmetic case. Now, the maximun assimilation error in coordinates x and w is 2?t and 2? respectively, and expression (38) becomes: ^w i w i < ^w i + 2?t and ^x i x i < ^x i + 2? (43) In this case, the overlap is q [^x i ] 0 and condition (42) becomes: 2? (2? 1) ^x i (44) 2 In this case the best option is to choose = 2=3 since we avoid the adders needed to work with the value i = 3, and the size of the table obtained is 9 input bits (6 bits for w and 3 bits for x), which is similar to the redundant case with = 1. 4 Scale Factor If we are interested in the magnitude of the vector, it is necessary to compensate the scale factor. We use the same technique that appears in [8] to solve the nonconstant scale factor problem. Basically, after the rst n=8 + 1 microrotations, we access a table in order to take the value of the scale factor, and in the next iterations (n=8 + 1 i n=4) we calculate the scale factor by a shift and operation in each iteration, since for these iterations the scale factor k i = q i 4?2i generated, may be approximated by the two rst terms of the Taylor series expansion (k i 1 + 1=2 2 i 4?2i ). Finally, we perform the division by K (K = Q k i ) using the radix{4 CORDIC algorithm in the vectoring mode 14

17 and linear coordinates (this is a conventional radix{4 division). The radix{4 CORDIC equations in linear coordinates are the following: x i+1 = x i w i+1 = 4 (w i? i x i ) (45) z i+1 = z i + i 4?i Performing the same analysis as in Section 2 we can test the convergence of the algorithm in linear coordinates. Also, the same selection functions given in Section 3 can be used. After n/2 iterations, we obtain the following value over coordinate z: z n=2 = z 0 + w 0 x 0 (46) Therefore, performing a suitable selection of the coordinates, we can carry out the division. In the next section we analyze the hardware requirements for calculating and compensating the scale factor. 5 Architectures In this section we obtain dierent architectures that implement the radix{4 CORDIC algorithm in the vectoring mode. There are dierent architectures that may implement the algorithm: redundant or non{redundant arithmetic; selection with arithmetic comparisons or selection by table; word{serial or pipelined. We illustrate the implementation of the radix{4 vectoring algorithm with two of these architectures. First, we consider a word-serial architecture using selection by the arithmetic comparison method in conventional arithmetic. Then we present a word{serial architecture using selection by look-up table in redundant arithmetic. Finally, some comments are given for the implementation in a pipelined architecture. 5.1 Architecture for the arithmetic comparison method In this subsection we develop a word-serial architecture in redundant arithmetic and selection by arithmetic comparisons. First, we design the hardware to implement the selection function (35) where the zero{skipping technique is incorporated. Then, we design the data path, explaining the dierent operations that are carried out over this architecture Implementation of the selection function with the zero{skipping technique If a microrotation obtains i = 0, then x i+1 = x i and w i+1 = 4w i (see equations (6)) and the value of i+1 can be obtained directly from w i (note that 4w i is no more than a shift of w i ). Thus, if in a microrotation we obtain, in parallel, 15

18 coecients i and i+1 and the rst is zero, it is not necessary to carry out microrotation i because the rotation angle is zero, and we can directly proceed to microrotation i + 1. In this way, we skip iteration i reducing the total number of microrotations. This technique is called the zero{skipping technique and it was initially developed for the division algorithm [11]. In [8] we used this technique for the rotation mode. In that paper we conclude that a reduction of about 20% in the total number of iterations can be achieved. In Figure 6 we present the hardware implementation of the selection functions incorporating the zero{skipping technique. Registers P 1 and P 2 keep the values of the comparison points ^Pi (1) and ^Pi (2). In order to be able to apply the zero{skipping technique, we must carry out a double comparison of the comparison points. On one hand, we use two 7 bit comparators (basically the necessary hardware for generating the carry c 7 of a 7-CLA) for comparing the 8 MSBs of w i to the comparison points in order to obtain i ; on the other hand, we need a twin architecture (indicated with dotted lines in Figure 6) that carries out the comparison with the 8 bits that follow the two most signicant of w i for obtaining i+1 (if i = 0, w i+1 = 4w i and the two MSBs of w i are zero). The Control Logic block found in the Figure 6 generates the value of i from the analysis of signal c 7 of each comparator and the sign of w i. Also, the skip is activated when i = 0. For low values of n this technique may be not ecient enough. In this case, we eliminate the hardware indicated with dotted lines in Figure Design of a data path In this section we analyze in detail a possible word{serial architecture in non{ redundant arithmetic. Figure 7 shows the architecture of paths x, w and z. The realization of the vectoring mode is carried out by programming the data paths as a function of the iteration we are in. The basic processes are the calculation of the comparison points, the realization of the radix{4 CORDIC iterations in circular coordinates (equations (6)), the calculation of the scale factor, and nally, its compensation (radix{4 CORDIC in linear coordinates). Some of these processes are carried out in parallel, and we now describe how the system works as a function of the iteration. Module A in Figure 7 corresponds to the hardware to implement the selection function with the zero{skipping technique, which is shown in detail in Figure 6. Table 2 may help in the description we now present, and reects the operation mode of the paths x, w and z, the function carried out by each path and the operation it performs in module A for each of the iterations. 1. Iterations i=0 to i=n/8 The main processes carried out in these iterations are the calculation of the comparison points, the processing of the corresponding radix{4 CORDIC iterations and the generation of the address in the scale factor table corresponding to the angle to be rotated. 16

19 Data path operation mode Function Module A i x w and z x w z operation 0 Eval. P i (2) { P 0 (2) = 3=2 x 0 { { Load P1,P2 0 CORDIC CORDIC x 1 = x w 0 w 1 = 4(w 0? 0 x 0 ) z 1 = z ( 0 ) Comput. 0 Circular Circular 1 Eval. P i (2) { P 1 (2) = 3=2 x 1 { { Load P1,P2 1 CORDIC CORDIC x 2 = x ?2 w 1 w 2 = 4(w 1? 1 x 1 ) z 2 = z ( 1 ) Comput. 1 Circular Circular 2 Eval. P i (2) { P 2 (2) = 3=2 x 2 { { Load P1,P2 2 to CORDIC CORDIC x i+1 = x i + i 4?2i w i w i+1 = 4(w i? i x i ) z i+1 = z i + i ( i ) Comput. i n/8 Circular Circular n/8+1 CORDIC CORDIC x i+1 = x i + i 4?2i w i w i+1 = 4(w i? i x i ) z i+1 = z i + i ( i ) Comput. i to n/4 Circular Circular Store j = i n/4+1 Compute CORDIC k i+1 = k i (1+ w i+1 = 4(w i? i x n=4 ) z i+1 = z i + i ( i ) Comput. i to n/2 Scale factor Circular +j 22?4j?1 ), j=i?n=8 n/2 to CORDIC CORDIC x r+1 = x r w r+1 = 4(w r? r K) z r+1 = z r + r4?r Comput. r to n** Linear Linear r=i-n/2 * j=i-n/8 ** Only for scale factor compensation purpose Table 2: Operation mode and function of the paths x; w; z and Module A The evaluation of the comparison points ^P 0 (1); ^P 0 (2); ^P 1 (1); ^P 1 (2) and ^P 2 (2) is performed by means of specic purpose iterations using data path x. In the rst, third and fth iterations, we program this path for the evaluation of comparison point P i (2) (Eval. P i (2) mode in Table 2). In the second, fourth and from the sixth iteration on, the hardware is programmed for obtaining the radix{4 CORDIC equations in circular coordinates (CORDIC circular mode in Table 2), that is, equations (6) are evaluated. MUX-1 allows obtaining the value 3=2 x i (leftmost input), and together with MUX-2,3,4 permits selecting the appropriate input for supporting i = 1; 2 or i+1 = 1; 2 if a zero skip takes place. As we can see in Table 2, it is not necessary to calculate P i after i = 2 since the comparison point obtained in iteration i = 2 is valid as comparison point for the remaining iterations. The i values obtained in these rst iterations are also used for addressing the table that stores the dierent scale factors (see Section 4). 2. Iterations i=n/8+1 to i=n/4 During these iterations, paths x, w and z obtain equations (6) (radix{4 CORDIC in circular coordinates). Module A calculates i ( i+1 if a zero{ skipping occurs) and puts the values j i j in a shift register, necessary for the calculation of the scale factor in later iterations. 3. Iterations i=n/4+1 to i=n/2 The main processing carried out during this period is the calculation of the scale factor (over data path x) and the ending of the angle computation (over data path z). Taking into account that the maximum value obtained experimentally for w i is 5.3, we have that i 4?2i w i < 2?n+1 if i n= Therefore we obtain, from (6), that x i+1 x i for the precision considered. Therefore it is not necessary to use path x from this iteration on. 17

20 Now, path x is free, and it is used to evaluate the scale factor (Compute Scale factor mode in Table 2). To preserve the value x n=4 obtained in the previous iterations we use the auxiliary register RX' in Figure 7. The scale factor produced by the rst n/8+1 rotations is obtained from the scale factor table, and it is initially loaded onto register RX (see Figure 7 ). From now on, the approximation k i = 1 + 1=2 2 i 4?2i [8] is carried out over path x (see Section 4). At the end of these iterations we have obtained the value of the scale factor K, which is in register RX, and the value of the nal rotated angle, which is in register RZ. Just in that moment, we have obtained one of the two results generated by the CORDIC algorithm: the value of the angle (argument of the initial vector). In applications that do not require the evaluation of the magnitude of the vector (for example, angle calculation and rotation [5]) the scale factor table and the process for the calculation of the scale factor carried out in data path x are not necessary. 4. Iterations i=n/2+1 to i=n These iterations have the aim of compensating the scale factor in those applications that require it. In order to do this, we program paths x, w and z for performing the radix{4 CORDIC algorithm in the vectoring mode in linear coordinates (CORDIC linear mode in Table 2, see also radix{4 CORDIC equations in linear coordinates (45)). Registers RW and RX were loaded with x n=4 and K respectively in the iteration i = n=2. The value 4?i of the equation in z of (45) is directly obtained from the angle table as follows: from iteration i n=6 we have that tan?1 ( i 4?i ) i 4?i, and thus the values 4?i with i n=6 are already present in the table; we only have to add to the primitive angle table, the values 4?i with i < n=6 (for example, for n=32 we would have to add six values: 4 0 ; 4?1,...,4?5 ). Consequently, after iteration i = n we obtain in RZ the value of the magnitude of the initial vector (see equation (46)). 5.2 Architecture for the look-up table method Figure 9 shows the architecture that implements the radix{4 algorithm using this method using redundant carry{save arithmetic. The considerations to calculate and compensate the scale factor are similar to those of the previous subsection, and they have been skipped to make the understanding of the architecture easier. Since for the selection with table we have considered that i may take value 3, two 4{to{2 adders are needed to perform the products 3 x i and 3 w i. These values will be needed if j i j = 3. Two word multiplexers permits to compute j i j x i and j i j w i. Finally two 4{to{2 adder/subtracters permit obtaining x i+1 and w i+1. 18

21 Figure 8 shows the block diagram of the digit selection network which is in charge of the selection of i. Six bits of x i and 8 bits of w i are assimilated. From the two most signicant bits of ^x i we obtain the suitable shift to make the scaling. By means of multiplexers we perform the scaling in both ^x i and ^w i. In the output of the multiplexers we obtain the inputs to the look{up table. The table has 9 input bits and three output bits. A look{up table is needed to store the microrotation angles, a multiplexer to select the suitable value for the microrotation angle depending on the value of i, and nally the 3{to{2 adder/subtractor performs the iteration to obtain z i+1. We use a 3-to-2 adder/subtractor since the microrotation angle has a non{redundant representation. 5.3 Pipelined architectures For a pipeline the crucial points are the hardware cost, the latency and the throughput (related to the cycle time). For the radix{4 CORDIC vectoring the selection of i should be implemented in each of the iterations. As we have seen in previous sections the selection (using arithmetic comparision or table) is more complex (in time and in hardware) than in the case of the radix{2 algorithm. The problems with the pipeline implementation are two-fold: 1. Since in each microrotation the shift to be performed is given, hardwired shifts are performed. Thus in this architecture there is no overlap between the digit selection and the shift operation, and the complex digit selection is fully in the critical path. 2. The replication of the hardware associated with each microrotation. This implies replicating the hardware for digit selection. Furthermore, in the case of the selection with table, two additional adders are needed in each microrotation to perform the multiplication by 3. Due to these factors, it seems that a full radix{4 pipelined CORDIC vectoring needs additional research to be ecient. 6 Evaluation and comparison In this section we compare the word{serial architectures proposed in the previous sections with each other and with the architectures proposed in [5] and [9] in the case of using redundant arithmetic, and with a conventional radix{2 architecture (with w iterations) when non{redundant arithmetic is used. To carry out the evaluation we will express the delay of each hardware element in terms of the delay and area of one full adder (t f a and a f a ). We have used the reference technology and Library (ES2-ECPD10 Standard Cells Library, 1m double metal CMOS [12]) for the hardware elements which do not 19

22 Element delay (tfa) Area (afa) Buer tbuf {to{1 mux t 2?1mux n 3{to{1 mux t 3?1mux n 4{to{1 mux t 4?1mux 0.5 n 6{to{1 mux t 6?1mux n Register treg 1.5 n 3{to{2 csa t 3?2csa 1 n 4{to{2 csa t 4?2csa n 7-CLA* t 7?CLA Ripple adder tripple 0.83n n Constant width carry skip** tcwcs 13(15) 45(87) CLA tcla?n dlog 2 (n)e 2n Barrel shifter r levels tbs 0:5 dlog 2 (r)e n log 2 r * Only the logic to obtain c 7 ** Values for n=32 and n=64 (this last between brackets) Table 3: Delays assumed for hardware elements have recognized delays in terms of the delay of one full adder. The delays and areas assumed for the dierent hardware elements are showed in Table 3. For the 4{to{2 carry{save adder/subtracter we have assumed the implementation given in [13], which possesses the same delay as a 4{to{2 carry{save adder. We have taken into account the delay introduced by the buers, which are necessary for control signals that are heavily loaded. We would like to emphasize that a true comparison between dierent implementations is possible only if actual implementation is considered and logic level simulations are carried out. Therefore, we present a rough, rst order approximation comparison based on Table 3. Nevertheless, we claim that it can express the general trend between dierent designs. In [5] a radix-2 CORDIC architecture is proposed with on-line redundant arithmetic, but the word{serial architecture is also considered. In that work the i values can be f0; 1g, resulting in a non-constant scale factor. This architecture is specially interesting when we are only interested in calculating the angle. In [9] a radix{2 CORDIC architecture in redundant arithmetic is proposed with a constant scale factor. In this case, the most signicant bits of w are used to estime the i values (truncating t fractional bits). Due to the estimation error it is necessary to repeat some microrrotation to assure convergence. To be exact, it is necesary to repeat one microrotation every t? 1 iterations during the rst n=2 microrrotation. From i > n=2, the value i = 0 is allowed, and only one repetition is necessary. The conventional radix{2 architecture with non{redundant arithmetic can be found in [3]. Basically, the word{serial version of the three architectures has the same 20

DIVISION BY DIGIT RECURRENCE

DIVISION BY DIGIT RECURRENCE 1 SEVERAL DIVISION METHODS: DIGIT-RECURRENCE METHOD studied in this chapter MULTIPLICATIVE METHOD (Chapter 7) VARIOUS APPROXIMATION METHODS (power series expansion), SPECIAL