Unit 1 - Computer Arithmetic

Size: px

Start display at page:

Download "Unit 1 - Computer Arithmetic"

Zoe Tate
5 years ago
Views:

1 FIXD-POINT (FX) ARITHMTIC Unit 1 - Comuter Arithmetic INTGR NUMBRS n bit number: b n 1 b n 2 b 0 Decimal Value Range of values UNSIGND n 1 SIGND D = b i 2 i D = 2 n 1 b n 1 + b i 2 i n 2 i=0 i=0 [0, 2 n 1] [ 2 n 1, 2 n 1 1] FIXD POINT RPRSNTATION Tyical reresentation [n ]: n bit number with fractional bits: b n 1 b n 2 b 0. b 1 b 2 b n Decimal Value Range of values Dynamic Range Resolution (1 LSB) Dynamic Range: UNSIGND n 1 D = b i 2 i i= SIGND n 2 D = 2 n 1 b n 1 + b i 2 i [ 0 2, 2n 1 2 ] = [0, 2 n 2 ] [ 2n 1 2, 2n ] = [ 2 n 1, 2 n 1 2 ] 2 n 2 2 = 2 n 2 n = 2 n 1 (db) = 20 log 10 (2 n 1) (db) = 20 log 10 (2 n 1 ) 2 2 Dynamic Range = largest abs. value smallest nonzero abs. value i= Unsigned numbers: Range of Values Dynamic Range(dB) = 20 log 10 (Dynamic Range) Signed numbers: Range of Values xamles: UNSIGND SIGND FX Format Range Dynamic Range (db) Resolution [8 7] [0, ] [12 8] [0, ] [16 10] [0, ] [8 7] [-1, ] [12 8] [-8, ] [16 10] [-64, ] Instructor: Daniel Llamocca

2 c 9 =0 c 8 =0 c 7 =0 c 6 =0 c 5 =1 c 4 =1 c 3 =1 c 2 =1 c 1 =0 c 10 =1 c 9 =1 c 8 =1 c 7 =0 c 6 =1 c 5 =1 c 4 =1 c 3 =0 c 2 =0 c 1 =0 c 9 =1 c 8 =1 c 7 =0 c 6 =0 c 5 =0 c 4 =1 c 3 =0 c 2 =0 c 1 =0 c 8 =0 c 7 =0 c 6 =0 c 5 =0 c 4 =0 c 3 =0 c 2 =0 c 1 =0 c 8 =1 c 7 =1 c 6 =1 c 5 =1 c 4 =0 c 3 =1 c 2 =0 c 1 =0 b 7 =0 b 6 =0 b 5 =0 b 4 =1 b 3 =1 b 2 =1 b 1 =1 b 0 =0 c 6 =1 c 5 =1 c 4 =1 c 3 =0 c 2 =0 c 1 =1 c 10 =0 c 9 =0 c 8 =0 c 7 =1 c 6 =0 c 5 =0 c 4 =0 c 3 =0 c 2 =0 c 1 =0 FIXD-POINT ADDITION/SUBTRACTION Addition of two numbers reresented in the format [n ]: + A 2 ± B 2 = (A ± B) 2 We erform integer addition/subtraction of A and B. We just need to interret the result differently by lacing the fractional oint where it belongs. Notice that the hardware is the same as that of integer addition/subtraction. +1 When adding/subtracting numbers with different formats [n ] and [m ], we first need to align the fractional oint so that we use a format for both numbers: it could be [n ], [m ], [n + ], [m + ]. This is done by zero-adding and sign-extending where necessary. In the figure below, the format selected for both numbers is [m ], while the result is in the format [m + 1 ]. + m- m- m-+1 Imortant: The result of the addition/subtraction requires an extra bit in the worst-case scenario. In order to correctly comute it in fixed-oint arithmetic, we need to sign-extend (by one bit) the oerators rior to addition/subtraction. m- + m-+1 Multi-oerand Addition: N numbers of format [n ]: The total number of bits is given by : n + log 2 N (this can be demonstrated by an adder tree). Notice that the number of fractional bits does not change (it remains ), only the integer bits increase by log 2 N, i.e., the number of integer bits become n + log 2 N. xamles: Calculate the result of the additions and subtractions for the following fixed-oint numbers. UNSIGND SIGND Unsigned: Signed: Instructor: Daniel Llamocca

3 FIXD-POINT MULTIPLICATION Unsigned multilication Multilication of two signed numbers reresented with different formats [n ], [m ]: x m- (A 2 ) (B 2 ) = (A B) 2. We can erform integer multilication of A and B and then lace the fractional oint where it belongs. The format of the multilication result is given by [n + m + ]. There is no need to align the fractional oint of the inut quantities. Secial case: m = n, = (A 2 ) (B 2 ) = (A B) 2 2. Here, the format of the multilication result is given by [2n 2]. Multilication rocedure for unsigned integer numbers: n+m-- + a 3 a 2 a 1 a 0 x b 3 b 2 b 1 b 0 a 3 b 0 a 2 b 0 a 1 b 0 a 0 b 0 a 3 b 1 a 2 b 1 a 1 b 1 a 0 b 1 a 3 b 2 a 2 b 2 a 1 b 2 a 0 b 2 a 3 b 3 a 2 b 3 a 1 b 3 a 0 b xamle: when multilying, we treat the numbers as integers. Only when we get the result, we lace the fractional oint where it belongs = x 6.5 = x = Signed Multilication: We first tae the absolute value of the oerands. Then, if at least one of the oerands was negative, we need to change the sign of the result. We then lace the fractional oint where it belongs. xamles: x x x x x x x x x x x x Instructor: Daniel Llamocca

4 FIXD-POINT DIVISION Unsigned integer division: The division of two unsigned integer numbers A B (where A is the dividend and B the divisor), results in a quotient Q and a remainder R, where A = B Q + R. Most divider architectures outut Q and R. 15 Q Q ALGORITHM B 101 R end For n-bits dividend (A) and m-bits divisor (B): The largest value for Q is 2 n 1 (by using B = 1). The smallest value for Q is 0. So, we use n bits for Q. The remainder R is a value between 0 and B 1. Thus, at most we use m bits for R. If A = 0, B 0, then Q = R = 0. If B = 0, we have a division by zero. The result is undetermined. In comuter arithmetic, integer division usually means getting Q = A B. Unsigned FX Division: A f B f We first need to align the numbers so that they have the same number of fractional bits, then divide them treating them as integers. The quotient will be integer, while the remainder will have the same number of fractional bits as A f. A f is in the format [na a]. B f is in the format [nb b]. We wor with a b. If a < b, just aend 0 s to A f so that a = b. Ste 1: For a b, we align the fractional oints and then get the integer numbers A and B, which result from: A = A f 2 a B = B f 2 a Ste 2: Integer division: A B = A f B f The numbers A and B are related by the formula: A = B Q + R, where Q and R are the quotient and remainder of the integer division of A and B. Note that Q is also the quotient of A f B f. Ste 3: To get the correct remainder of A f B f, we re-write the revious equation: Then: Q f = Q, R f = R 2 a xamle: A ,011 11,1 R B A A f 2 a = (B f 2 a ) Q + R A f = B f Q + (R 2 a ) R = 0 for i = n-1 downto 0 left shift R (inut = a i ) if R B Ste 1: Alignment, a = ,011 = 1010,011 11,1 11,100 = Ste 2: Integer Division = 11100(10) Q = 10, R = Ste 3: Get actual remainder: R 2 a R f = 11,011 Verification: 1010,011 = 11,1(10) + 11,011, Q f = 10, R f = 11,011 Adding recision to Qf (quotient of A f B f ): The revious rocedure only gets Q as an integer. What if we want to get the division result with x number of fractional bits? To do so, after alignment, we aend x zeros to A f 2 a and erform integer division. A = A f 2 a 2 x B = B f 2 a A f 2 a+x = (B f 2 a ) Q + R A f = B f (Q 2 x ) + (R 2 a x ) Then: Q f = Q 2 x, R f = R 2 a x q i else q i = 0 end = 1, R R-B xamle: 1010,011 11,1 Ste 1: Alignment, a = 3 Ste 2: Aend x = 2 zeros with x = 2 bits of recision 1010,011 = 1010,011 11,1 11,100 = = Instructor: Daniel Llamocca

5 Ste 3: Integer Division Ste 4: Get actual remainder and quotient: Q f = Q 2 x, R f = R 2 a x = 11100(1011) , Q = 1011, R = Q f = 10,11, R f = 0,11000 Verification: 1010,01100 = 11,1(10,11) + 0,11000 Signed FX Division: In this case (as in the multilication), we first tae the absolute value of the oerators A and B. If only one of the oerators is negative, the result of A B requires a sign change. Note that once the correct quotient Q f with fractional bits is available, getting R f with the correct sign is not very useful. xamles: Get the division result (with x = 4 fractional bits) for the following signed fixed-oint numbers: : To ositive (numerator and denominator), alignment, and then to unsigned: a = 4: = Aend x = 4 zeros: Unsigned integer Division: Q = , R = 100 Qf = (x = 4) Final result (2C): = (this is reresented as a signed number) : To ositive (numerator and denominator), alignment, and then to unsigned, a = 5: Aend x = 4 zeros: Unsigned integer Division: Q = 1111, R = 101 Qf = (x = 4) Final result (2C): = (this is reresented as a signed number) : To ositive (numerator), alignment, and then to unsigned, a = 4: = Aend x = 4 zeros: Unsigned integer Division: Q = 10100, R = Qf = (x = 4) Qf here is reresented as an unsigned number Final result (2C): = 2C(01.01) = = : To ositive (denominator), alignment, and then to unsigned, a = 5: = Aend x = 4 zeros: Unsigned integer Division: Q = 111, R = 1110 Qf = (x = 4) Final result (2C): = 2C(0.0111) = Instructor: Daniel Llamocca

6 ARITHMTIC FX UNITS. TRUNCATION/ROUNDING/SATURATION ARITHMTIC FX UNITS They are essentially the same as those integer arithmetic units. The main difference is that we need to now where to lace the fractional oint. The design must ee trac of the FX format at every oint of the architecture. For examle, in FX division, we need first to erform alignment and aend x zeros for a desired recision; this incurs in slightly extra hardware. One benefit of FX reresentation is that we can erform truncation, rounding and saturation on the inut, intermediate, and outut values. These oerations usually require extra hardware resources. Truncation This is a useful oeration when less hardware is required in subsequent oerations. However this comes at the exense of less accuracy. To assess the effect of truncation, use PSNR (db) or MS with resect to a double floating oint result or with resect to the original [n ] format. Truncation is usually meant to be truncation of the fractional art. However, we can also - truncate the integer art (cho off MSBs). This is not recommended as it might render the number unusable. Rounding This oeration allows for hardware savings in subsequent oerations at the exense of reduced accuracy. But it is more accurate than simle truncation. However, it requires extra hardware to deal with the rounding oeration. For the number b n 1 b n 2 b 0. b 1 b 2 b, if we want to cho bits (LSB ortion), we use the b 1 bit to determine whether to round. If the b 1 = 0, we just truncate. If b 1 = 1, we need to add 1 to the LSB of the truncated result Saturation This is helful when we need to restrict the number of integer bits. Here, we are ased to reduce the number of integer bits by. Simle truncation chos off the integer art by bits; this might comletely modify the number and render it totally unusable. Instead, in n- saturation, we aly the following rules: If all the + 1 MSBs of the initial number are identical, that means that choing by bits does not change the number at all, so we just discard the MSBs. - If the + 1 MSBs are not identical, choing by bits does change the number. Thus, here, if the MSB of the initial number is 1, the resulting (n )-bit number will be 2 n 1 = 10 0 (largest negative number). If the MSB is 0, the resulting (n )-bit number will be 2 n 1 2 = (largest ositive number). xamles: Reresent the following signed FX numbers in the signed fixed-oint format: [8 7]. You can use rounding or truncation for the fractional art. For the integer art, use saturation. 1, : To reresent this number in the format [8 7], we ee the integer bit, and we can only truncate or round the last LSB: After truncation: 1, After rounding: 1, = 1, , : Here, we need to get rid of on MSB and two LSBs. Let s use rounding (to the next bit). Saturation in this case amounts to truncation of the MSB, as the number won t change if we remove the MSB. After rounding: 11, = 11, After saturation: 1, , : Here, we need to get rid of two MSB and two LSBs. Saturation: Since the three MSBs are not the same and the MSB=1 we need to relace the number by the largest negative number (in absolute terms) in the format [8 7]: 1, , : Here, we need to get rid of two MSB and three LSBs. Saturation: Since the three MSBs are not identical and the MSB=0, we need to relace the number by the largest ositive number in the format [8 7]: 0, Instructor: Daniel Llamocca

7 FLOATING-POINT (FP) ARITHMTIC FLOATING POINT RPRSNTATION There are many ways to reresent floating numbers. A common way is: X = ±significand 2 e e significand m xonent e: Signed integer. It is common to encode this field using a bias: e + bias. This facilitates zero detection (e + bias = 0). Note that the exonent of the actual number is always e regardless of the bias (the bias is just for encoding). e [ 2 1, ] Significand: Unsigned fixed-oint number. Usually normalized to a articular format, e.g.: [0, 1), [1,2). Format (unsigned): [m ]. Range: [0, 2m 1 m ] = [0, 2 2m 2 ], = m If = 0 Significand [0,1 2 ] = [0,1) If = m Significand [0, 2 m 1]. Integer significand. Another common reresentation of the significand is using = 1 and setting that bit (the MSB) to 1. Here, the range of the significand would be [0, ], but since the integer bit is 1, the values start from 1, which result in the following significand range: [1, ]. This is a oular normalization, as it allows us to dro the MSB in the encoding. I-754 STANDARD The reresentation is as follows: X = ±1. f 2 e sign bit biased exonent e+bias Significand: Unsigned FX number. The reresentation is normalized to s = 1. f, where f is the mantissa. There is always an integer bit 1 (called hidden 1) in the reresentation of the significand, so we do not need to indicate in the encoding. Thus, we only use f the mantissa in the significant field. Significand range: [1,2 2 ] = [1,2) Biased exonent: Unsigned integer with bits. bias = Thus, ex = e + bias e = ex bias. We just subtract the bias from the exonent field in order to get the exonent value e. ex = e + bias [0, 2 1 ]. ex is reresented as un unsigned integer number with bits. The bias maes sure that ex 0. Also note that e [ , 2 1 ]. The I-754 standard reserves the following cases: i) ex = 2 1 (e = 2 1 ) to reresent secial numbers (NaN and ± ), and ii) ex = 0 to reresent the zero and the denormalized numbers. The remaining cases are called ordinary numbers. significand f Ordinary numbers: biased exonent e+bias [1,2-2] significand Range of e: [ , 2 1 1]. largest exonent Max number:largest significand 2 f max = = (2 2 ) Min. number: smallest significand 2 min = = Dynamic Range = max min = (2 2 ) = (2 2 ) Dynamic Range (db) = 20 log 10 {(2 2 ) } smallest exonent Plus/minus Infinite: ± biased exonent significand e+bias = 2-1 f=0 The ex field is a string of 1 s. This is a secial case where ex = 2 1. (e = 2 1 ) ± = ±2 2 1 Not a Number: NaN biased exonent e+bias = 2-1 significand f 0 The ex field is a strings of 1 s. ex = 2 1. This is a secial case where ex = 2 1 (e = 2 1 ). The only difference with ± is that f is a nonzero number. 7 Instructor: Daniel Llamocca

8 Zero: biased exonent significand Zero cannot be reresented with a normalized significand s = since X = ±1. f 2 e cannot be e+bias = 0 f=0 zero. Thus, a secial code must be assigned to it, where s = and ex = 0. very single bit (excet for the sign) is zero. There are two reresentations for zero. The number zero is a secial case of the denormalized numbers, where s = 0. f (see below). Denormalized numbers: The imlementation of these numbers is otional in the standard (excet for the zero). Certain small values that are not reresentable as normalized numbers (and are rounded to zero), can be reresented more recisely with denormals. This is a graceful underflow rovision, which leads to hardware overhead. biased exonent e+bias = 0 significand f 0 These numbers have the ex field equal to zero. The tricy art is that e is set to (not , as the e + bias formula states). The significand is reresented as s = 0. f. Thus, the floating oint number is X = ±0. f These numbers can reresent numbers lower (in absolute value) than min (the number zero is a secial case). Why is e not ? Note that the smallest ordinary number is The largest denormalized number with e = is: = (1 2 ) The largest denormalized number with e = is: = (1 2 ) By icing e = , the ga between the largest denormalized number and the smallest ordinary number is smaller. Though this secification maes the formula e + bias = 0 inconsistent, it hels to imrove accuracy. Deiction of the range of values: Underflow region (or denormal numbers) Overflow region The I (revision of I ) standard defines several reresentations: half (16 bits, =5, =10), single (32 bits, = 8, = 23) and double (64 bits, = 11, = 52). There is also quadrule recision (128 bits) and octule recision (256 bits). You can define your own reresentation by selecting a articular number of bits for the exonent and significand. The table lists various arameters for half, single and double FP arithmetic (ordinary numbers): Ordinary numbers xonent Dynamic Significand Significand Range of e Bias Min Max bits () Range (db) range bits () Half 2 14 ( ) [ 14,15] db [1, ] 10 Single ( ) [ 126,127] db [1, ] 23 Double ( ) [ 1022,1023] db [1, ] 52 Rules for arithmetic oerations: Ordinary number (+ ) = ±0 Ordinary number (0) = ± (+ ) Ordinary number = ± NaN + Ordinary number = NaN (0) (0) = NaN (± ) (± ) = NaN (0) (± ) = NaN ( ) + ( ) = NaN xamles: F43D962 (single): e + bias = = 232 e = = 105 Significand = = X = = FAD5 (single): e + bias = = 0 Denormal number e = 126 Significand = = X = = Instructor: Daniel Llamocca

9 c 12 =1 c 11 =1 c 10 =1 c 9 =1 c 8 =1 c 7 =0 c 6 =1 c 5 =0 c 4 =1 c 3 =1 c 2 =0 c 1 =1 ADDITION/SUBTRACTION b 1 = ±s 1 2 e 1, s 1 = 1. f 1 b 2 = ±s 2 2 e 2, s 1 = 1. f 2 b 1 + b 2 = ±s 1 2 e 1 ± s 2 2 e 2 If e 1 e 2, we simly shift s 2 to the right by e 1 e 2 bits. This ste is referred to as alignment shift. s 2 2 e 2 = s 2 2 e 2 1 e e 1 2 b 1 + b 2 = ±s 1 2 e 1 ± s 2 2 e 2 1 e e 1 = (±s 2 1 ± s 2 2 e ) 2 1 e e 1 = s 2 e 2 b 1 b 2 = ±s 1 2 e 1 s 2 2 e 1 e 2 2 e 1 = (±s 1 s 2 2 e 1 e 2 ) 2 e 1 = s 2 e Normalization: Once the oerators are aligned, we can add. The result might not be in the format 1. f, so we need to discard the leading 0 s of the result and sto when a leading 1 is found. Then, we must adjust e 1 roerly, this results in e. For examle, for addition, when the two oerands have similar signs, the resulting significand is in the range [1,4), thus a single bit right shift is needed on the significant to comensate. Then, we adjust e 1 by adding 1 to it (or by left shifting everything by 1 bit). When the two oerands have different signs, the resulting significand might be lower than 1 (e.g.: ) and we need to first discard the leading zeros and then right shift until we get 1. f. We then adjust e 1 by adding the same number as the number of shifts to the right on the significand. Note that overflow/underflow can occur during the addition ste as well as due to normalization. xamle: s 3 = (±s 1 ± s 2 2 e 1 e2 ) = First, discard the leading zeros: s 3 = Normalization: right shift 1 bit: s = s = Now that we have the normalized significand s, we need to adjust the exonent e 1 by adding 1 to it: e = e 1 + 1: (s ) 2 e 1+1 = s 2 e = e 1+1 xamle: b 1 = , b 2 = b = b 1 + b 2 = = ( ) = To get this result, we convert the oerands to the 2C reresentation (you can also do unsigned subtraction if the result is ositive). Here, the result is ositive. Finally, we erform normalization: b = b 1 + b 2 = ( ) 2 5 = ( ) = xamle (Subtraction): b 1 = , b 2 = b = b 1 b 2 = = ( ) 2 5 To subtract, we convert to 2C reresentation: R = = = Here, the result is negative. So, we get the absolute value ( R = 2C(1.0111) = ) and lace the negative sign on the final result: b = b 1 b 2 = (0.1001) 2 5 xamles: X = 50DAD000 D0FAD000: 50DAD000: e + bias = = 161 e = = 34 Significand = DAD000 = D0FAD000: e + bias = = 161 e = = 34 Significand = D0FAD000 = X = (unsigned addition) X = = e + bias = = 162 = X = = 516AD Instructor: Daniel Llamocca

10 X = 60A C2F97000: 60A10000: e + bias = = 193 e = = 66 Significand = A10000 = C2F97000: e + bias = = 133 e = = 6 Significand = C2F97000 = X = X = Reresenting the division by 2 60 requires more than + 1 = 24 bits. Thus, we can aroximate the 2 nd oerand with 0. X = X = = 60A10000 MULTIPLICATION b 1 = ±s 1 2 e 1, b 2 = ±s 2 2 e 2 b 1 b 2 = (±s 1 2 e 1 ) (±s2 2 e 2 ) = ±(s1 s 2 )2 e 1+e 2 Note that s = (s 1 s 2 ) [1,4). This means that the result might require normalization. xamle: b 1 = , b 2 = , b = b 1 b 2 = ( ) 2 6 = (10,0001) 2 6, Normalization of the result: b = (10, ) 2 7 = (1,00001) 2 7. Note: If the multilication requires more bits than allowed by the reresentation (32, 64 bits), we have to do truncation or rounding. It is also ossible that overflow/underflow might occur due to large/small exonents and/or multilication of large/small numbers. xamles: X = 7A09D300 0BF000: 7A09D300: e + bias = = 244 e = = 117 Significand = A09D300 = BF000: e + bias = = 23 e = = 104 Significand = BF000 = X = X = = = e + bias = = 141 = X = = 4680A35F (four bits were truncated) X = 0B09A000 8FACC000: 0B092000: e + bias = = 22 e = = 105 Significand = B = FACC000: e + bias = = 31 e = = 96 Significand = FAC000 = X = = = e + bias = = 74 < 0 Here, there is underflow (not even denormalized numbers different than zero can reresent it). Then X 0. X = = Instructor: Daniel Llamocca

11 DIVISION b 1 = ±s 1 2 e 1, b 2 = ±s 2 2 e 2 b 1 b 2 = ±s 12 e1 ±s 2 2 e 2 = ± s 1 s 2 2 e 1 e 2 Note that s = ( s 1 s 2 ) (1/2,2). This means that the result might require normalization. xamle: b 1 = , b 2 = b = b = : unsigned division, here we can include as many fractional bits as we want. With x = 4 (and a = 0) we have: = 10101(1011) Q f = 1,0101, R f = 00,0011 If the result is not normalized, we need to normalized it. In this examle, we do not need to do this. b = b = xamle: X = : : e + bias = = 146 e = = 19 Significand = = : e + bias = = 128 e = = 1 Significand = BF000 = X = Alignment: = = Aend x = 8 zeros: Integer division Q = , R = Qf = Thus: X = = = = e + bias = = 145 = X = = 489B Instructor: Daniel Llamocca

12 DUAL FIXD-POINT (DFX) ARITHMTIC Floating oint (FP) and fixed oint (FX) arithmetic are the standard numerical reresentations. Floating oint features a large dynamic range at the exense of a large resource usage. On the other hand, fixed oint requires fewer resources, but it delivers a low dynamic range. Dual fixed oint (DFX) arithmetic is an alternative reresentation that overcomes the limitations of FX (it greatly imroves dynamic range) without the high resource comlexity of floating oint. n bit Dual Fixed Point (DFX) number: exonent (), signed significand (X) with n 1 bits. xonent : It selects between two scalings for the significand X. Thus, there are two ossible cases for a DFX number D: D = { num0: X. 2 0, if = 0 num1: X. 2 1, if = 1, 0 > 1 num0: It has 0 fractional bits. num1: It has 1 fractional bits. Notation of a DFX number: n_0_1 or [n 0 1 ]. XPONNT SIGND SIGNIFICANT: n-1 bits X n bits BOUNDARY VALU The scaling (value of ) is determined by the boundary value B: = { 0 (num0), B D < B 1 (num1), D < B and D B Boundary value B: Defined as the next incremental value after the maximum ositive number of num0: Range of num0 = 2(n 1) 1 2 to 2(n 1) 1 1 n B = 2(n 1) LSB = 2(n 1) B = 2n num0: 0 0 If we have a num0 number with n bits, it seems that we could convert it into a num1 number with n bits (by truncating 0 1 LSBs and by sign-extending the MSB of the significand). However, we cannot do this, as the number will not be in the num1 range (D < B and D B). To convert a num0 number that does not fit with n bits (but with n + x bits, format (n + x)_ 0 _ 1 ) into a num1 number that might fit with n bits (format n_ 0 _ 1 ), we need to discard the 0 1 LSBs and then sign-extend the significand by 0 1 x bits. xamle: Format 16_7_3 to 14_7_3: x = 2, 0 1 = 4 num0 number in format 16_7_3: Convert to 14_7_3: it has to be num1: (num0 would not wor as we will lose MSBs) RANGS FOR num0 and num1: num0 range: There is smaller sacing (2 0 ) between consecutive numbers; thus, they are more accurate. num1 range: There is larger sacing (2 1 ) between consecutive numbers; thus, they are less accurate. num1 Range num0 Range 0 Range for num0: [ 2 n 0 2, 2 n ] Range for num1: [ 2 n 1 2, 2 n ] [ 2 n 0 2, 2 n ] 12 Instructor: Daniel Llamocca

13 DYNAMIC RANG It is defined as the ratio between the largest absolute value and the smallest nonzero absolute value. Unsigned numbers with n bits: Signed numbers (2 s comlement): [n ] Dual fixed oint (DFX) numbers: n_0_1 2 n 1 = 2 1 n 1 Dynamic Range = 20 log 10 (2 n 1) db 2 n 1 2 = 2 n 1 Dynamic Range = 20 log 10 (2 n 1 ) db 2 n = 2 0 n Dynamic Range = 20 log 10 (2 n ) db FX TO DFX CONVRSION What if we have [nin in] (signed) and we want to convert to n_0_1? We first try with num0 (since it is more accurate). For num0 we need: B D < B, B = 2 n 0 2. Note that B 2 0 = B = A quic way to chec this is by aligning the two formats and then comaring the nin in (n 1 0) + 1 MSBs of the FX format with the MSB (of the significand) of the DFX number. If they are all the same, that means that the number is a num0. niin in [nin in] n num0 0 If the number is not a num0, we try with num1: niin in [nin in] num1 1 n If num1 does not wor either, it means there is overflow and we need more than n bits. Note: This rocedure always starts by checing if the number is num0. A common mistae is to start by checing if the number is num1. The rocedure only wors for continuous ranges, which is not the case of num1. To directly verify that a DFX number is a num1, see if we can dro off enough MSBs so as to mae it a num0. If the integer art is affected, then the number is a num1; otherwise it is a num0. xamles: Signed [8 4] to 7_4_3. We first chec for num0, then for num1. None of these wor, so there is overflow (need more bits). 8 bits 8 bits [8 4] [8 4] _4_3, num _4_3, num bits 7 bits Signed [8 4] to 8_4_3. We first chec for num0, which does not wor. Then, we chec for num1, which wors. 8 bits 8 bits [8 4] [8 4] _4_3, num _4_3, num bits 13 Instructor: Daniel Llamocca

14 xamles Convert the following signed fixed-oint numbers in format [12 8] to Dual Fixed Point Format 12_8_4:.: To DFX 12_8_4 (num0): = 6 A.CD: To DFX 12_8_4 (num0): = not a num0! To DFX 12_8_4 (num1): = FAC C.1B: To DFX 12_8_4 (num0): = 41B 8.B9: To DFX 12_8_4 (num0): = not a num0! To DFX 12_8_4 (num1): = F8B 3.0A: To DFX 12_8_4 (num0): = 10A DUAL FIXD-POINT ADDITION Here, we add two DFX numbers A and B with n bits. To do this, we need to follow this rocedure: Pre-scaling: We get rid of the exonent () bit. Then, we align the two (n 1)-bit significands (signed FX numbers). Here, alignment might require discarding 0 1 fractional bits (truncation) and sign-extending MSBs by 0 1 bits. Imroving DFX Adder accuracy: We can save the 0 1 truncated LSBs. In the ost-scaler, we might be able to shift in the truncated LSBs. This only wors when A and B have different exonents. Fixed-oint addition: This is a simle addition of two (n 1)-bit significands. Note that in the addition result can have u to n bits. The format would be either [n 0 ] or [n 1 ]. Post-scaling: The n-bit FX result must be converted to n-bit DFX. In some cases, there can be overflow when the result cannot fit as a num1 in DFX. xamles: Addition of numbers in format 8_3_2. The result should be in the same format. Addition of two num0 numbers: FX addition requires 9 bits. When this FX number is converted to DFX, we notice that it cannot be reresented as a num0 (unless we change the DFX format to 9_3_2); it will have to be a num1. DFX FX FX DFX: num0 DFX: num Addition of two num1 numbers. Addition results in overflow (we need 9 bits to reresent the result). Here, the result in format 8_3_2 has to include an overflow flag. We can recover from this if we include a carry out (c n ) bit in the circuitry. DFX FX FX Overflow! DFX: num0 DFX: num Instructor: Daniel Llamocca

15 Addition of a num0 and a num1 number. Addition does not result in overflow, num0 suffices to reresent the addition in the 8_3_2 format. Here, there are two ways: add zero the LSB or retrieve the saved bit. DFX FX FX DFX: num0 OR DFX: num Subtraction of two num1 numbers. The oeration does not result in overflow, num0 suffices to reresent the final result in the 8_3_2 format. DFX FX FX FX bits 7 bits 7 bits DFX: num0 8 bits DUAL FIXD-POINT MULTIPLICATION Here, we add two DFX numbers with n bits. To do this, we get rid of the exonent () bit, and then we only need to multily two (n 1)-bit significands in fixed oint arithmetic. Then, we convert the FX result into the DFX number. Unlie DFX addition, there is no re-scaling. We just multily the two signed FX numbers. The result will have 2n-2 bits. We need a Post-Scaling stage that converts this FX result into the DFX format. xamles: DFX multilication of numbers in [8 3 2] format. DFX x DFX x x Signed FX x Unsigned FX Unsigned integer x x Signed FX x Unsigned FX Unsigned integer x Signed FX Signed FX Mult. Result Signed FX Mult. Result To DFX 8_3_2 (num0): = To DFX 8_3_2 (num0): = Instructor: Daniel Llamocca

Computer arithmetic. Intensive Computation. Annalisa Massini 2017/2018

Computer arithmetic. Intensive Computation. Annalisa Massini 2017/2018 Comuter arithmetic Intensive Comutation Annalisa Massini 7/8 Intensive Comutation - 7/8 References Comuter Architecture - A Quantitative Aroach Hennessy Patterson Aendix J Intensive Comutation - 7/8 3