Binary Floating-Point Numbers

Size: px

Start display at page:

Download "Binary Floating-Point Numbers"

Richard Wiggins
5 years ago
Views:

1 Binary Floating-Point Numbers S exponent E significand M F=(-1) s M β E Significand M pure fraction [0, 1-ulp] or [1, 2) for β=2

2 Normalized form significand has no leading zeros maximum # of significant digits easy comparison x x normalization for β=2 k at least one nonzero bit in the first k position x x M min =1/β ( ) M max =1-ulp ( )

3 Biased exponent E=E true +bias - 2 e-1 E true 2 e-1-1 if bias=2 e-1, 0 E 2 e -1 excess 2 e-1 code two s complement inversion of sign bit zero: M=0, E=0

4 biased exponent: can compare exponent as if they were unsigned numbers +normalization: compare exponents, then compare significands compare floating-point numbers as if they were integers in signed-magnitude representation S exponent E significand M magnitude

5 Range Positive floating-point number F + M min β E min F+ M max β E max E > E max : exponent overflow E < E min : exponent underflow range of F + = range of -F -

6 IBM S exponent E significand M β=16 F = (-1) S M 16 E-64 F + min=16-1 x x F + max=( ) x x 10 75

7 DEC/VAX S exponent E significand f β=2 F = (-1) S 0.1f 2 E-128 E = 0 is reserved for zero F + min=0.1x =2-128 F + max=( ) x = ( ) x 2 127

8 Floating-point operations Multiplication (F 3 =F 1 x F 2 ) sign bit S 3 = S 1 S 2 exponent E 3 = E 1 + E 2 - bias E 3 > E max overflow E 3 < E min underflow significand 1/β M i < 1 1/β 2 M 1 M 2 < 1 postnormalization if M 3 = M 1 M 2 < 1/β, shift left M 3 and decrease E 3 (may cause underflow)

9 Division (F 3 =F 1 / F 2 ) exponent E 3 = E 1 - E 2 + bias significand 1/β M i < 1 1/β M 1 / M 2 < β postnormalization if M 3 = M 1 / M 2 1, shift right M 3 and increase E 3 (may cause overflow) M 2 = 0: division by zero M 1 = M 2 = 0: undefined

10 Addition/Subtraction Shift right the significand of the smaller operand by E 1 - E 2 base-β positions addition: 1/β M < 2 if M 1 postnormalization subtraction: 0 M < 1 if M 1/β postnormalization

11 example F 1 = ( ) 16 x 16 3 F 2 = (0.FFFFFF) 16 x 16 2 F x 16 3 F 2 aligned 0. 0 F F F F F x 16 3 F 1 - F x 16 3 postnormalization x 16-2 guard digit F x 16 3 F 2 aligned 0. 0 F F F F F F x 16 3 F 1 - F x 16 3 postnormalization x 16-3

12 IEEE floating-point standard Single precision S exponent E fraction f F = (-1) S x 1.f x 2 E-127 (1 E 254) F + max = ( ) x = ( ) x x F + min = 1.0 x = Precision =

13 E = 255 is reserved for and NAN E = 0 is reserved for zero and denormalized numbers zero: E = 0, f = 0 denormalized numbers: E = 0, f 0 F = (-1) S 0.f

14 Double precision S exponent E fraction f F = (-1) S x 1.f x 2 E-1023

15 Round-off schemes Truncation Trunc(x) x

16 rounding errors for the truncation scheme with d = 2 (d: # extra digit) Number Trunc(x) Error X.00 X 0 X.01 X -1/4 X.10 X -1/2 X.11 X -3/4 bias = average error = (0-1/4-1/2-3/4)/4 = -3/8

17 round-to-nearest Round-tonearest(x) x

18 rounding errors for the round-to-nearest scheme with d = 2 Number Round-to-nearest(x) Error X.00 X 0 X.01 X -1/4 X.10 X +1 +1/2 X.11 X +1 +1/4 bias = (0-1/4+1/2+1/4)/4 = 1/8

19 round-to-nearest-even Round-tonearest-even(x) x

20 rounding errors for the round-to-nearesteven scheme with d = 2 Number Round-to- Error nearest- even(x) X0.00 X0. 0 X0.01 X0. -1/4 X0.10 X0. -1/2 X0.11 X1. +1/4 X1.00 X1. 0 X1.01 X1. -1/4 X1.10 X /2 X1.11 X /4 bias = 0

21 ROM rounding avoid add operation (l - 1) MSB of xtra d bits 2 l x (l - 1) ROM (l - 1)

22 round-to-nearest using ROM ROM(x) x

23 rounding errors for the ROM rounding scheme with l = 3 and d = 1 Number Round-to- Error nearest(x) X00.0 X00. 0 X00.1 X01. +1/2 X01.0 X01. 0 X01.1 X10. +1/2 X10.0 X10. 0 X10.1 X11. +1/2 X11.0 X11. 0 X11.1 X11. -1/2 bias = 1/2[(1/2) d - (1/2) l-1 ]

24 Guard Digits Multiplication 1.f x 1.f : at most one shift right 0.f x 0.f : at most one shift left one guard digit(g) for postnormalization Division 0.f / 0.f : at most one shift right 1.f / 1.f : at most one shift left one guard digit(g) for postnormalization Round-to-nearest needs one more digit R(round)

25 Round-to-nearest-even needs additional digit S(sticky) indicates whether all the additional digits are zero

26 Addition/Subtraction no shift for the alignment A B aligned A - B postnormalization one bit shift for the alignment one guard bit A B aligned A - B postnormalization

27 shift by two or more bits for the alignment: A B aligned A - B postnormalization postnormalization needs at most one bit shift but one guard bit is not sufficient A B aligned A - B postnormalization

28 use a sticky bit A B aligned A - B postnormalization G S A B aligned A - B postnormalization good for truncation scheme

29 round-to-nearest(-even) scheme requires an additional bit A B aligned A - B postnormalization round-to-nearest G S A B aligned A - B postnormalization round-to-nearest

30 use a round bit A B aligned A - B postnormalization round-to-nearest G R S A B aligned A - B postnormalization round-to-nearest round-to-nearest-even scheme can use S bit

31 if no postnormalization is required (no shift left), then G bit serves as an R bit S new = R + S old round-to-nearest-even : add ulp to the significand if R S + R S L = R (S + L) = 1 G R S A B aligned A - B R S before rounding after round-to nearest

ALU (3) - Division Algorithms

HUMBOLDT-UNIVERSITÄT ZU BERLIN INSTITUT FÜR INFORMATIK Lecture 12 ALU (3) - Division Algorithms Sommersemester 2002 Leitung: Prof. Dr. Miroslaw Malek www.informatik.hu-berlin.de/rok/ca CA - XII - ALU(3)