On relative errors of floating-point operations: optimal bounds and applications

On relative errors of floating-point operations: optimal bonds and applications Clade-Pierre Jeannerod, Siegfried M. Rmp To cite this version: Clade-Pierre Jeannerod, Siegfried M. Rmp. On relative errors of floating-point operations: optimal bonds and applications. Mathematics of Comptation, American Mathematical Society, 2018, 87, pp.803-819. <10.1090/mcom/3234>. <hal-00934443v4> HAL Id: hal-00934443 https://hal.inria.fr/hal-00934443v4 Sbmitted on 3 Nov 2016 HAL is a mlti-disciplinary open access archive for the deposit and dissemination of scientific research docments, whether they are pblished or not. The docments may come from teaching and research instittions in France or abroad, or from pblic or private research centers. L archive overte plridisciplinaire HAL, est destinée a dépôt et à la diffsion de docments scientifiqes de nivea recherche, pbliés o non, émanant des établissements d enseignement et de recherche français o étrangers, des laboratoires pblics o privés.

ON RELATIVE ERRORS OF FLOATING-POINT OPERATIONS: OPTIMAL BOUNDS AND APPLICATIONS CLAUDE-PIERRE JEANNEROD AND SIEGFRIED M. RUMP Abstract. Ronding error analyses of nmerical algorithms are most often carried ot via repeated applications of the so-called standard models of floating-point arithmetic. Given a rond-to-nearest fnction fl and barring nderflow and overflow, sch models bond the relative errors E 1 (t) = t fl(t) / t and E 2 (t) = t fl(t) / fl(t) by the nit rondoff. This paper investigates the possibility and the seflness of refining these bonds, both in the case of an arbitrary real t and in the case where t is the exact reslt of an arithmetic operation on some floating-point nmbers. We show that E 1 (t) and E 2 (t) are optimally bonded by /(1 + ) and, respectively, when t is real or, nder mild assmptions on the base and the precision, when t = x ± y or t = xy with x, y two floating-point nmbers. We prove that while this remains tre for division in base β > 2, smaller, attainable bonds can be derived for both division in base β = 2 and sqare root. This set of optimal bonds is then applied to the ronding error analysis of varios nmerical algorithms: in all cases, we obtain significantly shorter proofs of the best-known error bonds for sch algorithms, and/or improvements on these bonds themselves. 1. Introdction Given two integers β, p 2, let F be the associated set of floating-point nmbers having base β, precision p, and no restriction on the exponent range: F = {0} {M β e : M, e Z, β p 1 M < β p }. Let also fl : R F denote any rond-to-nearest fnction, sch that (1.1) t fl(t) = min t f, t R. f F In particlar, no specific tie-breaking strategy is assmed for the fnction fl. Two relative errors can then be defined, depending on whether the exact vale or the ronded vale is sed to divide the absolte error in (1.1): the error relative to t is E 1 (t) = while the error relative to fl(t) is E 2 (t) = t fl(t) t t fl(t) fl(t) if t 0, if fl(t) 0. (In each case the relative error may be defined to be zero if the denominator is zero: the exponent range being nbonded, fl(t) = 0 implies t = 0, so in both cases a zero denominator means that no error occrs when ronding t.) November 3, 2016. 2010 Mathematics Sbject Classification. Primary 65G50. 1

2 C.-P. JEANNEROD AND S. M. RUMP For ronding error analysis prposes, the most commonly sed bonds are E 1 (t) and E 2 (t), where = 1 2 β1 p is the nit rondoff associated with fl and F, and sch bonds are typically handled via the so-called standard models fl(t) = t(1 + δ 1 ) = t/(1 + δ 2 ), δ 1, δ 2 ; see [5, pp. 38 39]. It is also known that the bond on the first relative error can be refined slightly [12, p. 232], 1 so that (1.2) E 1 (t) and E 2 (t). 1 + These worst-case bonds hold for any real nmber t and any ronding fnction fl. Frthermore, they can be regarded as optimal in the sense that each of them is attained for some pair (t, fl) with t expressed in terms of β and p: since 1 + is exactly halfway between the two consective elements 1 and 1 + 2 of F, we have E 1 (1 + ) = /(1 + ) and, assming frther that fl ronds ties to even, E 2 (1 + ) =. In this paper, we investigate the possibility and the seflness of refining the bonds in (1.2) when t is not jst an arbitrary real nmber bt the exact reslt of an operation on some floating-point nmber(s), that is, when t = x op y, x, y F, op {+,,, /} : F F R as well as t = x. Or first contribtion is to establish optimal bonds on both E 1 and E 2 for each of these five basic operations, as shown in Table 1. Table 1. Optimal relative error bonds for varios inpts t. t bond on E 1 (t) bond on E 2 (t) real nmber x ± y xy x/y 1+ 1+ 1+ { 2 2 if β = 2, { 2 2 1+ 2 2 if β = 2, if β > 2 if β > 2 1+ x 1 1 1+2 1 + 2 1 As we shall see later in the paper, each of these bonds is attained for some explicit inpt vales in F and ronding fnctions fl, possibly nder some mild (necessary and sfficient) conditions on β and p. Specifically, for addition, sbtraction, and mltiplication the condition for optimality is that β is even, and in the case of mltiplication in base 2 it is that 2 p + 1 is not a Fermat prime. In most practical sitations sch conditions are satisfied and ths the general bonds in (1.2) remain 1 The refined bond E1 (t) /(1 + ) already appears in [18, p. 74] and, in some special cases, in [2] for β = 2 and [6] for β even.

OPTIMAL BOUNDS ON RELATIVE ERRORS, WITH APPLICATIONS 3 the best ones for floating-point addition, sbtraction, and mltiplication; as Table 1 shows, this is also the case of division in any base larger than 2. In contrast, for division in base 2 and for sqare root, the general bond /(1+) 2 on E 1 can be decreased frther to 2 2 and to 1 (1+2) 1/2 3 2 2, respectively; likewise, the general bond on E 2 can be decreased to ( 2 2 )/(1 + 2 2 ) 3 2 and (1 + 2) 1/2 1 1 2 2. Or second contribtion is to show that in the context of ronding error analysis of nmerical algorithms, applying these optimal bonds in a systematic way leads to simpler and sharper bonds, and/or to more direct proofs of existing ones. In particlar, this allows s to establish the following three new error bonds: (1.3) (1.4) For the smmation of n floating-point nmbers x 1,..., x n (sing n 1 floating-point additions and any ordering), we show that the reslting floating-point approximation ŝ satisfies ŝ n n 1 x i e i (n 1) 1 + n x i, where e i denotes the absolte error of the ith floating-point addition. For the smmation of n real nmbers x 1,..., x n, we show that by first ronding each x i into fl(x i ) and then smming the fl(x i ) in any order, the reslting approximation ŝ satisfies ŝ n x i n n 1 n d i + e i ζ n x i, ζ n < n, where the d i are given by d i = x i fl(x i ) and the e i are as before. For the Eclidean norm of a vector of n floating-point nmbers x 1,..., x n we show that smming the sqares in any order and then taking the sqare root of the reslt yields a floating-point nmber r sch that ( n ) 1/2 (1.5) r = x 2 i (1 + ɛ), ɛ (n/2 + 1). Note that each of these bonds holds withot any restriction on n. The bonds in (1.3) and (1.4) improve pon the best previos ones, from [10], in two ways: they are sharper and apply not only to the absolte error of ŝ bt also to the sm of the absolte local errors. Frthermore, we will see that the new bond in (1.3) implies the one in (1.4) almost immediately; in other words, sing (1.3) allows s to recover the constant n established in [10, Proposition 4.1] for sms of n reals (and ths n-dimensional inner prodcts as well), bt in a mch more direct way. Finally, the bond in (1.5) nicely replaces the expression (n/2 + 1) + O( 2 ) that wold reslt from sing the sboptimal bond E 1 ( x). Besides the evalation of sms and norms, and to illstrate frther the benefits of applying the refined bonds in Table 1, we provide for other typical examples of ronding error analysis. These examples deal with small arithmetic expressions, minors of tridiagonal matrices, Cholesky factorization, and complex floating-point mltiplication. We shall see that in each case the existing error bond can be either replaced by a simpler and sharper one, or recovered via a significantly shorter proof. Notation and assmptions. All or reslts hold nder the cstomary assmption that β, p 2. Frthermore, following [5] (and nless stated otherwise) we

4 C.-P. JEANNEROD AND S. M. RUMP shall ignore nderflow and overflow by assming that the exponent range of F is nbonded. Finally, the common tool sed to establish all the error bonds in Table 1 is the fnction fp : R F 0 from [16], called nit in the first place (fp) and defined as follows: fp(0) = 0 and, if t R\{0}, fp(t) is the largest integer power of β sch that fp(t) t. Hence, in particlar, for t nonzero, (1.6a) and for any t, (1.6b) fp(t) t < βfp(t), fp(t) fl(t) βfp(t). Otline. This paper is organized as follows. We begin in Section 2 by recalling how to derive the bonds in (1.2) and by completely characterizing their attainability. Section 3 then gives proofs for all the bonds annonced in Table 1 together with explicit expressions of inpt vales at which these bonds are attained. A first application of these reslts is described in Section 4, where we establish the new bonds for smmation shown in (1.3) and (1.4). We conclde in Section 5 with the derivation of the bond (1.5) together with the analysis of for other examples. 2. Optimal error bonds when ronding a real nmber Using the fnction fp, the bonds E i (t) for i = 1, 2 are easily derived as follows. First, recall from (1.6) that t = 0 belongs to the right-open interval [fp(t), β fp(t)), which contains (β 1)β p 1 eqally-spaced elements of F. The distance between two sch consective elements is ths β fp(t) fp(t) (β 1)β = 2 fp(t), p 1 and ronding to nearest implies that the absolte error is bonded as (2.1) t fl(t) fp(t). This bond is sharp and the vales at which it is attained are the midpoints of F, that is, the rational nmbers lying exactly halfway between two consective elements of F. Dividing both sides of (2.1) by either t or fl(t) gives (2.2) E 1 (t) fp(t) t and E 2 (t) fp(t) fl(t), and by sing the lower bonds in (1.6) we arrive at the classical bonds E 1 (t) and E 2 (t). As noted in [5, Theorem 2.2], we have in fact the strict ineqality E 1 (t) <, since t cannot be at the same time a midpoint and eqal to its fp; on the other hand, the derivation above shows that the bond on E 2 (t) is attained if and only if t is a midpoint sch that fl(t) = fp(t). Let s now refine the bond E 1 (t) <. As shown in [12, p. 232], all we need for this is a lower bond on t slightly sharper than the one in (1.6): by definition of ronding to nearest, t fl(t) t f for all f F, so taking in particlar f = sign(t)fp(t) gives t fl(t) t fp(t); in other words, (2.3) fp(t) + t fl(t) t and, becase of (2.1), eqality occrs if and only if t (1 + )fp(t). Ths, by applying (2.3) and then (2.1) to the definition of E 1, we find that E 1 (t) t fl(t) fp(t) + t fl(t) 1 +.

OPTIMAL BOUNDS ON RELATIVE ERRORS, WITH APPLICATIONS 5 Frthermore, de to the conditions of attainability of (2.1) and (2.3) given above, this bond on E 1 (t) is attained if and only if t is a midpoint sch that t (1 + )fp(t), that is, if and only if t = (1 + )fp(t). We smmarize or discssion in the theorem below. Althogh the bonds given there already appear in [18, p. 74] and [12, p. 232] for E 1, and in [5, Theorem 2.3] for E 2, the characterization of their attainability does not seem to have been reported elsewhere. Theorem 2.1. If t R\{0}, then E 1 (t) 1 + and E 2 (t). Frthermore, the bond on E 1 is attained if and only if t = (1 + )fp(t), and the bond on E 2 is attained if and only if t = (1 + )fp(t) and fl(t) = fp(t). 3. Optimal error bonds for floating-point operations We establish here optimal bonds on both E 1 and E 2 for the operations of addition, sbtraction, mltiplication, fsed mltiply-add, division, and sqare root. 3.1. Addition, sbtraction, and fsed mltiply-add. When t has the form t = x + y with x, y in F, we show in the theorem below that the general bonds E 1 (t) 1+ and E 2(t) given in (1.2) remain optimal nless the basis β is odd. This extends the analysis done by Holm [6], who considers only the first relative error and assmes implicitly that the basis is even. Theorem 3.1. Let β, p 2 and t = x + y with x, y F. The bonds in (1.2) are optimal if and only if β is even. Frthermore, when β is even, they are attained for (x, y) = (1, ) and ronding to nearest even. Proof. If β is odd, then = 1 2 β1 p = β 1 i=0 2 β i p with β 1 2 {1,..., β 1}, so and 1 + have infinite expansions in base β. Hence x + y cannot have the form ±(1 + )β e with e Z and ths, by Theorem 2.1, eqality never occrs in the bonds E 1 (x + y) /(1 + ) and E 2 (x + y). If β is even, then is in F. Conseqently, we can take (x, y) = (1, ), which gives E 1 (x + y) = /(1 + ) and, for ties ronded to even, E 2 (x + y) =. Since 1 + 2 belongs to F for any β and since 1 + = (1 + 2) = 1 1 +, the theorem above extends immediately to sbtraction as well as to higher-level operations encompassing addition, like the fsed mltiply-add (x, y, z) fl(xy +z). 3.2. Mltiplication. When t = xy with x, y F, the theorem below shows that the sitation is more sbtle than for addition: althogh the condition that β is even remains necessary for the optimality of the bonds E 1 (t) /(1 + ) and E 2 (t), it is sfficient only when β 2; when β = 2, optimality trns ot to be eqivalent to 2 p + 1 being composite. Theorem 3.2. Let β, p 2 and t = xy with x, y F. We then have the following, depending on the vale of the base β: If β = 2, then the bonds in (1.2) are optimal if and only if 2 p + 1 is not prime; frthermore, if D is a non-trivial divisor of 2 p +1, then these bonds are attained for ( ) (2 + 2)fp(D) D (x, y) =, D fp(d)

6 C.-P. JEANNEROD AND S. M. RUMP and ronding to nearest even. If β > 2, then the bonds in (1.2) are optimal if and only if β is even, and when the latter is tre these bonds are attained for and ronding to nearest even. (x, y) = (2 + 2, 2 1 ) Before proving this reslt, note that the attainability of the bond E 1 (xy) /(1 + ) has been observed in [6, p. 10] in the particlar case (β, p) = (2, 5), by taking x and y in the form shown above with D = 11. Proof. By Theorem 2.1, the optimality of the bonds in (1.2) when t = xy is eqivalent to the existence of a pair (x, y) F F sch that xy = (1 + )fp(xy). Assme first that β > 2. Similarly to Theorem 3.1, a necessary condition for optimality is that β be even. Frthermore, if β is even, then both 2 + 2 and 2 1 are in F, and since their prodct eqals 1+, it sffices to take (x, y) = (2+2, 2 1 ) to show that the bonds in (1.2) are optimal. Let s now consider the case β = 2. Since fp(t 2 e ) = fp(t) 2 e for all (t, e) R Z, we can assme with no loss of generality that 1 x, y < 2. This implies that fp(xy) {1, 2} and, since x, y {1, 1 + 2, 1 + 4,...} and > 0, that the prodct xy cannot be eqal to 1 +. Hence, optimality is eqivalent to the existence of x, y F [1, 2) sch that xy = 2 + 2, that is, eqivalent to the existence of integers X, Y sch that (3.1) XY = (2 p + 1) 2 p 1 and 2 p 1 X, Y < 2 p. If 2 p + 1 is prime, then either X or Y mst be larger than 2 p, so (3.1) has no soltion. If 2 p +1 is composite, one can constrct a soltion (X 0, Y 0 ) to (3.1) as follows. Let D denote a non-trivial divisor of 2 p +1, and let X 0 = 2p +1 D fp(d) and Y 0 = D 2p 1 fp(d). Clearly, X 0 is an integer and the prodct X 0 Y 0 has the desired shape. Ths, it remains to check that Y 0 Z and that both X 0 and Y 0 are in the range [2 p 1, 2 p ). Since 2 p + 1 is odd, D mst be odd too, which implies that (3.2) fp(d) + 1 D < 2fp(D) and D < 2 p. Conseqently, fp(d) 2 p 1, so that fp(d) divides 2 p 1 and Y 0 is an integer. Frthermore, (3.2) leads to 2 p 1 < 2p +1 2 < X 0 (2 p + 1)(1 1 D ) < 2p and 2 p 1 < Y 0 < 2 p, so that X 0 and Y 0 satisfy the range constraint in (3.1). Finally, mltiplying X 0 and Y 0 by 2 = 2 1 p gives x 0 = (2 + 2)fp(D)/D and y 0 = D/fp(D) in F [1, 2) and sch that x 0 y 0 = 2 + 2 = (1 + )fp(x 0 y 0 ). In the rest of this section, we show that the optimality condition 2 p + 1 is not prime arising in Theorem 3.2 for radix two is in fact satisfied in most practical sitations. First of all, if p is not a power of two, this condition is well known to hold [3, Exercise 18, 4], since p can be factored as p = mq with q odd and then (3.3) 2 p + 1 = (2 m + 1) ( 2 p m 2 p 2m + 2 m + 1 ). The binary basic formats specified by the IEEE 754-2008 standard being sch that p {24, 53, 113}, they fall into this category. More precisely, since here p is either odd or a mltiple of three, we dedce from (3.3) explicit divisors of 2 p + 1 and ths explicit pairs (x, y) F 2 for which the bonds in Theorem 3.2 are attained:

OPTIMAL BOUNDS ON RELATIVE ERRORS, WITH APPLICATIONS 7 If p is odd, then 3 divides 2 p + 1 and we can take (x, y) = ( 4+4 3, 3 2) ; If p 0 mod 3, then 2 p + 1 can be factored as (2 p/3 + 1)(2 2p/3 2 p/3 + 1), so we can take (x, y) = (2 2 1/3 + 2 2/3, 1 + 1/3 ). In fact, the sfficient condition p is not a power of two is satisfied not only by those basic formats bt also by all the binary interchange formats of IEEE 754-2008, for which either p {11, 24, 53, 113} or (3.4) p = k d + 13 with d = 4 log 2 k, k = 32j, j N 5, and where denotes ronding to a nearest integer. (A proof of the fact that (3.4) implies that p is not a power of two is deferred to Appendix A.) Assme now that p is a power of two. In this case 2 p + 1 is not prime if and only if it is not a prime nmber of the form F l = 2 2l + 1, called a Fermat prime. Crrently, the only known Fermat primes are F 0, F 1, F 2, F 3, F 4 and, on the other hand, F l is known to be composite for all 5 l 32; see for example [11, 17]. To smmarize, the only vales of p for which the optimality condition 2 p + 1 is not prime can fail to be satisfied are 2, 4, 8, 16 and p = 2 l 2 33 8.6 10 9, and none of these vales corresponds to an IEEE format. 3.3. Division. This section focses on the largest possible relative errors committed when ronding x/y with x, y nonzero elements of F. As the theorem below shows, the general bonds E 1 (t) 1+ and E 2(t) given in (1.2) can be refined frther for base 2, bt remain optimal for all other bases. Note that nlike for addition or mltiplication, optimality is achieved withot any extra assmption on the parity of β or the nmber theoretic properties of p. Theorem 3.3. Let β, p 2 and let x, y F be nonzero. Then { 2 2 if β = 2, E 1 (x/y) 1+ if β > 2, and { 2 2 E 2 (x/y) 1+ 2 if β = 2, 2 if β > 2. The bonds for β = 2 are attained at (x, y) = (1, 1 ) and, assming ties are ronded to even, the bonds for β > 2 are attained at (x, y) = (2 + 2, 2). Proof. When β > 2 the bonds are the general ones given in (1.2), and the fact they are attained for division follows immediately from 2 + 2 F. The rest of the proof is ths devoted to the case β = 2. Let t = x/y. Since t cannot be a midpoint [13, 9], we have fl( t) = fl(t) and fl(t 2 e ) = fl(t) 2 e, e Z, regardless of the tie-breaking rle of fl. Conseqently, we can assme 1 t < 2 and x, y > 0. When t = 1 both E 1 and E 2 are zero, so we are left with handling t sch that 1 < t < 2. The lower bond on t implies x > y, which for x and y in F is eqivalent to x y + 2 fp(y). Hence, sing y (2 2)fp(y), (3.5) t 1 + 2 fp(y) y 1 + 1 = 1 1.

8 C.-P. JEANNEROD AND S. M. RUMP Since 1/(1 ) is strictly larger than the midpoint 1 +, it follows that (3.6) fl(t) {1 + 2, 1 + 4,...}. Assme for now that p 3. (For simplicity, the case β = p = 2 is handled separately at the end of the proof.) If fl(t) 1+4 then t 1+3, so that E 1 (t) fp(t)/t /(1+3) < 2 2 for p 3. Similarly, E 2 (t) /(1 + 4) < ( 2 2 )/(1 + 2 2 ) for p 3. If fl(t) = 1 + 2 then, recalling (3.5) and the fact that t is not a midpoint, (3.7) 1 t < 1 + 3. 1 We now distingish between the following two sb-cases, depending on how t compares to fl(t) = 1 + 2: If t 1 + 2, then E 1 (t) = (1 + 2)/t 1 and E 2 (t) = 1 t/(1 + 2), so that the first ineqality in (3.7) gives immediately the desired bonds, which are attained only when t = 1/(1 ); since both 1 and 1 are in F when β = 2, this vale of t is obtained for (x, y) = (1, 1 ). If t > 1 + 2, then E 1 (t) = 1 (1 + 2)/t and E 2 (t) = t/(1 + 2) 1. The bond t < 1 + 3 from (3.7) then gives immediately E 1 (t) < /(1 + 3), which is less than 2 2 for p 3. For E 2 (t), however, sing t < 1 + 3 is not enogh and we show first how to replace this bond by the slightly sharper one (3.8) t 2 + 7 2 + = 1 + 3 3 2 2 + O( 3 ). The range of t implies y +2 fp(y) < x < y +6 fp(y). Note that y cannot be eqal to (2 2)fp(y), for otherwise 2fp(y) < x < (2+4)fp(y), ths contradicting the fact that x F. Therefore, y (2 4)fp(y) and x = y + 4 fp(y). Writing y = (1 + 2k)fp(y) with k a nonnegative integer, we dedce that t = 1 + 4 1 + 2k, so t < 1 + 3 is eqivalent to k > 2 p 1 /3, that is, k (2 p 1 + 1)/3 for k is an integer. Hence 2k (1 + 2)/3 and (3.8) follows. Recalling that E 2 (t) = t/(1 + 2) 1 and applying (3.8), we arrive at E 2 (t) 2(1 ) (2+)(1+2), which is less than 22 1+ 2 for p 3. 2 Assme now that p = 2. We have = 1/4 and, assming with no loss of generality that fp(x) = 1, we see that x {1, 3/2} and that y is either fp(y) or 3/2 fp(y). This yields for possibilities for t = x/y, all leading to E 1 (t) = E 2 (t) = 0 except for x = 1 and y = 3/2 fp(y). In this case, the constraint 1 < t < 2 implies frther y = 3/4 = 1. It follows that t = 1/(1 ), from which we dedce fl(t) = 1 + 2. The annonced bonds ths hold also in the case β = p = 2, and since 1 and 1 are in F, they are attained for (x, y) = (1, 1 ).

OPTIMAL BOUNDS ON RELATIVE ERRORS, WITH APPLICATIONS 9 3.4. Sqare root. Finally, we show how to refine frther the bonds E 1 (t) /(1 + ) and E 2 (t) in the special case where t = x for some positive floating-point nmber x, thereby establishing the optimal bonds in the last row of Table 1. This reslt is independent of any specific property of the base and the precision, and holds for any tie-breaking strategy. Theorem 3.4. Let β, p 2 and let x F be positive. Then E 1 ( 1 x) 1 and E 2 ( x) 1 + 2 1, 1 + 2 and these bonds are attained only for x = (1 + 2)β 2e with e Z. Proof. Let t = x. Writing x = µβ 2e with e Z and µ F [1, β 2 ), we see that t = µ β e with µ {1, 1 + 2, 1 + 4,...}. If µ = 1 then E 1 (t) = E 2 (t) = 0, so we are left with the following two cases. If µ = 1 + 2 then we dedce from 1 < 1 + 2 < 1 + that fl(t) = β e and, conseqently, that E 1 (t) = 1 1/ 1 + 2 and E 2 (t) = 1 + 2 1. If µ 1 + 4 then, recalling that µ < β 2, we have 1 + 4 β e t < β e+1 and fp(t) = β e. This implies that E 1 (t) fp(t)/t / 1 + 4 =: ϕ and it can be checked that ϕ < 1 1/ 1 + 2 for 1/2. Frthermore, sing 1 + 4 > 1 + gives fl(t) (1 + 2)β e and then E 2 (t) /(1 + 2) < 1 + 2 1. 4. Application to smmation 4.1. Sms of floating-point nmbers. Consider first the evalation of the sm of n floating-point nmbers. Here x 1,..., x n F are given and we assme that an approximation ŝ F to the exact sm s = x 1 + + x n is prodced after n 1 floating-point additions, sing any evalation order. 2 Each of these additions takes a pair of floating-point nmbers, say (a, b) F 2, and retrns fl(a + b), ths committing the local error e := a + b fl(a + b). When evalating the sm as above, n 1 sch errors can occr and, denoting their set as {e 1,..., e n 1 }, it is easy to see that (4.1) s = e 1 + + e n 1 + ŝ; see for example [5, p. 81]. The theorem below shows how to bond the sm of the e i and, therefore, ŝ s as well. This bond is slightly sharper than the one given in [10, Proposition 3.1] and can be established in the same way, sing /(1+) instead of to bond the relative ronding error committed by each floating-point addition. (For completeness a detailed proof is presented in Appendix B; note that as in [10] this bond holds even if nderflow occrs.) Theorem 4.1. For x 1,..., x n F, any order of evalation of the sm s = n x i prodces an approximation ŝ sch that the ronding errors e 1,..., e n 1 satisfy n 1 n ŝ s e i σ n 1 x i, σ n := n 1 +. 2 For example, for n = 4 possible evalation orders inclde ((x1 + x 2 ) + x 3 ) + x 4 and ((x 4 + x 3 ) + x 2 ) + x 1 and (x 1 + x 2 ) + (x 3 + x 4 ).

10 C.-P. JEANNEROD AND S. M. RUMP A direct conseqence of this reslt is a sharper and simpler bond for the following compensated smmation scheme (see [14] and the references therein): ŝ comp = fl(ŝ + ê), ê := a floating-point evalation of e 1 + + e n 1. Since each e i belongs to F and can be compted exactly sing for example Knth s TwoSm algorithm [12, p. 236], we can apply Theorem 4.1 twice and, writing e = n 1 e i, we dedce that n (4.2) ê e σ n 1 σ n 2 x i. Since s = ŝ + e, we have also (4.3) ŝ comp s fl(ŝ + ê) (ŝ + ê) + ê e ŝ + ê + ê e 1 + ( 1 + s + 1 + ) ê e. 1 + Then, combining (4.2) and (4.3) and sing the fact that (1 + 1+ )( 1+ )2 2 1+ 2 we arrive at (4.4) ŝ comp s 1 + s + (n 1)(n 2) 2 n 1 + 2 x i. The bond in (4.4) holds withot any restriction on n and regardless of the orderings sed for adding the x i and adding the e i. Frthermore, it is slightly sharper than the bond s + (n 1) 2 2 n (1 (n 1)) 2 x i given in [14, Proposition 4.5] in the special case of recrsive compensated smmation. 4.2. Sms of real nmbers. We now trn to the case where x 1,..., x n are in R instead of F. An approximation ŝ F to s = x 1 + + x n R is obtained in two steps, by first ronding each x i into fl(x i ) and then evalating fl(x 1 ) + + fl(x n ) as above. A typical example is the comptation of inner prodcts, where each x i is the exact prodct of two given floating-point nmbers. Writing d i = x i fl(x i ) for the errors de to ronding the data and, as before, e i for the errors de to the n 1 additions, we dedce that the exact and compted sms are now related as (4.5) s = d 1 + + d n + e 1 + + e n 1 + ŝ. By combining (1.2) and Theorem 4.1 we obtain almost immediately the following bond on the sm of the d i and the e i. Theorem 4.2. For x 1,..., x n R, any order of evalation of the sm n fl(x i) prodces an approximation ŝ to the sm s = n x i sch that the ronding errors d 1,..., d n and e 1,..., e n 1 satisfy n n 1 n (1 + 2)n 2 ŝ s d i + e i ζ n x i, ζ n := (1 + ) 2 < n. Proof. The lower bond follows from (4.5). To establish the pper bond, let v = /(1 + ). Theorem 4.1 gives i<n e i (n 1)v i n fl(x i) and, on the other hand, (1.2) implies that d i v x i and fl(x i ) (1 + v) x i. Hence

OPTIMAL BOUNDS ON RELATIVE ERRORS, WITH APPLICATIONS 11 i n d i + i<n e i (v + (n 1)v(1 + v)) i n x i, and it is easily checked that v +(n 1)v(1+v) simplifies to ((1+2)n 2 )/(1+) 2, which is less than n. Ths, the theorem above gives in particlar the bond n n ŝ x i ζn x i. This bond has a slightly smaller constant than the one given in [10, Proposition 4.1] and, perhaps more importantly, its proof is significantly shorter and avoids a tedios indction and fp-based case distinctions. Frthermore, the applications mentioned in [10] obviosly benefit directly from this new bond: for inner prodcts, matrix-vector prodcts, and matrix-matrix prodcts, the constants n obtained in [10, Theorem 4.2 and p. 343] can all be replaced by ζ n. 5. Other application examples 5.1. Example 1: Small arithmetic expressions. Ronding error analyses as those done in [5] typically involve bonds on θ n, where θ n is an expression of the form θ n = n (1 + δ i) 1 with δ i a relative error term associated with a single floating-point operation. Using the classical bond δ i it is easily checked that for all n, (5.1) n θ n (1 + ) n 1. Note that only the pper bond has the form n + O( 2 ), that is, contains terms nonlinear in. From (5.1) it follows that θ n (1 + ) n 1 and, assming n < 1, this bond itself is sally bonded above by the classical fraction γ n = n/(1 n). Using the refined bond E 1 (t) /(1 + ) from (1.2), we can replace δ i by δ i /(1 + ), which immediately leads to the refined enclosre (5.2) n 1 + θ n ( 1 + 1 + ) n 1. Althogh the pper bond in (5.2) still has O( 2 ) terms in general, it is bonded by n as long as n 3. Conseqently, by jst systematically sing E 1 (t) /(1 + ) instead of E 1 (t), we can replace γ n = n + O( 2 ) by n in every error analysis where θ n appears with n 3. For example, when evalating ab + cd in the sal way as r = fl(fl(ab) + fl(cd)), we have r = (ab(1 + δ 1 ) + cd(1 + δ 2 ))(1 + δ 3 ) with δ i /(1 + ). Hence r = ab(1 + θ 2 ) + cd(1 + θ 2), θ 2, θ 2 2, so the sal γ 2 is indeed replaced by 2. Note that once we have replaced E 1 (t) by E 1 (t) /(1+), we obtain that term 2 immediately, withot having to resort to a sophisticated fp-based argment as the one introdced in [1, pp. 1470 1471]. Similar examples inclde the evalation of small arithmetic expressions like the prodct x 1 x 2 x 3 x 4 (sing any parenthesization) or the sms ((x 1 +x 2 )+x 3 )+x 4 and ((x 1 +x 2 )+(x 3 +x 4 ))+((x 5 +x 6 )+(x 7 +x 8 )) (sing these specific parenthesizations); in each case the forward error bond classically involves γ 3, which we now replace by 3.

12 C.-P. JEANNEROD AND S. M. RUMP 5.2. Example 2: Leading principal minors of a tridiagonal matrix. Consider the tridiagonal matrix d 1 e 1 c 2 d 2 e 2 A =......... F n n,...... en 1 and let µ 1, µ 2,..., µ n be the seqence of its n leading principal minors. Writing µ 1 = 0 and µ 0 = 1, those minors are ths defined by the linear recrrence c n d n µ k = d k µ k 1 c k e k 1 µ k 2, 1 k n. Using the sal bond E 1 (t) and barring nderflow and overflow, Wilkinson shows in [19, 3] that the evalation of this recrrence prodces floating-point nmbers µ 1,..., µ n sch that µ k = d k (1 + ɛ k ) µ k 1 c k (1 + ɛ k)e k 1 (1 + ɛ k) µ k 2, where (1 ) 2 1 ɛ k (1 + ) 2 1 and (1 ) 3/2 1 ɛ k, ɛ k (1 + )3/2 1. In other words, the compted µ k are the leading principal minors of a nearby tridiagonal matrix A + A = [a ij (1 + δ ij )] that satisfies (5.3) 2 < δ ii 2 + 2 and 3 2 < δ ij 3 2 + O(2 ) if i j. Notice that the terms 2 and O( 2 ) come exclsively from the pper bonds on ɛ k, ɛ k, ɛ k. By sing the refined bond E 1(t) /(1 + ) from (1.2) instead of jst E 1 (t), these pper bonds are straightforwardly improved to ɛ k (1 + 1+ )2 1 < 2 and ɛ k, ɛ k (1 + 1+ )3/2 1 < 3 2. Conseqently, Wilkinson s bonds in (5.3) can be replaced by the following more concise and slightly sharper ones: δ ii < 2 and δ ij < 3 2 if i j. 5.3. Example 3: Eclidean norm of an n-dimensional vector. Given a vector [x 1,..., x n ] T F n, let its norm r = x 2 1 + + x2 n be evalated in floating point in the sal way: form the sqares fl(x 2 i ), sm them p in any order into ŝ, and retrn r = fl ( ŝ ). By applying the sal bond E 1 (t), all we can say is ŝ = ( n x2 i )(1 + θ n) with θ n as in (5.1), and r = ŝ (1 + δ) with δ. Conseqently, r = r(1 + ɛ), where ɛ = 1 + θ n (1 + δ) 1 satisfies (1 ) n/2+1 1 ɛ (1 + ) n/2+1 1. Althogh the lower bond has absolte vale at most (n/2+1) see for example [4, p. 42], the pper bond is strictly larger than this, so that (5.4) (n/2 + 1) ɛ (n/2 + 1) + O( 2 ). To avoid the O( 2 ) term above, we can se the refined bond E 1 (t) /(1 + ), which says δ /(1 + ), together with the improved bond for inner prodcts from [10], which says θ n n. Indeed, from these two bonds we dedce that ɛ

OPTIMAL BOUNDS ON RELATIVE ERRORS, WITH APPLICATIONS 13 is pper bonded by 1 + n (1 + /(1 + )) 1, and the latter qantity is easily checked to be at most (n/2 + 1). Ths, recalling the lower bond in (5.4), we conclde that (5.5) ɛ (n/2 + 1). In particlar, evalating the hypotense x 2 1 + x2 2 in floating-point prodces a relative error of at most 2; this improves over the classical bond 2 + O( 2 ), stated for example in [7, p. 225]. Of corse, the bond in (5.5) also applies when scaling by integer powers of the base is introdced to avoid nderflow and overflow. 5.4. Example 4: Cholesky factorization. We consider A F n n symmetric and its trianglarization in floating-point arithmetic sing the classical Cholesky algorithm. If the algorithm rns to completion, then by sing the bonds E i (t), i = 1, 2, the traditional ronding error analysis concldes that the compted factor R satisfies R T R = A + A with A γ n+1 R T R ; see for example [5, Theorem 10.3]. Here γ n+1 = (n+1) 1 (n+1) has the form (n + 1) + O( 2 ) and reqires n + 1 < 1. It was shown in [15] that both the qadratic term in and the restriction on n can be removed, reslting in the improved backward error bond A (n + 1) R T R. In the proof of [15, Theorem 4.4], one of the ingredients sed to sppress the O( 2 ) term is the following property: ( (5.6) a F 0 and b = fl ( a )) b 2 a 2b 2. In [15] it is shown that this property may not hold if only the bond E 2 (t) is assmed, and that in this case all we can say is (2 + 2 )b 2 b 2 a 2b 2. Frthermore, a proof of (5.6) is given, which is abot 10 lines long and based on a fp-based case analysis; see [15, p. 692]. Instead, or optimal bond on E 2 for sqare root provides a direct proof: Theorem 3.4 gives b(1 + δ) = a with δ 1 + 2 1; hence b 2 a = b 2 2δ + δ 2 and 2δ + δ 2 (2 + δ ) δ ( 1 + 2 + 1)( 1 + 2 1) = 2, from which (5.6) follows immediately. 5.5. Example 5: Complex mltiplication with an FMA. Given a, b, c, d F, consider the complex prodct z = (a + ib)(c + id). Varios approximations ẑ = R + i Î to z can be obtained, depending on how R = ac bd and I = ad + bc are evalated in floating-point. It was shown in [1] that the conventional way, which ses 4 mltiplications and 2 additions, gives ẑ = z(1 + ɛ) with ɛ C sch that ɛ < 5, and that the constant 5 is, at least in base 2, best possible. Assme now that an FMA is available, so that we compte, say, R = fl(ac fl(bd)) and Î = fl(ad + fl(bc)).

14 C.-P. JEANNEROD AND S. M. RUMP For this algorithm and its variants 3 it was shown in [8] that the bond 5 can be redced frther to 2, and that the latter is essentially optimal. The fact that 2 is an pper bond is established in [8, Theorem 3.1] with a rather long proof. As we shall see in the paragraph below, a mch more direct proof follows from simply applying the refined bond E 1 (t) /(1 + ) in a systematic way. Denoting by δ 1,..., δ 4 the for ronding errors involved, we have R = (ac bd(1 + δ 1 ))(1 + δ 2 ) = R + Rδ 2 bdδ 1 (1 + δ 2 ) and, similarly, Î = I + Iδ 4 + bcδ 3 (1 + δ 4 ). Now let λ, µ R 0 be sch that δ 2, δ 4 λ and δ 1 (1 + δ 2 ), δ 3 (1 + δ 4 ) µ. This implies that R R λ R + µ bd and I Î λ I + µ bc, from which we dedce (5.7) z ẑ 2 = (R R) 2 + (I Î)2 λ 2 z 2 + 2λµA + µ 2 B, where A = R bd + I bc and B = (bd) 2 + (bc) 2. It trns ot that (5.8) A, B z 2. For B, this bond simply follows from the eqality z 2 = (ac) 2 +(bd) 2 +(ad) 2 +(bc) 2. For A, define π = abcd and notice that A = π (bd) 2 + π + (bc) 2 is eqal to either B or ±(2π + (bc) 2 (bd) 2 ); frthermore, in the latter case we have A 2 π + (bc) 2 (bd) 2 min { (ac) 2 + (bd) 2, (ad) 2 + (bc) 2} + max { (bc) 2, (bd) 2} z 2. Ths, combining (5.7) and (5.8), z ẑ (λ + µ) z. Since the refined bond E 1 (t) /(1 + ) implies δ i /(1 + ) for all i, we can take λ = /(1 + ) and µ = /(1 + ) (1 + /(1 + )), which are both less than. Hence, barring nderflow and overflow and since z = 0 implies ẑ = 0, we conclde that ẑ = z(1 + ɛ), ɛ 2+32 (1+) 2 < 2. Note that 2+32 (1+) has the form 2 2 + O( 4 ) as 0. Ths, or approach 2 not only yields a shorter and more direct proof of the bond 2 of [8], bt it also improves on that bond. Appendix A. Proof that p as in (3.4) is not a power of two If j {5, 6, 7}, then p {144, 175, 206} and is not a power of two. Assme now that j 8. Writing d = 4m + i for integers m, i with 0 i 3, we have 4m + i 1 2 4 log 2 k 4m + i + 1 2. If i 0, this implies 2 m+1/8 k 2 m+7/8 and then 2 m+1/8 4m + 10 p 2 m+7/8 4m + 12. 3 There are three other ways to insert the innermost ronding fl, all giving the same error as the one developed here.

OPTIMAL BOUNDS ON RELATIVE ERRORS, WITH APPLICATIONS 15 Since the assmption j 8 implies k 2 8 and ths m 8, it follows that 2 m < p < 2 m+1. Conseqently, p cannot be a power of two when i 0. On the other hand, when i = 0, we see that p = 32j 4m + 13 mst be odd, and ths cannot be a power of two neither. Appendix B. Proof of Theorem 4.1 The lower bond follows from (4.1) and the triangle ineqality. For the pper bond, the proof is by indction on n, the case n = 1 being trivial (since then there is no ronding error at all). For n 2, we assme that the reslt is tre p to n 1, and we fix one evalation order in dimension n. The approximation ŝ obtained with this order has the form ŝ = fl(ŝ 1 + ŝ 2 ), where ŝ j is the reslt of a floating-point evalation of s j = i I j x i for j = 1, 2 and with {I 1, I 2 } a partition of the set I = {1, 2,..., n}. For j = 1, 2 let n j be the cardinality of I j and let e (1) j,..., e (nj 1) j be the ronding errors committed when evalating s j, so that s j = i<n j e (i) j + ŝ j. Conseqently, e i = δ + 1 + 2, δ = ŝ 1 + ŝ 2 fl(ŝ 1 + ŝ 2 ). i<n i<n i<n 1 e (i) i<n 2 e (i) Since 1 n j < n, the indctive assmption leads to e i δ + ) ((n 1 1) s 1 + (n 2 1) s 2, 1 + where s j = i I j x i for j = 1, 2. Since n = n 1 + n 2 and n x i = s 1 + s 2, it remains to check that (B.1) δ 1 + (n 2 s 1 + n 1 s 2 ). To do so, we note that (1.2) and [10, Lemma 2.2] imply that { } δ min ŝ 1, ŝ 2, 1 + ŝ 1 + ŝ 2, and we consider the following three cases. Assme first that s 2 /(1 + ) s 1. Then s 2 s 1 and, sing δ ŝ 2, we obtain δ ŝ 2 s 2 + s 2 i<n 2 e (i) 2 + s 2 (n 2 1) 1 + s 2 + 1 + s 1 n 2 1 + s 1 and (B.1) ths follows. Second, when s 1 /(1+) s 2 we proceed similarly, simply swapping the indices 1 and 2. Third, when /(1+) s 1 < s 2 and /(1+) s 2 < s 1, we have δ /(1 + ) ŝ 1 + ŝ 2 with ŝ 1 + ŝ 2 ŝ 1 s 1 + s 1 + ŝ 2 s 2 + s 2 and ŝ j s j (n j 1)/(1 + ) s j (n j 1) s k for (j, k) {(1, 2), (2, 1)}. Hence (B.1) follows in this third case as well, ths completing the proof of Theorem 4.1.

16 C.-P. JEANNEROD AND S. M. RUMP References [1] R. Brent, C. Percival, and P. Zimmermann, Error bonds on complex floating-point mltiplication, Math. Comp. 76 (2007), 1469 1481. [2] T. J. Dekker, A floating-point techniqe for extending the available precision, Nmer. Math. 18 (1971/72), 224 242. [3] R. L. Graham, D. E. Knth, and O. Patashnik, Concrete Mathematics: A Fondation for Compter Science, 2nd ed., Addison-Wesley, Reading, MA, 1994. [4] G. H. Hardy, J. E. Littlewood, and G. Pólya, Ineqalities, 2nd ed., Cambridge University Press, Cambridge, UK, 1952. [5] N. J. Higham, Accracy and Stability of Nmerical Algorithms, 2nd ed., Society for Indstrial and Applied Mathematics (SIAM), Philadelphia, PA, 2002. [6] J. E. Holm, Floating-Point Arithmetic and Program Correctness Proofs, Ph.D. thesis, Cornell University, Ithaca, NY, Agst 1980, pp. vii+133. [7] T. E. Hll, T. F. Fairgrieve, and P. T. P. Tang, Implementing complex elementary fnctions sing exception handling, ACM Trans. Math. Software 20 (1994), no. 2, 215 244. [8] C.-P. Jeannerod, P. Kornerp, N. Lovet, and J.-M. Mller, Error bonds on complex floating-point mltiplication with an FMA, Math. Comp. (pblished electronically), 2016. [9] C.-P. Jeannerod, N. Lovet, J.-M. Mller, and A. Panhalex, Midpoints and exact points of some algebraic fnctions in floating-point arithmetic, IEEE Trans. Compt. 60 (2011), no. 2, 228 241. [10] C.-P. Jeannerod and S. M. Rmp, Improved error bonds for inner prodcts in floating-point arithmetic, SIAM J. Matrix Anal. Appl. 34 (2013), no. 2, 338 344. [11] W. Keller, Prime factors k 2 n + 1 of Fermat nmbers F m and complete factoring stats, Agst 2016, web page available at http://www.prothsearch.net/fermat.html. [12] D. E. Knth, The Art of Compter Programming, Volme 2, Seminmerical Algorithms, 3rd ed., Addison-Wesley, Reading, MA, 1998. [13] P. W. Markstein, Comptation of elementary fnctions on the IBM RISC System/6000 processor, IBM Jornal of Research and Development 34 (1990), no. 1, 111 119. [14] T. Ogita, S. M. Rmp, and S. Oishi, Accrate sm and dot prodct, SIAM J. Sci. Compt. 26 (2005), no. 6, 1955 1988. [15] S. M. Rmp and C.-P. Jeannerod, Improved backward error bonds for LU and Cholesky factorizations, SIAM J. Matrix Anal. Appl. 35 (2014), no. 2, 684 698. [16] S. M. Rmp, T. Ogita, and S. Oishi, Accrate floating-point smmation part I: Faithfl ronding, SIAM J. Sci. Compt. 31 (2008), no. 1, 189 224. [17] N. J. A. Sloane and D. W. Wilson, Seqence A019434, The On-Line Encyclopedia of Integer Seqences, pblished electronically at http://oeis.org. [18] P. H. Sterbenz, Floating-Point Comptation, Prentice-Hall, 1974. [19] J. H. Wilkinson, Error analysis of floating-point comptation, Nmer. Math. 2 (1960), 319 340. Inria and Université de Lyon, laboratoire LIP (CNRS, ENS de Lyon, Inria, UCBL), 46 allée d Italie 69364 Lyon cedex 07, France E-mail address: clade-pierre.jeannerod@inria.fr Hambrg University of Technology, Schwarzenbergstraße 95, Hambrg 21071, Germany, and Faclty of Science and Engineering, Waseda University, 3-4-1 Okbo, Shinjkk, Tokyo 169-8555, Japan E-mail address: rmp@thh.de