Automated design of floating-point logarithm functions on integer processors

Size: px

Start display at page:

Download "Automated design of floating-point logarithm functions on integer processors"

Theresa Singleton
5 years ago
Views:

1 23rd IEEE Symposium on Computer Arithmetic Santa Clara, CA, USA, July 2016 Automated design of floating-point logarithm functions on integer processors Guillaume Revy (presented by Florent de Dinechin) Univ. Perpignan Via Domitia, DALI project-team Univ. Montpellier 2, LIRMM, UMR 5506 CNRS, LIRMM, UMR 5506 G. Revy (DALI UPVD/LIRMM, UM2, CNRS) Automated design of floating-point logarithm functions on integer processors 1/22

2 Summary Context and objectives Automated synthesis of floating-point function implementation particular case of logarithm functions log b (x) work done within the french ANR MetaLibm project ( FLIP software library low latency and correctly-rounded implementation VLIW integer processor of the ST200 family no exception handling Achievements 1. Unified range reduction for log b (x) implementation 2. Correctly-rounded log 2 (x), log(x), and log 10 (x) for the binary32 format 200 cycles on the ST231 G. Revy (DALI UPVD/LIRMM, UM2, CNRS) Automated design of floating-point logarithm functions on integer processors 2/22

3 Problem statement b R > 1 (p,e min,e max ) {RN,RU,RD,RZ} G. Revy (DALI UPVD/LIRMM, UM2, CNRS) Automated design of floating-point logarithm functions on integer processors 3/22

4 Problem statement x = m 2 e, x > 0 b R > 1 (p,e min,e max ) {RN,RU,RD,RZ} integer arithmetic r = (log b (x)) input/output: ±0, ±, NaN, subnormal and normal numbers G. Revy (DALI UPVD/LIRMM, UM2, CNRS) Automated design of floating-point logarithm functions on integer processors 3/22

5 Problem statement x = m 2 e, x > 0 b R > 1 (p,e min,e max ) {RN,RU,RD,RZ} integer arithmetic r = (log b (x)) input/output: ±0, ±, NaN, subnormal and normal numbers Our focus: IEEE-754 binary32 implementations on the ST231 b {2,exp(1),10} 4-issue VLIW 32-bit integer processor (p,emin,e max ) = (24, 126,127) 1-cycle ALU / 3-cycle MUL = RN 64-bit arithmetic emulated in software no underflow nor overflow high latency memory accesses G. Revy (DALI UPVD/LIRMM, UM2, CNRS) Automated design of floating-point logarithm functions on integer processors 3/22

6 Outline of the talk 1. Unified evaluation scheme for log b (x) 2. Error analysis 3. Implementation and results 4. Concluding remarks and future work G. Revy (DALI UPVD/LIRMM, UM2, CNRS) Automated design of floating-point logarithm functions on integer processors 4/22

7 Unified evaluation scheme for log b (x) Outline of the talk 1. Unified evaluation scheme for log b (x) 2. Error analysis 3. Implementation and results 4. Concluding remarks and future work G. Revy (DALI UPVD/LIRMM, UM2, CNRS) Automated design of floating-point logarithm functions on integer processors 5/22

8 Unified evaluation scheme for log b (x) Logarithm range reduction Let x be a non-negative normal floating-point number, with x 1 log b (x) = 2 d u RN(log b (x)) = ( 1) sign(u) 2 d RN( u ) with u [1,2) and d {e min,,e max } G. Revy (DALI UPVD/LIRMM, UM2, CNRS) Automated design of floating-point logarithm functions on integer processors 6/22

9 Unified evaluation scheme for log b (x) Logarithm range reduction Let x be a non-negative normal floating-point number, with x 1 log b (x) = 2 d u RN(log b (x)) = ( 1) sign(u) 2 d RN( u ) with u [1,2) and d {e min,,e max } x = 2 e m log b (x) = log b (2) e + log b (m), with m [1,2] catastrophic cancellation when e = 1 G. Revy (DALI UPVD/LIRMM, UM2, CNRS) Automated design of floating-point logarithm functions on integer processors 6/22

10 Unified evaluation scheme for log b (x) Logarithm range reduction Let x be a non-negative normal floating-point number, with x 1 log b (x) = 2 d u RN(log b (x)) = ( 1) sign(u) 2 d RN( u ) with u [1,2) and d {e min,,e max } x = 2 e m log b (x) = log b (2) e + log b (m), with m [1,2] catastrophic cancellation when e = 1 Range reduction: define τ = 1 if m 1.5, 0 otherwise log b (x) = log b (2) (e + τ) + log b (m/2 τ ), with m/2 τ [0.75,1.5] no branch instruction G. Revy (DALI UPVD/LIRMM, UM2, CNRS) Automated design of floating-point logarithm functions on integer processors 6/22

11 Unified evaluation scheme for log b (x) Logarithm range reduction Let x be a non-negative normal floating-point number, with x 1 log b (x) = 2 d u RN(log b (x)) = ( 1) sign(u) 2 d RN( u ) with u [1,2) and d {e min,,e max } x = 2 e m log b (x) = log b (2) e + log b (m), with m [1,2] catastrophic cancellation when e = 1 Range reduction: define τ = 1 if m 1.5, 0 otherwise log b (x) = log b (2) (e + τ) + log b (m/2 τ ), with m/2 τ [0.75,1.5] }{{} u 2 d no branch instruction G. Revy (DALI UPVD/LIRMM, UM2, CNRS) Automated design of floating-point logarithm functions on integer processors 6/22

12 Unified evaluation scheme for log b (x) Evaluation scheme: log b (2) (e + τ) Let t = m/2 τ 1 [ 0.25,0.5], the objective is to evaluate log b (x) = log b (2) (e + τ) + log b (1 + t) G. Revy (DALI UPVD/LIRMM, UM2, CNRS) Automated design of floating-point logarithm functions on integer processors 7/22

13 Unified evaluation scheme for log b (x) Evaluation scheme: log b (2) (e + τ) Let t = m/2 τ 1 [ 0.25,0.5], the objective is to evaluate log b (x) = log b (2) (e + τ) + log b (1 + t) log b (2) known at compile time: for example log 10 (2) = log 10 (2) = G. Revy (DALI UPVD/LIRMM, UM2, CNRS) Automated design of floating-point logarithm functions on integer processors 7/22

14 Unified evaluation scheme for log b (x) Evaluation scheme: log b (2) (e + τ) Let t = m/2 τ 1 [ 0.25,0.5], the objective is to evaluate log b (x) = log b (2) (e + τ) + log b (1 + t) log b (2) known at compile time: for example log 10 (2) = log 10 (2) = stored as a signed integer ( ) 2 on k bits, associated with an implicit scaling factor = 2 30 G. Revy (DALI UPVD/LIRMM, UM2, CNRS) Automated design of floating-point logarithm functions on integer processors 7/22

15 Unified evaluation scheme for log b (x) Evaluation scheme: log b (2) (e + τ) Let t = m/2 τ 1 [ 0.25,0.5], the objective is to evaluate log b (x) = log b (2) (e + τ) + log b (1 + t) log b (2) known at compile time: for example log 10 (2) = log 10 (2) = stored as a signed integer ( ) 2 on k bits, associated with an implicit scaling factor = 2 30 Hence, rewrite log b (2) as log b (2) = 2 µ ϕ, with 1 ϕ < 2 scaling done statically at synthesis-time G. Revy (DALI UPVD/LIRMM, UM2, CNRS) Automated design of floating-point logarithm functions on integer processors 7/22

16 Unified evaluation scheme for log b (x) Evaluation scheme: log b (2) (e + τ) Let t = m/2 τ 1 [ 0.25,0.5], the objective is to evaluate log b (x) = log b (2) (e + τ) + log b (1 + t) log b (2) known at compile time: for example log 10 (2) = log 10 (2) = stored as a signed integer ( ) 2 on k bits, associated with an implicit scaling factor = 2 30 Hence, rewrite log b (2) as log b (2) = 2 µ ϕ, with 1 ϕ < 2 scaling done statically at synthesis-time The formula becomes log b (x) = 2 µ (ϕ (e + τ) + log b (1 + t)/2 µ) G. Revy (DALI UPVD/LIRMM, UM2, CNRS) Automated design of floating-point logarithm functions on integer processors 7/22

17 Unified evaluation scheme for log b (x) Evaluation scheme: log b (m/2 τ ) Table lookup-based method Tang (1990) and Gal (1991) tabulated values of log(x) combined with small degree polynomial evaluation requires storing tabulated values on memory Polynomial evaluation-based method Schulte and Swartzlander (1993) univariate polynomial to compte log 2 (x) our approach: particular bivariate polynomial G. Revy (DALI UPVD/LIRMM, UM2, CNRS) Automated design of floating-point logarithm functions on integer processors 8/22

18 Unified evaluation scheme for log b (x) Polynomial approximation The formula so far log b (x) = 2 µ (ϕ (e + τ) + log b (1 + t)/2 µ) G. Revy (DALI UPVD/LIRMM, UM2, CNRS) Automated design of floating-point logarithm functions on integer processors 9/22

19 Unified evaluation scheme for log b (x) Polynomial approximation The formula so far log b (x) = 2 µ (ϕ (e + τ) + log b (1 + t)/2 µ) For example, when (b,µ) = (10, 2) log 10 (1 + ( ))/2 µ = log 10 ( )/2 µ = G. Revy (DALI UPVD/LIRMM, UM2, CNRS) Automated design of floating-point logarithm functions on integer processors 9/22

20 Unified evaluation scheme for log b (x) Polynomial approximation The formula so far log b (x) = 2 µ (ϕ (e + τ) + log b (1 + t)/2 µ) For example, when (b,µ) = (10, 2) log 10 (1 + ( ))/2 µ = log 10 ( )/2 µ = when e = 0, we need to compute log 10 (1 + t)/2 µ on a large number of bits to get sufficient accuracy after renormalization G. Revy (DALI UPVD/LIRMM, UM2, CNRS) Automated design of floating-point logarithm functions on integer processors 9/22

21 Unified evaluation scheme for log b (x) Polynomial approximation The formula so far log b (x) = 2 µ (ϕ (e + τ) + log b (1 + t)/2 µ) For example, when (b,µ) = (10, 2) log 10 (1 + ( ))/2 µ = log 10 ( )/2 µ = when e = 0, we need to compute log 10 (1 + t)/2 µ on a large number of bits to get sufficient accuracy after renormalization Hence rewrite t = 2 δ h, where h 0.5 (δ is the leading zero count on t, and h is t shifted by δ bits ) log b (1 + t) 2 µ = 2 δ h log b(1 + t) 2 µ t with log b (1 + t) 2 µ (1,4) t scaling done dynamically at evaluation-time G. Revy (DALI UPVD/LIRMM, UM2, CNRS) Automated design of floating-point logarithm functions on integer processors 9/22

22 Unified evaluation scheme for log b (x) Evaluation scheme Objective: compute RN(log b (x)), with ( log b (x) = 2 µ ϕ (e + τ) + 2 δ h log b(1 + t) ) 2 µ t G. Revy (DALI UPVD/LIRMM, UM2, CNRS) Automated design of floating-point logarithm functions on integer processors 10/22

23 Unified evaluation scheme for log b (x) Evaluation scheme Objective: compute RN(log b (x)), with ( log b (x) = 2 µ ϕ (e + τ) + 2 δ h log b(1 + t) ) 2 µ t Main step and d = c + µ ϕ (e + τ) + 2 δ h log b(1 + t) 2 µ t }{{} u 2 c G. Revy (DALI UPVD/LIRMM, UM2, CNRS) Automated design of floating-point logarithm functions on integer processors 10/22

24 Unified evaluation scheme for log b (x) Evaluation scheme Objective: compute RN(log b (x)), with ( log b (x) = 2 µ ϕ (e + τ) + 2 δ h log b(1 + t) ) 2 µ t Main step and d = c + µ ϕ (e + τ) + 2 δ h log b(1 + t) 2 µ t }{{} u 2 c The value u is approximated by a finite-precision value û, such that u û < ε. G. Revy (DALI UPVD/LIRMM, UM2, CNRS) Automated design of floating-point logarithm functions on integer processors 10/22

25 Error analysis Outline of the talk 1. Unified evaluation scheme for log b (x) 2. Error analysis 3. Implementation and results 4. Concluding remarks and future work G. Revy (DALI UPVD/LIRMM, UM2, CNRS) Automated design of floating-point logarithm functions on integer processors 11/22

26 Error analysis Three sources of error Objective: approximate u by a finite precision value û ( u = 2 c ϕ (e + τ) + 2 δ h log b(1 + t) ) 2 µ t G. Revy (DALI UPVD/LIRMM, UM2, CNRS) Automated design of floating-point logarithm functions on integer processors 12/22

27 Error analysis Three sources of error Objective: approximate u by a finite precision value û ( u = 2 c ϕ (e + τ) + 2 δ h log b(1 + t) ) 2 µ t ϕ ˆϕ 2 2 k ( 2 c ˆϕ (e + τ) + 2 δ h log b(1 + t) ) 2 µ t G. Revy (DALI UPVD/LIRMM, UM2, CNRS) Automated design of floating-point logarithm functions on integer processors 12/22

28 Error analysis Three sources of error Objective: approximate u by a finite precision value û ( u = 2 c ϕ (e + τ) + 2 δ h log b(1 + t) ) 2 µ t ϕ ˆϕ 2 2 k ( 2 c ˆϕ (e + τ) + 2 δ h log b(1 + t) ) 2 µ t α(a) ( ) 2 c ˆϕ (e + τ) + 2 δ h a(t) G. Revy (DALI UPVD/LIRMM, UM2, CNRS) Automated design of floating-point logarithm functions on integer processors 12/22

29 Error analysis Three sources of error Objective: approximate u by a finite precision value û ( u = 2 c ϕ (e + τ) + 2 δ h log b(1 + t) ) 2 µ t ϕ ˆϕ 2 2 k ( 2 c ˆϕ (e + τ) + 2 δ h log b(1 + t) ) 2 µ t α(a) ( 2 c ( û = 2 c ) ˆϕ (e + τ) + 2 δ h a(t) ρ(p ) ˆϕ (e + τ) 2 δ h a(t) }{{} bivariate polynomial ) G. Revy (DALI UPVD/LIRMM, UM2, CNRS) Automated design of floating-point logarithm functions on integer processors 12/22

30 Error analysis Sufficient error bounds By triangular inequality u û 2 6 k + (2 2 3 p ) α(a) + ρ(p ) To compute û such that u û < ε, we have to ensure 2 6 k + (2 2 3 p ) α(a) + ρ(p ) < ε G. Revy (DALI UPVD/LIRMM, UM2, CNRS) Automated design of floating-point logarithm functions on integer processors 13/22

31 Error analysis Sufficient error bounds By triangular inequality u û 2 6 k + (2 2 3 p ) α(a) + ρ(p ) To compute û such that u û < ε, we have to ensure 2 6 k + (2 2 3 p ) α(a) + ρ(p ) < ε More particularly, we have to compute a polynomial approximant a(t) such that α(a) < ε 26 k p G. Revy (DALI UPVD/LIRMM, UM2, CNRS) Automated design of floating-point logarithm functions on integer processors 13/22

32 Error analysis Sufficient error bounds By triangular inequality u û 2 6 k + (2 2 3 p ) α(a) + ρ(p ) To compute û such that u û < ε, we have to ensure 2 6 k + (2 2 3 p ) α(a) + ρ(p ) < ε More particularly, we have to compute a polynomial approximant a(t) such that α(a) θ with θ < ε 26 k p G. Revy (DALI UPVD/LIRMM, UM2, CNRS) Automated design of floating-point logarithm functions on integer processors 13/22

33 Error analysis Sufficient error bounds By triangular inequality u û 2 6 k + (2 2 3 p ) α(a) + ρ(p ) To compute û such that u û < ε, we have to ensure 2 6 k + (2 2 3 p ) α(a) + ρ(p ) < ε More particularly, we have to compute a polynomial approximant a(t) such that α(a) θ with θ < ε 26 k p And finally, we have to find a finite-precision evaluation program P ρ(p ) < ε 2 6 k (2 2 3 p ) θ G. Revy (DALI UPVD/LIRMM, UM2, CNRS) Automated design of floating-point logarithm functions on integer processors 13/22

34 Error analysis Computation of the required error bound Objective: compute û such that u û < ε to ensure correct rounding Table s Maker Dilemma exhaustive searching feasible for the binary32 floating-point format (with p = 24) G. Revy (DALI UPVD/LIRMM, UM2, CNRS) Automated design of floating-point logarithm functions on integer processors 14/22

35 Error analysis Computation of the required error bound Objective: compute û such that u û < ε to ensure correct rounding Table s Maker Dilemma exhaustive searching feasible for the binary32 floating-point format (with p = 24) Exhaustive search smallest distance between log b (x ) and the middle of two floating-point numbers in precision p, for all floating-point inputs x breakpoints in precision p d ˆv v log b (x ) G. Revy (DALI UPVD/LIRMM, UM2, CNRS) Automated design of floating-point logarithm functions on integer processors 14/22

36 Error analysis Computation of the required error bound Objective: compute û such that u û < ε to ensure correct rounding Table s Maker Dilemma exhaustive searching feasible for the binary32 floating-point format (with p = 24) Exhaustive search smallest distance between log b (x ) and the middle of two floating-point numbers in precision p, for all floating-point inputs x b 2 exp(1) 10 error bound: log 2 (ε) for example: x = log(x) = }{{} 5 58 bits G. Revy (DALI UPVD/LIRMM, UM2, CNRS) Automated design of floating-point logarithm functions on integer processors 14/22

37 Implementation and results Outline of the talk 1. Unified evaluation scheme for log b (x) 2. Error analysis 3. Implementation and results 4. Concluding remarks and future work G. Revy (DALI UPVD/LIRMM, UM2, CNRS) Automated design of floating-point logarithm functions on integer processors 15/22

38 Implementation and results Determining word-length k and polynomial degree Computation of a polynomial approximant a(t), such that α(a) < ε 26 k ε 26 k =, since p = p The word-length k must be chosen such that this bound remains non-negative b 2 exp(1) 10 error bound: log 2 (ε) value of k G. Revy (DALI UPVD/LIRMM, UM2, CNRS) Automated design of floating-point logarithm functions on integer processors 16/22

39 Implementation and results Determining word-length k and polynomial degree Computation of a polynomial approximant a(t), such that α(a) < ε 26 k ε 26 k =, since p = p The word-length k must be chosen such that this bound remains non-negative b 2 exp(1) 10 error bound: log 2 (ε) value of k For log(x), only three inputs required more accuracy than 2 56 we first ignore these inputs at synthesis-time then we check if the implementation returns the correct answer for these inputs interest: reducing the word-length k to 64 G. Revy (DALI UPVD/LIRMM, UM2, CNRS) Automated design of floating-point logarithm functions on integer processors 16/22

40 Implementation and results Design space exploration: polynomial evaluator Given ε, the synthesis process is the following 1. determine the smallest degree of a(t) so as the error bound is satisfied b 2 exp(1) 10 error bound: log 2 (ε) smallest degree of a (22) build polynomial approximant and compute the approximation error bound 3. build evaluation program (CGPE), and check its evaluation error (Gappa) explore Horner, Estrin, and other parallel schemes if no program can be found, increase the polynomial degree G. Revy (DALI UPVD/LIRMM, UM2, CNRS) Automated design of floating-point logarithm functions on integer processors 17/22

41 Implementation and results Correctly-rounded binary32 implementations Latency (# cycles) Horner Horner-2 Horner log 2 (x) log(x) log 10 (x) G. Revy (DALI UPVD/LIRMM, UM2, CNRS) Automated design of floating-point logarithm functions on integer processors 18/22

42 Implementation and results Faithful binary32/64 implementations Latency (# cycles) Horner Horner-2 Horner-4 Estrin Estrin rule exposes much more ILP than the other schemes register spilling in binary log 2 (x) log(x) log 10 (x) log 2 (x) log(x) log 10 (x) binary32 binary64 G. Revy (DALI UPVD/LIRMM, UM2, CNRS) Automated design of floating-point logarithm functions on integer processors 19/22

43 Implementation and results Faithful binary32/64 implementations 5 4 IPC (# instructions per cycle) Horner Horner-2 Horner-4 Estrin Estrin rule exposes much more ILP than the other schemes register spilling in binary64 0 log 2 (x) log(x) log 10 (x) log 2 (x) log(x) log 10 (x) binary32 binary64 G. Revy (DALI UPVD/LIRMM, UM2, CNRS) Automated design of floating-point logarithm functions on integer processors 19/22

44 Concluding remarks and future work Outline of the talk 1. Unified evaluation scheme for log b (x) 2. Error analysis 3. Implementation and results 4. Concluding remarks and future work G. Revy (DALI UPVD/LIRMM, UM2, CNRS) Automated design of floating-point logarithm functions on integer processors 20/22

45 Concluding remarks and future work Conclusion remarks and future work Work done so far Automated design of floating-point implementation of logarithm functions unified range reduction polynomial evaluation-based algorithm Correctly-rounded binary32 implementation in 200 cycles on the ST231 FDLibm over SoftFloat was cycles FDLibm over FLIP was about 1000 cycles Extension to faithful binary32/64 implementations Future work is threefold Evaluate the performance of this approach on other architectures Extend this approach to other transcendental functions, like exp b (x) Study the impact on performances of the polynomial evaluation scheme G. Revy (DALI UPVD/LIRMM, UM2, CNRS) Automated design of floating-point logarithm functions on integer processors 21/22

46 23rd IEEE Symposium on Computer Arithmetic Santa Clara, CA, USA, July 2016 Automated design of floating-point logarithm functions on integer processors Guillaume Revy (presented by Florent de Dinechin) Univ. Perpignan Via Domitia, DALI project-team Univ. Montpellier 2, LIRMM, UMR 5506 CNRS, LIRMM, UMR 5506 G. Revy (DALI UPVD/LIRMM, UM2, CNRS) Automated design of floating-point logarithm functions on integer processors 22/22

Automated design of floating-point logarithm functions on integer processors

Automated design of floating-point logarithm functions on integer processors Guillaume Revy To cite this version: Guillaume Revy. Automated design of floating-point logarithm functions on integer processors.