ASAP 05 Table-Based Polynomials for Fast Hardware Function Evaluation Jérémie Detrey Florent de Dinechin Projet Arénaire LIP UMR CNRS ENS Lyon UCB Lyon INRIA 5668 http://www.ens-lyon.fr/lip/arenaire/ CENTRE NATIONAL DE LA RECHERCHE SCIENTIFIQUE ECOLE NORMALE SUPERIEURE DE LYON
Overview 1 Context The HOTBM method Results Conclusion Jérémie Detrey, Florent de Dinechin Table-Based Polynomials for Fast Hardware Function Evaluation 1 / 34
Context 2 Context The HOTBM method Results Conclusion Jérémie Detrey, Florent de Dinechin Table-Based Polynomials for Fast Hardware Function Evaluation 2 / 34
Context: function evaluation 3 fixed-point elementary functions sin(x), cos(x), log(x), e x,... signal or image processing neural networks dedicated computations logarithmic number system: log 2 (1 + 2 x ) and log 2 (1 2 x )... Jérémie Detrey, Florent de Dinechin Table-Based Polynomials for Fast Hardware Function Evaluation 3 / 34
Context: function evaluation 3 fixed-point elementary functions sin(x), cos(x), log(x), e x,... signal or image processing neural networks dedicated computations logarithmic number system: log 2 (1 + 2 x ) and log 2 (1 2 x )... X w I w O f(x) usually w I = w O and 8 w I, w O 32 Jérémie Detrey, Florent de Dinechin Table-Based Polynomials for Fast Hardware Function Evaluation 3 / 34
Context: function evaluation 3 fixed-point elementary functions sin(x), cos(x), log(x), e x,... signal or image processing neural networks dedicated computations logarithmic number system: log 2 (1 + 2 x ) and log 2 (1 2 x )... X w I? w O f(x) usually w I = w O and 8 w I, w O 32 Jérémie Detrey, Florent de Dinechin Table-Based Polynomials for Fast Hardware Function Evaluation 3 / 34
Order 0: direct look-up table 4 tabulate all the possible values X w I f(0) f(1).. f(2 w I 2) f(2 w I 1) w O f(x) very short critical path: only 1 table look-up huge look-up table: w O 2 w I bits Jérémie Detrey, Florent de Dinechin Table-Based Polynomials for Fast Hardware Function Evaluation 4 / 34
Order 1: lookup-multiply method 5 piecewise linear approximation K 0 ( A) w O + g X w I A K 1 ( A) w O + g w O f(x) B smaller tables longer critical path: 1 table look-up, 1 mult and 1 add Jérémie Detrey, Florent de Dinechin Table-Based Polynomials for Fast Hardware Function Evaluation 5 / 34
Order 1: bipartite table method [2] 6 tabulate the product in a table of offsets (TO) TIV( A) w O + g A w O f(x) X w I A 0 TO(, B) A 0 w O + g B shorter critical path: 1 table look-up and 1 add slightly larger tables Jérémie Detrey, Florent de Dinechin Table-Based Polynomials for Fast Hardware Function Evaluation 6 / 34
Order 1: multipartite table method [14,11,4] 7 split the linear offset (TO) as a sum of several offsets (TO i s) X w I A TIV w O + g A 0 B B 0 A 1 O X R TO 0 O X R w O + g w O f(x) B 1 O X R TO 1 O X R w O + g B 2 A 2 O X R TO 2 O X R w O + g critical path: 2 XOR stages, 1 table look-up and log 2 (n) adds much smaller tables, but adder tree Jérémie Detrey, Florent de Dinechin Table-Based Polynomials for Fast Hardware Function Evaluation 7 / 34
Order 2: SMSO method [6] 8 split the order-1 term as the sum of a small product and an offset X w I A TIV w O + g 0 B A 0 TS w O + g 1 w O + g 0 B 0 w O f(x) A 1 B 1 O X R TO 1 O X R w O + g 0 A 2 B 2 O X R TO 2 w O + g 0 critical path: 1 table look-up, 1 rectangular mult and 2 adds multiplier, but smaller tables Jérémie Detrey, Florent de Dinechin Table-Based Polynomials for Fast Hardware Function Evaluation 8 / 34
Higher order methods 9 Hörner evaluation interleaved memory interpolators: Lewis partial product arrays: Hassler and Takagi specialized squaring unit: Piñero, Bruguera and Muller this work Jérémie Detrey, Florent de Dinechin Table-Based Polynomials for Fast Hardware Function Evaluation 9 / 34
Objectives 10 higher order approximation for larger precisions and smaller tables accurate error analysis to help the optimization of the hardware cost split large operators into smaller ones for architectural exploration Jérémie Detrey, Florent de Dinechin Table-Based Polynomials for Fast Hardware Function Evaluation 10 / 34
The HOTBM method (Higher-Order Table-Based Method) 11 Context The HOTBM method Results Conclusion Jérémie Detrey, Florent de Dinechin Table-Based Polynomials for Fast Hardware Function Evaluation 11 / 34
Polynomial approximation 12 1 f(x) 0 0 1 Jérémie Detrey, Florent de Dinechin Table-Based Polynomials for Fast Hardware Function Evaluation 12 / 34
Polynomial approximation 12 1 f(x) 0 0 1/8 2/8 3/8 4/8 5/8 6/8 7/8 1 input word decomposition: X = A + 2 α B =.a 1 a 2 a α b 1 b 2 b β w I A B α β Jérémie Detrey, Florent de Dinechin Table-Based Polynomials for Fast Hardware Function Evaluation 12 / 34
Polynomial approximation 12 1 f(x) 0 0 1/8 2/8 3/8 4/8 5/8 6/8 7/8 1 input word decomposition: X = A + 2 α B =.a 1 a 2 a α b 1 b 2 b β w I A B α β piecewise order-n minimax polynomial approximation: n f(x) P (A)(B) = K k (A) (2 α B) k k=0 Jérémie Detrey, Florent de Dinechin Table-Based Polynomials for Fast Hardware Function Evaluation 12 / 34
Polynomial approximation: architecture 13 X w I A K 0 ( A) w O + g B K 1 ( A) 2 α B w O + g w O f(x) K 2 ( A) ( 2 α B) 2 w O + g. K n ( A) ( 2 α B ) n w O + g Jérémie Detrey, Florent de Dinechin Table-Based Polynomials for Fast Hardware Function Evaluation 13 / 34
Polynomial approximation: architecture 13 X w I A K 0 ( A) w O + g B? K 1 ( A) 2 α B w O + g w O f(x)? K 2 ( A) ( 2 α B) 2 w O + g.? K n ( A) ( 2 α B ) n w O + g architectural choices to implement each term T k (A, B) = K k (A) (2 α B) k Jérémie Detrey, Florent de Dinechin Table-Based Polynomials for Fast Hardware Function Evaluation 13 / 34
Computing the terms: exploiting symmetry 14 each term T k (A, B) is symmetric with respect to the middle of each sub-interval: when k is even, T k (A, B) = T k (A, B): B < 0 B > 0 A B T k ( A, B) when k is odd, T k (A, B) = T k (A, B): B < 0 B > 0 A B T k ( A, B) Jérémie Detrey, Florent de Dinechin Table-Based Polynomials for Fast Hardware Function Evaluation 14 / 34
Computing the terms: exploiting symmetry 14 each term T k (A, B) is symmetric with respect to the middle of each sub-interval: when k is even, T k (A, B) = T k (A, B): B < 0 B > 0 A b 1 B B T k ( A, B ) when k is odd, T k (A, B) = T k (A, B): B < 0 B > 0 A b 1 B B T k ( A, B ) Jérémie Detrey, Florent de Dinechin Table-Based Polynomials for Fast Hardware Function Evaluation 14 / 34
Computing the terms: simple look-up table 15 tabulate all the possible values A b 1 B T k ( A, B ) Jérémie Detrey, Florent de Dinechin Table-Based Polynomials for Fast Hardware Function Evaluation 15 / 34
Computing the terms: power-and-multiply 16 compute S k = B k with a powering unit split S k into several sub-words S k,1,..., S k,mk : k (β 1) S k,1 S k,2... S k,mk σ k,1 σ k,2 σ k,mk compute the product K k (A) S k K k (A) S k,j : as the sum of all the sub-products the most significant ones implemented as actual multipliers the least significant ones implemented as look-up tables exploit symmetry for each of those sub-products Jérémie Detrey, Florent de Dinechin Table-Based Polynomials for Fast Hardware Function Evaluation 16 / 34
Computing the terms: power-and-multiply 17 A K k ( A) b 1 S k,1 B k B..... K k ( A) S k,2. S k,mk K k ( A) S k,m k Jérémie Detrey, Florent de Dinechin Table-Based Polynomials for Fast Hardware Function Evaluation 17 / 34
Computing the terms: power-and-multiply 17 A K k ( A) S b 1? k,1 k O X B. R B.... K k ( A) S k,2. S k,mk K k ( A) S k,m k Jérémie Detrey, Florent de Dinechin Table-Based Polynomials for Fast Hardware Function Evaluation 17 / 34
Computing the terms: powering unit 18 implemented as a look-up table Jérémie Detrey, Florent de Dinechin Table-Based Polynomials for Fast Hardware Function Evaluation 18 / 34
Computing the terms: powering unit 18 implemented as a look-up table implemented as a sum of partial products B partial products S k Jérémie Detrey, Florent de Dinechin Table-Based Polynomials for Fast Hardware Function Evaluation 18 / 34
Degrading accuracy 19 T 0 ( A) = K 0 ( A) α K 1 ( A) S 1,1 K 1 ( A) S 1,2 K 1 ( A) T 1 ( A, B) = K 1 ( A) 2 α B S 1,3 2α. f(x). K 2 ( A) K 2 ( A) S 2,1 T2 ( A, B) = K 2 ( A) ( 2 α B) 2 S 2,2. some of the terms are more accurate than others Jérémie Detrey, Florent de Dinechin Table-Based Polynomials for Fast Hardware Function Evaluation 19 / 34
Degrading accuracy 19 T 0 ( A) = K 0 ( A) α K 1 ( A) S 1,1 K 1 ( A) S 1,2 K 1 ( A) T 1 ( A, B) = K 1 ( A) 2 α B S 1,3 2α. f(x). K 2 ( A) K 2 ( A) S 2,1 T2 ( A, B) = K 2 ( A) ( 2 α B) 2 S 2,2. some of the terms are more accurate than others Jérémie Detrey, Florent de Dinechin Table-Based Polynomials for Fast Hardware Function Evaluation 19 / 34
Degrading accuracy 19 T 0 ( A) = K 0 ( A) α K 1 ( A) S 1,1 K 1 ( A) S 1,2 K 1 ( A) T 1 ( A, B) = K 1 ( A) 2 α B S 1,3 2α. f(x). K 2 ( A) K 2 ( A) S 2,1 T2 ( A, B) = K 2 ( A) ( 2 α B) 2 S 2,2. some of the terms are more accurate than others we can save area by using less bits to compute the most accurate tables Jérémie Detrey, Florent de Dinechin Table-Based Polynomials for Fast Hardware Function Evaluation 19 / 34
Degrading accuracy: global architecture 20 each term T k is computed using only: A k, the α k most significant bits of A B k, the β k most significant bits of B w I A k α k B k β k Jérémie Detrey, Florent de Dinechin Table-Based Polynomials for Fast Hardware Function Evaluation 20 / 34
Degrading accuracy: global architecture 21 X w I A A 0 T 0 ( ) A 0 w O + g A 1 B B 1 T 1 ( A 1, B 1 ) w O + g A 2 w O f(x) B 2 T 2 ( A 2, B 2 ). w O + g A 3 B n T n ( A n, B n ) w O + g Jérémie Detrey, Florent de Dinechin Table-Based Polynomials for Fast Hardware Function Evaluation 21 / 34
Degrading accuracy: global architecture 21 X w I A A 0 T 0 ( ) A 0 w O + g A 1 B B 1 T 1 ( A 1, B 1 ) w O + g A 2 w O f(x) B 2 T 2 ( A 2, B 2 ). w O + g A 3 B n T n ( A n, B n ) w O + g Jérémie Detrey, Florent de Dinechin Table-Based Polynomials for Fast Hardware Function Evaluation 21 / 34
Degrading accuracy: power-and-multiply terms 22 only the λ k most significant bits of B k k are used for S k each sub-product K k (A k ) S k,j is computed using only A k,j, the α k,j most significant bits of A k Jérémie Detrey, Florent de Dinechin Table-Based Polynomials for Fast Hardware Function Evaluation 22 / 34
Degrading accuracy: power-and-multiply terms 23 A k A k,1 K k ( ) A k,1 b 1 S k,1 B O X k R k B k..... A k,2 K k ( ) A k,2 S k,2. A k,mk S k,mk K k ( ) A k,mk S k,m k Jérémie Detrey, Florent de Dinechin Table-Based Polynomials for Fast Hardware Function Evaluation 23 / 34
Degrading accuracy: power-and-multiply terms 23 A k A k,1 K k ( ) A k,1 b 1 S k,1 B O X k R k B k. S k,1..... A k,2 K k ( ) A k,2 S k,2. S k,mk A k,mk K k ( ) A k,mk S k,m k Jérémie Detrey, Florent de Dinechin Table-Based Polynomials for Fast Hardware Function Evaluation 23 / 34
Degrading accuracy: ad-hoc powering units 24 each ad-hoc powering unit is truncated to µ k bits B k S k Jérémie Detrey, Florent de Dinechin Table-Based Polynomials for Fast Hardware Function Evaluation 24 / 34
Degrading accuracy: ad-hoc powering units 24 each ad-hoc powering unit is truncated to µ k bits B k S k S k Jérémie Detrey, Florent de Dinechin Table-Based Polynomials for Fast Hardware Function Evaluation 24 / 34
Error analysis 25 every error entailed by the operator is accurately bounded: minimax error method errors rounding errors we can easily compute g the number of guard bits required to ensure faithful rounding (last bit accuracy) a trial-and-error method is then applied to decrease g Jérémie Detrey, Florent de Dinechin Table-Based Polynomials for Fast Hardware Function Evaluation 25 / 34
Results 26 Context The HOTBM method Results Conclusion Jérémie Detrey, Florent de Dinechin Table-Based Polynomials for Fast Hardware Function Evaluation 26 / 34
Results: area estimations for log 2 (1 + x) 27 Operator area (in slices) 3000 FPGA area ratio 2500 2000 order 2 SMSO order 3 50% 1500 30% 1000 order 4 500 10% 0 12 16 20 24 28 32 Input / output precision w I = w O (in bits) as expected, exponential growth order 2 up to 24 bits, order 3 up to 28 bits, order 4 up to 32 bits Jérémie Detrey, Florent de Dinechin Table-Based Polynomials for Fast Hardware Function Evaluation 27 / 34
Results: area estimations for sin x 28 Operator area (in slices) 3000 FPGA area ratio 2500 50% 2000 1500 order 2 SMSO 30% 1000 500 order 3 order 4 10% 0 12 16 20 24 28 32 Input / output precision w I = w O (in bits) Jérémie Detrey, Florent de Dinechin Table-Based Polynomials for Fast Hardware Function Evaluation 28 / 34
Results: delay estimations for log 2 (1 + x) 29 Operator delay (in ns) 50 45 40 order 4 35 30 order 3 25 order 2 SMSO 20 12 16 20 24 28 32 Input / output precision w I = w O (in bits) latency increase for higher orders Jérémie Detrey, Florent de Dinechin Table-Based Polynomials for Fast Hardware Function Evaluation 29 / 34
Results: delay estimations for sin x 30 Operator delay (in ns) 50 45 40 order 4 35 order 3 30 SMSO 25 order 2 20 12 16 20 24 28 32 Input / output precision w I = w O (in bits) Jérémie Detrey, Florent de Dinechin Table-Based Polynomials for Fast Hardware Function Evaluation 30 / 34
Conclusion 31 Context The HOTBM method Results Conclusion Jérémie Detrey, Florent de Dinechin Table-Based Polynomials for Fast Hardware Function Evaluation 31 / 34
Contribution 32 a novel function approximation method: arbitrary order: smaller tables optimized powering units small multipliers: shorter critical path, and can benefit from recent FPGA technologies (Virtex-II) highly parameterizable design, adaptable to various metrics accurate approximation and rounding error analysis targeted to precisions up to 32 bits Jérémie Detrey, Florent de Dinechin Table-Based Polynomials for Fast Hardware Function Evaluation 32 / 34
Future work 33 improve parameter space exploration heuristic following user-specified criteria adapt this method to ASIC (different metric, architectural choices,...) take advantage of accurate error analysis method to finely tune the tables Jérémie Detrey, Florent de Dinechin Table-Based Polynomials for Fast Hardware Function Evaluation 33 / 34
Future work 33 improve parameter space exploration heuristic following user-specified criteria adapt this method to ASIC (different metric, architectural choices,...) take advantage of accurate error analysis method to finely tune the tables work-in-progress: library of parameterizable floating-point operators for elementary functions: logarithm exponential Jérémie Detrey, Florent de Dinechin Table-Based Polynomials for Fast Hardware Function Evaluation 33 / 34
Thank you for your attention 34 more information: http://www.ens-lyon.fr/lip/arenaire/ CVS repository: http://lipforge.ens-lyon.fr/www/hotbm/ Jérémie Detrey, Florent de Dinechin Table-Based Polynomials for Fast Hardware Function Evaluation 34 / 34
Thank you for your attention 34 more information: http://www.ens-lyon.fr/lip/arenaire/ CVS repository: http://lipforge.ens-lyon.fr/www/hotbm/ Questions? Jérémie Detrey, Florent de Dinechin Table-Based Polynomials for Fast Hardware Function Evaluation 34 / 34