DIGIT-SERIAL ARITHMETIC

Similar documents
DIVISION BY DIGIT RECURRENCE

Chapter 5: Solutions to Exercises

Chapter 6: Solutions to Exercises

VLSI Arithmetic. Lecture 9: Carry-Save and Multi-Operand Addition. Prof. Vojin G. Oklobdzija University of California

Chapter 1: Solutions to Exercises

Design and FPGA Implementation of Radix-10 Algorithm for Division with Limited Precision Primitives

Chapter 5 Arithmetic Circuits

A Hardware-Oriented Method for Evaluating Complex Polynomials

Computer Architecture 10. Fast Adders

9. Datapath Design. Jacob Abraham. Department of Electrical and Computer Engineering The University of Texas at Austin VLSI Design Fall 2017

Part II Addition / Subtraction

Lecture 8: Sequential Multipliers

ARITHMETIC COMBINATIONAL MODULES AND NETWORKS

EECS150 - Digital Design Lecture 24 - Arithmetic Blocks, Part 2 + Shifters

Part VI Function Evaluation

Chapter 2 Basic Arithmetic Circuits

Adders, subtractors comparators, multipliers and other ALU elements

Part II Addition / Subtraction

LOGIC CIRCUITS. Basic Experiment and Design of Electronics

Complement Arithmetic

Chapter 8: Solutions to Exercises

CMP 334: Seventh Class

Adders, subtractors comparators, multipliers and other ALU elements

Dual-Field Arithmetic Unit for GF(p) and GF(2 m ) *

LOGIC CIRCUITS. Basic Experiment and Design of Electronics. Ho Kyung Kim, Ph.D.

Cost/Performance Tradeoff of n-select Square Root Implementations

Logic and Computer Design Fundamentals. Chapter 5 Arithmetic Functions and Circuits

A VLSI Algorithm for Modular Multiplication/Division

Binary Multipliers. Reading: Study Chapter 3. The key trick of multiplication is memorizing a digit-to-digit table Everything else was just adding

Fundamentals of Digital Design

Carry Look Ahead Adders

CSE 140 Spring 2017: Final Solutions (Total 50 Points)

Design of Sequential Circuits

Digital Integrated Circuits A Design Perspective. Arithmetic Circuits. Jan M. Rabaey Anantha Chandrakasan Borivoje Nikolic.

What s the Deal? MULTIPLICATION. Time to multiply

Review for Final Exam

Lecture 11. Advanced Dividers

Multiplication Ivor Page 1

Hardware Design I Chap. 4 Representative combinational logic

Lecture 8. Sequential Multipliers

Integer Multipliers 1

Reduced-Area Constant-Coefficient and Multiple-Constant Multipliers for Xilinx FPGAs with 6-Input LUTs

Unit 7 Sequential Circuits (Flip Flop, Registers)

Design and Implementation of a Radix-4 Complex Division Unit with Prescaling

EECS Components and Design Techniques for Digital Systems. FSMs 9/11/2007

We are here. Assembly Language. Processors Arithmetic Logic Units. Finite State Machines. Circuits Gates. Transistors

Proposal to Improve Data Format Conversions for a Hybrid Number System Processor

ALU (3) - Division Algorithms

CS 140 Lecture 14 Standard Combinational Modules

Tunable Floating-Point for Energy Efficient Accelerators

Low Power and Low Complexity Shift-and-Add Based Computations

Digital Logic: Boolean Algebra and Gates. Textbook Chapter 3

Tree and Array Multipliers Ivor Page 1

Menu. 7-Segment LED. Misc. 7-Segment LED MSI Components >MUX >Adders Memory Devices >D-FF, RAM, ROM Computer/Microprocessor >GCPU

Cost/Performance Tradeoffs:

Analysis and Synthesis of Weighted-Sum Functions


COE 202: Digital Logic Design Sequential Circuits Part 4. Dr. Ahmad Almulhem ahmadsm AT kfupm Phone: Office:

The goal differs from prime factorization. Prime factorization would initialize all divisors to be prime numbers instead of integers*

Chapter 5. Digital Design and Computer Architecture, 2 nd Edition. David Money Harris and Sarah L. Harris. Chapter 5 <1>

Logic Design II (17.342) Spring Lecture Outline

Proposal to Improve Data Format Conversions for a Hybrid Number System Processor

Digital Integrated Circuits A Design Perspective. Arithmetic Circuits. Jan M. Rabaey Anantha Chandrakasan Borivoje Nikolic.

DSP Design Lecture 7. Unfolding cont. & Folding. Dr. Fredrik Edman.

Residue Number Systems Ivor Page 1

ELEN Electronique numérique

Linear Feedback Shift Registers (LFSRs) 4-bit LFSR

EECS150 - Digital Design Lecture 21 - Design Blocks

A Suggestion for a Fast Residue Multiplier for a Family of Moduli of the Form (2 n (2 p ± 1))


CPE100: Digital Logic Design I

Digital Logic. CS211 Computer Architecture. l Topics. l Transistors (Design & Types) l Logic Gates. l Combinational Circuits.

Radix-4 Vectoring CORDIC Algorithm and Architectures. July 1998 Technical Report No: UMA-DAC-98/20

CprE 281: Digital Logic

Arithme(c logic units and memory

Unit II Chapter 4:- Digital Logic Contents 4.1 Introduction... 4

Spiral 2-1. Datapath Components: Counters Adders Design Example: Crosswalk Controller

Low Power Division and Square Root

ECE380 Digital Logic. Positional representation

SIR C.R.REDDY COLLEGE OF ENGINEERING ELURU DIGITAL INTEGRATED CIRCUITS (DIC) LABORATORY MANUAL III / IV B.E. (ECE) : I - SEMESTER

CPS 104 Computer Organization and Programming Lecture 11: Gates, Buses, Latches. Robert Wagner

EXPERIMENT Bit Binary Sequential Multiplier

ALU A functional unit

Number representation

A 32-bit Decimal Floating-Point Logarithmic Converter

IT T35 Digital system desigm y - ii /s - iii

International Association of Scientific Innovation and Research (IASIR) (An Association Unifying the Sciences, Engineering, and Applied Research)

DIVIDER IMPLEMENTATION

Implementation Of Digital Fir Filter Using Improved Table Look Up Scheme For Residue Number System

ELEC516 Digital VLSI System Design and Design Automation (spring, 2010) Assignment 4 Reference solution

1 Short adders. t total_ripple8 = t first + 6*t middle + t last = 4t p + 6*2t p + 2t p = 18t p

Lab 3 Revisited. Zener diodes IAP 2008 Lecture 4 1

CprE 281: Digital Logic

Issues on Timing and Clocking

Parallel Multipliers. Dr. Shoab Khan

COMPUTERS ORGANIZATION 2ND YEAR COMPUTE SCIENCE MANAGEMENT ENGINEERING UNIT 3 - ARITMETHIC-LOGIC UNIT JOSÉ GARCÍA RODRÍGUEZ JOSÉ ANTONIO SERRA PÉREZ

University of Toronto Faculty of Applied Science and Engineering Edward S. Rogers Sr. Department of Electrical and Computer Engineering

Vidyalankar S.E. Sem. III [CMPN] Digital Logic Design and Analysis Prelim Question Paper Solution

King Fahd University of Petroleum and Minerals College of Computer Science and Engineering Computer Engineering Department

Combinatorial Logic Design Multiplexers and ALUs CS 64: Computer Organization and Design Logic Lecture #13

Transcription:

DIGIT-SERIAL ARITHMETIC 1 Modes of operation:lsdf and MSDF Algorithm and implementation models LSDF arithmetic MSDF: Online arithmetic

TIMING PARAMETERS 2 radix-r number system: conventional and redundant Serial signal: a numerical input or output with one digit per clock cycle For an n digit signal, the clock cycles are numbered from 1 to n Timing characteristics of a serial operation determined by two parameters: initial delay δ: additional number of operand digits required to determine the first result digit execution time T n ; the time between the first digit of the operand and the last digit of the result T n = δ + n + 1

3 cycle: 0 1 2 3 4 5 6 7 8 9 10 11 12 input compute output δ = 0 T 12 = 1 + 12 (a) cycle: -3-2 -1 0 1 2 3 4 5 6 7 8 9 10 11 12 input compute output δ =3 T 12 = 3+1+12 (b) Figure 9.1: Timing characteristics of serial operation with n = 12. (a) With δ = 0. (b) With δ = 3.

LSDF AND MSDF 4 1. Least-significant digit first (LSDF) mode (right-to-left mode) x = n 1 x ir i i=0 2. Most-significant digit first (MSDF) mode (left-to-right mode) also known as online arithmetic initial delay called online delay x = n i=1 x ir i

ALGORITHM MODEL 5 Operands x and y, result z: n radix-r digits In cycle j the result digit z j+1 is computed Cycles labeled δ,..., 0, 1,..., n In cycle j receive the operand digits x j+1+δ and y j+1+δ, and output z j x[j], y[j] and z[j] are the numerical values of the corresponding signals when their representation consists of the first j + δ digits for the operands, and j digits for the result. In iteration j x[j + 1] = (x[j], x j+1+δ ) y[j + 1] = (y[j], y j+1+δ ) z j+1 = F(w[j], x[j], x j+1+δ, y[j], y j+1+δ, z[j]) z[j + 1] = (z[j, ], z j+1 ) w[j + 1] = G(w[j], x[j], x j+1+δ, y[j], y j+1+δ, z[j], z j+1 )

Cycle j j+1 6 Input x j+1+δ y j+1+δ x[j+1] x j+2+δ y j+2+δ x[j+2] Compute z j+1 z j+2 Output z j z j+1 (a) x j+1+δ y j+1+δ X; Y x[j] y[j] z j+1 F; G z z j j+1 Z z[j] w[j+1] W w[j] (residual) z j (b) digit-serial digit-parallel

7 Table 9.1: Initial delay (δ) Operation LSDF MSDF Addition 0 2 (r = 2) 1 (r 4) Multiplication 0 3 (r = 2) 2 (r = 4) (only MS half) n Division 2n 4 Square root 2n 4 Max/min n 0

8 a) LSDF mode n-digit addition: Cycle: 0 1 2... LSD MSD Inputs: x x x x x x x x x Output: x x x x x x x x x n by n --> 2n multiplication: Inputs: Output: LSD MSD x x x x x x x x x x x x x x x x x x x x x x x x x x x ----------------- MS half b) MSDF mode Cycle: -2-1 0 1 2... n-digit operation: MSD LSD Inputs: x x x x x x x x x Output: --- online delay = 2 x x x x x x x x x Figure 9.3: LSDF and MSDF modes.

COMPOSITE ALGORITHM: EXAMPLE 9 Givens method for matrix triangularization Rotation factors: c = The online delay of the network x x2 + y 2 s = y x2 + y 2 rot = δ1 + δ2 + δ3 + δ4 = 13 Execution time (latency): D rot = rot + n + 4

10 x h y h Operation: δ1 (on-line delay) OLSQ a i i = h-δ1 1 OLSQ b i δ1 Squaring δ2 OLADD Addition (p-bit shift register) f k k=i δ2 1 =p =p δ3 OLSQRT Square root δ4 OLDIV g p p=k δ3 1 OLDIV δ4 Division c q q=p δ4 1 s q (a) x,y a,b f g c, s δ1 δ2 δ3 δ4 (b) Figure 9.4: Online computation of rotation factors: (a) Network. (b) Timing diagram.

LSDF ARITHMETIC: ADDITION/SUBTRACTION 11 The cycle delay t LSDFadd k = t CPA(k) + t FF The total time for n-bit addition T LSDFadd n = ( n k + 1)t LSDFadd k The cost: one k-bit CPA, one flip-flop, and one k-bit output register

Operand X: Operand Y: x z i Result Z: i y k z i z k k-bit k k i-1 k CPA result digit register SUB 12 (a) c carry/borrow FF (initialize to 0 if ADD 1 if SUB) Operand Y: Operand X: Result Z: y i.0 x i,0 z i,0 z z i-1,0 y i,1 x i,1 y i,2 x i,2 4-bit ADDER z i,1 z i,2 z z z i-1,1 z i-1,2 y i,3 x i,3 z i,3 z z i-1,3 SUB carry/borrow FF (initialize to 0 if ADD 1 if SUB) c

LSDF MULTIPLICATION 13 For radix-2 and 2 s complement representation: 1. Serial-serial (LSDF-SS) multiplier, both operands used in digit-serial form. 2. Serial-parallel ( LSDF-SP) multiplier, one operand first converted to parallel form Operation cannot be completed during the input of operands

SERIAL-SERIAL MULTIPLIER 14 Define internal state (residual) w[j] = 2 (j+1) (x[j] y[j] p[j]) where x[j] = j i=0 x i 2 i and similarly for y[j] and p[j]. Both operands used in serial form; the recurrence is w[j + 1] = 2 (j+2) (x[j + 1] y[j + 1] p[j + 1]) = 2 (j+2) ((x[j] + x j+1 2 j+1 )(y[j] + y j+1 2 j+1 ) (p[j] + p j+1 2 j+1 )) = 2 1 (w[j] + y[j + 1]x j+1 + x[j]y j+1 p j+1 ) This can be expressed as v[j] = w[j] + y[j + 1]x j+1 + x[j]y j+1 and w[j + 1] = 2 1 v[j] p j+1 = v[j]mod 2

Position Cycle 8 7 6 5 4 3 2 1 0 0 y 0 x 0 1 x 0 y 1 y 1 x 1 y 0 x 1 2 x 1 y 2 x 0 y 2 y 2 x 2 y 1 x 2 y 0 x 2 3 x 2 y 3 x 1 y 3 x 0 y 3 y 3 x 3 y 2 x 3 y 1 x 3 y 0 x 3 4 x 3 y 4 x 2 y 4 x 1 y 4 x 0 y 4 y 4 x 4 y 3 x 4 y 2 x 4 y 1 x 4 y 0 x 4 15

16 (shift-register for load control in left-append registers not shown) x j+2 x j+1 LA-Reg X y j+2 LA-Reg Y y j+1 x j+1 n w[j+1]: x[j] n SELECTOR n x[j] n shifted WC Reg WC [4:2] ADDER n x j+1 n+1 y j+1 n shifted WS Reg WS y[j+1] n SELECTOR n sign(x) sign(y) FA y[j+1] cycle n LS bits p j+1 carry-out w[j]: n n SA MS bits (register control signals not shown) SA - Serial adder Figure 9.6: Serial-serial 2 s complement radix-2 multiplier.

cont. 17 The total execution time T SSMULT = 2nt cyc The delay of the critical path in a cycle t cyc = t SEL + t 4 2CSA + t FF Cost: one n-bit [4:2] adder, 5 n-bit registers, and gates to form multiples

SERIAL-PARALLEL MULTIPLIER 18 One of the operands is a constant, One possibility: perform operation in 3n cycles Phase 1: Serial input and conversion of one operand to parallel form; Phase 2: Serial-parallel processing and output of the LS half of the product. Phase 3: Serial output of the MS half of the product. The critical path in a cycle t cyc = t SEL + t CSA + t FF The delay of the LSDF-SP multiplier T SPrnd = 3n t cyc

(shifted -in in Phase 1 or constant) x j (used serially in Phase 2) y j n n shifted WC Reg WC Shift-Reg X x SELECTOR n n [3:2] ADDER n+1 w[j+1] n x { 0, x, x } shifted WS Reg WS sign(y) HA cycle n in Phase 2 (shifted -out in Phase 2) LS bits carry-out 19 n w[j] n SA (shifted -out in Phase 3) MS bits (register control signals not shown) SA - Serial adder Phase 1: shift-in operand X (n cycles) Phase 2: serial-parallel carry-save multiplication (n cycles) shifted sum and carry bit-vectors loaded bit-parallel Phase 3: MS bits obtained using bit-serial adder SA operating on bits shifted out of WC and WS shift-registers (n cycles) Figure 9.7: 3-phase serial multiplier.

MSDF: ONLINE ARITHMETIC 20 Online arithmetic algorithms operate in a digit-serial MSDF mode To compute the first digit of the result, δ + 1 digits of the input operands needed Thereafter, for each new digit of the operands, an extra digit of the result obtained The online delay δ typically a small integer, e.g., 1 to 4. Cycle -2-1 0 1 2 Input x 1 x 2 x 3 x 4 x 5 Compute z 1 z 2 z 3 Output z 1 z 2 δ = 2 Figure 9.8: Timing in online arithmetic.

REDUNDANT REPRESENTATION 21 The left-to-right mode of computation requires redundancy Both symmetric { a,..., a} and asymmetric {b,..., c}) digit-sets used in online arithmetic Over-redundant digit-sets also useful Examples of radix-4 redundant digit sets: {-1,0,1,2,3} (asymmetric, minimally-redundant), {-2,-1,0,1,2} (symmetric, minimally-redundant), and {-5,-4,-3,-2,-1,0,1,2,3,4,5} (symmetric, over-redundant) Heterogeneous representations to optimize the implementation Conversion to conventional representation: use on-the-fly conversion

ADDITION/SUBTRACTION 22 The online addition/subtraction algorithm: the serialization of a redundant addition (carry-save or signed-digit) Radix r > 2 and (t j+1, w j+2 ) = where x j, y j, z j { a,..., a}. (0, x j+2 + y j+2 ) if x j+2 + y j+2 a 1 (1, x j+2 + y j+2 r) if x j+2 + y j+2 a ( 1, x j+2 + y j+2 + r) if x j+2 + y j+2 a z j+1 = w j+1 + t j+1

EXAMPLE OF ONLINE ADDITION (r = 4, a = 3) 23 Operands x = (.12 330 1) The result z = (1. 10 1221). y = (.2 1 3322) j x j+2 y j+2 t j+1 w j+2 w j+1 z j+1 z j -1 1 2 1-1 0* 1 0* 0 2-1 0 1-1 -1 1 1-3 -3-1 -2 1 0-1 2 3 3 1 2-2 -1 0 3 0 2 0 2 2 2-1 4-1 2 0 1 2 2 2 5 0 0 0 0 1 1 2 6 0 0 0 0 0 0 1 * latches initialized to 0.

RADIX r > 2 ONLINE ADDER 24 x j y j x j+1 y j+1 x j+2 y j+2 x j+2 y j+2 TW TW TW TW t j-1 w j t j w j+1 t j+1 w j+2 t j+2 t j+1 w j+2 latch w j+1 SUM SUM SUM SUM z j+1 z j z j+1 z j+2 z j (a) (b) Figure 9.9: (a) A segment of radix-r > 2 signed-digit parallel adder. (b) Radix-r > 2 online adder. All latches cleared at start.

RADIX-2 ONLINE ADDER 25 Digit-parallel radix-2 signed-digit adder converted into a radix-2 online adder with online delay δ = 2 x + x - y + y - j+1 j+1 j+1 j+1 x + x - j+2 j+2 y + y - j+2 j+2 x + y + y - j+3 x- j+3 j+3 j+3 x + j+3 x- j+3 y + y - j+3 j+3 FA FA FA FA h j g j+1 h j+1 g j+2 h j+2 g j+3 hj+3 h j+2 g j+3 g j+2 FA FA FA FA t j w j+1 t j+1 w j+2 t j+2 w j+3 t j+3 w j+2 latch z + z + z - z + z - j+1 z - j+1 j+2 j+2 j+3 j+3 z - j+1 = t j+1 z + = w j+1 j+1 (output latches) (a) (b) z - j z + j Figure 9.10: (a) A segment of radix-2 signed-digit parallel adder. (b) Online adder.

cont. 26 The cycle time is t cyc = 2t FA + t FF The operation time T OLADD 2 = (2 + n + 1)t cyc The cost 2 FAs and 5 FFs. To reduce the cycle time, pipeline the two stages: reduces the cycle time by one t FA ; increases online delay to δ = 3

EXAMPLE OF RADIX-2 ONLINE ADDITION 27 x = (.010 1110 1) y = (.10 101 1 10) z = (1. 10100 101) j x j+3 y j+3 x + j+3x j+3 y j+3y + j+3 h j+2 g j+3 g j+2 t j+1 w j+2 z j+1z + j+1 z j -2 0 1 00 10 1 10 00* 0 1 - - - -1 1 0 10 00 1 10 10 0 0 10-0 0-1 00 01 0 01 10 1 1 01 1 1-1 0 01 00 0 10 01 1 1 11-1 2 1 1 10 10 1 00 10 0 0 10 0 3 1-1 10 01 0 11 00 0 1 00 1 4 0-1 00 01 0 01 11 1 0 11 0 5-1 0 01 00 0 10 01 1 1 01 0 6 0 0 00 00 0 00 10 1 1 11-1 7 0 0 00 00 0 00 00 0 0 10 0 8 0 0 00 00 0 00 00 0 0 00 1 * g latches initialized to 00.

METHOD FOR DEVELOPING ONLINE ALGORITHMS 28 Part 1: development of the residual recurrence w[j + 1] = G(w[j], x[j], x j+1+δ, y[j], y j+1+δ, z[j], z j+1 ) for δ j n 1 where x[j] = j+δ x ir i, y[j] = j+δ i=1 y ir i, z[j] = j z ir i i=1 i=1 are the online forms of the operands and the result Part 2: the result digit selection z j+1 = F(w[j], x[j], x j+1+δ, y[j], y j+1+δ, z[j])

Part 1: RESIDUAL AND ITS RECURRENCE 29 Step 1. Describe the online operation by the error bound after j digits Step 2 Transform expression to use only multiplication by r (shift), addition/subtraction, multiplication by a single digit f(x[j], y[j]) z[j] < r j B < G(f(x[j], y[j]) z[j]) < B where G is the required transformation and B and B are the transformed bounds Example: division error expression x[j]/y[j] z[j] < r j transformed into x[j] z[j] y[j] < r j y[j]

cont. 30 Step 3 Define a scaled residual with the bound w[j] = r j (G(f(x[j], y[j]) z[j])) ω = r j B < w[j] < r j B = ω and initial condition w[ δ] = 0. ω and ω are the actual bounds determined in Step 6 Step 4 Determine a recurrence on w[j] w[j+1] = rw[j]+r j+1 (G(f(x[j+1], y[j+1]) z[j+1]) G(f(x[j], y[j]) z[j])) Step 5 Decompose recurrence so that H 1 is independent of z j+1 w[j + 1] = rw[j] + H 1 + H 2 (z j+1 ) = v[j] + H 2 (z j+1 )

cont. 31 Step 6 Determine the bounds of w[j + 1] in terms of H 1 and H 2 ω = rω + max(h1) + H2(a) resulting in Similarly, ω = max(h 1) + H 2 (a) r 1 ω = min(h 1) + H 2 ( a) r 1

Part 2a: SELECTION FUNCTION WITH SELECTION CONSTANTS 32 z j+1 = k if m k v[j] < m k+1 where v[j] is an estimate of v[j] obtained by truncating the redundant representation of v[j] to t fractional bits. Selection constants need to satisfy max( L k ) m k min(ûk 1) where [ L k, Ûk] is the selection interval of the estimate v[j] Step 7 Determine [ L k, Ûk] First, determine [L k, U k ] for v[j] ω = U k + H 2 (k) ω = L k + H 2 (k) Substituting ω and ω the selection intervals for v[j] is, U k = max(h 1)+H 2 (a) r 1 H 2 (k) L k = min(h 1)+H 2 ( a) r 1 H 2 (k)

cont. 33 Now restrict the intervals because of the use of the estimate v[j] e min v[j] v[j] e max producing the error-restricted selection interval [L k, Uk] with Uk = U k e max L k = L k + e min The errors are For carry-save representation e max = 2 t+1 ulp and e min = 0. For signed-digit representation e max = 2 t ulp and e min = (2 t ulp). U k 1 = Uk 1 + 2 t t = L k t L k where x t and x t indicate x values truncated to t fractional bits.

34 L k U k-1 2 -t - possible choices for m k 2 -t+1 v[j] L k U * k-1 U k-1 L * k (the ticks on the v[j] line represent the estimate v[j]) Figure 9.11: The choices of selection constant m k.

cont. 35 Step 8 Determine t and δ. To determine m k, we need min(ûk 1) max( L k ) 0 This relation between t and δ is used to choose suitable values. Step 9 Determine the selection constants m k and the range of v[j] as rω + min(h 1 ) e max t v[j] rω + max(h 1 ) + e min t

Part 2b: SELECTION BY ROUNDING 36 In algorithms using a higher radix (r > 4) w[j + 1] = rw[j] + H 1 + H 2 (z j+1 ) = v[j] + H 2 (z j+1 ) In the rounding method, the result digit is obtained as z j+1 = v[j] + 1 2 with v[j] < r 1 2 to avoid over-redundant output digit. w[j + 1] = v[j] + H 2 ( v[j] + 1 2 ) For CS form of t fractional bits, the estimate error e max = 2 t+1 ulp When ˆv[j] = m k 2 t it must be possible to choose z j+1 = k 1 m k 2 t + e max = 2k 1 2 + 2 t Ûk 1

GENERIC FORM OF EXECUTION AND IMPLEMENTATION. 37 Execution: n + δ iterations of the recurrence, each one clock cycle Iterations (cycles) labeled from δ to n 1 One digit of each input introduced during cycles δ to n 1 δ and digits value 0 thereafter Result digits 0 for cycles δ to 1 and z 1 is produced in cycle 0 Result digit z j is output in cycle j (one extra cycle to output z n )

cont. 38 The actions in cycle j: Input x j+1+δ and y j+1+δ. Update x[j+1] = (x[j], x j+1+δ ) and y[j+1] = (y[j], y j+1+δ ) by appending the input digits. Compute v[j] = rw[j] + H 1 Determine z j+1 using the selection function. Update z[j + 1] = (z[j], z j+1+δ ) by appending the result digits. Compute the next residual w[j + 1] = v[j] + H 2 (z j+1 ) Output result digit z j

IMPLEMENTATION 39 Similar structure of algorithms all implemented with same basic components, such as (i) registers to store operands, results, and residual vectors; (ii) multiplication of vector by digit; (iii) append units to append a new digit to a vector; (iv) Two-operand and multioperand redundant adders, such as signed digit adders, [3:2] carry-save adders and their generalization to [4:2] and [5:2] adders; (v) converters from redundant representations (i.e., signed digit and carry save) to conventional representations; (vi) carry-propagate adders of limited precision (3 to 6 bits) to produce estimates of the residual functions; and (vii) digit-selection schemes to obtain output digits.

cont. 40 Online algorithm implementation similar to implementation of digit-recurrence algorithms Algorithms and implementations developed for most of basic arithmetic operations and for certain composite operations Larger set of operations possible than with LSDF approach

DIGIT-SLICE ORGANIZATION 41 x j+1+δ y j+1+δ * 1*** 2 m z j+1 z j digit slice ** * paths for appending input digits ** left-shifted bits of the residual *** the width of the MS slice depends on the selection function Figure 9.12: A typical digit-slice organization of online arithmetic unit

ONLINE MULTIPLICATION 42 Online forms The error bound at cycle j The residual x[j] = j+δ x ir i, y[j] = j+δ i=1 with the bound w[j] < ω The residual recurrence y ir i, p[j] = j p ir i i=1 i=1 x[j] y[j] p[j] < r j w[j] = r j (x[j] y[j] p[j]) w[j + 1] = rw[j] + (x[j]y j+1+δ + y[j + 1]x j+1+δ )r δ p j+1 = v[j] p j+1

SELECTION FUNCTION 43 Decomposition Bound Selection intervals H 1 = (x[j]y j+1+δ + y[j + 1]x j+1+δ )r δ H 2 = p j+1 ω = ω = ω = ρ(1 2r δ ) U k = ρ(1 2r δ ) + k L k = ρ(1 2r δ ) + k With carry-save representation for w[j] and v[j], the grid-restricted intervals are U k = ρ(1 2r δ ) + k 2 t t L k = ρ(1 2r δ ) + k t The expression to determine t and δ: resulting in ρ(1 2r δ ) + k 1 2 t t ρ(1 2r δ ) + k t 0 ρ(1 2r δ ) t 2 1 (1 + 2 t )

cont. 44 Several examples of relations between r, ρ, t, and δ Radix ρ t δ 2 1 2 3 4 1 2 2 2/3 3 3 8 2/3 2 3

RADIX-2 ONLINE MULTIPLICATION 45 δ = 3 and t = 2 Selection constants m k s obtained from where L k m k Ûk 1 U k = 1 2 2 + k 2 2 2 = k + 2 1 L k = 1 + 2 2 + k 2 = k 3 2 2 Since Ûk 1 = k 2 1 and L k = k 3 2 2, m k = k 2 1 is acceptable. The selection constants are Range of v[j] is m 0 = 2 1, m 1 = 2 1 2 v[j] 7/4 The selection function SELM( v[j]) is p j+1 = SELM( v[j]) = 1 if 1/2 v[j] 7/4 0 if 1/2 v[j] 1/4 1 if 2 v[j] 3/4

IMPLEMENTATION OF SELECTION FUNCTION 46 Estimate ˆv represented by (v 1, v 0, v 1, v 2 ) Product digit p j+1 = (pp, pn) with the code p j+1 pp pn 1 1 0 0 0 0-1 0 1 Switching expressions: pp = v 1(v 0 v 1 ) pn = v 1 (v 0 v 1)

ˆv v 1 v 0 v 1 v 2 p j+1 7/4 01.11 1 6/4 01.10 1 5/4 01.01 1 1 01.00 1 3/4 00.11 1 1/2 00.10 1 1/4 00.01 0 0 00.00 0-1/4 11.11 0-1/2 11.10 0-3/4 11.01-1 -1 11.00-1 -5/4 10.11-1 -6/4 10.10-1 -7/4 10.01-1 -2 10.00-1 47

48 1. [Initialize] x[ 3] = y[ 3] = w[ 3] = 0 for j = 3, 2, 1 x[j + 1] CA(x[j], x j+4 ); y[j + 1] CA(y[j], y j+4 ) v[j] = 2w[j] + (x[j]y j+4 + y[j + 1]x j+4 )2 3 w[j + 1] v[j] end for 2. [Recurrence] for j = 0...n 1 x[j + 1] CA(x[j], x j+4 ); y[j + 1] CA(y[j], y j+4 ) v[j] = 2w[j] + (x[j]y j+4 + y[j + 1]x j+4 )2 3 p j+1 = SELM( v[j]); w[j + 1] v[j] p j+1 P out p j+1 end for Figure 9.13: Radix-2 online multiplication algorithm.

predecessor on-line unit (shift-register for load control in right-append registers not shown) predecessor on-line unit 49 x j+5 LX CA-Reg X y j+5 LY CA-Reg Y n x[j] n x[j] x j+4 y j+4 n y[j+1] n y[j+1] y j+4 SELECTOR x j+4 SELECTOR n n n+2 [4:2] ADDER v[j1] n+2 c x =1 if x j+4 < 0 c y =1 if y j+4 < 0 V 4 4 SELM 3 4 v n-2 n-2 p j+1 M wired shift left Pout 3 2w[j+1] p j Reg WS Reg WC v[j] vs -1 vs 0. vs 1 vs 2 vs 3 vs 4..... vc -1 vc 0. vc 1 vc 2 vc 3 vc 4.... n+2 n+2 2w[j] V block produces estimate of v M block performs subtraction of p j+1 (register control signals not shown) (a) v -1 v 0. v 1 v estimate of v[j] 2 v * o v 1. v 2 vs vs 3 4.... 2w[j+1] vc 3 vc 4... v * o = v 0 XOR p j+1 (b) Figure 9.14: (a) Implementation of radix-2 online multiplier. (b) Calculation of 2w[j + 1].

EXAMPLE OF RADIX-2 ONLINE MULTIPLICATION 50 Operands: x = (.110 110 11) y = (.101 1 1110) j x j+4 y j+4 x[j + 1] y[j + 1] v[j] p j+1 w[j + 1] -3 1 1.1.1 00.0001 0 00.0001-2 1 0.11.10 00.00110 0 00.00110-1 0 1.110.101 00.011110 0 00.011110 0-1 -1.1011.1001 00.1100011 1 11.1100011 1 1-1.10111.10001 11.10000111 0 11.10000111 2 0 1.101110.100011 11.001001010-1 00.001001010 3-1 1.1011011.1000111 00.0100111101 0 00.0100111101 4 1 0.10110111.10001110 00.10110000010 1 11.10110000010 5 0 0.10110111.10001110 11.0110000010-1 00.0110000010 6 0 0.10110111.10001110 00.110000010 1 11.110000010 7 0 0.10110111.10001110 11.10000010 0 11.10000010

cont. 51 Computed product: p = (.10 101 110) The exact double precision product p = (.0110010110000010) The absolute error wrt to the exact product truncated to 8 bits: p p tr = 2 8 Note: p[8] + w[8]2 8 = p

ONLINE DIVISION 52 Online forms x[j] = j+δ x ir i, y[j] = j+δ i=1 y ir i, q[j] = j q ir i i=1 i=1 Error bound at cycle j x[j] q[j]d[j] < d[j]r j Residual w[j] = r j (x[j] q[j]d[j]) w[j] < ω d[j] Residual recurrence w[j + 1] = rw[j] + x j+1+δ r δ q[j]d j+1+δ r δ d[j + 1]q j+1 = v[j] d[j + 1]q j+1

RADIX-2 ONLINE DIVISION 53 δ = 4 and t = 3 Selection intervals and selection constants minû0 = Û0[d[j + 1] = 1/2] = 2 1 2 3 + 0 2 3 = 2 2 max L 1 = L 1 [d[j + 1] = 1] = 1 + 2 3 + 1 = 2 3 resulting in m 1 = 2 2 minû 1 = Û 1[d[j + 1] = 1] = 1 2 3 1 2 3 = 2 2 max L 0 = L 0 [d[j + 1] = 1/2] = 2 1 + 2 3 = 3 2 3 so that m 0 = 2 2. q j+1 = SELD( v[j]) = 1 if 1/4 v[j] 15/8 0 if 1/4 v[j] 1/8 1 if 2 v[j] 1/2

RADIX-2 ONLINE DIVISION: ALGORITHM 54 1. [Initialize] x[ 4] = d[ 4] = w[ 4] = q[0] = 0 for j = 4,..., 1 d[j + 1] CA(d[j], d j+5 ) v[j] = 2w[j] + x j+5 2 4 w[j + 1] v[j] end for 2. [Recurrence] for j = 0...n 1 d[j + 1] CA(d[j], d j+5 ) v[j] = 2w[j] + x j+5 2 4 q[j]d j+5 2 4 q j+1 = SELD( v[j]); w[j + 1] v[j] q j+1 d[j + 1] q[j + 1] CA(q[j], q j+1 ) Q out q j+1 end for

(shift-register for load control in right-append registers not shown) predecessor on-line unit 55 q j+1 CA-Reg Q d j+6 LD CA-Reg D x j+6 q s x j+5 LX (single digit) predecessor on-line unit n q[j] n d j+5 d j+5 SELECTOR U u all 6 bits wide n 5 q[j] v[j] [3:2] ADDER ws q j+1 wc n d[j+1] n d[j+1] SELECTOR n c d = 1 if (d j+5 > 0 and q>0) or (d j+5 < 0 and q<0) 5 SELD 4 V v [3:2] ADDER c q = 1 if (q j+1 > 0 and d>0) or (q j+1 < 0 and d<0) q j+1 wired left shift Qout 2w[j+1] q j (register control signals not shown) Reg WS Reg WC n+2 n+2 2w[j] ws wc Figure 9.16: Block diagram of radix-2 online divider.

REDUCTION OF DIGIT-SLICES 56 Selection valid if (t fractional bits) p + h = n + δ p 2h + δ t p = 2n + δ + t 3 Total number of bit-slices: ib + p, ib - no. integer bits For example, the number of bit-slices for 32-bit radix-2 online multiplication is 2 32 + 3 + 2 2 + = 2 + 23 = 25 3 compared to 34 in implementation without slice reduction.

57 δ not implemented ib t n p (a) after left shift: p x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x e x x x x x e e x x x x x e e x x x x e e e (b) Figure 9.17: Reduction of bit-slices in implementation.

MULTI-OPERATION AND COMPOSITE ONLINE ALGORITHMS 58 To reduce the overall online delay of a group of operations - combine several operations into a single multi-operation online algorithm Example: x 2 + y 2 + z 2 Inputs in [1/2,1), output in [1/4, 3) Online delay δ ss = 0 when the output digit is over-redundant. Online delay (3+2+2=7) of the corresponding network

ALGORITHM FOR SUM OF SQUARES 59 1. [Initialize] w[0] = x[0] = y[0] = z[0] = 0 2. [Recurrence] for j = 0...n 1 v[j] = 2w[j] + (2x[j] + x j 2 j )x j + (2y[j] + y j 2 j )y j + (2z[j] + z j 2 j )z j w[j + 1] csfract(v[j]) s j+1 csint(v[j]) x[j + 1] (x[j], x j+1 ); y[j + 1] (y[j], y j+1 ); z[j + 1] (z[j], z j+1 ) S out s j+1 end for Figure 9.18: Radix-2 online sum of squares algorithm.

serial parallel x j+1 APPEND y j+1 APPEND z j+1 APPEND x[j+1] w[j+1] x[j] y[j] z[j] 60 2w[j] WS WC MUL/ APPEND MUL/ APPEND MUL/ APPEND [5:2] ADDER s j+1 CPA Sout s j in {0,...,8} w[j+1] APPEND implements x[j+1]=x[j]+x j+1 2 -j-1 δ ss = 0 x. x x x x x x x x x x. x x x x x x x x x x. x x x x x x x x x x. x x x x x x x x x x. x x x x x x x x x csint csfrac max(csint) = 8 (a) 2w[j] MUL/APPEND implements 2x[j]x j+1 +x 2 j+1 2-j-1 (2x[j]x j+1 +x 2 j+1 2-j-1 ) (2y[j]y j+1 +y 2 j+1 2-j-1 ) (2z[j]z j+1 +z 2 j+1 2-j-1 ) Note: the fractional portion of the 5-2 CSA produces at most three carries

COMPOSITE ALGORITHM 61 d = (x 2 + y 2 + z 2 ) Overall online delay of 5 A network of standard online modules: online delay of 11

62 y j+5 x serial j+5 z j+5 parallel APPEND APPEND APPEND x[j+1] w[j+1] x[j] y[j] z[j] Operation: WS WC MUL/ APPEND MUL/ APPEND MUL/ APPEND Sum of squares 2w[j] [5:2] ADDER CPA s j+6 w[j+1] Sout s j+5 {0,...8} R[j+1] On-the-Fly converter CONVERT d j+1 RS RC R[j] d d APPEND u=-(2 d[j]d j+1 +d 2 j+1 2-j-1 ) Square root dsel [3:2] ADDER d j+1 {-1,0,1} Dout R[j+1] d j Figure 9.20: Composite scheme for computing d = (x 2 + y 2 + z 2 ).

ONLINE IMPLEMENTATION OF RECURSIVE FILTER 63 IIR filter y(k) = a 1 y(k 1) + a 2 y(k 2) + bx(k) Conventional parallel arithmetic time to obtain y[k]: T CONV = 6t module. t module 6t FA rate of filter computation: R CONV 1/(4 6t FA ) LSDF serial arithmetic time to obtain y[k]: T LSDF = nt FA. rate of filter computation: R LSDF 1/(n t FA ) Online arithmetic Multioperation modules of type vu + w, online delay of 4 cycle time t M 3t FA Throughput independent of working precision but not the number of online units Rate: R OL = 1/( iter t M ) 1/(12t FA )

64 x[k] MUL ADD y[k] a 1 b y[k-1] ADD MUL (a) MUL a 2 y[k-2] coefficients x or y M1 M2 M3 M4 M5 y (Multiply) (Multiply-add) ([4:2] adder) ([4:2] adder) (CPA) CS form (b) CYCLE: k k+1 k+2 k+3 k+4 k+5 Module: M1 bx[k] R a 2 y[k-2] R a 1 y[k-1] R bx[k+1] R M2 M3 (bx[k] R ) +bx[k] L (a 2 y[k-2] R )+ a 2 y[k-2] L (a 1 y[k-1] R )+ a 1 y[k-1] L (bx[k]) + (a 2 y[k-2]) M4 (bx[k] + a 2 y[k-2])+ (a 1 y[k-1]) M5 y[k] (c) Figure 9.21: Conventional implementation of second-order IIR filter: (a) Filter. (b) 5-stage pipeline. (c) Timing diagram.

y[k-2] y[k-1] 65 MODULE M δ=2 δ=3 δ=3 x[k] b vu vu+w a 2 a 1 c d vu+w y[k] (a) x[k-2] M y[k-2] x[i] PARALLEL/SERIAL CONVERTER + DEMUX x[k-1] x[k] x[k+1] M M M y[k-1] y[k] y[k+1] SERIAL/PARALLEL CONVERTER + MUX y[j] Array for n=16 and iter =4 (b) y[k-2] y[k-1] from other modules x[k] c d y[k] (c) Figure 9.22: Online implementation of second-order IIR filter.