How fast can we calculate?

Similar documents
Residue Number Systems Ivor Page 1

Adders, subtractors comparators, multipliers and other ALU elements

Binary addition by hand. Adding two bits

Hardware Design I Chap. 4 Representative combinational logic

Adders, subtractors comparators, multipliers and other ALU elements

Chapter 5. Digital Design and Computer Architecture, 2 nd Edition. David Money Harris and Sarah L. Harris. Chapter 5 <1>

Logic and Computer Design Fundamentals. Chapter 5 Arithmetic Functions and Circuits

ARITHMETIC COMBINATIONAL MODULES AND NETWORKS

We are here. Assembly Language. Processors Arithmetic Logic Units. Finite State Machines. Circuits Gates. Transistors

CMP 334: Seventh Class

Digital Integrated Circuits A Design Perspective. Arithmetic Circuits. Jan M. Rabaey Anantha Chandrakasan Borivoje Nikolic.

Overview. Arithmetic circuits. Binary half adder. Binary full adder. Last lecture PLDs ROMs Tristates Design examples

Midterm Exam Two is scheduled on April 8 in class. On March 27 I will help you prepare Midterm Exam Two.

CS 140 Lecture 14 Standard Combinational Modules

Combinational Logic. By : Ali Mustafa

Lecture 12: Adders, Sequential Circuits

Chapter 4. Combinational: Circuits with logic gates whose outputs depend on the present combination of the inputs. elements. Dr.

Chapter 5 Arithmetic Circuits

Digital Integrated Circuits A Design Perspective. Arithmetic Circuits. Jan M. Rabaey Anantha Chandrakasan Borivoje Nikolic.

1 Short adders. t total_ripple8 = t first + 6*t middle + t last = 4t p + 6*2t p + 2t p = 18t p

CSE140: Components and Design Techniques for Digital Systems. Decoders, adders, comparators, multipliers and other ALU elements. Tajana Simunic Rosing

Fundamentals of Digital Design

Hw 6 due Thursday, Nov 3, 5pm No lab this week

Sample Marking Scheme

Systems I: Computer Organization and Architecture

14:332:231 DIGITAL LOGIC DESIGN

VLSI Design. [Adapted from Rabaey s Digital Integrated Circuits, 2002, J. Rabaey et al.] ECE 4121 VLSI DEsign.1

Lecture 13: Sequential Circuits

ISSN (PRINT): , (ONLINE): , VOLUME-4, ISSUE-10,

Computer Architecture 10. Residue Number Systems

Computer Science 324 Computer Architecture Mount Holyoke College Fall Topic Notes: Digital Logic

ECE 545 Digital System Design with VHDL Lecture 1. Digital Logic Refresher Part A Combinational Logic Building Blocks

Tree and Array Multipliers Ivor Page 1

Floating Point Representation and Digital Logic. Lecture 11 CS301

Class Website:

CMPEN 411 VLSI Digital Circuits Spring Lecture 19: Adder Design

Computer Architecture 10. Fast Adders

Looking at a two binary digit sum shows what we need to extend addition to multiple binary digits.

Lecture 5: Arithmetic

CprE 281: Digital Logic

ELEN Electronique numérique

INF2270 Spring Philipp Häfliger. Lecture 8: Superscalar CPUs, Course Summary/Repetition (1/2)

Lecture 12: Adders, Sequential Circuits

CSE477 VLSI Digital Circuits Fall Lecture 20: Adder Design

DIGITAL TECHNICS. Dr. Bálint Pődör. Óbuda University, Microelectronics and Technology Institute

A High-Speed Realization of Chinese Remainder Theorem

Carry Look Ahead Adders

UNIT II COMBINATIONAL CIRCUITS:

Sample Test Paper - I

KEYWORDS: Multiple Valued Logic (MVL), Residue Number System (RNS), Quinary Logic (Q uin), Quinary Full Adder, QFA, Quinary Half Adder, QHA.

Numbers and Arithmetic

UNSIGNED BINARY NUMBERS DIGITAL ELECTRONICS SYSTEM DESIGN WHAT ABOUT NEGATIVE NUMBERS? BINARY ADDITION 11/9/2018

Chapter 7 Logic Circuits

Digital Techniques. Figure 1: Block diagram of digital computer. Processor or Arithmetic logic unit ALU. Control Unit. Storage or memory unit

Part II Addition / Subtraction

COE 202: Digital Logic Design Combinational Circuits Part 2. Dr. Ahmad Almulhem ahmadsm AT kfupm Phone: Office:

Number representation

Digital System Design Combinational Logic. Assoc. Prof. Pradondet Nilagupta

Digital Logic. CS211 Computer Architecture. l Topics. l Transistors (Design & Types) l Logic Gates. l Combinational Circuits.

ECE380 Digital Logic. Positional representation

Part II Addition / Subtraction

Hakim Weatherspoon CS 3410 Computer Science Cornell University

14:332:231 DIGITAL LOGIC DESIGN. Why Binary Number System?

Logic Design II (17.342) Spring Lecture Outline

CPE100: Digital Logic Design I

Digital Systems Roberto Muscedere Images 2013 Pearson Education Inc. 1

Total Time = 90 Minutes, Total Marks = 100. Total /10 /25 /20 /10 /15 /20

Combinational Logic Design Arithmetic Functions and Circuits

Algorithms (II) Yu Yu. Shanghai Jiaotong University

A VLSI Algorithm for Modular Multiplication/Division

Binary Multipliers. Reading: Study Chapter 3. The key trick of multiplication is memorizing a digit-to-digit table Everything else was just adding

Numbers and Arithmetic

ECE 2300 Digital Logic & Computer Organization

Binary addition example worked out

CSE 20 DISCRETE MATH. Fall

Cost/Performance Tradeoff of n-select Square Root Implementations

Arithmetic Building Blocks

Bit-Sliced Design. EECS 141 F01 Arithmetic Circuits. A Generic Digital Processor. Full-Adder. The Binary Adder

CprE 281: Digital Logic

ELCT201: DIGITAL LOGIC DESIGN

This Unit: Arithmetic. CIS 371 Computer Organization and Design. Pre-Class Exercise. Readings

Outline. policies for the first part. with some potential answers... MCS 260 Lecture 10.0 Introduction to Computer Science Jan Verschelde, 9 July 2014

Experience in Factoring Large Integers Using Quadratic Sieve

Computer organization

Lecture 8: Sequential Multipliers

EECS150. Arithmetic Circuits

EECS150 - Digital Design Lecture 21 - Design Blocks

The goal differs from prime factorization. Prime factorization would initialize all divisors to be prime numbers instead of integers*

What s the Deal? MULTIPLICATION. Time to multiply

The Design Procedure. Output Equation Determination - Derive output equations from the state table

University of Toronto Faculty of Applied Science and Engineering Edward S. Rogers Sr. Department of Electrical and Computer Engineering

ww.padasalai.net

Lecture 3 Review on Digital Logic (Part 2)

ASSC = Automatic Sequence-Controlled Calculator (= Harvard Mark I) Howard Aitken, 1943

Latches. October 13, 2003 Latches 1

Chapter 1: Solutions to Exercises

Review for Test 1 : Ch1 5

Introduction to Digital Logic Missouri S&T University CPE 2210 Subtractors

ECEN 248: INTRODUCTION TO DIGITAL SYSTEMS DESIGN. Week 9 Dr. Srinivas Shakkottai Dept. of Electrical and Computer Engineering

Lecture 4. Adders. Computer Systems Laboratory Stanford University

Transcription:

November 30, 2013

A touch of History The Colossus Computers developed at Bletchley Park in England during WW2 were probably the first programmable computers. Information about these machines has only been released in the last 15 years. The first model was working by December 1943. Prior to release of information about Colossus, University of Manchester s Small-Scale Experimental Machine (SSEM) had generally ben recognized as world s first electronic computer that ran a stored program - an event that occurred on June 21st 1948. However the SSEM was not regarded as a fully-fledged computer, more a proof of concept that led to the Manchester Mark 1 Computer.

A touch of History Letter from F.C. Williams, Tom Kilburn, Manchester University, UK The capacity of the store is at present only 32 words, each of 31 binary digits, to hold instructions, data and working. Hence only simple arithmetic routines devised to test the machine can be run. Examples of problems that have been carried out are: 1 Long division by the standard process. (For (2 30 1)/31, this took 1.5 seconds, the quotient being given to 39 significant binary figures of which the 13 least significant, to the left of the binary point, were zero, since 31 is a factor of 2 30 1.) 2 H.C.F. (GCD) by the standard process. (For 314,159,265 and 271,828,183, which are co-prime, approximately 0.5 second.) 3 Factorizing an integer. For (3) the method was deliberately chosen to give a long run the result of which could be easily checked. Thus the highest proper factor of 2 18 was found by trying in a single routine every integer from 2 18 1 downward, the necessary divisions being done not by long division, but by the primitive process of repeated subtraction of the divisor. Thus about 130,000 numbers were tested, involving some 3.5 million operations. The correct answer was obtained in a 52-minute run. The instruction table in the machine contained 17 entries.

Instruction Execution Times Here are measured instruction execution times on a 1.86 GHz MAC Book Air and on the UTD 2.93 GHz Lab Machines. 1.86 GHz Mac Book Air UTD 2.93 GHz Lab PCs INST N Time Factor Time Factor int add 122 ps 1 60pS 1 long add 122 ps 1 60pS 1 int mult 122 ps 1 60pS 1 long multi 133 ps 1 60pS 1 int div 350 ps 3 330pS 5.5 long div 1600 ps 14 330pS 5.5

Fastest in the World From breaking the German Enigma and Lorenz codes during the second world war, to the physics of nuclear reactions, to almost all chemistry and physics today, and to weather prediction, science continues to press computer architects for faster computers. The Chinese Tianhe-2, or Milky Way-2, can perform 54.9 petaflops per second. That is 54.9 times 10 15 floating point operations per second. Tianhe-2 has 16,000 nodes, each with two Intel Xeon IvyBridge processors and three Xeon Phi processors for a combined total of 3,120,000 computing cores.

Fastest in Texas On January 7, 2013, TACC s new cluster, Stampede, went into production. Stampede comprised 6400 nodes, 102400 cpu cores, 205 TB total memory, 14 PB total and 1.6 PB local storage. The bulk of the cluster consisted of 160 racks of primary compute nodes, each with dual Xeon E5-2680 8-core processors, Xeon Phi coprocessor, and 32 GB ram.[13] The cluster also contained 16 nodes with 32 cores and 1 TB ram each, 128 standard compute nodes with Nvidia Kepler K20 GPUs, and other nodes for I/O (to a Lustre filesystem), login, and cluster management. Stampede was the 7th fastest supercomputer on the November 2012 Top500 list. Stampede is listed as having a peak performance of 10 Peta Flops

Fastest in the USA When complete, Titan at Oak Ridge National Laboratory will have a theoretical peak performance of more than 20 petaflops, or more than 20,000 trillion calculations per second. This will enable researchers across the scientific arena, from materials to climate change to astrophysics, to acquire unparalleled accuracy in their simulations and achieve research breakthroughs more rapidly than ever before.

Titan slowed by sequestration Oak Ridge s budget was cut by about 7%in 2013, nearly $100M. But while the lab s budget is down, its electric bill isn t. Oak Ridge s machines suck up as much juice as a small town, 25 megawatts: For every megawatt of power we use a year, it costs us about a million dollars, and that s real, cold hard cash, says Jeff Nichols, the lab s associate director for computing. Money will also have to come from plans for the future. Researchers had hoped to replace Titan with another superfast machine in 2017. The automatic cuts are also hurting the researchers who use Titan, like Sally Ellingson, a graduate student at the University of Tennessee. She s currently teaching Titan to screen chemical compounds that could be used as drugs to treat diseases. Recently I ran a job where I tested over 4 million compounds in just a couple hours, she says.

Adders and Multipliers In this lecture I will restrict our discussion to the design of a single adder or multiplier. I will not discuss methods of reducing the number of partial products in a multiply, high-radix multiplication, or pipelining.

Binary Arithmetic Computers use the binary number system. Unsigned binary integers come in k-bit units, where k = 32 or k = 64 for most purposes today. We can easily create huge numbers by combining strings of these integers. For now, let s stick to k-bit unsigned integers. A 6 bit unsigned binary integer 100101 2 has decimal value 2 5 + 2 2 + 2 0 = 32 + 4 + 1 = 37 10. Note the subscripts, used to indicate the radix or base of the number. In general, the value of a k-bit unsigned binary number is expressed as follows: k 1 value = d i 2 i We can form this sum in any radix, but we will use decimals. i=0

Range The Range of a k-bit unsigned binary system is [0, 2 k 1]. For 32 bits, the range is [0, 2 32 1] = [0, 4294967295 10 ] The smallest and largest numbers are represented by the bit patterns 000...000 and 111...111. If we add 1 to the largest value in the range, the result is zero. We represent the behavior on a number wheel: 2 1 0 k 2-1

Negative Values The Two s Complement number system is used almost exclusively to represent negative values. The range of a k-bit 2 s complement number system is [ 2 k 1, 2 k 1 1]. The positive range is one smaller than the negative range. This imbalance is the consequence of having an even number of bit settings (2 k ) and the need to use one of them to represent zero. Here is the number wheel: k-1 2 2-1 1-2 k-1 0-2 k-1 +1-1 -2

Negative Values The sign bit is the most significant bit. All negative values have sign bits equal to 1. All non-negative values have sign bits of 0. In an 8-bit 2 s complement system the representations look like this: Decimal Binary Comment 127 0111 1111 2 7 1 126 0111 1110 2 0000 0010 1 0000 0001 0 0000 0000 all zeros 1 1111 1111 all ones 2 1111 1110 3 1111 1101 127 1000 0001 128 1000 0000 2 7

Negation The representations of positive values matches the representations of their unsigned counterparts. Negation is achieved by taking the 2 s complement. Here is the mathematical definition: x = 2 k x where x is the unsigned value of the k-bit field x. Notice that 2 k requires k + 1 bits, one larger than the k-bit field of the number system. For example, if k = 8 bits and x = 0000 1111 then x = 1 0000 0000 0000 1111 = 1111 0001. If x = 1111 0000 then x = 1 0000 0000 1111 0000 = 0001 0000. A simpler method of negation is to complement all the bits and add 1: If x = 1111 0000 then x = 0000 1111 + 1 = 0001 0000

Overflow The most negative 2 s complement value has no positive counterpart. If I try to take its 2 s complement I obtain the same value as I started with: If x = 1000 0000 then x = 1 0000 0000 1000 0000 = 1000 0000. When the result of any arithmetic operation exceeds the range of the number system we say that an overflow has occurred. Adding 1 to the largest positive value gives the most negative value. etc.

Limited Range The effects of limited range are to make addition and subtraction modulo operations. In a k-bit unsigned system, a + b becomes (a + b) mod 2 k A consequence is that when any sequence of additions, subtractions, and multiplications occurs on unsigned values, the answer is always correct in the least significant k bits, no matter whether overflow occurs or not.

Unsigned or Negative? A consequence of the choice of 2 s complement to represent negative values is that the k bit pattern used to represent any number may equally be thought of as representing an unsigned or a signed value. For example, if k = 8-bits, 1000 1111 represents 113 10 if we think of the system as 2 s complement, or +143 10 if we think of the system as unsigned. We can do any arithmetic operations on these k-bit values while maintaining either view of the meaning of the bit patters. The arithmetic units of modern computers operate exactly the same irrespective of the typing of the integers. The ONLY difference is that some computers have an overflow flag that is set when a signed operation exceed the 2 s complement range of the number system.

Ripple Carry Adder Circuits The simplest parallel adder circuit is the Ripple-Carry, or Carry-Propagate Adder. For a k-bit adder, k Full Adder circuits are used: y 2 x 2 y 1 x 1 y 0 x 0... FA FA FA c 3 c 2 c 1 c in z 2 z 1 z 0

Ripple Carry Adder Circuits The above diagram illustrates the right-most (or least significant) 3 stages of the adder. Two binary values, X = x k 1, x k 1,, x 1, x 0 and Y = y k 1, y k 1,, y 1, y 0 are added. Carry signals flow from right to left between the stages. The sum is represented by Z = z k 1, z k 1,, z 1, z 0 The Full Adders can be broken down into Half-Adders and interstage OR gates: x i y i x i-1 y i-1 HA HA c i+1 c s c i c s c i-1 HA HA c s c s z i z i-1

Ripple Carry Adder Circuits Here is the truth table of a Half Adder: x y s c 0 0 0 0 0 1 1 0 1 0 1 0 1 1 0 1 The output s is the Ex-Or function and c is the And function of the x and y inputs.

Ripple Carry Adder Circuits There are many ways to implement the Full-Adder circuit, one of the most commonly used in CMOS integrated circuits is the 28T circuit, named for its 28 CMOS transistors. I only show this to stress the complexity of these circuits. X y X X y c in X y c in y c in sum y c in y X X y X c in y X c out

Speed of the Ripple-Carry Adder Modern computer designs tend to base the clock rate on the slowest, or worst-case, time of any instruction. If we add two k-bit binary numbers, the worst case time occurs when the carry ripples from the least-significant to the most-significant ends of the adder. Let s assume that a Full-Adder has delay from the carry signal c i to c i+1 of 20 picoseconds. The total carry-propagation time is then 20 k picoseconds. For a 64 bit adder, the time would be about 1310 psec, resulting in a maximum clock speed of 10 12 /1310 = 0.76 ghz.

Using Actual Times to Completion Consider the actual times for completion of the addition. Here we add two arbitrary 16-bit numbers. 0 1 1 0 0 1 1 1 0 1 1 1 1 0 0 1 0 1 0 1 1 1 1 1 0 0 1 1 1 1 1 1 a g p p p g g g a p g g g p p g 0 1 4 3 2 1 1 1 0 2 1 1 1 3 2 1 Each column is labeled with a, g, or p. A g stage generates a carry. It consumes any carry coming into its right hand side. We start the propagation time at each g stage. A p stage propagates a carry. It adds one unit of delay to the timing of the input carry to its right. An a stage annihilates an incoming carry. No further propagation of its input carry occurs. The timings of the output carries of the stages are given in the bottom row above.

Using Actual Times to Completion A Carry Chain is defined as the number of stages between and including a g stage and the next g or a stage to its left. The longest carry chain has length 4 in the above example. The theoretical mean carry chain length if a very large number of pairs of random k-bit values are added is very close to 2.0. The theoretical mean of the longest carry chain in each those additions is about log 2 k These results suggest that an adder could be designed to report when it was finished instead of waiting for the worst case carry propagation time. The CDC 6600 (first supercomputer introduced in the 1960 s) used such a scheme, but the logic of the control circuitry is complicated by variable-time operations, so most modern computers do not use this idea.

Using Actual Times to Completion Here are the results of an experiment in which 100,000 pairs of random 16 bit values were added. It shows the number of times that the carry chain and the longest carry chain were equal to Length. Length carry chain longest carry chain 1 449698 1033 2 211799 8591 3 99376 25787 4 47046 28229 5 21939 18393 6 10080 9472 7 4650 4560 mean carry chain length = 1.87 8 2145 2138 9 986 985 mean of the longest carry chain = 3.27 10 476 476 11 202 202 12 84 84 13 31 31 14 11 11 15 7 7

Carry Lookahead Adders This topic is non-trivial, so only a sketch is given. In each column of the adder we produce two signals, p i and g i, with the meanings given above. p i = x i y i Exclusive OR g i = x i.y i And From these propagate and generate signals all the carry signals may be produced: c i = g i 1 + c i 1 p i 1 = g i 1 + (g i 2 + c i 2 p i 2 )p i 1... = g i 1 + g i 2 p i 1 + c i 2 p i 2 p i 1 = g i 1 + g i 2 p i 1 + g i 3 p i 2 p i 1 + g i 4 p i 3 p i 2 p i 1 + c i 4 p i 4 p i 3 p i 2 p i 1

CLA-4 block In practice we must limit the fan-in of the gates to 3 or 4, depending on the technology. If the limit is 4 inputs then the largest CLA block that we can have is a CLA-4. It produces signals c 3, c 2, and c 1 from p 2, p 1, p 0, g 2, g 1, g 0 and c in : c 3 = g 2 + g 1 p 2 + g 0 p 2 p 1 + c in p 2 p 1 p 0 c 2 = g 1 + g 0 p 1 + c in p 1 p 0 c 1 = g 0 + c in p 0

Carry Lookahead Adders Each carry signal is a simple sum of products of p i and g i signals. Each product term is the logical And of p i and g i signals and these product terms are Or d together. Here is an And-Or-Invert circuit that produces c 3. Nand gates would be used in practice. The delay through these circuits is 2D, where D is the nominal gate delay. g 2 g 1 p 2 g 0 p 1 p 2 c 3 c in p 0 p 1 p 2

CLA-4 block Here is the right-most CLA block, showing the 3 carry signals that it generates: y 4 x 4 y 3 x 3 y 2 x 2 y 1 x 1 y 0 x 0 p 4 g 4 p 3 g 3 p 2 g 2 p 1 p 0 g 1 g 0 c 5 c 3 c 2 CLA-4 c 1 c in c 4 Where does c 4 come from?

Group G and P signals We add group generate and propagate signals to the CLA block: G 0 = g 3 + g 2 p 3 + g 1 p 3 p 2 + g 0 p 3 p 2 p 1 P 0 = p 3 p 2 p 1 p 0 And add a second layer of CLA-4 blocks.

Group G and P signals y15 x 15 y14x 14 y13x 13 y12x 12 y11x 11 y10x 10 y9 x 9 y8 x 8 y7 x 7 y6 x 6 y5 x 5 y4 x 4 y3 x 3 y2 x 2 y1 x 1 y0 x 0 p p g p g p g c 15 c14 c 13 CLA-4 g c 12 p p g c 11 g p 8 p 9 g 9 c 10 c 9 CLA-4 g 8 c8 p 7 p 6 g 7 g 6 p 4 p 5 g 5 c 7 c 6 c 5 CLA-4 g 4 c 4 p 3 p 2 g 3 g 2 p 0 p 1 g 1 c 3 c 2 c 1 CLA-4 g 0 c in P3 G3 P2 G2 P1 G1 c 12 c 8 c 4 P0 G0 P 2 G 2 P 0 P 3 G 3 P 1 G 1 c 1 G 0 c 3 c 2 CLA-4 PP3 GG 3

Tree of CLA blocks The process can be continued to generate a tree of CLA blocks. X23 Y23 X0Y0 c23 c22 CLA-3 c20 c19 CLA-3 c17 c16 CLA-3 c14 c13 CLA-3 c11 c10 CLA-3 c8 c7 CLA-3 c5 c4 CLA-3 c2 c1 CLA-3 p,g P,G Cin c21 CLA-2 c15 CLA-3 c12 c6 CLA-3 c3 P,G c18 CLA-3 c9 P,G Xi Yi not used Si (not shown) MFA Ci pi gi 24-bit adder based on a tree of CLA-3 blocks

The Longest Delay Path X23 Y23 X0Y0 c23 c22 CLA-3 c20 c19 CLA-3 c17 c16 CLA-3 c14 c13 CLA-3 c11 c10 CLA-3 c8 c7 CLA-3 c5 c4 CLA-3 c2 c1 CLA-3 p,g P,G Cin c21 CLA-2 c15 CLA-3 c12 c6 CLA-3 c3 P,G c18 CLA-3 c9 P,G Xi Yi not used Si (not shown) MFA Ci pi gi The Longest Path is Shown in bold

Time to add in a Tree of CLA-4 blocks There are h = log 4 k levels of CLA blocks. The delay time will be: Produce p,g signals in MFAs 2D Pass down through h levels of CLA blocks 2Dh Pass up through h 1 levels of CLA blocks 2D(h 1) Compute final sum in MFAs 3D Total time to add 5D + (4log 4 k 2)D For a 64-bit adder the time is 15D, as opposed to 129D with a ripple-carry adder. D is the nominal gate delay. The cost of the speedup is a 77% increase in the number of transistors in the adder.

Binary Signed Digit Number System Consider a number system with k-digits, each of which is selected from the set { 1, 0, 1}. We write the digit set as {1, 0, 1} The system is redundant, meaning that all values, other than zero, have multiple representations: 5 10 = 101 2 = 111 2 = 1011 2 = 1101 2 etc. We use the subscript 2 to denote numbers in this system. Two bits are required in order to represent each digit. The usual convention is as follows: Value of digit d i d + i d i 0 0 0 1 0 1 1 1 0

Conversion to Binary Signed Digit Numbers A number X can be thought of as two bit vectors: If X = 1100111 then X + = 1000100 and X = 0100011. Conversion from 2 s complement binary to BSD is simple and takes constant time: a k -1 a k-2 a k-3 a k-4 a i a 1 a 0 A 0 x + k-2 x + k-3 x + k-4 x + i x + 1 x + 0 X + x - k -1 0 0 0 0 0 0 X -

Conversion from Binary Signed Digit Numbers Conversion back to 2 s complement binary requires a subtraction: A = X + X and is performed using a CLA adder in log k time. There is no point in converting 2 s complement binary numbers to BSD to do single additions or subtractions since the conversion time from BSD to binary is equal to the time to add or subtract 2 s complement binary numbers. But when many additions or subtractions are required, as in a multiply, the conversion time is worthwhile. BSD adders take constant time by eliminating carry propagation.

Carry-Free Addition with BSD The adder for two BSD numbers, X and Y, comprises two stages. In the first stage of each column input digits x i and y i are combined to form an intermediate sum s i and an intermediate carry c i+1. These are all BSD values. In the second stage, the sum bit s i is combined with the carry-in from the previous stage, c i, to generate the result bit z i : x i+1 y i+1 x i y i x i-1 y i-1 e i+1 e i Stage 1 Si+1 s i S i-1 c i+1 c i Stage 2 z i+1 z i

Carry-Free Addition with BSD When x i + y i = 1 or x i + y i = 1 redundancy enables us to choose the values of the intermediate sum and carry so that the second stage can absorb the carry from the position to the right. Consider the following columns: The results can equally be col i j x i 1 1 y i 0 0 col i j c i+1, s i = 0,1 c j+1, s j = 0, 1 or or c i+1, s i = 1, 1 c j+1, s j = 1, 1

Carry-Free Addition with BSD The second stage sum, z i, must fit into the BSD digit range {1, 0, 1}. If a carry into stage i from the right is 1, we must choose the result, c i+1, s i = 1, 1 for column i. If a carry into stage j is 1, we must choose the result, c j+1, s j = 1, 1 for column j. But we cannot know the exact value of that carry-in.

Carry-Free Addition with BSD We can, however, bound the carry-in to one of two ranges, c i {0, 1} or c i {1, 0}. The signal e i tells which range applies: x i 1 y i 1 range of c i e i 1 1 {0, 1} 0 1 0 {0, 1} 0 0 1 {0, 1} 0 1 1 0 0 0 0 0 0 1 1 0 0 0 1 {0, 1} 1 1 0 {0, 1} 1 1 1 {0, 1} 1

Carry-Free Addition with BSD Below is a table of possible intermediate values for the 9 values of x i and y i. Inputs Intermediate Results Intermediate Results for c i {0, 1} for c i {0, 1} x i y i c i+1 s i c i+1 s i 1 1 1 0 1 0 1 0 1 1 0 1 0 1 1 1 0 1 1 1 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 1 0 1 1 1 1 0 0 1 1 1 1 1 1 0 1 0

Longest Delay Path in a BSD Adder The longest paths through the adder are shown in bold x i+1 y i+1 x i y i x i-1 y i-1 e i+1 e i Stage 1 s i c i+1 c i Stage 2 z i+1 z i Addition is a constant time process.

BSD Addition Example col = 9 8 7 6 5 4 3 2 1 0 0 1 1 1 0 1 0 1 1 1 X = 333 0 0 1 1 1 1 0 1 0 1 Y = 43 1 1 0 0 1 0 0 1 0 0 e 0 1 0 0 1 0 0 0 1 0 s 1 1 1 0 1 0 1 0 1 0 c 1 0 1 0 0 0 1 0 0 0 Z = 376 In column 1, x 1 = 1, y 1 = 0. The results could be c 2, s 1 = 0, 1 or c 2, s 1 = 1, 1 c 2, s 1 = 0, 1 is chosen because c 1 {0, 1} In column 5, x 5 = 0, y 5 = 1. The results could be c 6, s 5 = 0, 1 or c 6, s 5 = 1, 1 c 6, s 5 = 0, 1 is chosen because c 5 {0, 1} In column 8, x 8 = 1, y 8 = 0. The results could be c 9, s 8 = 0, 1 or c 9, s 8 = 1, 1 c 9, s 8 = 1, 1is chosen because c 8 {0, 1} No choice is available or needed in any other column

Multiplication using BSD In the simplest k-bit multiplier k partial products must be added. A tree of BSD adders can perform the additions in log 2 k time. Partial Products p 0 + Tree of BSD Adders p 1 p 2 + + + p 3 p 4 p 5 p k-2 p k-1 +.. + + Log k 2 + + CLA Adder BSD Product Binary Product

Residue Number System A Residue Number System (RNS) enables constant time addition, subtraction, and multiplication. Its great speed comes from a simple theorem from number theory: x y mod m = (x mod m y mod m) mod m where the binary operator {+,, }. For example (x + y) mod m = ((x mod m) + (y mod m)) mod m and (x y) mod m = ((x mod m) (y mod m)) mod m

Residue Number System Each RNS is based on a set of co-prime moduli: We will use RNS(8,7,5,3) in our examples. A decimal number x, if it lies within the range of the RNS, has a unique RNS representation given by its residues modulo the RNS moduli: Conversion may be carried out by parallel processes, one for each modulus: x RNS(8,7,5,3) = (x mod 8, x mod 7, x mod 5, x mod 3) RNS(8,7,5,3) has range [0, (8 7 5 3) 1] = [0, 839]. Larger ranges are obtained by adding more moduli. See later.

RNS Addition Example 123 10 = (3, 4, 3, 0) RNS(8,7,5,3), 431 10 = (7, 4, 1, 2) RNS(8,7,5,3) The sum in RNS is ((3+7) mod 8, (4+4) mod 7, (3+1) mod 5, (0+2) mod 3) = (2, 1, 4, 2) To check the result, let s convert the decimal answer to RNS: (123 + 431) = 554 10 = (2, 1, 4, 2) RNS(8,7,5,3)

RNS Multiplication Example 23 10 = (7, 2, 3, 2) RNS(8,7,5,3), 33 10 = (1, 5, 3, 0) RNS(8,7,5,3) The product in RNS is ((7 1) mod 8, (2 5) mod 7, (3 3) mod 5, (2 0) mod 3) = (7, 3, 4, 0) The decimal answer is 23 33 = 759 10 = (7, 3, 4, 0) RNS(8,7,5,3)

RNS Speed The potential for high speed arithmetic comes from the ability to calculate the result in each column independent of other columns, and therefore in parallel. There is no carry propagation between the columns. In each column the residues will most likely be represented by unsigned binary integers. If a particular column requires p bits to store its residues, a p bit adder (or a table-lookup scheme) will be needed. The speed of that column adder is determined by p. The speed of RNS arithmetic is therefore (partly) determined by the largest column width.

Negative Values RNS(3, 2) can represent 6 values. This range can be used to represent any interval of 6 consecutive integers on the number line. The interval [0, 5] is an obvious choice. RNS(3, 2) can also represent the range [ 3, 2]. This done with the following mapping: Decimal RNS -3 (0,1) -2 (1,0) -1 (2,1) 0 (0,0) 1 (1,1) 2 (2,0) Each negative value is represented by the complement of its magnitude. To complement (r 1, r 2 ), subtract each residue from its modulus. If r i = 0, its complement is also 0. The negative of (1,1) is (2,1), and the negative of (2,0) is (1,0).

Conversion from 2 s Complement Binary We convert the magnitude to RNS and then take its complement if the 2 s complement value was negative. Conversion of the magnitude is achieved via a table of RNS values of the powers of 2. If we want to convert 32 bit 2 s complement values, we would need a table of the RNS equivalents of 1, 2, 4, 8,..., 2 31. For each one in the input binary number we look up its RNS representation in the table and add that value to a total using the RNS parallel adder system. In the worst case 32 additions would take place. Conversion from RNS back to 2 s complement binary is a little tricky and is beyond our allotted time.

Using Low-Cost Moduli Say we need an RNS with a range of at least that of a 16 bit binary field, [0, 65,535]. Low-cost moduli have the following form: RNS(2 a k 2, 2 a k 2 1, 2 a k 3 1,, 2 a 1 1, 2 a 0 1) (Two values 2 a 1 and 2 b 1 are co-prime if and only if a and b are co-prime.) We form a simple sequence and build on it as follows: RNS Range RNS(2 3, 2 3 1, 2 2 1) 168 RNS(2 4, 2 4 1, 2 3 1) 1680 RNS(2 5, 2 5 1, 2 3 1, 2 2 1) 20,832 RNS(2 5, 2 5 1, 2 4 1, 2 3 1) 104,160 The final system, RNS(32, 31, 15, 7), has range 1.59 times what is needed. Its widest column is 5 bits, but the speed of arithmetic is greatly helped by using low-cost moduli.

Disadvantages of RNS Division and comparison are difficult operations in RNS. If both cases numbers are usually converted to a Mixed Radix System to perform these tasks. Fortunately almost all signal processing is done with multiplication and addition. RNS has been used in high-speed radar system.