HIGH-SPEED MULTI OPERAND ADDITION UTILIZING FLAG BITS VIBHUTI DAVE DEPARTMENT OF ELECTRICAL AND COMPUTER ENGINEERING

Size: px

Start display at page:

Download "HIGH-SPEED MULTI OPERAND ADDITION UTILIZING FLAG BITS VIBHUTI DAVE DEPARTMENT OF ELECTRICAL AND COMPUTER ENGINEERING"

Tobias Hall
6 years ago
Views:

1 HIGH-SPEED MULTI OPERAND ADDITION UTILIZING FLAG BITS BY VIBHUTI DAVE DEPARTMENT OF ELECTRICAL AND COMPUTER ENGINEERING Submtted n partal fulfllment of the requrements for the degree of Doctor of Phlosophy n Computer Engneerng n the Graduate College of the Illnos Insttute of Technology Approved Advser Co Advser Chcago, Illnos May 2007

3 ACKNOWLEDGEMENT I would le to than my mentor Dr. Erdal Orulu for hs constant support and undue fath n me. I hghly apprecate the tme he has nvested durng my research and for the completon of ths dssertaton. Ths dssertaton would not have been possble wthout Dr. Jafar Sane, my advsor and hs attempts to challenge me throughout my academc program, encouragng me when I was successful and pushng me to do better when I fell short. I would also le to than Dr. Dmtros Velens and Dr. James Stne for ther constructve crtcsm about my wor and helpng me to perform better. A specal thans to the commttee members for ther support and tme.

4 TABLE OF CONTENTS Page ACKNOWLEDGEMENT... LIST OF TABLES... v LIST OF FIGURES... v ABSTRACT... x CHAPTER 1. INTRODUCTION Motvaton Goals Structure of Thess DESIGN CRITERIA AND IMPLICATIONS Arthmetc Operatons and Unts Crcut and Layout Desgn Technques Automated Crcut Synthess and Optmzaton Crcut Complexty and Performance Measures Summary ADDER DESIGNS Bt Adders Carry Propagate Adders Carry Select Adders Carry Sp Adders Carry Save Adders Parallel Prefx Adders Summary LOGICAL EFFORT Delay n a Logc Gate Multstage Logc Networs Choosng the Best Number of Stages v

5 4.4 Summary of the Method Summary FLAGGED PREFIX ADDITION Bacground Theory of Fagged Prefx Addton Implementaton of a Flagged Prefx Adder Modfcatons to a Prefx Adder Delay Performance of a Flagged Prefx Adder Fxed Pont Arthmetc Applcatons Summary THREE - INPUT ADDITION Carry - Save Adders Mult - Operand Adders Flag Logc Computaton Constant Addton Three - Input Addton Gate Count Summary ANALYSIS AND SIMULATION Logcal Effort Smulaton Results Summary CONCLUSIONS AND FUTURE WORK BIBLIOGRAPHY v

6 LIST OF TABLES Table Page 4.1 Logcal Effort for nputs of statc CMOS gates Estmates of parastc delay for logc gates Best Number of stages for path efforts Summary of terms and equatons for Logcal Effort Selecton Table for a Flagged Prefx Adder Flag and Carry Logc Based on Thrd Operand Flag Logc utlzng Carry from the Prefx Tree Mnmum Flag Logc Gates Logc Gate Combnatons Gate Count for All Adder Implementatons Logcal Effort and Path Delays for Adder Blocs Logcal Effort and Path Delays for Gates wthn Adder Blocs Logcal Effort Estmates for Conventonal Adder Desgns Logcal Effort Estmates for Flagged Adder Desgns Logcal Effort Estmates for Three Input Flagged Adder Desgns Post Layout Estmates for Conventonal Adder Archtectures Post Layout Estmates for Flagged Adder Archtectures Post Layout Estmates for Enhanced Adder Archtectures wth Constant Addton Post Layout Estmates for Three - Input Adder Archtectures v

7 LIST OF FIGURES Fgure Page 3.1 (m,) Counter Symbol and Logc for Half Adder Symbol and Logc for Full Adder Rpple Carry Adder m-bt Condtonal Sum Adder Carry-Select Adder Carry-Sp Bloc Carry-Sp Adder Optmal Sze for Carry-Sp Adder Carry-Save Adder Parallel Prefx Adder Logc and Symbol for Pre-Processng Gates Logc and Symbol for Prefx Tree Gates Logc and Symbol for Post-Processng Gates Brent-Kung Prefx Tree Ladner-Fscher Prefx Tree Kogge-Stone Prefx Tree Electrcal Effort vs. Delay Dual Adder Desgn Flagged Prefx Adder Flagged Inverson Cells v

8 5.4 Output Cell Logc for the Flagged Prefx Adder Bloc Dagram of a Sgn Magntude Adder Symbol and Schematc of Carry Save Adder Four Operand Carry - Propagate Adder Array Four Operand Carry Save Adder Array wth Fnal CPA Typcal Array Adder Structure for Mult - Operand Addton Flag Inverson Cells for Constant Addton Flag Inverson Cells for Three - Input Addton Full Adder used wthn Carry-Sp and Carry-Select, and Carry-Save Adders One Bt FIC for Three-Input Addton Dot Dagram of a Carry - Save Adder Area Results for Conventonal Adder Desgns Delay Results for Conventonal Adder Desgns Power Results for Conventonal Adder Desgns Area Results for Enhanced Adder Desgns Delay Results for Enhanced Adder Desgns Power Results for Enhanced Adder Desgns Area Results for Three Input Adder Desgns Delay Results for Three - Input Adder Desgns Power Results for Three - Input Adder Desgns Area Results for 16 - bt Desgns Delay Results for 16 - bt Desgns Power Results for 16 - bt Desgns v

9 7.16 Area Results for 32 - bt Desgns Delay Results for 32 - bt Desgns Power Results for 32 - bt Desgns Area Results for 64 - bt Desgns Delay Results for 64 - bt Desgns Power Results for 64 - bt Desgns x

10 ABSTRACT The goal of ths research s to desgn arthmetc crcuts that meet the challenges faced by computer archtects durng the desgn of hgh performance embedded systems. The focus s narrowed down to addton algorthms and the desgn of hgh speed adder archtectures. Addton s one of the most basc operatons performed n all computng unts, ncludng mcroprocessors and dgtal sgnal processors. It s also a basc unt utlzed n varous complcated algorthms of multplcaton and dvson. Varous adder archtectures for bnary addton have been nvestgated, n vew of a wde range of performance characterstcs, whch nclude delay, area, power, the sze of the nput operands, and the number of nput operands. Effcent mplementaton of an adder crcut usually revolves around reducng the cost to propagate the carry between successve bt postons. The problem of carry propagaton s elmnated by expressng addton as a prefx computaton. The resultng adder crcuts are called parallel prefx adders. Based on the advantages posed by the prefx scheme, a qualtatve evaluaton of three dfferent exstng prefx adders (Brent Kung, Ladner Fscher, and Kogge Stone) has been performed. A technque to enhance the functonalty of these basc desgns has also been nvestgated enablng ther utlzaton n nteger arthmetc. The technque has been named after the new set of ntermedate outputs that are generated as part of the algorthm, called the flag bts. Ths technque targets a new desgn crtera; the number of results obtaned at the output. An algorthm based on flag bts has been proposed to acheve three-operand addton whch has proven to have a favorable overall performance wth respect to conventonal mult-operand addton schemes. x

11 1 CHAPTER 1 INTRODUCTION 1.1 Motvaton Besdes technologcal scalng, advances n the feld of computer archtecture have also contrbuted to the exponental growth n performance of dgtal computer hardware. The flp-sde of the rsng processor performance s an unprecedented ncrease n hardware and software complexty. Increasng complexty leads to hgh development costs, dffculty wth testablty and verfablty, and less adaptablty. The challenge n front of computer desgners s therefore to opt for smpler, robust, and easly certfable crcuts. Computer arthmetc, here plays a ey role adng computer archtects wth ths challenge. It s one of the oldest sub-felds of computer archtecture. The bul of hardware n earler computers resded n the accumulator and other arthmetc/logc crcuts. Successful operaton of computer arthmetc crcuts was taen for granted and hgh performance of these crcuts has been routnely expected. Ths context has been changng due to varous reasons. Frst, at very hgh cloc rates, the nterfaces between arthmetc crcuts and the rest of the processor become crtcal. Arthmetc crcuts can no longer be desgned and verfed n solaton. Rather an ntegrated desgn optmzaton s requred. Second, optmzng arthmetc crcuts to meet the desgn goals by tang advantage of the strengths of new technologes, and mang them tolerant to the weaness, requres a re-examnaton of exstng desgn paradgms. Fnally, ncorporaton of hgher-level arthmetc prmtves nto hardware maes the desgn, optmzaton and verfcaton efforts hghly complex and nterrelated.

12 2 The core of every mcroprocessor, dgtal sgnal processor (DSP), and dataprocessng applcaton-specfc ntegrated crcut (ASIC) s ts datapath. Wth respect to the most mportant desgn crtera; crtcal delay, chp sze, and power dsspaton, the datapath s a crucal crcut component. The datapath comprses of varous arthmetc unts, such as comparators, adders, and multpler [43]. The bass of every complex arthmetc operaton s bnary addton. Hence, t can be concluded, that bnary addton s one of the most mportant arthmetc operaton. The hardware mplementaton of an adder becomes even more crtcal due to the expensve carry-propagaton step, the evaluaton tme of whch s dependent on the operand word length. The effcent mplementaton of the addton operaton n an ntegrated crcut s a ey problem n VLSI desgn [58]. Productvty n ASIC desgn s constantly mproved by the use of cell-based desgn technques such as standard cells, gate arrays, and feld programmable gate arrays (FPGA), and low-level and hgh-level hardware synthess [13]. Ths ass for adder archtectures whch result n effcent cell-based crcut realzatons whch can easly be syntheszed. Furthermore, they should provde enough flexblty n order to accommodate custom tmng and area constrants as well as to allow the mplementaton of customzed adders. 1.2 Goals The followng goals have been formulated for ths research: Establsh the performance crtera for an adder desgn whch nclude; speed, area, power, sze of nput operands, number of nput operands, and number of useful results obtaned at the output.

13 3 Performance evaluaton of conventonal adder archtectures and compare them, wth focus on crcut mplementaton. Study of mathematcs nvolved behnd Flagged Prefx Adders [8] and performance evaluaton of ths adder desgn wth dfferent prefx trees. Derve the logc to ncrease the number of nput operands a prefx adder can add. Desgn a three-nput bnary adder utlzng the concept of flag bts, where the thrd operand s applcaton-ndependent. The hardware that needs to be ncorporated wthn a bnary adder wll be adder-ndependent. Performance evaluaton of the proposed desgn wth respect to varous adder archtectures n terms of delay, area, and power. Obtan delay and area estmates from synthess and verfy results by applcaton of Logcal Effort [50] to the proposed desgn. 1.3 Structure of the Thess As a startng pont, the basc desgn crtera are dscussed and ther mplcatons establshed n Chapter 2. Optmzaton technques have been presented whch are utlzed for the fnal adder desgn. It s substantated why cell-based combnatonal adders and ther synthess are mportant n VLSI desgn. Chapter 3 ntroduces conventonal adder archtectures whch nclude the Carry - Propagate, Carry - Select, Carry - Sp, and Parallel - Prefx adders [31]. The Carry-Save adder whch s a mult-operand adder has also been ntroduced. It s ths adder archtecture that s used as a benchmar to evaluate the performance of the proposed technque n ths thess.

14 4 Chapter 4 gves a bref ntroducton to the method of logcal effort and ts applcaton to dgtal crcuts. Ths forms the bass of verfcaton of results obtaned va synthess. The theory of flagged prefx addton [6] s ntroduced n Chapter 5. It also dscusses the applcatons of flagged prefx adders and the advantages presented by these desgns. Chapter 6 extends the theory of flagged prefx addton to enhance the functonalty of a bnary adder to three-nput addton, not lmtng the number of nput operands to two. The necessary hardware mplementaton s derved, ntally consderng that the thrd nput s a constant. The resultng adder desgns are called Enhanced Flagged Bnary Adders (EFBA) [17]. Ths s followed by a hardware optmzaton to enable three-nput addton ndependent of whether the thrd operand s a constant or a varable. The fnal adder desgns are referred to as Three Input Prefx Adders (TIFPA). Chapter 7 nvestgates the performance of conventonal, flagged prefx, EFBA, and TIFPA desgns theoretcally as well as based on smulaton results. The method of logcal effort s appled to all the desgns to verfy synthess results wth the analytcal results. Conclusons are drawn and potental for future wor s presented n Chapter 8.

15 5 CHAPTER 2 DESIGN CRITERIA AND IMPLICATIONS Ths chapter formulates the motvaton for the research presented n ths thess by focusng on questons le: Why s the effcent mplementaton of bnary adders mportant? What wll be the ey layout desgn technologes n the future, and why do cell-based desgn technques, such as standard cells, gan more and more mportance? Why s hardware synthess becomng a ey ssue n VLSI desgn? How can area, delay, and power measurements of combnatonal crcuts be estmated and optmzed? How can the performance and complexty of adder crcuts be modeled by tang nto account archtectural, crcut, layout, and technology aspects? Ths chapter summarzes the technques, advantages, and dsadvantages of technques that are utlzed durng the desgn and mplementatons of dgtal logc crcuts. 2.1 Arthmetc Operatons and Unts The tass of a VLSI chp are the processng of data and the control of nternal or external system components. Ths s typcally done by algorthms whch are based on logc and arthmetc operatons on data tems [10]. Applcatons of arthmetc operatons n ntegrated crcuts are manfold. Mcroprocessors and DSPs typcally contan adders and multplers n ther datapath. Specal crcut unts for fast dvson and square-root operatons are sometmes ncluded as well. Adders, ncrementers/decrementers, and comparators are often used for address calculaton and flag generaton purposes n controllers. ASICs use arthmetc unts for the same purposes. Dependng on ther applcaton, they may even requre dedcated crcut components for specal arthmetc

16 6 operators, such as for fnte feld arthmetc used n cryptography, error correcton codng, and sgnal processng. Some of the basc arthmetc operatons are lsted below [58] Shft/extenson operatons Equalty and magntude comparson Incrementaton / Decrementaton Negaton Addton/ Subtracton Multplcaton Dvson Square root Exponentaton Logarthmc Functons Trgonometrc Functons For trgonometrc and logarthmc functons as well as for exponentaton, varous teratve algorthms exst whch mae use of smpler arthmetc operatons. Multplcaton, dvson and square root can be performed usng seral or parallel methods. In both methods, the computaton s reduced to a sequence of condtonal addtons/subtractons and shft operatons. Exstng speed-up technques try to reduce the number of requred addton/subtracton operatons to mprove ther speed. Subtracton corresponds to the addton of a negated operand. The addton of two n-bt bnary numbers can be regarded as an elementary operaton. The algorthm for negaton of a number depends on the chosen number representaton [42] and s usually accomplshed by bt nverson and ncrementaton.

17 7 Increment and decrement operatons are smplfed addtons wth one nput operand beng constantly 1 or -1. Equalty and magntude comparson operatons can also be regarded as smplfed addtons wth only some of the respectve addton flags, but no sum bts are used as outputs [58]. Ths short overvew shows that the addton s the ey arthmetc operaton, whch most other operatons are based on. Its mplementaton n hardware s therefore crucal for the effcent realzaton of almost every arthmetc unt n VLSI. Ths s n terms of crcut sze, computaton delay, and power consumpton. Addton s a prefx problem [34], whch means that each result s dependent on all nput bts of equal or lower magntude. Propagaton of a carry sgnal from each bt poston to all hgher bt postons s necessary. Carry - propagate adders [31] perform ths operaton mmedately. The requred carry - propagaton from the least to the most sgnfcant bt results n a consderable crcut delay, whch s a functon of the word length of the nput operands [58]. The most effcent way to speed up addton s to avod carry propagaton thus savng the carres for later processng. Ths allows the addton of two or more numbers n a very short tme, but yelds results n a redundant number representaton [20]. The redundant representaton forms the bass of carry-save adders [20]. They play an mportant role n the effcent mplementaton of mult-operand addton crcuts. They are very fast, ther structure smple, but the potental for further optmzaton s mnmal. The bnary carry-propagate adder therefore, s one of the most often used and most crucal buldng blocs n dgtal VLSI desgn. Varous well-nown methods exst for speedng up carry - propagaton n adders, offerng very dfferent performance

18 8 characterstcs, advantages and dsadvantages. Instances of adder archtectures targetng mportant desgn crtera are lsted below. Delay: Kogge-Stone parallel prefx adder [32] Area: Carry-rpple adder [44] Power: 2 - level carry-sp adder [7] Sze of nput operands: Brent-Kung parallel prefx adder [5] Number of nput operands: Carry-save mult-operand adder [42] Number of results at output: Flagged parallel-prefx adder [8] The performance measure of each algorthm s also dependent on desgn technques that are employed to create each crcut. The next secton presents the crcut and layout desgn technques. 2.2 Crcut and Layout Desgn Technques IC fabrcaton technologes can be classfed nto full-custom, sem-custom, and programmable ICs. Further dstnctons are made wth respect to crcut desgn technques and layout desgn technques, whch are strongly related [58] Layout-Based Desgn Technques. In layout-based desgn technques, dedcated full-custom layout s drawn manually for crcuts desgned at the transstor level. The ntal desgn effort s very hgh, but maxmum crcut performance and layout effcency s acheved. Full-custom cells are entrely desgned by hand for dedcated hgh performance unts, e.g., arthmetc unts. The tled-layout technque can be used to smplfy, automate, and parameterze the layout tas. For reuse purpose, the crcuts and layouts are often collected n lbrares together wth automatc generators. Mega-cells are full-custom cells for unversal functons whch need no parameterzaton. Macro-cells are

19 9 used for large crcut components wth regular structure and need word-length parameterzaton. Datapaths are usually realzed n a bt-slced layout style, whch allows parameterzaton of word length and concatenaton of arbtrary datapath elements for logc, arthmetc, and storage functons. Snce adders are too small to be mplemented as macro cells, they are usually realzed as data-path elements Cell-Based Desgn Technques. At a hgher level of abstracton, arbtrary crcuts can be composed from elementary logc gates and storage elements contaned n a lbrary of pre-desgned cells. The layout s automatcally composed from correspondng layout cells usng dedcated layout strateges, dependng on the used IC technology. Cellbased desgn technques are used n standard-cell, gate-array, sea-of-gates, and feldprogrammable gate-array (FPGA) technologes. The desgn of logc crcuts does not dffer consderably among the dfferent cell-based IC technologes. Crcuts are obtaned from schematc entry, behavoral synthess, or crcut generators (structural synthess). Due to the requred generc propertes of the cells, more conventonal logc styles have to be used for ther crcut mplementaton [58]. The advantages of cell-based desgn technques le n ther unversal usage, automated synthess and layout generaton for arbtrary crcuts, portablty between tools and lbrares, hgh desgn productvty, hgh-relablty, and hgh flexblty n floorplannng. Ths comes at the prce of lower crcut performance wth respect to speed and area. Cell-based desgn technques are manly used for the mplementaton of random logc and custom crcuts for whch no approprate lbrary component are avalable and custom mplementaton proves costly. Cell-based desgn technques are wdely used n the ASIC desgn communty.

20 10 Standard Cells Standard cells [22] [49] represent the hghest performance cell-based technology. The layout of the cells s full-custom, whch mandates for full-custom fabrcaton of the wafers. Ths n turn enables the combnaton of standard cells wth custom-layout components on the same de. For layout generaton, the standard cells are placed n rows and connected through ntermedate routng channels. Wth the ncreasng number of routng layers and the over-the-cell routng capabltes n modern process technologes, the layout densty of standard cells gets close to the densty obtaned from full-custom layout. The remanng drawbac s the restrcted use of hgh-performance crcut technques [58]. Gate-arrays and sea-of gates On gate-arrays and sea-of gates, pre-processed wafers wth unconnected crcut elements are used. Thus, only metallzaton used for the nterconnect s customzed, resultng n lower producton costs and faster turnaround tmes. Crcut performance and layout flexblty s lower than for standard cells, whch n partcular decreases mplementaton effcency of regular structures such as macro-cells. FPGA Cells Feld-programmable gate-arrays (FPGA) [29] are electroncally programmable generc ICs. They are organzed as an array of logc blocs and routng channels, and the confguraton s stored n a statc memory or programmed e.g., usng ant-fuses. Agan, a lbrary of logc cells and macros allow flexble and effcent desgn of arbtrary crcuts. Turnaround tmes are very fast mang FPGAs the deal soluton for rapd prototypng. On the other hand, low crcut performance, lmted crcut complexty, and hgh de costs

21 11 severely lmt ther area of applcaton [29]. The followng secton gves a synopss of the method employed for the completon of ths research 2.3 Automated Crcut Synthess and Optmzaton Crcut synthess denotes the automated generaton of logc networs from behavoral descrptons at an arbtrary level. Synthess s becomng a ey ssue n VLSI desgn for many reasons. Increasng crcut complextes, shorter development tmes, as well as effcent and flexble usage of cell and component lbrares can only be handled wth the ad of powerful desgn automaton tools. Arthmetc synthess addresses the effcent mappng of arthmetc functons onto exstng arthmetc components and logc gates Hgh-Level Synthess. Hgh-level synthess or behavoral synthess allows the translaton of algorthmc or behavoral descrptons of hgh abstracton level down to Regster Transfer Logc (RTL) representaton, whch can be processed further by lowlevel synthess tools. Hgh-level arthmetc synthess maes use of arthmetc transformatons n order to optmze hardware usage under gven performance crtera. Thereby, arthmetc lbrary components are regarded as resources for mplementng the basc arthmetc operatons [58] Low-Level Synthess. Low-level synthess or logc synthess translates an RTL specfcaton nto a generc logc networ. For random logc, synthess s acheved by establshng the logc equatons for all outputs and mplementng them n a logc networ Data-Path Synthess. Effcent arthmetc crcuts contan very specfc structures of large logc depth and hgh factorzaton degree. Ther drect synthess from logc equatons s not feasble. Therefore, parameterzed netlst generators usng dedcated

22 12 algorthms are used nstead. Most synthess tools nclude generators for the basc arthmetc functons, such as comparators, ncrementers, adders, and multplers. For other mportant operatons, such as squarng and dvson, usually no generators are provded and thus synthess of effcent crcutry s not avalable. Also the performance of the commonly used archtectures vares consderably, whch often leads to sub-optmal cell-based crcut mplementatons [58] Optmzaton of Combnatonal Crcuts. The optmzaton of combnatonal crcuts connotes the automated mnmzaton of a logc netlst wth respect to delay, area, and power dsspaton measures of the resultng crcut, and the technology mappng. The appled algorthms are very powerful for optmzaton of random logc by performng steps le flattenng, logc mnmzaton [29], tmng-drven factorzaton, and technology mappng. However the potental for optmzaton s lmted for networs wth large logc depth and hgh factorzaton degree, especally arthmetc crcuts. Therefore, only local logc mnmzaton s possble, leavng the global crcut archtecture bascally unchanged. Thus the realzaton of well-performng arthmetc crcuts reles more on effcent datapath synthess than on smple logc optmzaton Hardware Descrpton Languages. Hardware descrpton languages (HDL) allow the specfcaton of hardware at dfferent levels of abstracton, servng as entry ponts to hardware synthess. Verlog s one of the most wdely used and powerful languages. It enables the descrpton of crcuts at the behavoral and structural level. Synthess of arthmetc unts s ntated by usng standard arthmetc operator symbols n the Verlog code [45] for whch the correspondng bult-n netlst generators are called by the synthess tool. The advantages of utlzng Verlog over schematc entry le n the

23 13 possblty of behavoral hardware descrpton, the parameterzablty crcuts, and portablty of code. Now that t has been establshed how to desgn the schematcs, the next secton focuses on the desgn crtera for VLSI crcuts. 2.4 Crcut Complexty and Performance Measures One mportant aspect n desgn automaton s the complexty and performance estmaton of a crcut. At a hgher desgn level, ths s acheved by usng characterzaton nformaton of the hgh-level components to be used and by complexty estmaton of the nterconnect [58]. At gate-level, however, estmaton s more dffcult and less accurate because crcut sze and performance strongly depend on the gate-level synthess results and on the physcal cell arrangement and routng. Estmates of the expected area, speed, and power dsspaton for a compled cellbased crcut can be found as a functon of the operand word length Area. Slcon area on a VLSI chp s taen up by the actve crcut elements and ther nterconnectons. In cell-based desgn technques, the followng crtera for area modelng can be formulated [19]: Total crcut complexty (GE total ) can be measured by the number of gate equvalents. (1GE 1 2-nput NAND gate 4 MOSFETS) Crcut Area (A crcut ) s occuped by logc cells and nter-cell wrng. (A crcut = A cells + A wrng ) Total cell area (A cells ) s proportonal to the number of transstors or GE total contaned n a crcut. Ths number s nfluenced by technology mappng. (A cells α GE total ) Wrng area (A wrng ) s proportonal to the total wre length. (A wrng α L total )

24 14 Total wre length (L total ) can be estmated from the number of nodes and the average wre length of a node, or more accurately from the sum of cell fan-out and the average wre length of cell to cell connectons. The wre lengths also depend on crcut sze, crcut connectvty, and layout topology. (L total α FO total ) Cell fan out (FO) s the number of cell nputs a cell output s drvng. Fan-n s the number of nputs to a cell, whch for many combnatonal gates s proportonal to the sze of the cell. Snce the sum of the cell fan-out (FO total ) of a crcut s equvalent to the sum of the cell fan-n t s also proportonal to the crcut sze. (FO total α GE total ) Delay. Propagaton delay n a crcut s determned by the cell and nterconnecton delays on the crtcal path. Indvdual cell and node values are relevant for path delays. Crtcal path evaluaton s done by statc tmng analyss whch nvolves graph-based search algorthms. Tmngs are also dependent on temperature, voltage, and process parameters [54]. Maxmum delay (t crt-path ) of a crcut s equal to the sum of cell nertal delays, cell output ramp delays, and wre delays on the crtcal path. (t crt-path = Σє crtpath ((t cell + t ramp ) + Σє crt-path t wre ) Cell delay (t cell ) depends on the transstor-level crcut mplementaton and the complexty of a cell. All smple gates have comparable delays. Complex gates usually contan tree-le crcut and transstor arrangements, resultng n logarthmc delay to area dependences. (t cell α log(a cell )

25 15 Ramp Delay (t ramp ) s the tme t taes for a cell output to drve the attached capactve load, whch s made up of nterconnect and cell nput loads. The ramp delay depends lnearly on the capactve load attached, whch n turn depends lnearly on the fan-out of the cell. (t ramp α FO cell ) Wre delay (t wre ) s the RC-delay of a wre, whch depends on the wre length. RC delays are neglgble compared to cell and ramp delays for small crcuts such as adders. (t wre =0) A rough delay estmaton s possble by consderng szes and wth a small weghtng factor, fan-out of the cells on the crtcal path. (t crt-path α Σє crt-path (log (A cell ) + FO cell ) Power. An ncreasngly mportant parameter for VLSI crcuts s power dsspaton. Pea power s a problem wth respect to crcut relablty whch, however, can be dealt wth by careful desgn. On the other hand, average power dsspaton s becomng a crucal desgn constrant n many modern applcatons, such as hghperformance mcroprocessors and portable applcatons, due to the heat removal problems and power budget lmtatons. The followng prncples hold for average power dsspaton n synchronous CMOS crcuts [30]: Total power (P total ) n CMOS crcuts s domnated by the dynamc swtchng of crcut elements, whereas dynamc short-crcut currents and statc leaage are of less mportance. Thus, power dsspaton can be assumed proportonal to the total capactance to be swtched, the square of the supply voltage, the

26 16 cloc frequency, and the swtchng actvty α n a crcut. (P total = ½. C total. V 2 dd. f cl. α ) Total capactance (C total ) n a CMOS crcut s the sum of the capactances from transstor gates, sources, and drans and from wrng. Thus, the total capactance s proportonal to the number of transstors and the amount of wrng, both of whch are roughly proportonal to crcut sze. (C total α GE total ) Supply voltage (V dd ) and cloc frequency (f cl ) can be regarded as constant wthn a crcut and therefore are not relevant n out crcut comparsons. (V dd, f cl = constant) The swtchng actvty factor (α) gves a measure for the number of transent nodes per cloc cycle and depends on nput patterns and crcut characterstcs. In many cases, nput patterns to datapaths and arthmetc unts are assumed to be random, whch result n an average transton actvty of 50% on all nputs. Sgnal propagaton through several levels of combnatonal logc may decrease or ncrease transton actvtes, dependng on crcut structure. (α = constant) For arthmetc unts wth constant nput swtchng actvty, power dsspaton s approxmately proportonal to crcut sze. (P total α GE total ) 2.5 Summary Arthmetc unts belong to the basc and most crucal buldng blocs n many ntegrated crcuts, and ther performance depends on the effcent hardware mplementaton of the underlyng arthmetc operatons. Advances n computer-aded desgn as well as the ever growng desgn productvty demands tend to prefer cell-based

27 17 desgn technques and hardware synthess, also for arthmetc components. The followng chapter wll dscuss several adder desgns that formed the bass of such arthmetc components.

28 18 CHAPTER 3 ADDER DESIGNS Addton s the most common arthmetc operaton and also serves as the buldng bloc for syntheszng all other operatons. Wthn dgtal computers, addton s performed extensvely both, n explctly specfed computaton steps and as a part of mplct ones dctated by ndexng and other forms of address arthmetc [43]. In smple ALUs due to the lac of dedcated hardware for fast multplcaton and dvson, these latter operatons are performed as sequences of addtons. Subtracton s normally performed by negatng the subtrahend and addng the result to the mnuend. Ths s qute natural, gven that an adder must handle sgned numbers anyway. Ths chapter ntroduces the basc prncples and crcut structures used for the addton of sngle bts and of two or more multple bnary numbers. Bnary carry - propagate addton s formulated as a prefx problem, and the fundamental algorthms and speed-up technques for the effcent solutons of ths problem have been descrbed Bt Adders As the basc combnatonal addton structure, a 1-bt adder computes the sum of m nput bts of the same magntude. It s also called an (m,) counter (See Fg. 3.1) because t counts the number of 1s at the m nputs (a m-1, a m-2,, a 0 ) and outputs a bt sum (s -1, s -2,, s 0 ), where = log( m +1) same s as follows [58] 1 j= 0. The arthmetc equaton representng the m 1 j 2 s = a (3.1) j = 0

29 19 a a m-1 (m,).... s -1 s 0 Fgure 3.1. (m,) counter Half-Adder (2,2)-Counter. The half adder s a (2,2) counter. The most sgnfcant sum bt s called the carry out, c out because t carres an overflow to the next hgher bt poston. Fgure 3.2 depcts the logc symbol and the crcut mplementaton of the halfadder. The correspondng arthmetc and logc equatons are as follows n equatons 3.2 and 3.3 respectvely. 2 c out + s = a + b s = ( a + b) mod 2 (3.2) c out = ( a + b) dv2 = 1/ 2( a + b s) s = a b c out = a b (3.3)

30 20 a b a b c out HA s s c out Fgure 3.2. Symbol and Logc Crcut for Half Adder Full Adder (3,2)-Counter. The full adder s a (3,2) counter. The thrd nput s called c n snce t represents the carry bt from a hgher bt poston. Important nternal sgnals wthn a full adder are the bt generate and the bt propagate sgnals represented by g and p respectvely. The generate sgnal represents whether the carry sgnal, 0 or 1 wll be generated wthn the full adder. The propagate sgnal ndcates f the carry- n at the nput of the full adder wll be propagated to the carry-out of the full adder unchanged. The arthmetc and logc equatons for all the sgnals encountered wthn a full adder are gven n equatons 3.4 and 3.5 respectvely [58]. 2 c + s = a + b + out c n s = a + b + c ( n ) mod 2 (3.4) c out = ( a + b + c ) dv2 = 1/ 2( a + b + c s) n n g = a b p = a b s = a b c = p n c n (3.5) c n = a b + ac n + bc n = g + p c n

31 21 Fgure 3.3 depcts the logc symbol and the mplementaton crcut for a full adder desgn utlzed for the purpose of ths thess. a a b g HA c out FA c n c out p b c n s HA s Fgure 3.3. Symbol and Logc Crcut for Full Adder 3.2 Carry Propagate Adders A carry-propagate adder adds 2 n-bt operands, A=(a n-1, a n-2,, a 0 ) and B=(b n-1, b n-2,, b 0 ) and an optonal carry-n, c n by performng carry-propagaton. The result s an rredundant (n+1)-bt number consstng of the n-bt sum S=(s n-1, s n-2,, s 0 ) and a carryout, c out. Equaton 3.6 represents the arthmetc equatons for a conventonal carrypropagate adder. The logc equatons are represented n equaton 3.7 [58]. n 2 c + S = A + B + c out n 2 n 1 n 1 n n c + out s = a b + = 0 = 0 = 0 c n (3.6) g = a b p = a b s = p c (3.7) c +1 = g + p c ; =0, 1,, n-1

32 22 Also note that, c 0 =c n and c n =c out. The carry propagate adder can be mplemented as a combnatonal crcut usng n full adders connected n seres (See Fg. 3.4) and s called a Rpple-Carry Adder. A 3 B 3 A 2 B 2 A 1 A 0 B 0 B 1 C Cout C 3 C bt adder 1-bt adder 1-bt adder 1-bt adder C n S 3 S 2 S 1 S 0 Fgure 3.4. Rpple Carry Adder The followng expresson depcts the latency of an n-bt rpple carry adder[42]. T rpple add = TFA ( a, b c+ 1) + ( n 2) TFA ( c c+ 1) + TFA ( c s ) (3.8) where T FA (nput->output) represents the latency of a full adder on the path between ts specfed nput and output. As an approxmaton to the foregong, t can be sad that the latency of a rpple carry adder s nt FA. 3.3 Carry Select Adders One of the earlest logarthmc tme adder desgns s based on the condtonal - sum addton algorthm. In ths scheme, blocs of bts are added n two ways: assumng an ncomng carry of 0 or of 1, wth the correct outputs selected later as the bloc s true carry-n becomes nown. Ths s one of the speed-up technques that s used n order to reduce the latency of carry propagaton as seen wth the rpple-carry adder. Wth each level of selecton, the number of nown output bts doubles, leadng to a logarthmc number of levels and thus logarthmc tme addton as opposed to the lnear tme addton of a carry-propagate adder. An analyss of ths scheme s presented next.

33 23 The basc problem faced n speedng up carry propagaton s the fast processng of a late carry nput. Snce ths carry-n can have only two values, 0 and 1, the two possble addton results can be pre-computed and selected afterwards by the late carry-n usng small and constant tme. The n- bt adder s broen down nto groups of m bts. Each group of m-bts are added utlzng what are called m-bt condtonal adders. An m-bt condtonal adder s shown n Fgure 3.5 [20]. A B Condtonal Adder m m m-bt ADDER 1 m-bt ADDER 0 m+1 m ( C, S ) m ( 0, S 0 ) C m Fgure 3.5. m-bt Condtonal Sum Adder Equaton 3.9 [20] gves a mathematcal representaton of a condtonal sum adder ( C ( C 0 m 1 m, S, S 0 1 ) = ADD( A, B, c ) = ADD( A, B, c 0 0 = 0) = 1) (3.9) Here, A, B and S are all m-bt vectors. C 0 m represents the m-bt vector of all the carry outputs assumng c 0 =0. Smlarly, C 1 m represents the m-bt vector of all the carry outputs assumng c 0 =1. S 0 and S 1 represent the m-bt vector consstng of all the sum bts assumng c 0 =0 and c 0 =1 respectvely. Such m-bt condtonal adders are then combned as shown n Fgure 3.6 to obtan a carry-select adder.

34 24 c 0ut A 12:15 B 12:15 A 8:11 B 8:11 A 4:7 B 4:7 A 3:0 B 3: bt c 4-bt c 4-bt 12 8 c Condtonal 4 Condtonal Condtonal Adder Adder Adder 4-bt Adder c S 15:12 S 11:8 S 7:4 S 3:0 Fgure 3.6. Carry-Select Adder From Fgures 3.5 and 3.6, t can be deduced that, each carry - select adder s composed of an ntal full adder (fa), at the LSB poston, a seres of full adders (bfa) that generate 2 sets of results for each possble value of the carry-sgnal. Each group wthn the carry-select adder wll also consst of an ntal full adder (bfa) at the group LSB, and a fnal carry-generator (bcg) at the group MSB. c pb and c tb denote the carry-out of the prevous and the current bloc respectvely. The logcal equatons for a carry select adder are gven as n equatons [58]. Ifa c tb = a0b0 + a0c0 + b0c0 (3.10) bfa g p c c s s s = a b p = c = a b p = g = g = = s 0 pb + p + c s 1 pb (3.11)

35 25 bfa g p c c s s s = a b = c = a b = g = g s 0 pb + p c p c = p c = p c + c 0 1 s 1 pb (3.12) bcg c (3.13) = tb c+ 1 c pbc+ 1 The delay of carry-select adder therefore s represented as n equaton 3.14 [42]. n T = mt + ( 1) T m carry sel FA MUX (3.14) T MUX represents the delay of a multplexer. The advantage of utlzng a carry - select adder s that the delay rses logarthmcally nstead of lnearly. However, t also has a hgh hardware overhead snce t requres double the number of carry-propagate adders. In addton, t also requres a set (n/m-1) set of multplexers. 3.4 Carry Sp Adder The carry - sp adder s obtaned by a modfcaton of the rpple - carry adder. The objectve s to reduce the worst-case delay by reducng the number of full adder cells through whch the carry has to propagate. To acheve ths, the adder s dvded nto groups of m bts and the carry nto group j+1 s determned by one of the followng two condtons [20] 1. The carry s propagated by group j. That s, the carry-out of group j s equal to the carry-n of that group. Ths stuaton occurs only when the sum of the m nputs to that group s equal to 2 1.

36 26 2. The carry s not propagated by the group (that s, t s generated or lled nsde the group). Consequently, to reduce the length of the propagaton of the carry, a sp networ s provded for each group of m bts so that when a carry s propagated by ths group, the sp networ maes the carry bypass the group. The m-bt adder s shown n 3.7, and a networ of these modules mplementng an n-bt adder s ndcated n Fgure 3.8. The analytcal and logcal analyss s presented next. A : B : c 0 +1 CPA 1 c P : S : Fgure 3.7. Carry-Sp Bloc Carry computaton for a sngle bt poston, c +1 = p g + p c can be reformulated for a group of bts as n equaton 3.15 c = P c + Pc ' + 1 : (3.15) A 12:15 B 12:15 A 8:11 B 8:11 A 4:7 B 4:7 A 3:0 B 3: c out 4-bt CSK c 12 4-bt CSK c 8 4-bt CSK c 4 4-bt CSK c 0 S 12:15 S 8:11 S 4:7 S 3:0 Fgure 3.8. Carry-Sp Adder

37 27 where P : denotes the group propagate of the carry-propagate adder and acts as a select sgnal n the multplexer structure and s gven by P : 1 = AND p. c s the carry - out of j= j the partal carry propagate adder. Two cases can be dstngushed [58]: P : =0 : The carry c +1 s generated wthn the carry propagate adder and selected by the multplexer as carry - out c +1. The carry - n c does not propagate through the carry propagate adder to the carry out c +1 P : =0 : The carry - n c propagates through the carry propagate adder to c +1 but s not selected by the multplexer. It sps the carry propagate adder and s drectly selected as the carry - out c +1 nstead. Thus the combnatonal path from the carry - n through the carry-propagate adder s never actvated. In other words, the slow carry - chan path from the carry - n to the carry - out through the carry propagate adder s broen by the adder tself or the multplexer. The resultng carry - sp addton bloc, therefore s a regular carry propagate adder wth a small and constant tme delay from c c + 1 whch s why t speeds up carry propagaton. Note that that the multplexer n ths crcut s logcally redundant,.e., the sgnals c +1 and c +1 are logcally equvalent and dffer only n sgnal delays. The carry - n, c has a reconvergent fan-out. Ths nherent logc-redundancy results n a false longest path whch leads from the carry-n through the carry-propagate adder to the carry-out. Ths poses a problem n automatc logc optmzaton and statc tmng analyss. Redundancy removal technques exst whch are based on duplcaton of the carry-chan n the carrypropagate adder: one carry-chan computes the carry-out wthout a carry-n, whle the other taes the carry-n for calculaton of the sum bts. Ths scheme however sgnfes a consderable amount of addtonal logc compared to the redundant carry-sp scheme.

38 28 The worst case delay of a carry-sp adder from Fgures 3.7 and 3.8 can be formulated as [7] n T CSK = mtc + tmux + 2 t mux + ( m 1) tc + t m s (3.16) Here, T CSK represents the delay of the carry-sp adder, t mux s the propagaton delay of the multplexer, t c s the delay from the nputs of the full adder to the carryoutput and t s s the delay from the nputs to the sum output. m s the sze of the groups nto whch the adder s dvded and n s the sze of the nput operands. As can be seen from equaton 3.16, the delay of the adder depends on the sze of the group m. In order to acheve mnmum delay, the sze of group m can be found by dfferentatng equaton The optmal value of m s represented below [26] m = t / t n (3.17) opt mux 2 c Ths analyss assumes that all groups are of the same sze. However, ths does not produce mnmum delay. Ths s due to the fact that, for nstance, carres generated n the frst group have to traverse more sp networs to get to the last group than carres generated n some nternal group. To determne the worst case, delays of all propagaton chans need to be compared. A partcular chan s ntated n group and termnates n group j, wth j, beng propagated by the j--1 groups n between. Consequently, f group, has sze m, T = max(( m + m 1) t + ( j 1) t ) + t + t CSK, j j c mux mux s (3.18) wth Σm =n. The worst case delay, therefore can be reduced by reducng the sze of the groups close to the begnnng and end, as llustrated n Fgure 3.9.

39 29 Group Sze m M-1 M number of groups 0 Group Fgure 3.9. Optmal Group Szes for Carry-Sp Adder 3.5 Carry Save Adder A row of bnary full adders can be vewed as a mechansm to reduce three numbers to two numbers to ther sum. Fgure 3.10 shows the relatonshp of a rpple-carry adder for the latter reducton and a carry-save adder for the former. The dotted lnes represent the flow of the carry sgnals wthn a rpple carry adder. In the carry-save adder, the carry sgnals are separate outputs whch are then passed on to another level of a carrypropagate adder along wth the S bts to generate the fnal sum [15]. A 3 B 3 A 2 B 2 A 1 A 0 B 0 B 1 C out C 1 C 2 1-bt adder 1-bt adder 1-bt adder 1-bt adder C 3 C n S 0 C 1 C 2 C 3 S 1 S 2 S 3 Fgure Carry-Save Adder The basc dea s to perform an addton of three bnary vectors usng an array of one - bt adders wthout propagatng the carres. The sum s a redundant n - dgt carry - save

40 30 number, consstng of two bnary numbers S (sum bts) and C (carry bts). A carry - save adder accepts three bnary nput operands or, alternatvely, one bnary and one carry - save operand. It has a constant delay ndependent of n. Mathematcally [58], = + = = = = = + = = + 2 0, 1, j j j n j n n n a s c a s c A A A S C (3.19) where =0,1,.,n Parallel Prefx Adders The addton of two bnary numbers can be formulated as a prefx problem. The correspondng parallel-prefx algorthms can be used for speedng up bnary addton and for llustratng and understandng varous addton prncples. Ths secton ntroduces a mathematcal and vsual formalsm for prefx problems and algorthms Prefx Problems. In a prefx problem, n outputs (y n-1, y n-2,., y 0 ) are computed from n nputs (x n-1, x n-2,.,x 0 ) usng an arbtrary assocatve operator as follows [21]: x x x x y x x x y x x y x y n n n = = = = (3.20) The problem can also be formulated recursvely: 1; 0 0 = = y x y x y 1,2,..., 1 = n (3.21)

41 31 In other words, n a prefx problem, every output depends on all nputs of equal or lower magntude, and every nput nfluences all outputs of equal or hgher magntude. Due to the assocatvty of the prefx operator, the ndvdual operatons can be carred out n any order. In partcular, sequences of operatons can be grouped n order to solve the prefx problem partally and n parallel groups (.e., sequences) of nput bts (x, x -1,.,x ), resultng n the group varables Y :. At hgher levels, sequences of group varables can agan be evaluated, yeldng m levels of ntermedate group varable Y l :. It denotes the prefx result of bts (x, x -1,.,x ) at level l. The group varables of the last level m must cover all bts from to 0 (Y m :) and therefore represent the results of the prefx problem [58]. Y Y y 0 : l : = x = Y = Y l 1 : j + 1 m :0 ; l 1 Y j:, = 0,1,..., n 1 j ; l = 1,2,..., m (3.22) Note, that for j=, the group varable Y l-1 : s unchanged (.e., Y l : = Y l-1 :). Snce prefx problems descrbe a combnatonal nput-to output relatonshp, they can be solved by logc networs, whch wll be the major focus n the followng text. Varous seral and parallel algorthms exst for solvng prefx problems, dependng on the bt groupng propertes n equaton They result n very dfferent sze and delay performance measures when mapped onto a logc networ Prefx Algorthms. Two categores of prefx algorthms can be dstngushed; the seral prefx, and the tree-prefx algorthms. Tree-prefx algorthms nclude parallelsm for calculaton speed-up, and therefore form the category of parallel-prefx algorthms.

42 32 Equaton 3.21 represents a seral algorthm for solvng the prefx problem. The seral-prefx algorthm needs a mnmal number of bnary operatons and s nherently slow (O(n)). Accordng to equaton 3.20, all outputs can be computed separately and n parallel. By arrangng the operatons n a tree structure, the computaton tme for each output can be reduced to O(log n). However, the overall number of operatons to be evaluated and wth that the hardware costs grow wth (O(n 2 )) f ndvdual evaluaton trees are used for each output. As a tradeoff, the ndvdual output evaluaton trees can be merged (.e., common sub-expressons be shared) to a certan degree accordng to dfferent tree-prefx algorthms, reducng the area complexty to O(n log n) or even O(n). Bnary addton has been presented as a prefx computaton next. The prefx problem of bnary carry-propagate addton computes the generaton and propagaton of carry sgnals. The ntermedate prefx varables can have three dfferent values.e., generate a carry 0 (or ll a carry 1), generate a carry 1, and propagate the carry-n. The varables are coded by two bts, group generate l G :, and group propagate l P :. The generate/propagate form a sgnal par l l l Y : G:, P : = at level l. The ntal prefx sgnal pars ( o o G P : :, ) correspondng to the bt generate g and bt propagate p sgnals have to be computed from the addton nput operands n a pre processng step. The prefx sgnal pars of level l are then calculated from the sgnals of level l-1 by an arbtrary prefx algorthm usng the bnary operaton [58] l l l 1 l 1 l 1 l 1 ( G, P ) = ( G, P ) ( G, P ) (3.23) : : : j : j j: j:

43 33 = G l 1 l 1 l 1 : j + P : j G j:, P l 1 : j P l 1 j: In the prefx tree, there are n columns, correspondng to the number of nput bts. The gates performng the operaton and whch wor n parallel are arranged n the same row, and smlarly, the same gates connected n seres are placed n consecutve rows. Thus, the number of rows m corresponds to the number of bnary operatons to be evaluated n seres. The outputs of row l are the group varables l l l Y : G:, P : =. The generate/propagate sgnals from the last prefx stage ( m m G P : :, ) are used to compute the carry sgnals c. The sum bts, s are fnally obtaned from a post processng step. The parallel prefx adders therefore can be represented as shown n Fgure 3.11 A n-1:0 B n-1:0 Pre-Preprocessng g n-1:0 p n-1:0 c n-1:0 Prefx Carry Tree Post-processng S n-1:0 Fgure Parallel Prefx Adder Combnng equatons 3.22 and 3.23 yelds the followng generate-propagatebased addton prefx formalsm [58]: g p = a b = a b ; = 0,1,..., n 1 (3.24)

TOPICS MULTIPLIERLESS FILTER DESIGN ELEMENTARY SCHOOL ALGORITHM MULTIPLICATION

TOPICS MULTIPLIERLESS FILTER DESIGN ELEMENTARY SCHOOL ALGORITHM MULTIPLICATION 1 2 MULTIPLIERLESS FILTER DESIGN Realzaton of flters wthout full-fledged multplers Some sldes based on support materal by W. Wolf for hs book Modern VLSI Desgn, 3 rd edton. Partly based on followng papers: