Cost/Performance Tradeoffs:

Similar documents
6.004 Computation Structures Spring 2009

8. Design Tradeoffs x Computation Structures Part 1 Digital Circuits. Copyright 2015 MIT EECS

8. Design Tradeoffs x Computation Structures Part 1 Digital Circuits. Copyright 2015 MIT EECS

Tree and Array Multipliers Ivor Page 1

Computer Architecture 10. Fast Adders

Lecture 8: Sequential Multipliers

Binary Multipliers. Reading: Study Chapter 3. The key trick of multiplication is memorizing a digit-to-digit table Everything else was just adding

Datapath Component Tradeoffs

Carry Look Ahead Adders

Chapter 5 Arithmetic Circuits

Logic Design II (17.342) Spring Lecture Outline

ECEN 248: INTRODUCTION TO DIGITAL SYSTEMS DESIGN. Week 9 Dr. Srinivas Shakkottai Dept. of Electrical and Computer Engineering

Midterm Exam Two is scheduled on April 8 in class. On March 27 I will help you prepare Midterm Exam Two.

What s the Deal? MULTIPLICATION. Time to multiply

DIGITAL TECHNICS. Dr. Bálint Pődör. Óbuda University, Microelectronics and Technology Institute

Chapter 5. Digital Design and Computer Architecture, 2 nd Edition. David Money Harris and Sarah L. Harris. Chapter 5 <1>

Logic. Combinational. inputs. outputs. the result. system can

COMBINATIONAL LOGIC FUNCTIONS

Digital Logic: Boolean Algebra and Gates. Textbook Chapter 3

Logic and Computer Design Fundamentals. Chapter 5 Arithmetic Functions and Circuits

A COMBINED 16-BIT BINARY AND DUAL GALOIS FIELD MULTIPLIER. Jesus Garcia and Michael J. Schulte

EECS150 - Digital Design Lecture 24 - Arithmetic Blocks, Part 2 + Shifters

Vectorized 128-bit Input FP16/FP32/ FP64 Floating-Point Multiplier

Serial Parallel Multiplier Design in Quantum-dot Cellular Automata

CPE100: Digital Logic Design I

COMPUTERS ORGANIZATION 2ND YEAR COMPUTE SCIENCE MANAGEMENT ENGINEERING UNIT 3 - ARITMETHIC-LOGIC UNIT JOSÉ GARCÍA RODRÍGUEZ JOSÉ ANTONIO SERRA PÉREZ

Lecture 8. Sequential Multipliers

Systems I: Computer Organization and Architecture

Lecture 22 Chapters 3 Logic Circuits Part 1

Design of Sequential Circuits

CS/COE 1501 cs.pitt.edu/~bill/1501/ Integer Multiplication

Novel Bit Adder Using Arithmetic Logic Unit of QCA Technology

Logic and Computer Design Fundamentals. Chapter 8 Sequencing and Control

EECS150 - Digital Design Lecture 11 - Shifters & Counters. Register Summary

Design of Arithmetic Logic Unit (ALU) using Modified QCA Adder

LOGIC CIRCUITS. Basic Experiment and Design of Electronics

Counters. We ll look at different kinds of counters and discuss how to build them

Chapter 4. Combinational: Circuits with logic gates whose outputs depend on the present combination of the inputs. elements. Dr.

Reduced-Area Constant-Coefficient and Multiple-Constant Multipliers for Xilinx FPGAs with 6-Input LUTs

Computer Science 324 Computer Architecture Mount Holyoke College Fall Topic Notes: Digital Logic

EECS150 - Digital Design Lecture 25 Shifters and Counters. Recap

Digital Integrated Circuits A Design Perspective. Arithmetic Circuits. Jan M. Rabaey Anantha Chandrakasan Borivoje Nikolic.

Ch 7. Finite State Machines. VII - Finite State Machines Contemporary Logic Design 1

10/12/2016. An FSM with No Inputs Moves from State to State. ECE 120: Introduction to Computing. Eventually, the States Form a Loop

LOGIC CIRCUITS. Basic Experiment and Design of Electronics. Ho Kyung Kim, Ph.D.

9. Datapath Design. Jacob Abraham. Department of Electrical and Computer Engineering The University of Texas at Austin VLSI Design Fall 2017

Computers also need devices capable of Storing data and information Performing mathematical operations on such data

Design at the Register Transfer Level

Combinational Logic Design Combinational Functions and Circuits

Binary addition by hand. Adding two bits

Arithmetic Circuits-2

Models for representing sequential circuits

Arithmetic Circuits How to add and subtract using combinational logic Setting flags Adding faster

Latches. October 13, 2003 Latches 1

CS470: Computer Architecture. AMD Quad Core

Computer Architecture. ESE 345 Computer Architecture. Design Process. CA: Design process

ELEN Electronique numérique

211: Computer Architecture Summer 2016

CS 140 Lecture 14 Standard Combinational Modules

Digital Integrated Circuits A Design Perspective. Arithmetic Circuits. Jan M. Rabaey Anantha Chandrakasan Borivoje Nikolic.

Hardware Design I Chap. 4 Representative combinational logic

VLSI Signal Processing

Chapter 2 Basic Arithmetic Circuits

CprE 281: Digital Logic

14:332:231 DIGITAL LOGIC DESIGN

CSE370 HW6 Solutions (Winter 2010)

Arithmetic Circuits Didn t I learn how to do addition in the second grade? UNC courses aren t what they used to be...

Cost/Performance Tradeoff of n-select Square Root Implementations

Combinational Logic. Mantıksal Tasarım BBM231. section instructor: Ufuk Çelikcan

Digital Design for Multiplication

Median architecture by accumulative parallel counters

EE141- Spring 2004 Digital Integrated Circuits

Adders, subtractors comparators, multipliers and other ALU elements

Adders, subtractors comparators, multipliers and other ALU elements

Class Website:

EECS150 - Digital Design Lecture 18 - Counters

EECS150 - Digital Design Lecture 18 - Counters

Introduction to Digital Logic

EE40 Lec 15. Logic Synthesis and Sequential Logic Circuits

DSP Configurations. responded with: thus the system function for this filter would be

CMPEN 411 VLSI Digital Circuits Spring Lecture 19: Adder Design

EECS150 - Digital Design Lecture 23 - FSMs & Counters

Fundamentals of Digital Design

Arithmetic Circuits-2

14:332:231 DIGITAL LOGIC DESIGN. 2 s-complement Representation

Logic BIST. Sungho Kang Yonsei University

NCU EE -- DSP VLSI Design. Tsung-Han Tsai 1

EECS150 - Digital Design Lecture 17 - Sequential Circuits 3 (Counters)

Review. EECS Components and Design Techniques for Digital Systems. Lec 18 Arithmetic II (Multiplication) Computer Number Systems

DE58/DC58 LOGIC DESIGN DEC 2014

UNIT III Design of Combinational Logic Circuits. Department of Computer Science SRM UNIVERSITY

課程名稱 : 數位邏輯設計 P-1/ /6/11

Memory, Latches, & Registers

EECS150 - Digital Design Lecture 16 Counters. Announcements

Multiplication of signed-operands

ELCT201: DIGITAL LOGIC DESIGN

Decoding A Counter. svbitec.wordpress.com 1

CS/EE 181a 2008/09 Lecture 4

CMP 334: Seventh Class

Sample Test Paper - I

Transcription:

Cost/Performance Tradeoffs: a case study Digital Systems Architecture I. L10 - Multipliers 1

Binary Multiplication x a b n bits n bits EASY PROBLEM: design combinational circuit to multiply tiny (1-, 2-, 3-bit) operands... 2n bits since (2 n -1) 2 < 2 2n HARD PROBLEM: design circuit to multiply BIG (32-bit, 64-bit) numbers We can make big multipliers out of little ones! Engineering Principle: Exploit STRUCTURE in in problem. L10 - Multipliers 2

Making a 2n-bit multiplier using n-bit multipliers Given n-bit multipliers: a H a L a x b = ab n bits n bits 2n bits x b H b L Synthesize 2n-bit multipliers: x ab 4n bits a 2n bits b 2n bits a L b L a L b H a H b L a H b H ab L10 - Multipliers 3

Our Basis: n=1: minimalist starting point Multiplying two 1-bit numbers is pretty simple: a x b = 0 ab Of course, we could start with optimized combinational multipliers for larger operands; e.g. a 1 a 0 b 1 b 0 2 2 2-bit Multiplier 4 c 3 c 2 c 1 c 0 the logic gets more complex, but some optimizations are possible... L10 - Multipliers 4

2n-bit by 2n-bit multiplication: Our induction step: 1. Divide multiplicands into n-bit pieces 2. Form 2n-bit partial products, using n-bit by n-bit multipliers. 3. Align appropriately 4. Add. a L b H REGROUP partial products - 2 additions rather than 3! a H a L x b H b a H b H a L b L L a H b L Induction: we can use the same structuring principle to build a 4n-bit multiplier from our newly-constructed 2n-bit ones... a b L10 - Multipliers 5

Brick Wall view of partial products Making 4n-bit multipliers from n-bit ones: 2 induction steps x a 3 a2 a1 a0 b 3 b 2 b 1 b 0 a 2 b 3 a 3 b 3 a 2 b 2 a 0 b 3 a 1 b 3 a 0 b 2 a 1 b 2 a 0 b 1 a 1 b 1 a 0 b 0 a 3 b 2 a 2 b 1 a 3 b 1 a 2 b 0 a 3 b 0 a 1 b 0 L10 - Multipliers 6

Multiplier Cookbook: Chapter 1 Given problem: x a 3 a2 a1 a0 b 3 b 2 b 1 b 0 Subassemblies: Partial Products Adders Step 1: Form (& arrange) Partial Products: a 0 b 3 a 1 b 3 a 0 b 2 a 2 b 3 a 1 b 2 a 0 b 1 a 3 b 3 a 2 b 2 a 1 b 1 a 0 b 0 a 3 b 2 a 2 b 1 a 1 b 0 MULT ADD a 3 b 1 a 2 b 0 a 3 b 0 Step 2: Sum L10 - Multipliers 7

Performance/Cost Analysis "Order Of" notation: "g(n) is of order f(n)" g(n) = Θ (f(n)) g(n) = Θ(f(n)) if there exist C 2 C 1 > 0, such that for all but finitely many integral n 0 c 1 f(n) g(n) c 2 f(n) Example: n 2 +2n+3 = Θ(n 2 ) since n 2 (n 2 +2n+3) 2n 2 "almost always" g(n) = O(f(n)) Θ(...) implies both inequalities; O(...) implies only the second. Partial Products: Things to Add: Adder Width: Hardware Cost: n 2 2n-2 n? = = = = Θ(n 2 ) Θ(n) Θ(n) Θ(n 2 ) Latency: Ο(n 2 )?? L10 - Multipliers 8

Observations: a 0 b 3 a 1 b 3 a 0 b 2 a 2 b 3 a 1 b 2 a 0 b 1 a 3 b 3 a 2 b 2 a 1 b 1 a 0 b 0 MULT ADD Θ(n 2 ) partial products. Θ(n 2 ) full adders. Hmmm. a 3 b 2 a 2 b 1 a 1 b 0 a 3 b 1 a 2 b 0 a 3 b 0 L10 - Multipliers 9

Repackaging Function Engineering Principle #2: #2: Put Put the the Solution where where the the Problem is. is. Θ(n 2 ) partial products. Θ(n 2 ) full adders. MULT ADD ab 0 3 ab 1 3 ab 0 2 ab 2 3 ab 1 2 ab 0 1 ab 3 3 ab 2 2 1 1 ab ab 3 2 ab 2 1 ab 1 0 ab 3 1 ab 2 0 ab 3 0 ab 0 0 How about n 2 blocks, each doing a little multiplication and a little addition? L10 - Multipliers 10

b 3 b 2 ab 0 3 b 1 ab 1 3 ab 0 2 b 0 ab 2 3 ab 1 2 ab 0 1 ab 3 3 ab 2 2 ab 1 1 ab 0 0 ab 3 2 ab 2 1 ab 1 0 a 0 ab 3 1 ab 2 0 a 1 ab 3 0 a 2 a 3 C i+1 A i FA B i Goal: Array of Identical Multiplier Cells C i C k+2 S k+1 S k Single "brick" of brick-wall array... Forms partial product Adds to accumulating sum along with carry S k+1 S k Necessary Component: Full Adder Takes 2 addend bits plus carry bit. Produces sum and carry output bits. b i a j C k (A+B) i CASCADE to form an n-bit adder. L10 - Multipliers 11

Design of 1-bit multiplier "Brick": b 3 b 2 ab 0 3 ab 1 3 ab 0 2 ab 2 3 ab 1 2 ab 0 1 ab 3 3 ab 2 2 1 1 ab ab 3 2 ab 2 1 ab 1 0 ab 3 1 ab 2 0 ab 3 0 a 2 a 3 ab 0 0 b 1 b 0 a 1 a 0 Brick design: AND gate forms 1x1 product 2-bit sum propagates from top to bottom Carry propagates to left Array Layout: operand bits bused diagonally Carry bits propagate right-to-left Sum bits propagate down C k+2 S k+1 0 FA FA S k+1 S k S k b i a j C k L10 - Multipliers 12

Latency revisited Here s our combinational multiplier: a 3 a 2 a 1 a 0 a0 b3 a1 b3 a2 b3 a3 b3 3 2 2 1 b 3 1 2 0 1 2 2 a1 b1 a0 b0 3 1 a3 b0 0 2 2 0 1 0 b 2 b 1 b 0 What s its propagation delay? Naive (but valid) bound: O(n) addtions O(n) time for each addition Hence O(n 2 ) time required On closer inspection: Propagation only toward left, bottom Hence longest path bounded by length + width of array: O(n+n) = O(n)! L10 - Multipliers 13

Combinational Multiplier: Multiplier Cookbook: Chapter 2 a 3 + a 2 a 1 a 0 a0 b3 a1 b3 a2 b3 a3 b3 3 2 2 1 b 3 1 2 0 1 2 2 a1 b1 a0 b0 3 1 a3 b0 0 2 2 0 1 0 b 2 b 1 b 0 Hardware for n by n bits: Latency: Throughput: Θ(n 2 ) Θ(n) Θ(1/n) L10 - Multipliers 14

Combinational Multiplier: best bang for the buck? Suppose we have LOTS of multiplications. Can we do better from a cost/performance standpoint? PIPELINING L10 - Multipliers 15

The Pipelining Bandwagon... where do I get on? WE HAVE: Pipeline rules - "well formed pipelines" Plenty of registers Demand for higher throughput. What do we do? Where do we define stages? a 3 a 2 a 0 b 3 b a 1 a0 b3 a1 b3 a0 b2 a2 b3 a1 b2 a0 b1 2 b 1 a3 b3 a2 b2 a1 b1 a0 b0 a3 b2 a2 b1 a1 b0 a3 b1 a2 b0 a3 b0 b 0 L10 - Multipliers 16

Stupid Pipeline Tricks a0 b3 a2 b3 a1 b3 a1 b2 a0 b2 a0 b1 gottreak that long carry chain! a3 b3 a2 b2 a1 b1 a0 b0 a3 b2 a2 b1 a1 b0 a3 b1 a3 b0 a2 b0 Stages: Θ(n) Clock Period: Θ(n) Hardware cost for n by n bits: Θ(n 2 ) Latency: Θ(n 2 ) Throughput: Θ( 1/n ) L10 - Multipliers 17

Even Stupider Pipeline Tricks a3 b3 a2 b3 a3 b2 a1 b3 a2 b2 a0 b3 a0 b2 1 2 0 1 a1 b1 2 1 a3 b1 3 0 2 0 1 0 0 0 WORSE idea: Doesn t break long combinational paths NOT a well-formed pipeline...... different register counts on alternative paths... data crosses stage boundaries in both directions! Back to basics: what s the point of pipelining, anyhow? L10 - Multipliers 18

Breaking O(n) combinational paths LONG PATHS go down, to left: Break array into diagonal slices Segment every long combinational path a 3 a 2 a3 b3 a 0 b 3 b 2 a a0 b 1 3 a0 b2 a1 b3 a0 b1 a1 b2 a0 b0 a2 b3 a1 b1 a2 b2 a1 b0 a3 b2 a3 b1 a3 b0 2 1 2 0 b 1 b 0 GOAL: Θ (n) stages; Θ (1) clock period! L10 - Multipliers 19

Multiplier Cookbook: Chapter 3 Stages: Clock Period: Hardware cost for n by n bits: Latency: Well-formed pipeline (careful!) Constant (high!) throughput, independently of operand size.... but suppose we don t need the throughput? a 3 Θ (n) Θ (1) Θ (n 2 ) Θ (n) Throughput: Θ (1) a 2 a3 b3 a 0 b 3 b 2 a a0 b 1 3 a0 b2 a1 b3 a0 b1 a1 b2 a0 b0 a2 b3 a1 b1 a2 b2 a1 b0 a3 b2 a3 b1 a3 b0 2 1 2 0 b 1 b 0 L10 - Multipliers 20

Moving down the cost curve... Suppose we have INFREQUENT multiplications... pipelining doesn t help us. Can we do better from a cost/performance standpoint? Hmmm, are all these extras really needed? a 3 a 0 b 2 a 1 a 0 b 3 a 0 b 2 a 2 a 1 b 3 a 0 b 1 a 1 b 2 a 0 b 0 a 2 b 3 a 1 b 1 a 2 b 2 a 1 b 0 a 3 b 3 a 2 b 1 a 3 b 2 a 2 b 0 a 3 b 1 3 0 b 1 b 0 L10 - Multipliers 21

Multiplier Cookbook: Chapter 4 a i b 3 b 2 b 1 ai b3 ai b2 ai b1 ai b0 b 0 Sequential Multiplier: Re-uses a single n-bit slice to emulate each pipeline stage a operand entered serially Lots of details to be filled in... Stages: Clock Period: Hardware cost for n by n bits: Latency: Throughput: 1 Θ (1) (constant!) Θ (n) Θ (n) Θ (1/n) L10 - Multipliers 22

(Ridiculous?) Extremes Dept... Cost minimization: how far can we go? a i b 3 b 2 b 1 i 3 i 2 i 1 i 0 b 0 Suppose we want to minimize hardware (at any cost) Consider bit-serial! Form and add 1-bit partial product per clock Reuse single brick for each bit b j of slice; Re-use slice for each bit of a operand L10 - Multipliers 23

Multiplier Cookbook: Chapter 5 Bit Serial multiplier: Re-uses a single brick to emulate an n-bit slice both operands entered serially O(n 2 ) clock cycles required Needs additional storage (typically from existing registers) a i b 3 b 2 b 1 ai b3 i 2 i 1 i 0 b 0 Stages: Clock Period: Hardware cost for n by n bits: Latency: Throughput: Θ (1 n ) Θ (1) (constant!) Θ (1) +? Θ (n 2 ) Θ (1/n 2 ) L10 - Multipliers 24

Summary: Scheme: $ Latency Thruput Combinational Θ(n 2 ) Θ(n) Θ(1/n) N-pipe Θ(n 2 ) Θ(n) Θ(1) Slice-serial Θ(n) Θ(n) Θ(1/n) Bit-serial Θ(1) * Θ(n 2 ) Θ(1/n 2 ) Lots more multiplier technology: fast adders, Booth Encoding, column compression,... L10 - Multipliers 25