Pipelining and Parallel Processing

Similar documents
Pipelining and Parallel Processing

Pipelining and Parallel Processing

VLSI Signal Processing

PIPELINING AND PARALLEL PROCESSING. UNIT 4 Real time Signal Processing

DSP Design Lecture 5. Dr. Fredrik Edman.

Serial Parallel Multiplier Design in Quantum-dot Cellular Automata

Retiming. delay elements in a circuit without affecting the input/output characteristics of the circuit.

Pipelined and Parallel Recursive and Adaptive Filters

Issues on Timing and Clocking

Combinational Logic. Lan-Da Van ( 范倫達 ), Ph. D. Department of Computer Science National Chiao Tung University Taiwan, R.O.C.

Transformation Techniques for Real Time High Speed Implementation of Nonlinear Algorithms

Combinational Logic. Lan-Da Van ( 范倫達 ), Ph. D. Department of Computer Science National Chiao Tung University Taiwan, R.O.C.

L15: Custom and ASIC VLSI Integration

Chapter 4. Sequential Logic Circuits

Lecture 7: Logic design. Combinational logic circuits

Chaper 4: Retiming (Tái định thì) GV: Hoàng Trang

FAST FIR ALGORITHM BASED AREA-EFFICIENT PARALLEL FIR DIGITAL FILTER STRUCTURES

DSP Design Lecture 7. Unfolding cont. & Folding. Dr. Fredrik Edman.

Fast Fir Algorithm Based Area- Efficient Parallel Fir Digital Filter Structures

Reduced-Area Constant-Coefficient and Multiple-Constant Multipliers for Xilinx FPGAs with 6-Input LUTs

Sequential Logic Circuits

A Digit-Serial Systolic Multiplier for Finite Fields GF(2 m )

Discrete-Time Systems

DSP Configurations. responded with: thus the system function for this filter would be

EE371 - Advanced VLSI Circuit Design

CPE100: Digital Logic Design I

Parallel Multipliers. Dr. Shoab Khan

CSE241 VLSI Digital Circuits Winter Lecture 07: Timing II

Datapath Component Tradeoffs

Area-Time Optimal Adder with Relative Placement Generator

Forward and Reverse Converters and Moduli Set Selection in Signed-Digit Residue Number Systems

AREA EFFICIENT LINEAR-PHASE FIR DIGITAL FILTER STRUCTURES

Chapter 8. Low-Power VLSI Design Methodology

ELEC516 Digital VLSI System Design and Design Automation (spring, 2010) Assignment 4 Reference solution

Digital Integrated Circuits A Design Perspective. Arithmetic Circuits. Jan M. Rabaey Anantha Chandrakasan Borivoje Nikolic.

NCU EE -- DSP VLSI Design. Tsung-Han Tsai 1

Lecture 19 IIR Filters

8. Design Tradeoffs x Computation Structures Part 1 Digital Circuits. Copyright 2015 MIT EECS

Fault Modeling. 李昆忠 Kuen-Jong Lee. Dept. of Electrical Engineering National Cheng-Kung University Tainan, Taiwan. VLSI Testing Class

8. Design Tradeoffs x Computation Structures Part 1 Digital Circuits. Copyright 2015 MIT EECS

A COMBINED 16-BIT BINARY AND DUAL GALOIS FIELD MULTIPLIER. Jesus Garcia and Michael J. Schulte

Signal Flow Graphs. Roger Woods Programmable Systems Lab ECIT, Queen s University Belfast

Design and Study of Enhanced Parallel FIR Filter Using Various Adders for 16 Bit Length

School of EECS Seoul National University

LECTURE 28. Analyzing digital computation at a very low level! The Latch Pipelined Datapath Control Signals Concept of State

Lecture 5: DC & Transient Response

Measurement of Electrical Resistance and Ohm s Law

9/18/2008 GMU, ECE 680 Physical VLSI Design

CMPEN 411 VLSI Digital Circuits Spring Lecture 19: Adder Design

GMU, ECE 680 Physical VLSI Design

EEC 216 Lecture #3: Power Estimation, Interconnect, & Architecture. Rajeevan Amirtharajah University of California, Davis

Boolean Algebra and Logic Gates

CMPEN 411 VLSI Digital Circuits Spring 2012 Lecture 17: Dynamic Sequential Circuits And Timing Issues

Chapter 5. Digital Design and Computer Architecture, 2 nd Edition. David Money Harris and Sarah L. Harris. Chapter 5 <1>

GMU, ECE 680 Physical VLSI Design 1

Chapter 5 CMOS Logic Gate Design

LOGIC CIRCUITS. Basic Experiment and Design of Electronics. Ho Kyung Kim, Ph.D.

Computer Architecture 10. Fast Adders

R13 SET - 1

FPGA Implementation of a Predictive Controller

Lecture 9: Clocking, Clock Skew, Clock Jitter, Clock Distribution and some FM

Timing Issues. Digital Integrated Circuits A Design Perspective. Jan M. Rabaey Anantha Chandrakasan Borivoje Nikolić. January 2003

Professor Fearing EECS150/Problem Set Solution Fall 2013 Due at 10 am, Thu. Oct. 3 (homework box under stairs)

Lecture 10, ATIK. Data converters 3

Testability. Shaahin Hessabi. Sharif University of Technology. Adapted from the presentation prepared by book authors.

Design for Manufacturability and Power Estimation. Physical issues verification (DSM)

A Practical Application of Wave-Pipelining Theory on a Adaptive Differential Pulse Code Modulation Coder-Decoder Design

FPGA IMPLEMENTATION OF BASIC ADDER CIRCUITS USING REVERSIBLE LOGIC GATES

Lecture 3 Review on Digital Logic (Part 2)

ECE 407 Computer Aided Design for Electronic Systems. Simulation. Instructor: Maria K. Michael. Overview

ESE 570: Digital Integrated Circuits and VLSI Fundamentals

Novel Bit Adder Using Arithmetic Logic Unit of QCA Technology

Implementation of Discrete-Time Systems

Implementation of Clock Network Based on Clock Mesh

Memory Elements I. CS31 Pascal Van Hentenryck. CS031 Lecture 6 Page 1

Salphasic Distribution of Clock Signals Vernon L. Chi UNC Chapel Hill Department of Computer Science Microelectronic Systems Laboratory

Errata of K Introduction to VLSI Systems: A Logic, Circuit, and System Perspective

ARMADILLO: a Multi-Purpose Cryptographic Primitive Dedicated to Hardware

EECS 312: Digital Integrated Circuits Final Exam Solutions 23 April 2009

LOGIC CIRCUITS. Basic Experiment and Design of Electronics

Sequential Circuits. Circuits with state. Silvina Hanono Wachman Computer Science & Artificial Intelligence Lab M.I.T. L06-1

A Novel Low Power 1-bit Full Adder with CMOS Transmission-gate Architecture for Portable Applications

DESIGN OF LOW POWER-DELAY PRODUCT CARRY LOOK AHEAD ADDER USING MANCHESTER CARRY CHAIN

Lecture 27: Latches. Final presentations May 8, 1-5pm, BWRC Final reports due May 7 Final exam, Monday, May :30pm, 241 Cory

S.Y. Diploma : Sem. III [CO/CM/IF/CD/CW] Digital Techniques

DHANALAKSHMI COLLEGE OF ENGINEERING DEPARTMENT OF ELECTRICAL AND ELECTRONICS ENGINEERING EC2314- DIGITAL SIGNAL PROCESSING UNIT I INTRODUCTION PART A

Cost/Performance Tradeoffs:

Low Latency Architectures of a Comparator for Binary Signed Digits in a 28-nm CMOS Technology

Minimizing Clock Latency Range in Robust Clock Tree Synthesis

Design of Arithmetic Logic Unit (ALU) using Modified QCA Adder

Clock Strategy. VLSI System Design NCKUEE-KJLEE

Digital Integrated Circuits A Design Perspective. Arithmetic Circuits. Jan M. Rabaey Anantha Chandrakasan Borivoje Nikolic.

Chapter 3. Digital Design and Computer Architecture, 2 nd Edition. David Money Harris and Sarah L. Harris. Chapter 3 <1>

ALU A functional unit

4.0 Update Algorithms For Linear Closed-Loop Systems

Datapath&Design&Approaches. Increasing&the&Initiation&Rate

Introduction to Digital Logic

Jan M. Rabaey Anantha Chandrakasan Borivoje Nikolic. November Digital Integrated Circuits 2nd Sequential Circuits

Sequential Logic. Handouts: Lecture Slides Spring /27/01. L06 Sequential Logic 1

VU Signal and Image Processing

Transcription:

Pipelining and Parallel Processing ( 范倫達 ), Ph. D. Department of Computer Science National Chiao Tung University Taiwan, R.O.C. Fall, 010 ldvan@cs.nctu.edu.tw http://www.cs.nctu.edu.tw/~ldvan/

Outlines Introduction Pipelining of FIR Digital Filter Parallel Processing Pipelining and Parallel Processing for Low Power Conclusions VLSI-DSP-3-

Pipelining Reduce the critical path Introduction Increase the clock speed or sample speed Reduce power consumption Parallel processing Not reduce the critical path Not increase clock speed, but increase sample speed Reduce power consumption VLSI-DSP-3-3

A 3-tap FIR Filter Direct-form structure y( n) ax( n) bx( n 1) cx( n ) T T T sample M A f sample T M 1 T A VLSI-DSP-3-4

Outlines Introduction Pipelining of FIR Digital Filter Parallel Processing Pipelining and Parallel Processing for Low Power Conclusions VLSI-DSP-3-5

Pipelining Pipelining and Parallel Concept Introduce pipelining latches along the datapath Parallel processing Duplicate the hardware T A VLSI-DSP-3-6

Pipelining FIR Filter Critical path T A +T M -->T A +T M VLSI-DSP-3-7

Drawbacks Pipelining (1/) Increase number of delay elements (registers/latches) in the critical path Increase latency Clock period limitation: critical path may be between An input and a latch A latch and an output Two Latches An input and an output Pipelining latches can only be placed across any feed-forward cutset of the graph VLSI-DSP-3-8

Pipelining (/) Cutset: A cutset is a set of edges of a graph such that if these edges are removed from the graph, the graph becomes disjoint. Feed-forward cutset: A cutset is called a feed-forward cutset if the data move in the forward direction on all the edges of the cutset. VLSI-DSP-3-9

Example 3..1 4 u.t. Error! u.t. VLSI-DSP-3-10

Transposition Theorem Reversing the direction of all edges in a given SFG and interchanging the input and output ports preserve the functionality of the system. VLSI-DSP-3-11

Data-Broadcast Structure Direct Form II Critical path is reduced to (T M +T A ). VLSI-DSP-3-1

Fine-Gain Pipelining Let T M =10 u.t., T A = u.t., and the desired clock period=6 u.t. Break the MULTIPLIER into smaller units with processing time of 6 and 4 units. VLSI-DSP-3-13

Outlines Introduction Pipelining of FIR Digital Filter Parallel Processing Pipelining and Parallel Processing for Low Power Conclusions VLSI-DSP-3-14

Parallel Processing Parallel processing and pipelining are dual If a computation can be pipelined, it can also be processed in parallel. Convert a single-input single-output (SISO) system to multiple-input multiple-output (MIMO) system via parallelism VLSI-DSP-3-15

VLSI-DSP-3-16 Parallel Processing of 3-Tap FIR Filter (1/) ) (3 1) (3 ) (3 ) (3 1) (3 ) (3 1) (3 1) (3 ) (3 1) (3 ) (3 ) (3 k cx k bx k ax k y k cx k bx k ax k y k cx k bx k ax k y ) ( 1) ( ) ( ) ( n cx n bx n ax n y ) ( 3 1 1 A M clk sample iter T T T L T T

Parallel Processing of 3-Tap FIR Filter (/) How about direct form II? VLSI-DSP-3-17

Complete Parallel Processing System Critical path has remained unchanged. But the iteration period is reduced. VLSI-DSP-3-18

S/P and P/S Converter Edge Trigger! Edge Trigger! VLSI-DSP-3-19

Why Parallel Processing? Parallel leads to duplicating many copies of hardware, and the cost increases! Why use? Answer lies in the fact that the fundamental limit to pipelining is at I/O bottlenecks, referred to as Communication Bound, composed of I/O pad delay and the wire delay. Parallel Transmission VLSI-DSP-3-0

Combined Fine-Grain Pipelining and Parallel Processing T iter T 1 LM sample T clk 1 6 ( T M T A ) VLSI-DSP-3-1

Outlines Introduction Pipelining of FIR Digital Filter Parallel Processing Pipelining and Parallel Processing for Low Power Conclusions VLSI-DSP-3-

Underlying Low Power Concept Propagation delay T pd Power consumption P C C k(v total V 0 V charge 0 0 Vt ) f P Sequential filter seq C total V 0 f, T seq C k(v V charge 0 0 Vt ), f 1 T seq VLSI-DSP-3-3

Pipelining for Low-Power (1/) M-level pipelined system Critical path-->1/m, capacitance to be charged in a single clock cycle-->1/m If the clock frequency is maintained, the power supply can be reduced to V 0 (0<<1) VLSI-DSP-3-4

Pipelining for Low-Power (/) Power consumption P pip C total β V 0 f β P seq Propagation delay T seq C k(v V charge 0 0 Vt ), T pip Ccharge V M k( V V ) 0 t 0 Let T seq =T pip M( V0 V ) ( V V ) t 0 t get VLSI-DSP-3-5

Example 3.4.1 (1/) Consider an original 3-tap FIR filter and its fine-grain pipeline version shown in the following figures. Assume T M =10 ut, T A = ut, V t =0.6V, V o =5V, and C M =5C A. In fine-grain pipeline filter, the multiplier is broken into parts, m1 and m with computation time of 6 u.t. and 4 u.t. respectively, with capacitance 3 times and times that of an adder, respectively. (a) What is the supply voltage of the pipelined filter if the clock period remains unchanged? (b) What is the power consumption of the pipelined filter as a percentage of the original filter? VLSI-DSP-3-6

Example 3.4.1 (/) Solution: Original : Fine Grain : C (5 0.6) 3.0165V (5 36.4% 0.6) m1 6C m C 0.6033 or 0.039 (infeasible) V pip Ratio C charge charge C M C C A C A A 3C A VLSI-DSP-3-7

Comparison System Sequential FIR (Original) Pipelined FIR (Without reducing Vo) Pipelined FIR (With reducing Vo) Power (Ref) P Ref P Ref 0.364P Ref Clock Period (u.t.) Sample Period (u.t.) 1 ut 6 ut 1 ut 1 ut 6 ut 1 ut Thinking Again! VLSI-DSP-3-8

Parallel Processing for Low-Power L-parallel system Since maintaining the same sample rate, clock period is increased to LT seq This means that C charge is charged in LT seq, and the power supply can be reduced to V 0 VLSI-DSP-3-9

Parallel Processing for Low-Power Power consumption P par Propagation delay T seq LT seq =T par f ( LCtotal )( V0) L C k(v V charge 0 0 Vt ), T par P seq CchargeV k( V V ) 0 0 t L( V0 Vt ) ( V0 Vt ) get VLSI-DSP-3-30

Example 3.4. (1/) Consider a 4-tap FIR filter shown in Fig. 3.18(a) and its - parallel version in 3.18(b). The two architectures are operated at the sample period 9 u.t. Assume T M =8, T A =1, V t =0.45V, V o =3.3V, C M =8C A (a) What is the supply voltage of the -parallel filter? (b) What is the power consumption of the -parallel filter as a percentage of the original filter? Solution: Original Parallel 9(3.3 0.45) Vpar Ratio : : 0.6589.1743V C charge C or charge C 43.41% M C 0.08 C M A C A 5 (3.3 0.45) 10C A VLSI-DSP-3-31

Example 3.4. (/) x(k) x(k+1) VLSI-DSP-3-3

Example 3.4.3 (1/) A more efficient structure than the previous one is depicted in Fig. 3.18(c). (a) What is the supply voltage of the efficient -parallel filter? (b) What is the power consumption of the efficient -parallel filter as a percentage of the original filter? Solution: Original New - Parallel : C 9(3.3 0.45) V pip Ratio : C 0.745 or 0.05 (infeasible).45857v P P par seq charge C 55C 1 (3.3 0.45) A M charge 35C C V A V A C 0 0 9C M 4C 1 f s f A s A 1C 43.6% A VLSI-DSP-3-33

Example 3.4.3 (/) VLSI-DSP-3-34

Combining Pipelining and Parallel Processing T Parallel-pipelined structure seq C k(v T pp LT seq V charge 0 0 Vt ), ML Ccharge V M k( V V ) ( V0 Vt ) ( V0 Vt ) M=L=, V 0 =5V, V t =0.6V-->=0.4, =0.16 T pp 0 t 0 VLSI-DSP-3-35

Conclusions Methodologies of pipelining 3-tap FIR filter Methodologies of parallel processing for 3-tap FIR filter Methodologies of using pipelining and parallel processing for low power demonstration. Pipelining and parallel processing of recursive digital filters using look-ahead techniques are addressed in Chapter 10. VLSI-DSP-3-36