Pipelining and Parallel Processing ( 范倫達 ), Ph. D. Department of Computer Science National Chiao Tung University Taiwan, R.O.C. Fall, 010 ldvan@cs.nctu.edu.tw http://www.cs.nctu.edu.tw/~ldvan/
Outlines Introduction Pipelining of FIR Digital Filter Parallel Processing Pipelining and Parallel Processing for Low Power Conclusions VLSI-DSP-3-
Pipelining Reduce the critical path Introduction Increase the clock speed or sample speed Reduce power consumption Parallel processing Not reduce the critical path Not increase clock speed, but increase sample speed Reduce power consumption VLSI-DSP-3-3
A 3-tap FIR Filter Direct-form structure y( n) ax( n) bx( n 1) cx( n ) T T T sample M A f sample T M 1 T A VLSI-DSP-3-4
Outlines Introduction Pipelining of FIR Digital Filter Parallel Processing Pipelining and Parallel Processing for Low Power Conclusions VLSI-DSP-3-5
Pipelining Pipelining and Parallel Concept Introduce pipelining latches along the datapath Parallel processing Duplicate the hardware T A VLSI-DSP-3-6
Pipelining FIR Filter Critical path T A +T M -->T A +T M VLSI-DSP-3-7
Drawbacks Pipelining (1/) Increase number of delay elements (registers/latches) in the critical path Increase latency Clock period limitation: critical path may be between An input and a latch A latch and an output Two Latches An input and an output Pipelining latches can only be placed across any feed-forward cutset of the graph VLSI-DSP-3-8
Pipelining (/) Cutset: A cutset is a set of edges of a graph such that if these edges are removed from the graph, the graph becomes disjoint. Feed-forward cutset: A cutset is called a feed-forward cutset if the data move in the forward direction on all the edges of the cutset. VLSI-DSP-3-9
Example 3..1 4 u.t. Error! u.t. VLSI-DSP-3-10
Transposition Theorem Reversing the direction of all edges in a given SFG and interchanging the input and output ports preserve the functionality of the system. VLSI-DSP-3-11
Data-Broadcast Structure Direct Form II Critical path is reduced to (T M +T A ). VLSI-DSP-3-1
Fine-Gain Pipelining Let T M =10 u.t., T A = u.t., and the desired clock period=6 u.t. Break the MULTIPLIER into smaller units with processing time of 6 and 4 units. VLSI-DSP-3-13
Outlines Introduction Pipelining of FIR Digital Filter Parallel Processing Pipelining and Parallel Processing for Low Power Conclusions VLSI-DSP-3-14
Parallel Processing Parallel processing and pipelining are dual If a computation can be pipelined, it can also be processed in parallel. Convert a single-input single-output (SISO) system to multiple-input multiple-output (MIMO) system via parallelism VLSI-DSP-3-15
VLSI-DSP-3-16 Parallel Processing of 3-Tap FIR Filter (1/) ) (3 1) (3 ) (3 ) (3 1) (3 ) (3 1) (3 1) (3 ) (3 1) (3 ) (3 ) (3 k cx k bx k ax k y k cx k bx k ax k y k cx k bx k ax k y ) ( 1) ( ) ( ) ( n cx n bx n ax n y ) ( 3 1 1 A M clk sample iter T T T L T T
Parallel Processing of 3-Tap FIR Filter (/) How about direct form II? VLSI-DSP-3-17
Complete Parallel Processing System Critical path has remained unchanged. But the iteration period is reduced. VLSI-DSP-3-18
S/P and P/S Converter Edge Trigger! Edge Trigger! VLSI-DSP-3-19
Why Parallel Processing? Parallel leads to duplicating many copies of hardware, and the cost increases! Why use? Answer lies in the fact that the fundamental limit to pipelining is at I/O bottlenecks, referred to as Communication Bound, composed of I/O pad delay and the wire delay. Parallel Transmission VLSI-DSP-3-0
Combined Fine-Grain Pipelining and Parallel Processing T iter T 1 LM sample T clk 1 6 ( T M T A ) VLSI-DSP-3-1
Outlines Introduction Pipelining of FIR Digital Filter Parallel Processing Pipelining and Parallel Processing for Low Power Conclusions VLSI-DSP-3-
Underlying Low Power Concept Propagation delay T pd Power consumption P C C k(v total V 0 V charge 0 0 Vt ) f P Sequential filter seq C total V 0 f, T seq C k(v V charge 0 0 Vt ), f 1 T seq VLSI-DSP-3-3
Pipelining for Low-Power (1/) M-level pipelined system Critical path-->1/m, capacitance to be charged in a single clock cycle-->1/m If the clock frequency is maintained, the power supply can be reduced to V 0 (0<<1) VLSI-DSP-3-4
Pipelining for Low-Power (/) Power consumption P pip C total β V 0 f β P seq Propagation delay T seq C k(v V charge 0 0 Vt ), T pip Ccharge V M k( V V ) 0 t 0 Let T seq =T pip M( V0 V ) ( V V ) t 0 t get VLSI-DSP-3-5
Example 3.4.1 (1/) Consider an original 3-tap FIR filter and its fine-grain pipeline version shown in the following figures. Assume T M =10 ut, T A = ut, V t =0.6V, V o =5V, and C M =5C A. In fine-grain pipeline filter, the multiplier is broken into parts, m1 and m with computation time of 6 u.t. and 4 u.t. respectively, with capacitance 3 times and times that of an adder, respectively. (a) What is the supply voltage of the pipelined filter if the clock period remains unchanged? (b) What is the power consumption of the pipelined filter as a percentage of the original filter? VLSI-DSP-3-6
Example 3.4.1 (/) Solution: Original : Fine Grain : C (5 0.6) 3.0165V (5 36.4% 0.6) m1 6C m C 0.6033 or 0.039 (infeasible) V pip Ratio C charge charge C M C C A C A A 3C A VLSI-DSP-3-7
Comparison System Sequential FIR (Original) Pipelined FIR (Without reducing Vo) Pipelined FIR (With reducing Vo) Power (Ref) P Ref P Ref 0.364P Ref Clock Period (u.t.) Sample Period (u.t.) 1 ut 6 ut 1 ut 1 ut 6 ut 1 ut Thinking Again! VLSI-DSP-3-8
Parallel Processing for Low-Power L-parallel system Since maintaining the same sample rate, clock period is increased to LT seq This means that C charge is charged in LT seq, and the power supply can be reduced to V 0 VLSI-DSP-3-9
Parallel Processing for Low-Power Power consumption P par Propagation delay T seq LT seq =T par f ( LCtotal )( V0) L C k(v V charge 0 0 Vt ), T par P seq CchargeV k( V V ) 0 0 t L( V0 Vt ) ( V0 Vt ) get VLSI-DSP-3-30
Example 3.4. (1/) Consider a 4-tap FIR filter shown in Fig. 3.18(a) and its - parallel version in 3.18(b). The two architectures are operated at the sample period 9 u.t. Assume T M =8, T A =1, V t =0.45V, V o =3.3V, C M =8C A (a) What is the supply voltage of the -parallel filter? (b) What is the power consumption of the -parallel filter as a percentage of the original filter? Solution: Original Parallel 9(3.3 0.45) Vpar Ratio : : 0.6589.1743V C charge C or charge C 43.41% M C 0.08 C M A C A 5 (3.3 0.45) 10C A VLSI-DSP-3-31
Example 3.4. (/) x(k) x(k+1) VLSI-DSP-3-3
Example 3.4.3 (1/) A more efficient structure than the previous one is depicted in Fig. 3.18(c). (a) What is the supply voltage of the efficient -parallel filter? (b) What is the power consumption of the efficient -parallel filter as a percentage of the original filter? Solution: Original New - Parallel : C 9(3.3 0.45) V pip Ratio : C 0.745 or 0.05 (infeasible).45857v P P par seq charge C 55C 1 (3.3 0.45) A M charge 35C C V A V A C 0 0 9C M 4C 1 f s f A s A 1C 43.6% A VLSI-DSP-3-33
Example 3.4.3 (/) VLSI-DSP-3-34
Combining Pipelining and Parallel Processing T Parallel-pipelined structure seq C k(v T pp LT seq V charge 0 0 Vt ), ML Ccharge V M k( V V ) ( V0 Vt ) ( V0 Vt ) M=L=, V 0 =5V, V t =0.6V-->=0.4, =0.16 T pp 0 t 0 VLSI-DSP-3-35
Conclusions Methodologies of pipelining 3-tap FIR filter Methodologies of parallel processing for 3-tap FIR filter Methodologies of using pipelining and parallel processing for low power demonstration. Pipelining and parallel processing of recursive digital filters using look-ahead techniques are addressed in Chapter 10. VLSI-DSP-3-36