Developing an Architecture for a Single-Flux Quantum Based Reconfigurable Accelerator

Size: px

Start display at page:

Download "Developing an Architecture for a Single-Flux Quantum Based Reconfigurable Accelerator"

Joleen Hunt
5 years ago
Views:

1 Developing an Architecture for a Single-Flux Quantum Based Reconfigurable Accelerator F. Mehdipour, Hiroaki Honda*, * H. Kataoka, K. Inoue and K. Murakami Graduate School of Information Science and Electrical Engineering, Kyushu University, Japan *Institute of Systems, Information Technologies and Nanotechnologies (ISIT), Fukuoka, Japan farhad@c.csce.kyushu-ua.c.jp

2 Agenda Introduction ti SFQ-LSRDP General Architecture The Design Procedure and Tool Chain Input/ Output Nodes Placement Area Minimization Experimental Results Conclusions

3 CREST-JST SFQ-RDP Project (2006~): A Low-power, high-performance reconfigurable processor based on single-flux quantum circuits Yokohama National Univ. SFQ-FPU chip, cell library Nagoya Univ. SFQ-RDP chip, cell library, and wiring Prof. A. Fujimaki et al. Prof. N. Yoshikawa et al. SFQ-LSRDP Kyushu Univ. Architecture, Compiler and Applications Prof. K. Murakami et al. Nagoya Univ. CAD for logic design Superconducting and arithmetic circuits Research Lab. (SRL) Prof. N. Takagi (Leader) SFQ process et al. Dr. S. Nagasawa et al.

4 Goals Discovering appropriate scientific applications Developing compiler tools Developing performance analyzing tools Designing and Implementing SFQ-LSRDP architecture considering the features and limitations of SFQ circuits

5 How a reconfigurable processor works Non-critical code Computation-intensive (critical) code GPP LSRDP Non-critical code PE PE PE... PE Computation-intensive (critical) code Non-critical code... PE PE ORN LSRDP... PE PE PE ORN PE PE... PE Application code Main Memory

6 Single-flux quantum (SFQ) against CMOS CMOS main issues in implementing a large accelerator: High electric power consumption High heat radiation Difficulties in high-density packing SFQ Features: High-speed switching and signal transmission Low power consumption Compact implementation (smaller area) Suitable for pipeline processing of data stream 磁束量子 Single Flux Quantum Superconductivity 超伝導ループ loop ジョセフソン接合 Josephson junction

7 Outline of large-scale reconfigurable data-path th (LSRDP) processor PE LSRDP... PE PE PE GPP ORN : Operand Routing Network PE : : : : PE PE... PE ORN Reconfigurable data-path components: A matrix of large number of floatingpoint Functional Units (FUs) Reconfigurable Operand Routing Network : (ORN) Dynamic reconfiguration facilities Streaming Buffer (SB) for I/O ports Main Memory PE... PE PE PE SB : : :... : SMAC Scratchpad Memory Features: Handling data flow graphs (DFGs) extracted from scientific applications Pipeline execution Burst transfer of input /output rearranged data from/to memory Reduced no. of memory accesses (alleviating the memory wall problem)

8 SFQ-LSRDP General Architecture

9 LSRDP architecture Processing Elements Input ports FU (Functional Unit): implements basic 64-bit double-precision floating point operations including: ADD/SUB and MUL TU(transfer unit): as a routing resource for transferring data b/w inconsecutive rows MUL Node 15 TU FU TU FU FU TU FU PE including two components TU FU TU TU Four functionalities Output ports

10 PE structures FU TU FU - - TU PE Basic arch. 3-inps/2-outs FU TU TU TU - FU TU - FU TU - FU TU TU FU TU TU TU TU TU FU TU PE arch. I 4-inps/3-outs FU TU TU TU PE arch. II 3-inps/3-outs FU - - TU FU TU TU TU TU-TU TU

11 Layout types- Type I W A A A A T T T M M M M ORN T A M T Each PE implements ADD/SUB and MUL M A : MUL : ADD/SUB H A A A A T T T M M M M ORN A A A A T T T M M M M.. T T A T M ADD/SUB A T M MUL TU T : Transfer Unit. ORN A M T A M T A M T A M T A M T Flexible but consumes a lot of resources

12 Layout types- Type II W Each PE implements ADD/SUB or MUL A T M T A T A T M T ORN A T M T A T A T M T ORN Each PE implements ADD/SUB or MUL ADD/SUBA TU M T A T A T MUL M T TU H... ORN A T M T A T A T M T

13 Maximum connection length (MCL)- Definition MCL: maximum horizontal distance b/w two PEs located in two subsequent rows

14 An ORN structure T FPU T FPU T FPU T FPU T FPU ORN ½CB ½CB ½CB ½CB ½CB ½CB ½CB ½CB ½CB ½CB T2 CB CB CB CB CB CB CB CB CB T2 T2 CB CB CB CB CB CB CB CB CB CB CB CB CB CB CB CB CB CB CB T2 2bit shift register T2 CB T2 CB T2 CB T2 CB T2 CB T FPU T FPU T FPU T FPU T FPU ORN is consisted of 2-bit shift registers, 1-by-2 2and2by 2-by-22 cross bar switches A. Fujimaki, et al., Demonstration of an SFQ-Based Accelerator Prototype for a High-Performance Computer, ASC08, 2008.

15 Dynamic reconfiguration architecture Three bit-stream lines for dynamic reconfiguration of: Immediate registers (64bit) in each PE Selector bits for muxes selecting the input data of FUs Cross-bar switches in ORNs

16 What should be decided during the design procedure Width and Height? The number of I/O ports? Maximum Connection Length (MCL)? ORN size and structure? Layout: FU types (ADD/SUB and MUL)? Reconfiguration mechanism? (PE, ORN, Immediate data) On-chip memory configuration?

17 Th D i P d d The Design Procedure and Tool Chain

18 Compiler and design flow DFGs are manually generated DFG mapping results are employed for: Analyzing LSRDP architecture statistics (a quantitative approach) Generating LSRDP configuration bit-streams

19 Benchmark applications Finite differential method calculation of 2 nd order partial differential equations 1dim-Heat equation (Heat) 1dim-Vibration equation (Vibration) 2dim-Poisson equation (Poisson) Quantum chemistry application Recursive parts of Electron Repulsion Integral calculation (ERI-Rec) Types of operations in the calculations: ADD/SUB and MUL

20 DFG extraction- Heat equation 1-dim. heat equation for T(x,t) T ( x, t ) T ( x, t ) A 2 t x 2 (A is const.) T(i-1,j) T(i,j) T(i+1,j) Calculation by Finite Difference Method (FDM) T ( x i, t j1 ) D* T ( x i, t j ) B * T ( x i1, t j ) T ( x Basic DFG can be extended to horizontal and vertical directions to make a larger DFG i1, t j ) Basic DFG + * * + D B T(i,j+1)

21 A sample DFG - Heat Inputs: 32 Outputs: 16 Operations: 721 Immediates: 364 A sample DFG (Heat)

22 DFG mapping flow DFG LSRDP Architecture Description Longest connections MCL= 2 Placing DFG nodes on LSRDP Placing IO nodes Re-placing DFG nodes on LSRDP (considering IO nodes positions) Routing connections Re-palcing output nodes Routing Inp/Out connections Modified Mapping Flow Configuration File

23 Placing Input/Output Nodes

24 Fan-out based I/O nodes placement ni: the number of children of input node i Ci1, Ci2, Ci3, Ci,ni X: location of the input node i Total Connection Length: TCL= Ci1-X + Ci2- X + Ci,ni-X Objective: Minimize TCL ni= 1 X= Ci1 ni= 2 Ci1 <= X <= Ci2 ni= 3 X = Ci2 ni>=2 X = Cij, j=2 ni-1

25 One main reason for the large MCL Inputs Ports are far from each other

26 Proximity-factor based placement Proximity factor indicates how far a pair of input ports should be located from each other For a pair of input nodes The larger number of closer descendants, higher proximity factor is assigned S ij i,j :aset of common descendants for input nodes i and j D k,i (=D k,j ): distance of common descendant node k to the input nodes i and j (it is equal to ASAP execution level of the node) P p,2 p2,1... p n, 1 p n, 2 p1, 1... n... p... 2, n p i, j p j, i if i j 1 if i D k S i k, i, j j

27 Proximity factor-example I1 I2 I S1,3 7 S 1,2 P( I 1, I P( I, I3) p1, ,6,7 ) 1 4 p 1,2 1 2 S 2, P( I, I3) p2, Inputs nodes I1 and I2 should be located closer than I3

28 Input nodes placement alg.: Example C ( l ) C ( r ) r 1 p ij il1 r 1 pij il1 1 i l 1 r i if C(l)> C(r) l= l+1, L[l]=j else r= r+1, L[r]=j Placing the 1 st input node with the highest proximity factor N/2-3 N/2-2 N/2-1 1 N/2+1 N/2+2 N/2+3 Placing the 2 nd input node with the highest proximity factor N/2-3 N/ N/2+1 N/2+2 N/2+3

29 Input ports placement alg.: Example Placing i-th input node N/2-K N/2+M l r If C(l)> C(r): i N/2+M If C(r)> () C(l): () l N/2-K i r l r

30 Area Minimization

31 Estimating the area of a PE Area(FU)= Area(ADD/SUB)= Area(MUL) FU TU Area(TU)= Area(MUX)~ 0.1 Area (FU) FU TU TU PE basic arch Layout I: Area(PE)= 2.1x Area(FU), Layout II: Area(PE)= 1.1x Area(FU) A B C PE arch. I Layout I: Area(PE)= 2.2x Area(FU), Layout II: Area(PE)= 1.2x Area(FU) A B C FU TU TU TU op PE arch. II Layout I: Area(PE)= 2.2x 2 Area(FU) sel mux TU Layout II: Area(PE)= 1.2x Area(FU)

32 Estimating the ORN area-pe Basic arch. FU TU W Basic arch 3-inps/2-outs Num mber of row ws = 1.5 Basic arch. MCL= 1 Number of columns = 4 MCL Area (ORN) = 1.5 x W x (4 x MCL) x Area (CB) W: the no. of the PEs in a RDP row

33 Estimating the ORN area-pe arch. I TU FU TU PE arch. I 4-inps/3-outs = 2 W Number of rows MCL= 1 Number of columns = 6 MCL+2 Area (ORN) = 2 x W x (6 x MCL+ 2) x Area (CB)

34 Estimating the ORN area-pe arch. II FU TU TU TU PE arch. II 3-inps/3-outs Numb ber of row ws = W 1.5 MCL= 2 Number of columns = 4 MCL+1 Area (ORN) = 1.5 x W x (4 x MCL + 1) x Area (CB)

35 A modified connection length measurement New measurement technique for the net length src Connection length measurement: d v dest initial C.L.= d h modified C.L.= d h / d v d h src C.L.(previous)= 3 C.L.(new)=3 C.L.(previous)= 3 C.L.(new)=1 dest1 dest2

36 A modified connection length measurement- Example Parent 2 is chosen when C.L. is measured as d h /d v MCL= 1 Parent 1 d h 0 4 0, 4/3 d h /d v 0, 4 1, 3 2, 2 3, 1 4, 0 1, 1 2, 2/3 3,1/3 4, 0 d h is chosen when C.L. 0, 1 is measured as d h 0, 4 1, 3 2, 2 3, 1 4, 0 d , h /d v 1, 0.5 3/2, 1/4 2, 0 MCL= 2

37 MCL minimization- Using a MCL threshold A maximum threshold is assumed for the MCL During the placement process: For each CL larger than the threshold, the vertical distance increases as: d v = CL/MCL_Threshold PE with the min. C.L to the source max permitted length= 2 src d h =3 > max permitted length d v = 1 dest dest d v= d v + [3/2]=d v +1= 2

38 Basic placement and routing vs. integrated placement and routing DFG DFG Placing Input Nodes Placing Input Nodes using PF-based alg. LSRDP Architecture Description Placing Operational & Output Nodes LSRDP Architecture Description Placing Operational Nodes & Routing Nets (node by node) Routing Nets Placing Output Nodes Final Map Routing IO Nets Final Map Routing Output Nets Basic Placement and Routing Flow Integrated Placement and Routing Flow

39 Experimental Results

40 Specifications of the benchmark DFGs # of # of # of # of pure max. inp. Max. DFG nodes inputs outputs ops nodes fan-out fan-out Heat-8x Heat-8x Heat-16x Poisson-3x Vibration-4x Vibration-8x ERI ERI ERI ERI Max

41 Evaluation results for various architectures- MCL and ORN sizes Layout-I Layout-II S1 S2 S1 S2 PE basic arch MCL PE arch. I PE arch. II ORN size PE basic arch (overall) PE arch. I x CB PE arch. II nodes placement Connection length measurement S1 fan-out based l h S2 proximity-factor based l hv S2 results in smaller MCL and ORN size for both layout types

42 Evaluation results for various architectures- no. of utilized PEs No. of PEs (overall) x PE Layout-I Layout-II S1 S2 S1 S2 PE basic arch PE arch. I PE arch. II By using l hv, larger number of RDP rows are utilized larger number of PEs will be employed for S2

43 Evaluation results for various architectures- overall LSRDP area (KJJ) FU TU TU FU TU Basic PE arch. PE arch. I FU TU TU TU PE arch. II 3-inps/2-outs 4-inps/3-outs 3-inps/3-outs Overall LSRDP Area x (KJJ) Layout-I Layout-II S1 S2 S1 S2 PE basic arch PE arch. I PE arch. II S2 results in smaller overall area in terms of KJJ for both layout types Layout II results in smaller area PE arch. II gives smaller area

44 A sample ORN implementation Block diagram of a high frequency test bench clkin_hf ladder clkin_lfin clkin_lfout data_in input shift register circuit under test output shift register data_out A photograph p of a chip with 1-to-3 ORN prototype test bench circuit under test ladder mm 5 m input shift register output shift register

Conclusions SFQ-LSRDP is a basic core of a high-performance low-power computer Data Flow Graphs (DFGs) extracted from scientific applications are mapped on the LSRDP LSRDP micro-architecture is

45 Conclusions SFQ-LSRDP is a basic core of a high-performance low-power computer Data Flow Graphs (DFGs) extracted from scientific applications are mapped on the LSRDP LSRDP micro-architecture is designed based on characteristics of DFGs via a quantitative approach LSRDP is promising for resolving issues originated from CMOS technology as well as achieving remarkable performance Acknowledgement: This research was supported in part by Core Research for Evolutional Science and Technology (CREST) of Japan Science and Technology Corporation (JST).

46 Thanks for your attention! Any questions?

A Combined Analytical and Simulation-Based Model for Performance Evaluation of a Reconfigurable Instruction Set Processor

A Combined Analytical and Simulation-Based Model for Performance Evaluation of a Reconfigurable Instruction Set Processor Farhad Mehdipour, H. Noori, B. Javadi, H. Honda, K. Inoue, K. Murakami Faculty