Exploring the Potential of Instruction-Level Parallelism of Exposed Datapath Architectures with Buffered Processing Units

Exploring the Potential of Instruction-Level Parallelism of Exposed Datapath Architectures with Buffered Processing Units Anoop Bhagyanath and Klaus Schneider Embedded Systems Chair University of Kaiserslautern ACSD 2017 1 / 26

Outline 1 Motivation 2 Queue-based Code Generation 3 Mapping to SMT 4 Preliminary Results 5 Future Work 2 / 26

Motivation Instruction Level Parallelism (ILP) expression tree dataflow graph (2 Regs) VLIW (4R 2W ports) ld st x3 x4 / x3 x4 / x3 x4 / 3 steps 5 steps 4 steps 3 / 26

Conventional Architectures ILP restricted due to limited number of registers and ports in register file compiler spills variables to main memory number of instructions packed into a VLIW word increasing number of registers is difficult instruction format encoding register file wiring 4 / 26

Exposed Datapath Architectures Sync Control Async Dataflow (SCAD) grid of processing units FIFO buffers (queues) at inputs and outputs of PUs compiler also moves values from one PU to another: bypass registers although bypassing is used, the code generators still use register mappings examples: TTA, MOVE-PRO, TRIPS, Wavescalar, STA, Flexcore etc 5 / 26

SCAD Architecture move instruction O I move instructions O I move instruction bus (MIB) fills address slots PU fires if enough data available at input buffer heads data transport network (DTN) fills data slots application-specific any arbitrary functionality in PUs interconnect choice 6 / 26

Queue-based Code Generation Code Generation for Queue Machine z1 y1 y2 x3 x4 executed,,x3,x4 7 / 26

Code Generation for Queue Machine z1 y1 y2 x3 x4 executed y1 8 / 26

Code Generation for Queue Machine z1 y1 y2 x3 x4 executed y2 9 / 26

Code Generation for Queue Machine z1 y1 y2 x3 x4 executed z1 10 / 26

Computation Overhead for Basic Blocks given DAG levelized DAG planar DAG level-planar DAG queue program y1 y2 + y1 y2 + D y1 y2 S + D y1 D S y2 + D load,1 load,2 add 2 dup 2 dup 1 swap dup 1 mul 1 add 1 + D store y1 store y2 11 / 26

Depth-First Traversal Current Compilers order nodes by depth-first traversal minimize register usage optimal code for expression trees Sethi-Ullmann algorithm polynomial time optimal code for directed-acyclic graphs (DAG) proved to be NP-Complete 12 / 26

DFT in Queue Machine z1 y1 y2 x3 x4 executed, 13 / 26

DFT in Queue Machine z1 y1 y2 x3 x4 executed y1 14 / 26

DFT in Queue Machine z1 y1 y2 x3 x4 executed x3,x4 wrong order to execute y2! 15 / 26

Queue code to SCAD code (MBMV) Queue Instruction Corresponding SCAD Move Instructions.. (load,1) [->inp1; load->opc; 1->cps].. (add 1) [out->inp1; out->inp2; add->opc; 1->cps].. (dup 2) [out->inp1; dup->opc; 2->cps].. (swap) [out->inp1; out->inp2; swap->opc].. (store y1) [y1->inp1; out->inp2; store->opc].. is the SCAD code optimal? 16 / 26

Reduced Computation Overhead for SCAD given DAG DAG with PU assigned y1 y2 y1 y2 + + Queue machine: One queue single total order of all nodes x 2 x 2 x 2 x 1 x 2 x 2 x 2 x 1 17 / 26

Reduced Computation Overhead for SCAD given DAG DAG with PU assigned y1 + y2 x 1 + x 2 x 2 x 2 x 2 x 2 x 1 + x 2 y1 + y2 SCAD machine: Multiple queues multiple partial orders of nodes less computation overhead SAT based SCAD code generation (MEMOCODE) resource-optimal at most 4 PUs for up to 15 instruction basic blocks 18 / 26

Mapping to SMT Problem statement Given a basic block (in three-address SSA code), a SCAD machine with p universal PUs and 1 load-store unit, a desired execution time t: determine if the basic block can be executed on the SCAD machine in time t without any computation overhead. Relations α ij θ i,j variable x i is assigned to PU j variable x i is scheduled in timeslot j 19 / 26

Constraints Binary values n 1 i=0 j=0 p 0 α i,j 1 and n 1 t 1 i=0 j=0 0 θ i,j 1 (1) Schedule exactly once Unique PU assignment n 1 t 1 θ i,j = 1 (2) i=0 j=0 n 1 i=0 j=0 p α i,j = 1 (3) 20 / 26

Constraints... Data dependency For each node x i, τ i is the time slot in which the node is scheduled t 1 τ i = j θ i,j j=0 For every instruction x t = x l x r, τ t τ l δ l τ t τ r δ r (4) 21 / 26

Constraints... Ordering variables in buffers for every pair of instructions: x t = x l x r x t = x l x r x t x t Buffer constraint β t,t = ( ) τt < τ t ( ) β l,l = τ l τ l ( ) β r,r = τ r τ r ( ) τt < τ t ) (β l,l = τ l τ l ) (β r,r = τ r τ r x l x l x r x r (5) 22 / 26

Preliminary Results Performance unit latency for all nodes in the basic block 23 / 26

Preliminary Results... Performance 90% cache hit probability 1 cycle: hit 10 cycles: miss 24 / 26

Future Work analyze buffer-sizes hardness of optimal code generation problem efficient heuristics 25 / 26

Thank You! Questions? 26 / 26