Exploring the Potential of Instruction-Level Parallelism of Exposed Datapath Architectures with Buffered Processing Units

Similar documents
CSCI-564 Advanced Computer Architecture

Special Nodes for Interface

Issue = Select + Wakeup. Out-of-order Pipeline. Issue. Issue = Select + Wakeup. OOO execution (2-wide) OOO execution (2-wide)

Micro-architecture Pipelining Optimization with Throughput- Aware Floorplanning

Performance, Power & Energy. ELEC8106/ELEC6102 Spring 2010 Hayden Kwok-Hay So

P C max. NP-complete from partition. Example j p j What is the makespan on 2 machines? 3 machines? 4 machines?

Compiling Techniques

Digital Integrated Circuits A Design Perspective. Arithmetic Circuits. Jan M. Rabaey Anantha Chandrakasan Borivoje Nikolic.

This Unit: Scheduling (Static + Dynamic) CIS 501 Computer Architecture. Readings. Review Example

Exploiting In-Memory Processing Capabilities for Density Functional Theory Applications

EE 660: Computer Architecture Out-of-Order Processors

Computational Complexity

Performance, Power & Energy

Automated design of floating-point logarithm functions on integer processors

CS 301: Complexity of Algorithms (Term I 2008) Alex Tiskin Harald Räcke. Hamiltonian Cycle. 8.5 Sequencing Problems. Directed Hamiltonian Cycle

Digital Integrated Circuits A Design Perspective. Arithmetic Circuits. Jan M. Rabaey Anantha Chandrakasan Borivoje Nikolic.

Hybrid static/dynamic scheduling for already optimized dense matrix factorization. Joint Laboratory for Petascale Computing, INRIA-UIUC

ICS 233 Computer Architecture & Assembly Language

Clock-driven scheduling

ECE 3401 Lecture 23. Pipeline Design. State Table for 2-Cycle Instructions. Control Unit. ISA: Instruction Specifications (for reference)

INF2270 Spring Philipp Häfliger. Lecture 8: Superscalar CPUs, Course Summary/Repetition (1/2)

Loop Scheduling and Software Pipelining \course\cpeg421-08s\topic-7.ppt 1

CMP N 301 Computer Architecture. Appendix C

Worst-Case Execution Time Analysis. LS 12, TU Dortmund

Simple Instruction-Pipelining. Pipelined Harvard Datapath

ECE 5775 (Fall 17) High-Level Digital Design Automation. Scheduling: Exact Methods

Worst-Case Execution Time Analysis. LS 12, TU Dortmund

Fall 2008 CSE Qualifying Exam. September 13, 2008

Logic BIST. Sungho Kang Yonsei University

CPSC 3300 Spring 2017 Exam 2

Runtime Model Predictive Verification on Embedded Platforms 1

Che-Wei Chang Department of Computer Science and Information Engineering, Chang Gung University

APTAS for Bin Packing

CSE 200 Lecture Notes Turing machine vs. RAM machine vs. circuits

Unit 1A: Computational Complexity

NCU EE -- DSP VLSI Design. Tsung-Han Tsai 1

CSCI 1590 Intro to Computational Complexity

CSE. 1. In following code. addi. r1, skip1 xor in r2. r3, skip2. counter r4, top. taken): PC1: PC2: PC3: TTTTTT TTTTTT

Lecture 8: Complete Problems for Other Complexity Classes

Simple Instruction-Pipelining. Pipelined Harvard Datapath

5 Integer Linear Programming (ILP) E. Amaldi Foundations of Operations Research Politecnico di Milano 1

Dense Arithmetic over Finite Fields with CUMODP

Divisible Load Scheduling

Design for Testability

8.5 Sequencing Problems

Determine the size of an instance of the minimum spanning tree problem.

Metode şi Algoritmi de Planificare (MAP) Curs 2 Introducere în problematica planificării

Cyclic Task Scheduling with Storage Requirement Minimisation under Specific Architectural Constraints: Case of Buffers and Rotating Storage Facilities

A Combined Analytical and Simulation-Based Model for Performance Evaluation of a Reconfigurable Instruction Set Processor

Models: Amdahl s Law, PRAM, α-β Tal Ben-Nun

CprE 281: Digital Logic

Lecture: Pipelining Basics

GATE 2014 A Brief Analysis (Based on student test experiences in the stream of CS on 1 st March, Second Session)

Announcements. Project #1 grades were returned on Monday. Midterm #1. Project #2. Requests for re-grades due by Tuesday

Register Allocation. Maryam Siahbani CMPT 379 4/5/2016 1

A Second Datapath Example YH16

Building a Multi-FPGA Virtualized Restricted Boltzmann Machine Architecture Using Embedded MPI

Vector Lane Threading

How to deal with uncertainties and dynamicity?

Limits of Feasibility. Example. Complexity Relationships among Models. 1. Complexity Relationships among Models

CMPEN 411 VLSI Digital Circuits Spring Lecture 19: Adder Design

CS20a: NP completeness. NP-complete definition. Related properties. Cook's Theorem

Embedded Systems Design: Optimization Challenges. Paul Pop Embedded Systems Lab (ESLAB) Linköping University, Sweden

Aperiodic Task Scheduling

Multicore Semantics and Programming

Decision Diagram Relaxations for Integer Programming

Weighted Acyclic Di-Graph Partitioning by Balanced Disjoint Paths

Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism

Hoare Logic for Realistically Modelled Machine Code

CHAPTER log 2 64 = 6 lines/mux or decoder 9-2.* C = C 8 V = C 8 C * 9-4.* (Errata: Delete 1 after problem number) 9-5.

Informatique Fondamentale IMA S8

Schedule Table Generation for Time-Triggered Mixed Criticality Systems

An Integrative Model for Parallelism

LRADNN: High-Throughput and Energy- Efficient Deep Neural Network Accelerator using Low Rank Approximation

Hardware Acceleration of DNNs

A Formal Model of Clock Domain Crossing and Automated Verification of Time-Triggered Hardware

Bayesian Networks. Motivation

4th year Project demo presentation

High Performance Computing

Computer Science. Questions for discussion Part II. Computer Science COMPUTER SCIENCE. Section 4.2.

Computational Boolean Algebra. Pingqiang Zhou ShanghaiTech University

CHAPTER 3 FUNDAMENTALS OF COMPUTATIONAL COMPLEXITY. E. Amaldi Foundations of Operations Research Politecnico di Milano 1

QuIDD-Optimised Quantum Algorithms

Data Structures in Java

Energy-efficient Mapping of Big Data Workflows under Deadline Constraints

Design for Testability

CODE GENERATION REGISTER ALLOCATION. Goal. Interplay between. Translate intermediate code into target code

Problem-Solving via Search Lecture 3

Priority queues implemented via heaps

VLSI Design. [Adapted from Rabaey s Digital Integrated Circuits, 2002, J. Rabaey et al.] ECE 4121 VLSI DEsign.1

Unit 6: Branch Prediction

Mm7 Intro to distributed computing (jmp) Mm8 Backtracking, 2-player games, genetic algorithms (hps) Mm9 Complex Problems in Network Planning (JMP)

Complexity: Some examples

8.5 Sequencing Problems. Chapter 8. NP and Computational Intractability. Hamiltonian Cycle. Hamiltonian Cycle

Environment (E) IBP IBP IBP 2 N 2 N. server. System (S) Adapter (A) ACV

6.5.3 An NP-complete domino game

Chapter 8. NP and Computational Intractability

Communication-avoiding LU and QR factorizations for multicore architectures

Automatic Verification of Parameterized Data Structures

Lattice Boltzmann simulations on heterogeneous CPU-GPU clusters

Transcription:

Exploring the Potential of Instruction-Level Parallelism of Exposed Datapath Architectures with Buffered Processing Units Anoop Bhagyanath and Klaus Schneider Embedded Systems Chair University of Kaiserslautern ACSD 2017 1 / 26

Outline 1 Motivation 2 Queue-based Code Generation 3 Mapping to SMT 4 Preliminary Results 5 Future Work 2 / 26

Motivation Instruction Level Parallelism (ILP) expression tree dataflow graph (2 Regs) VLIW (4R 2W ports) ld st x3 x4 / x3 x4 / x3 x4 / 3 steps 5 steps 4 steps 3 / 26

Conventional Architectures ILP restricted due to limited number of registers and ports in register file compiler spills variables to main memory number of instructions packed into a VLIW word increasing number of registers is difficult instruction format encoding register file wiring 4 / 26

Exposed Datapath Architectures Sync Control Async Dataflow (SCAD) grid of processing units FIFO buffers (queues) at inputs and outputs of PUs compiler also moves values from one PU to another: bypass registers although bypassing is used, the code generators still use register mappings examples: TTA, MOVE-PRO, TRIPS, Wavescalar, STA, Flexcore etc 5 / 26

SCAD Architecture move instruction O I move instructions O I move instruction bus (MIB) fills address slots PU fires if enough data available at input buffer heads data transport network (DTN) fills data slots application-specific any arbitrary functionality in PUs interconnect choice 6 / 26

Queue-based Code Generation Code Generation for Queue Machine z1 y1 y2 x3 x4 executed,,x3,x4 7 / 26

Code Generation for Queue Machine z1 y1 y2 x3 x4 executed y1 8 / 26

Code Generation for Queue Machine z1 y1 y2 x3 x4 executed y2 9 / 26

Code Generation for Queue Machine z1 y1 y2 x3 x4 executed z1 10 / 26

Computation Overhead for Basic Blocks given DAG levelized DAG planar DAG level-planar DAG queue program y1 y2 + y1 y2 + D y1 y2 S + D y1 D S y2 + D load,1 load,2 add 2 dup 2 dup 1 swap dup 1 mul 1 add 1 + D store y1 store y2 11 / 26

Computation Overhead for Basic Blocks given DAG levelized DAG planar DAG level-planar DAG queue program y1 y2 + y1 y2 + D y1 y2 S + D y1 D S y2 + D load,1 load,2 add 2 dup 2 dup 1 swap dup 1 mul 1 add 1 + D store y1 store y2 11 / 26

Computation Overhead for Basic Blocks given DAG levelized DAG planar DAG level-planar DAG queue program y1 y2 + y1 y2 + D y1 y2 S + D y1 D S y2 + D load,1 load,2 add 2 dup 2 dup 1 swap dup 1 mul 1 add 1 + D store y1 store y2 11 / 26

Computation Overhead for Basic Blocks given DAG levelized DAG planar DAG level-planar DAG queue program y1 y2 + y1 y2 + D y1 y2 S + D y1 D S y2 + D load,1 load,2 add 2 dup 2 dup 1 swap dup 1 mul 1 add 1 + D store y1 store y2 11 / 26

Computation Overhead for Basic Blocks given DAG levelized DAG planar DAG level-planar DAG queue program y1 y2 + y1 y2 + D y1 y2 S + D y1 D S y2 + D load,1 load,2 add 2 dup 2 dup 1 swap dup 1 mul 1 add 1 + D store y1 store y2 11 / 26

Depth-First Traversal Current Compilers order nodes by depth-first traversal minimize register usage optimal code for expression trees Sethi-Ullmann algorithm polynomial time optimal code for directed-acyclic graphs (DAG) proved to be NP-Complete 12 / 26

DFT in Queue Machine z1 y1 y2 x3 x4 executed, 13 / 26

DFT in Queue Machine z1 y1 y2 x3 x4 executed y1 14 / 26

DFT in Queue Machine z1 y1 y2 x3 x4 executed x3,x4 wrong order to execute y2! 15 / 26

Queue code to SCAD code (MBMV) Queue Instruction Corresponding SCAD Move Instructions.. (load,1) [->inp1; load->opc; 1->cps].. (add 1) [out->inp1; out->inp2; add->opc; 1->cps].. (dup 2) [out->inp1; dup->opc; 2->cps].. (swap) [out->inp1; out->inp2; swap->opc].. (store y1) [y1->inp1; out->inp2; store->opc].. is the SCAD code optimal? 16 / 26

Reduced Computation Overhead for SCAD given DAG DAG with PU assigned y1 y2 y1 y2 + + Queue machine: One queue single total order of all nodes x 2 x 2 x 2 x 1 x 2 x 2 x 2 x 1 17 / 26

Reduced Computation Overhead for SCAD given DAG DAG with PU assigned y1 + y2 x 1 + x 2 x 2 x 2 x 2 x 2 x 1 + x 2 y1 + y2 SCAD machine: Multiple queues multiple partial orders of nodes less computation overhead SAT based SCAD code generation (MEMOCODE) resource-optimal at most 4 PUs for up to 15 instruction basic blocks 18 / 26

Mapping to SMT Problem statement Given a basic block (in three-address SSA code), a SCAD machine with p universal PUs and 1 load-store unit, a desired execution time t: determine if the basic block can be executed on the SCAD machine in time t without any computation overhead. Relations α ij θ i,j variable x i is assigned to PU j variable x i is scheduled in timeslot j 19 / 26

Mapping to SMT Problem statement Given a basic block (in three-address SSA code), a SCAD machine with p universal PUs and 1 load-store unit, a desired execution time t: determine if the basic block can be executed on the SCAD machine in time t without any computation overhead. Relations α ij θ i,j variable x i is assigned to PU j variable x i is scheduled in timeslot j 19 / 26

Constraints Binary values n 1 i=0 j=0 p 0 α i,j 1 and n 1 t 1 i=0 j=0 0 θ i,j 1 (1) Schedule exactly once Unique PU assignment n 1 t 1 θ i,j = 1 (2) i=0 j=0 n 1 i=0 j=0 p α i,j = 1 (3) 20 / 26

Constraints Binary values n 1 i=0 j=0 p 0 α i,j 1 and n 1 t 1 i=0 j=0 0 θ i,j 1 (1) Schedule exactly once Unique PU assignment n 1 t 1 θ i,j = 1 (2) i=0 j=0 n 1 i=0 j=0 p α i,j = 1 (3) 20 / 26

Constraints Binary values n 1 i=0 j=0 p 0 α i,j 1 and n 1 t 1 i=0 j=0 0 θ i,j 1 (1) Schedule exactly once Unique PU assignment n 1 t 1 θ i,j = 1 (2) i=0 j=0 n 1 i=0 j=0 p α i,j = 1 (3) 20 / 26

Constraints... Data dependency For each node x i, τ i is the time slot in which the node is scheduled t 1 τ i = j θ i,j j=0 For every instruction x t = x l x r, τ t τ l δ l τ t τ r δ r (4) 21 / 26

Constraints... Ordering variables in buffers for every pair of instructions: x t = x l x r x t = x l x r x t x t Buffer constraint β t,t = ( ) τt < τ t ( ) β l,l = τ l τ l ( ) β r,r = τ r τ r ( ) τt < τ t ) (β l,l = τ l τ l ) (β r,r = τ r τ r x l x l x r x r (5) 22 / 26

Preliminary Results Performance unit latency for all nodes in the basic block 23 / 26

Preliminary Results... Performance 90% cache hit probability 1 cycle: hit 10 cycles: miss 24 / 26

Future Work analyze buffer-sizes hardness of optimal code generation problem efficient heuristics 25 / 26

Thank You! Questions? 26 / 26