Exploring the Potential of Instruction-Level Parallelism of Exposed Datapath Architectures with Buffered Processing Units
|
|
- Rudolf Goodman
- 5 years ago
- Views:
Transcription
1 Exploring the Potential of Instruction-Level Parallelism of Exposed Datapath Architectures with Buffered Processing Units Anoop Bhagyanath and Klaus Schneider Embedded Systems Chair University of Kaiserslautern ACSD / 26
2 Outline 1 Motivation 2 Queue-based Code Generation 3 Mapping to SMT 4 Preliminary Results 5 Future Work 2 / 26
3 Motivation Instruction Level Parallelism (ILP) expression tree dataflow graph (2 Regs) VLIW (4R 2W ports) ld st x3 x4 / x3 x4 / x3 x4 / 3 steps 5 steps 4 steps 3 / 26
4 Conventional Architectures ILP restricted due to limited number of registers and ports in register file compiler spills variables to main memory number of instructions packed into a VLIW word increasing number of registers is difficult instruction format encoding register file wiring 4 / 26
5 Exposed Datapath Architectures Sync Control Async Dataflow (SCAD) grid of processing units FIFO buffers (queues) at inputs and outputs of PUs compiler also moves values from one PU to another: bypass registers although bypassing is used, the code generators still use register mappings examples: TTA, MOVE-PRO, TRIPS, Wavescalar, STA, Flexcore etc 5 / 26
6 SCAD Architecture move instruction O I move instructions O I move instruction bus (MIB) fills address slots PU fires if enough data available at input buffer heads data transport network (DTN) fills data slots application-specific any arbitrary functionality in PUs interconnect choice 6 / 26
7 Queue-based Code Generation Code Generation for Queue Machine z1 y1 y2 x3 x4 executed,,x3,x4 7 / 26
8 Code Generation for Queue Machine z1 y1 y2 x3 x4 executed y1 8 / 26
9 Code Generation for Queue Machine z1 y1 y2 x3 x4 executed y2 9 / 26
10 Code Generation for Queue Machine z1 y1 y2 x3 x4 executed z1 10 / 26
11 Computation Overhead for Basic Blocks given DAG levelized DAG planar DAG level-planar DAG queue program y1 y2 + y1 y2 + D y1 y2 S + D y1 D S y2 + D load,1 load,2 add 2 dup 2 dup 1 swap dup 1 mul 1 add 1 + D store y1 store y2 11 / 26
12 Computation Overhead for Basic Blocks given DAG levelized DAG planar DAG level-planar DAG queue program y1 y2 + y1 y2 + D y1 y2 S + D y1 D S y2 + D load,1 load,2 add 2 dup 2 dup 1 swap dup 1 mul 1 add 1 + D store y1 store y2 11 / 26
13 Computation Overhead for Basic Blocks given DAG levelized DAG planar DAG level-planar DAG queue program y1 y2 + y1 y2 + D y1 y2 S + D y1 D S y2 + D load,1 load,2 add 2 dup 2 dup 1 swap dup 1 mul 1 add 1 + D store y1 store y2 11 / 26
14 Computation Overhead for Basic Blocks given DAG levelized DAG planar DAG level-planar DAG queue program y1 y2 + y1 y2 + D y1 y2 S + D y1 D S y2 + D load,1 load,2 add 2 dup 2 dup 1 swap dup 1 mul 1 add 1 + D store y1 store y2 11 / 26
15 Computation Overhead for Basic Blocks given DAG levelized DAG planar DAG level-planar DAG queue program y1 y2 + y1 y2 + D y1 y2 S + D y1 D S y2 + D load,1 load,2 add 2 dup 2 dup 1 swap dup 1 mul 1 add 1 + D store y1 store y2 11 / 26
16 Depth-First Traversal Current Compilers order nodes by depth-first traversal minimize register usage optimal code for expression trees Sethi-Ullmann algorithm polynomial time optimal code for directed-acyclic graphs (DAG) proved to be NP-Complete 12 / 26
17 DFT in Queue Machine z1 y1 y2 x3 x4 executed, 13 / 26
18 DFT in Queue Machine z1 y1 y2 x3 x4 executed y1 14 / 26
19 DFT in Queue Machine z1 y1 y2 x3 x4 executed x3,x4 wrong order to execute y2! 15 / 26
20 Queue code to SCAD code (MBMV) Queue Instruction Corresponding SCAD Move Instructions.. (load,1) [->inp1; load->opc; 1->cps].. (add 1) [out->inp1; out->inp2; add->opc; 1->cps].. (dup 2) [out->inp1; dup->opc; 2->cps].. (swap) [out->inp1; out->inp2; swap->opc].. (store y1) [y1->inp1; out->inp2; store->opc].. is the SCAD code optimal? 16 / 26
21 Reduced Computation Overhead for SCAD given DAG DAG with PU assigned y1 y2 y1 y2 + + Queue machine: One queue single total order of all nodes x 2 x 2 x 2 x 1 x 2 x 2 x 2 x 1 17 / 26
22 Reduced Computation Overhead for SCAD given DAG DAG with PU assigned y1 + y2 x 1 + x 2 x 2 x 2 x 2 x 2 x 1 + x 2 y1 + y2 SCAD machine: Multiple queues multiple partial orders of nodes less computation overhead SAT based SCAD code generation (MEMOCODE) resource-optimal at most 4 PUs for up to 15 instruction basic blocks 18 / 26
23 Mapping to SMT Problem statement Given a basic block (in three-address SSA code), a SCAD machine with p universal PUs and 1 load-store unit, a desired execution time t: determine if the basic block can be executed on the SCAD machine in time t without any computation overhead. Relations α ij θ i,j variable x i is assigned to PU j variable x i is scheduled in timeslot j 19 / 26
24 Mapping to SMT Problem statement Given a basic block (in three-address SSA code), a SCAD machine with p universal PUs and 1 load-store unit, a desired execution time t: determine if the basic block can be executed on the SCAD machine in time t without any computation overhead. Relations α ij θ i,j variable x i is assigned to PU j variable x i is scheduled in timeslot j 19 / 26
25 Constraints Binary values n 1 i=0 j=0 p 0 α i,j 1 and n 1 t 1 i=0 j=0 0 θ i,j 1 (1) Schedule exactly once Unique PU assignment n 1 t 1 θ i,j = 1 (2) i=0 j=0 n 1 i=0 j=0 p α i,j = 1 (3) 20 / 26
26 Constraints Binary values n 1 i=0 j=0 p 0 α i,j 1 and n 1 t 1 i=0 j=0 0 θ i,j 1 (1) Schedule exactly once Unique PU assignment n 1 t 1 θ i,j = 1 (2) i=0 j=0 n 1 i=0 j=0 p α i,j = 1 (3) 20 / 26
27 Constraints Binary values n 1 i=0 j=0 p 0 α i,j 1 and n 1 t 1 i=0 j=0 0 θ i,j 1 (1) Schedule exactly once Unique PU assignment n 1 t 1 θ i,j = 1 (2) i=0 j=0 n 1 i=0 j=0 p α i,j = 1 (3) 20 / 26
28 Constraints... Data dependency For each node x i, τ i is the time slot in which the node is scheduled t 1 τ i = j θ i,j j=0 For every instruction x t = x l x r, τ t τ l δ l τ t τ r δ r (4) 21 / 26
29 Constraints... Ordering variables in buffers for every pair of instructions: x t = x l x r x t = x l x r x t x t Buffer constraint β t,t = ( ) τt < τ t ( ) β l,l = τ l τ l ( ) β r,r = τ r τ r ( ) τt < τ t ) (β l,l = τ l τ l ) (β r,r = τ r τ r x l x l x r x r (5) 22 / 26
30 Preliminary Results Performance unit latency for all nodes in the basic block 23 / 26
31 Preliminary Results... Performance 90% cache hit probability 1 cycle: hit 10 cycles: miss 24 / 26
32 Future Work analyze buffer-sizes hardness of optimal code generation problem efficient heuristics 25 / 26
33 Thank You! Questions? 26 / 26
CSCI-564 Advanced Computer Architecture
CSCI-564 Advanced Computer Architecture Lecture 8: Handling Exceptions and Interrupts / Superscalar Bo Wu Colorado School of Mines Branch Delay Slots (expose control hazard to software) Change the ISA
More informationSpecial Nodes for Interface
fi fi Special Nodes for Interface SW on processors Chip-level HW Board-level HW fi fi C code VHDL VHDL code retargetable compilation high-level synthesis SW costs HW costs partitioning (solve ILP) cluster
More informationIssue = Select + Wakeup. Out-of-order Pipeline. Issue. Issue = Select + Wakeup. OOO execution (2-wide) OOO execution (2-wide)
Out-of-order Pipeline Buffer of instructions Issue = Select + Wakeup Select N oldest, read instructions N=, xor N=, xor and sub Note: ma have execution resource constraints: i.e., load/store/fp Fetch Decode
More informationMicro-architecture Pipelining Optimization with Throughput- Aware Floorplanning
Micro-architecture Pipelining Optimization with Throughput- Aware Floorplanning Yuchun Ma* Zhuoyuan Li* Jason Cong Xianlong Hong Glenn Reinman Sheqin Dong* Qiang Zhou *Department of Computer Science &
More informationPerformance, Power & Energy. ELEC8106/ELEC6102 Spring 2010 Hayden Kwok-Hay So
Performance, Power & Energy ELEC8106/ELEC6102 Spring 2010 Hayden Kwok-Hay So Recall: Goal of this class Performance Reconfiguration Power/ Energy H. So, Sp10 Lecture 3 - ELEC8106/6102 2 PERFORMANCE EVALUATION
More informationP C max. NP-complete from partition. Example j p j What is the makespan on 2 machines? 3 machines? 4 machines?
Multiple Machines Model Multiple Available resources people time slots queues networks of computers Now concerned with both allocation to a machine and ordering on that machine. P C max NP-complete from
More informationCompiling Techniques
Lecture 11: Introduction to 13 November 2015 Table of contents 1 Introduction Overview The Backend The Big Picture 2 Code Shape Overview Introduction Overview The Backend The Big Picture Source code FrontEnd
More informationDigital Integrated Circuits A Design Perspective. Arithmetic Circuits. Jan M. Rabaey Anantha Chandrakasan Borivoje Nikolic.
Digital Integrated Circuits A Design Perspective Jan M. Rabaey Anantha Chandrakasan Borivoje Nikolic Arithmetic Circuits January, 2003 1 A Generic Digital Processor MEMORY INPUT-OUTPUT CONTROL DATAPATH
More informationThis Unit: Scheduling (Static + Dynamic) CIS 501 Computer Architecture. Readings. Review Example
This Unit: Scheduling (Static + Dnamic) CIS 50 Computer Architecture Unit 8: Static and Dnamic Scheduling Application OS Compiler Firmware CPU I/O Memor Digital Circuits Gates & Transistors! Previousl:!
More informationExploiting In-Memory Processing Capabilities for Density Functional Theory Applications
Exploiting In-Memory Processing Capabilities for Density Functional Theory Applications 2016 Aug 23 P. F. Baumeister, T. Hater, D. Pleiter H. Boettiger, T. Maurer, J. R. Brunheroto Contributors IBM R&D
More informationEE 660: Computer Architecture Out-of-Order Processors
EE 660: Computer Architecture Out-of-Order Processors Yao Zheng Department of Electrical Engineering University of Hawaiʻi at Mānoa Based on the slides of Prof. David entzlaff Agenda I4 Processors I2O2
More informationComputational Complexity
Computational Complexity Algorithm performance and difficulty of problems So far we have seen problems admitting fast algorithms flow problems, shortest path, spanning tree... and other problems for which
More informationPerformance, Power & Energy
Recall: Goal of this class Performance, Power & Energy ELE8106/ELE6102 Performance Reconfiguration Power/ Energy Spring 2010 Hayden Kwok-Hay So H. So, Sp10 Lecture 3 - ELE8106/6102 2 What is good performance?
More informationAutomated design of floating-point logarithm functions on integer processors
23rd IEEE Symposium on Computer Arithmetic Santa Clara, CA, USA, 10-13 July 2016 Automated design of floating-point logarithm functions on integer processors Guillaume Revy (presented by Florent de Dinechin)
More informationCS 301: Complexity of Algorithms (Term I 2008) Alex Tiskin Harald Räcke. Hamiltonian Cycle. 8.5 Sequencing Problems. Directed Hamiltonian Cycle
8.5 Sequencing Problems Basic genres. Packing problems: SET-PACKING, INDEPENDENT SET. Covering problems: SET-COVER, VERTEX-COVER. Constraint satisfaction problems: SAT, 3-SAT. Sequencing problems: HAMILTONIAN-CYCLE,
More informationDigital Integrated Circuits A Design Perspective. Arithmetic Circuits. Jan M. Rabaey Anantha Chandrakasan Borivoje Nikolic.
Digital Integrated Circuits A Design Perspective Jan M. Rabaey Anantha Chandrakasan Borivoje Nikolic Arithmetic Circuits January, 2003 1 A Generic Digital Processor MEM ORY INPUT-OUTPUT CONTROL DATAPATH
More informationHybrid static/dynamic scheduling for already optimized dense matrix factorization. Joint Laboratory for Petascale Computing, INRIA-UIUC
Hybrid static/dynamic scheduling for already optimized dense matrix factorization Simplice Donfack, Laura Grigori, INRIA, France Bill Gropp, Vivek Kale UIUC, USA Joint Laboratory for Petascale Computing,
More informationICS 233 Computer Architecture & Assembly Language
ICS 233 Computer Architecture & Assembly Language Assignment 6 Solution 1. Identify all of the RAW data dependencies in the following code. Which dependencies are data hazards that will be resolved by
More informationClock-driven scheduling
Clock-driven scheduling Also known as static or off-line scheduling Michal Sojka Czech Technical University in Prague, Faculty of Electrical Engineering, Department of Control Engineering November 8, 2017
More informationECE 3401 Lecture 23. Pipeline Design. State Table for 2-Cycle Instructions. Control Unit. ISA: Instruction Specifications (for reference)
ECE 3401 Lecture 23 Pipeline Design Control State Register Combinational Control Logic New/ Modified Control Word ISA: Instruction Specifications (for reference) P C P C + 1 I N F I R M [ P C ] E X 0 PC
More informationINF2270 Spring Philipp Häfliger. Lecture 8: Superscalar CPUs, Course Summary/Repetition (1/2)
INF2270 Spring 2010 Philipp Häfliger Summary/Repetition (1/2) content From Scalar to Superscalar Lecture Summary and Brief Repetition Binary numbers Boolean Algebra Combinational Logic Circuits Encoder/Decoder
More informationLoop Scheduling and Software Pipelining \course\cpeg421-08s\topic-7.ppt 1
Loop Scheduling and Software Pipelining 2008-04-24 \course\cpeg421-08s\topic-7.ppt 1 Reading List Slides: Topic 7 and 7a Other papers as assigned in class or homework: 2008-04-24 \course\cpeg421-08s\topic-7.ppt
More informationCMP N 301 Computer Architecture. Appendix C
CMP N 301 Computer Architecture Appendix C Outline Introduction Pipelining Hazards Pipelining Implementation Exception Handling Advanced Issues (Dynamic Scheduling, Out of order Issue, Superscalar, etc)
More informationWorst-Case Execution Time Analysis. LS 12, TU Dortmund
Worst-Case Execution Time Analysis Prof. Dr. Jian-Jia Chen LS 12, TU Dortmund 02, 03 May 2016 Prof. Dr. Jian-Jia Chen (LS 12, TU Dortmund) 1 / 53 Most Essential Assumptions for Real-Time Systems Upper
More informationSimple Instruction-Pipelining. Pipelined Harvard Datapath
6.823, L8--1 Simple ruction-pipelining Laboratory for Computer Science M.I.T. http://www.csg.lcs.mit.edu/6.823 Pipelined Harvard path 6.823, L8--2. I fetch decode & eg-fetch execute memory Clock period
More informationECE 5775 (Fall 17) High-Level Digital Design Automation. Scheduling: Exact Methods
ECE 5775 (Fall 17) High-Level Digital Design Automation Scheduling: Exact Methods Announcements Sign up for the first student-led discussions today One slot remaining Presenters for the 1st session will
More informationWorst-Case Execution Time Analysis. LS 12, TU Dortmund
Worst-Case Execution Time Analysis Prof. Dr. Jian-Jia Chen LS 12, TU Dortmund 09/10, Jan., 2018 Prof. Dr. Jian-Jia Chen (LS 12, TU Dortmund) 1 / 43 Most Essential Assumptions for Real-Time Systems Upper
More informationFall 2008 CSE Qualifying Exam. September 13, 2008
Fall 2008 CSE Qualifying Exam September 13, 2008 1 Architecture 1. (Quan, Fall 2008) Your company has just bought a new dual Pentium processor, and you have been tasked with optimizing your software for
More informationLogic BIST. Sungho Kang Yonsei University
Logic BIST Sungho Kang Yonsei University Outline Introduction Basics Issues Weighted Random Pattern Generation BIST Architectures Deterministic BIST Conclusion 2 Built In Self Test Test/ Normal Input Pattern
More informationCPSC 3300 Spring 2017 Exam 2
CPSC 3300 Spring 2017 Exam 2 Name: 1. Matching. Write the correct term from the list into each blank. (2 pts. each) structural hazard EPIC forwarding precise exception hardwired load-use data hazard VLIW
More informationRuntime Model Predictive Verification on Embedded Platforms 1
Runtime Model Predictive Verification on Embedded Platforms 1 Pei Zhang, Jianwen Li, Joseph Zambreno, Phillip H. Jones, Kristin Yvonne Rozier Presenter: Pei Zhang Iowa State University peizhang@iastate.edu
More informationChe-Wei Chang Department of Computer Science and Information Engineering, Chang Gung University
Che-Wei Chang chewei@mail.cgu.edu.tw Department of Computer Science and Information Engineering, Chang Gung University } 2017/11/15 Midterm } 2017/11/22 Final Project Announcement 2 1. Introduction 2.
More informationAPTAS for Bin Packing
APTAS for Bin Packing Bin Packing has an asymptotic PTAS (APTAS) [de la Vega and Leuker, 1980] For every fixed ε > 0 algorithm outputs a solution of size (1+ε)OPT + 1 in time polynomial in n APTAS for
More informationCSE 200 Lecture Notes Turing machine vs. RAM machine vs. circuits
CSE 200 Lecture Notes Turing machine vs. RAM machine vs. circuits Chris Calabro January 13, 2016 1 RAM model There are many possible, roughly equivalent RAM models. Below we will define one in the fashion
More informationUnit 1A: Computational Complexity
Unit 1A: Computational Complexity Course contents: Computational complexity NP-completeness Algorithmic Paradigms Readings Chapters 3, 4, and 5 Unit 1A 1 O: Upper Bounding Function Def: f(n)= O(g(n)) if
More informationNCU EE -- DSP VLSI Design. Tsung-Han Tsai 1
NCU EE -- DSP VLSI Design. Tsung-Han Tsai 1 Multi-processor vs. Multi-computer architecture µp vs. DSP RISC vs. DSP RISC Reduced-instruction-set Register-to-register operation Higher throughput by using
More informationCSCI 1590 Intro to Computational Complexity
CSCI 59 Intro to Computational Complexity Overview of the Course John E. Savage Brown University January 2, 29 John E. Savage (Brown University) CSCI 59 Intro to Computational Complexity January 2, 29
More informationCSE. 1. In following code. addi. r1, skip1 xor in r2. r3, skip2. counter r4, top. taken): PC1: PC2: PC3: TTTTTT TTTTTT
CSE 560 Practice Problem Set 4 Solution 1. In this question, you will examine several different schemes for branch prediction, using the following code sequence for a simple load store ISA with no branch
More informationLecture 8: Complete Problems for Other Complexity Classes
IAS/PCMI Summer Session 2000 Clay Mathematics Undergraduate Program Basic Course on Computational Complexity Lecture 8: Complete Problems for Other Complexity Classes David Mix Barrington and Alexis Maciel
More informationSimple Instruction-Pipelining. Pipelined Harvard Datapath
6.823, L8--1 Simple ruction-pipelining Updated March 6, 2000 Laboratory for Computer Science M.I.T. http://www.csg.lcs.mit.edu/6.823 Pipelined Harvard path 6.823, L8--2. fetch decode & eg-fetch execute
More information5 Integer Linear Programming (ILP) E. Amaldi Foundations of Operations Research Politecnico di Milano 1
5 Integer Linear Programming (ILP) E. Amaldi Foundations of Operations Research Politecnico di Milano 1 Definition: An Integer Linear Programming problem is an optimization problem of the form (ILP) min
More informationDense Arithmetic over Finite Fields with CUMODP
Dense Arithmetic over Finite Fields with CUMODP Sardar Anisul Haque 1 Xin Li 2 Farnam Mansouri 1 Marc Moreno Maza 1 Wei Pan 3 Ning Xie 1 1 University of Western Ontario, Canada 2 Universidad Carlos III,
More informationDivisible Load Scheduling
Divisible Load Scheduling Henri Casanova 1,2 1 Associate Professor Department of Information and Computer Science University of Hawai i at Manoa, U.S.A. 2 Visiting Associate Professor National Institute
More informationDesign for Testability
Design for Testability Outline Ad Hoc Design for Testability Techniques Method of test points Multiplexing and demultiplexing of test points Time sharing of I/O for normal working and testing modes Partitioning
More information8.5 Sequencing Problems
8.5 Sequencing Problems Basic genres. Packing problems: SET-PACKING, INDEPENDENT SET. Covering problems: SET-COVER, VERTEX-COVER. Constraint satisfaction problems: SAT, 3-SAT. Sequencing problems: HAMILTONIAN-CYCLE,
More informationDetermine the size of an instance of the minimum spanning tree problem.
3.1 Algorithm complexity Consider two alternative algorithms A and B for solving a given problem. Suppose A is O(n 2 ) and B is O(2 n ), where n is the size of the instance. Let n A 0 be the size of the
More informationMetode şi Algoritmi de Planificare (MAP) Curs 2 Introducere în problematica planificării
Metode şi Algoritmi de Planificare (MAP) 2009-2010 Curs 2 Introducere în problematica planificării 20.10.2009 Metode si Algoritmi de Planificare Curs 2 1 Introduction to scheduling Scheduling problem definition
More informationCyclic Task Scheduling with Storage Requirement Minimisation under Specific Architectural Constraints: Case of Buffers and Rotating Storage Facilities
UNIVERSITE DE VERSAILLES SAINT-QUENTIN EN YVELINES Cyclic Task Scheduling with Storage Requirement Minimisation under Specific Architectural Constraints: Case of Buffers and Rotating Storage Facilities
More informationA Combined Analytical and Simulation-Based Model for Performance Evaluation of a Reconfigurable Instruction Set Processor
A Combined Analytical and Simulation-Based Model for Performance Evaluation of a Reconfigurable Instruction Set Processor Farhad Mehdipour, H. Noori, B. Javadi, H. Honda, K. Inoue, K. Murakami Faculty
More informationModels: Amdahl s Law, PRAM, α-β Tal Ben-Nun
spcl.inf.ethz.ch @spcl_eth Models: Amdahl s Law, PRAM, α-β Tal Ben-Nun Design of Parallel and High-Performance Computing Fall 2017 DPHPC Overview cache coherency memory models 2 Speedup An application
More informationCprE 281: Digital Logic
CprE 28: Digital Logic Instructor: Alexander Stoytchev http://www.ece.iastate.edu/~alexs/classes/ Simple Processor CprE 28: Digital Logic Iowa State University, Ames, IA Copyright Alexander Stoytchev Digital
More informationLecture: Pipelining Basics
Lecture: Pipelining Basics Topics: Performance equations wrap-up, Basic pipelining implementation Video 1: What is pipelining? Video 2: Clocks and latches Video 3: An example 5-stage pipeline Video 4:
More informationGATE 2014 A Brief Analysis (Based on student test experiences in the stream of CS on 1 st March, Second Session)
GATE 4 A Brief Analysis (Based on student test experiences in the stream of CS on st March, 4 - Second Session) Section wise analysis of the paper Mark Marks Total No of Questions Engineering Mathematics
More informationAnnouncements. Project #1 grades were returned on Monday. Midterm #1. Project #2. Requests for re-grades due by Tuesday
Announcements Project #1 grades were returned on Monday Requests for re-grades due by Tuesday Midterm #1 Re-grade requests due by Monday Project #2 Due 10 AM Monday 1 Page State (hardware view) Page frame
More informationRegister Allocation. Maryam Siahbani CMPT 379 4/5/2016 1
Register Allocation Maryam Siahbani CMPT 379 4/5/2016 1 Register Allocation Intermediate code uses unlimited temporaries Simplifying code generation and optimization Complicates final translation to assembly
More informationA Second Datapath Example YH16
A Second Datapath Example YH16 Lecture 09 Prof. Yih Huang S365 1 A 16-Bit Architecture: YH16 A word is 16 bit wide 32 general purpose registers, 16 bits each Like MIPS, 0 is hardwired zero. 16 bit P 16
More informationBuilding a Multi-FPGA Virtualized Restricted Boltzmann Machine Architecture Using Embedded MPI
Building a Multi-FPGA Virtualized Restricted Boltzmann Machine Architecture Using Embedded MPI Charles Lo and Paul Chow {locharl1, pc}@eecg.toronto.edu Department of Electrical and Computer Engineering
More informationVector Lane Threading
Vector Lane Threading S. Rivoire, R. Schultz, T. Okuda, C. Kozyrakis Computer Systems Laboratory Stanford University Motivation Vector processors excel at data-level parallelism (DLP) What happens to program
More informationHow to deal with uncertainties and dynamicity?
How to deal with uncertainties and dynamicity? http://graal.ens-lyon.fr/ lmarchal/scheduling/ 19 novembre 2012 1/ 37 Outline 1 Sensitivity and Robustness 2 Analyzing the sensitivity : the case of Backfilling
More informationLimits of Feasibility. Example. Complexity Relationships among Models. 1. Complexity Relationships among Models
Limits of Feasibility Wolfgang Schreiner Wolfgang.Schreiner@risc.jku.at Research Institute for Symbolic Computation (RISC) Johannes Kepler University, Linz, Austria http://www.risc.jku.at 1. Complexity
More informationCMPEN 411 VLSI Digital Circuits Spring Lecture 19: Adder Design
CMPEN 411 VLSI Digital Circuits Spring 2011 Lecture 19: Adder Design [Adapted from Rabaey s Digital Integrated Circuits, Second Edition, 2003 J. Rabaey, A. Chandrakasan, B. Nikolic] Sp11 CMPEN 411 L19
More informationCS20a: NP completeness. NP-complete definition. Related properties. Cook's Theorem
CS20a: NP completeness Cook s theorem SAT is an NP-complete problem http://www.cs.caltech.edu/courses/cs20/a/ December 2, 2002 1 NP-complete definition A problem is in NP if it can be solved by a nondeterministic
More informationEmbedded Systems Design: Optimization Challenges. Paul Pop Embedded Systems Lab (ESLAB) Linköping University, Sweden
of /4 4 Embedded Systems Design: Optimization Challenges Paul Pop Embedded Systems Lab (ESLAB) Linköping University, Sweden Outline! Embedded systems " Example area: automotive electronics " Embedded systems
More informationAperiodic Task Scheduling
Aperiodic Task Scheduling Jian-Jia Chen (slides are based on Peter Marwedel) TU Dortmund, Informatik 12 Germany Springer, 2010 2017 年 11 月 29 日 These slides use Microsoft clip arts. Microsoft copyright
More informationMulticore Semantics and Programming
Multicore Semantics and Programming Peter Sewell Tim Harris University of Cambridge Oracle October November, 2015 p. 1 These Lectures Part 1: Multicore Semantics: the concurrency of multiprocessors and
More informationDecision Diagram Relaxations for Integer Programming
Decision Diagram Relaxations for Integer Programming Christian Tjandraatmadja April, 2018 Tepper School of Business Carnegie Mellon University Submitted to the Tepper School of Business in Partial Fulfillment
More informationWeighted Acyclic Di-Graph Partitioning by Balanced Disjoint Paths
Weighted Acyclic Di-Graph Partitioning by Balanced Disjoint Paths H. Murat AFSAR Olivier BRIANT Murat.Afsar@g-scop.inpg.fr Olivier.Briant@g-scop.inpg.fr G-SCOP Laboratory Grenoble Institute of Technology
More informationAccelerating Decoupled Look-ahead to Exploit Implicit Parallelism
Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism Raj Parihar Advisor: Prof. Michael C. Huang March 22, 2013 Raj Parihar Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism
More informationHoare Logic for Realistically Modelled Machine Code
Hoare Logic for Realistically Modelled Machine Code Magnus O. Myreen, Michael J. C. Gordon TACAS, March 2007 This talk Contribution: A mechanised Hoare logic for machine code with emphasis on resource
More informationCHAPTER log 2 64 = 6 lines/mux or decoder 9-2.* C = C 8 V = C 8 C * 9-4.* (Errata: Delete 1 after problem number) 9-5.
CHPTER 9 2008 Pearson Education, Inc. 9-. log 2 64 = 6 lines/mux or decoder 9-2.* C = C 8 V = C 8 C 7 Z = F 7 + F 6 + F 5 + F 4 + F 3 + F 2 + F + F 0 N = F 7 9-3.* = S + S = S + S S S S0 C in C 0 dder
More informationInformatique Fondamentale IMA S8
Informatique Fondamentale IMA S8 Cours 4 : graphs, problems and algorithms on graphs, (notions of) NP completeness Laure Gonnord http://laure.gonnord.org/pro/teaching/ Laure.Gonnord@polytech-lille.fr Université
More informationSchedule Table Generation for Time-Triggered Mixed Criticality Systems
Schedule Table Generation for Time-Triggered Mixed Criticality Systems Jens Theis and Gerhard Fohler Technische Universität Kaiserslautern, Germany Sanjoy Baruah The University of North Carolina, Chapel
More informationAn Integrative Model for Parallelism
An Integrative Model for Parallelism Victor Eijkhout ICERM workshop 2012/01/09 Introduction Formal part Examples Extension to other memory models Conclusion tw-12-exascale 2012/01/09 2 Introduction tw-12-exascale
More informationLRADNN: High-Throughput and Energy- Efficient Deep Neural Network Accelerator using Low Rank Approximation
LRADNN: High-Throughput and Energy- Efficient Deep Neural Network Accelerator using Low Rank Approximation Jingyang Zhu 1, Zhiliang Qian 2, and Chi-Ying Tsui 1 1 The Hong Kong University of Science and
More informationHardware Acceleration of DNNs
Lecture 12: Hardware Acceleration of DNNs Visual omputing Systems Stanford S348V, Winter 2018 Hardware acceleration for DNNs Huawei Kirin NPU Google TPU: Apple Neural Engine Intel Lake rest Deep Learning
More informationA Formal Model of Clock Domain Crossing and Automated Verification of Time-Triggered Hardware
A Formal Model of Clock Domain Crossing and Automated Verification of Time-Triggered Hardware Julien Schmaltz Institute for Computing and Information Sciences Radboud University Nijmegen The Netherlands
More informationBayesian Networks. Motivation
Bayesian Networks Computer Sciences 760 Spring 2014 http://pages.cs.wisc.edu/~dpage/cs760/ Motivation Assume we have five Boolean variables,,,, The joint probability is,,,, How many state configurations
More information4th year Project demo presentation
4th year Project demo presentation Colm Ó héigeartaigh CASE4-99387212 coheig-case4@computing.dcu.ie 4th year Project demo presentation p. 1/23 Table of Contents An Introduction to Quantum Computing The
More informationHigh Performance Computing
Master Degree Program in Computer Science and Networking, 2014-15 High Performance Computing 2 nd appello February 11, 2015 Write your name, surname, student identification number (numero di matricola),
More informationComputer Science. Questions for discussion Part II. Computer Science COMPUTER SCIENCE. Section 4.2.
COMPUTER SCIENCE S E D G E W I C K / W A Y N E PA R T I I : A L G O R I T H M S, T H E O R Y, A N D M A C H I N E S Computer Science Computer Science An Interdisciplinary Approach Section 4.2 ROBERT SEDGEWICK
More informationComputational Boolean Algebra. Pingqiang Zhou ShanghaiTech University
Computational Boolean Algebra Pingqiang Zhou ShanghaiTech University Announcements Written assignment #1 is out. Due: March 24 th, in class. Programming assignment #1 is out. Due: March 24 th, 11:59PM.
More informationCHAPTER 3 FUNDAMENTALS OF COMPUTATIONAL COMPLEXITY. E. Amaldi Foundations of Operations Research Politecnico di Milano 1
CHAPTER 3 FUNDAMENTALS OF COMPUTATIONAL COMPLEXITY E. Amaldi Foundations of Operations Research Politecnico di Milano 1 Goal: Evaluate the computational requirements (this course s focus: time) to solve
More informationQuIDD-Optimised Quantum Algorithms
QuIDD-Optimised Quantum Algorithms by S K University of York Computer science 3 rd year project Supervisor: Prof Susan Stepney 03/05/2004 1 Project Objectives Investigate the QuIDD optimisation techniques
More informationData Structures in Java
Data Structures in Java Lecture 21: Introduction to NP-Completeness 12/9/2015 Daniel Bauer Algorithms and Problem Solving Purpose of algorithms: find solutions to problems. Data Structures provide ways
More informationEnergy-efficient Mapping of Big Data Workflows under Deadline Constraints
Energy-efficient Mapping of Big Data Workflows under Deadline Constraints Presenter: Tong Shu Authors: Tong Shu and Prof. Chase Q. Wu Big Data Center Department of Computer Science New Jersey Institute
More informationDesign for Testability
Design for Testability Outline Ad Hoc Design for Testability Techniques Method of test points Multiplexing and demultiplexing of test points Time sharing of I/O for normal working and testing modes Partitioning
More informationCODE GENERATION REGISTER ALLOCATION. Goal. Interplay between. Translate intermediate code into target code
CODE GENERATION Goal Translate intermediate code into target code Interplay between Register Allocation Instruction Selection Instruction Scheduling 1 REGISTER ALLOCATION 1 REGISTER ALLOCATION Motivation
More informationProblem-Solving via Search Lecture 3
Lecture 3 What is a search problem? How do search algorithms work and how do we evaluate their performance? 1 Agenda An example task Problem formulation Infrastructure for search algorithms Complexity
More informationPriority queues implemented via heaps
Priority queues implemented via heaps Comp Sci 1575 Data s Outline 1 2 3 Outline 1 2 3 Priority queue: most important first Recall: queue is FIFO A normal queue data structure will not implement a priority
More informationVLSI Design. [Adapted from Rabaey s Digital Integrated Circuits, 2002, J. Rabaey et al.] ECE 4121 VLSI DEsign.1
VLSI Design Adder Design [Adapted from Rabaey s Digital Integrated Circuits, 2002, J. Rabaey et al.] ECE 4121 VLSI DEsign.1 Major Components of a Computer Processor Devices Control Memory Input Datapath
More informationUnit 6: Branch Prediction
CIS 501: Computer Architecture Unit 6: Branch Prediction Slides developed by Joe Devie/, Milo Mar4n & Amir Roth at Upenn with sources that included University of Wisconsin slides by Mark Hill, Guri Sohi,
More informationMm7 Intro to distributed computing (jmp) Mm8 Backtracking, 2-player games, genetic algorithms (hps) Mm9 Complex Problems in Network Planning (JMP)
Algorithms and Architectures II H-P Schwefel, Jens M. Pedersen Mm6 Advanced Graph Algorithms (hps) Mm7 Intro to distributed computing (jmp) Mm8 Backtracking, 2-player games, genetic algorithms (hps) Mm9
More informationComplexity: Some examples
Algorithms and Architectures III: Distributed Systems H-P Schwefel, Jens M. Pedersen Mm6 Distributed storage and access (jmp) Mm7 Introduction to security aspects (hps) Mm8 Parallel complexity (hps) Mm9
More information8.5 Sequencing Problems. Chapter 8. NP and Computational Intractability. Hamiltonian Cycle. Hamiltonian Cycle
Chapter 8 NP and Computational Intractability 8.5 Sequencing Problems Basic genres. Packing problems: SET-PACKING, INDEPENDENT SET. Covering problems: SET-COVER, VERTEX-COVER. Constraint satisfaction problems:
More informationEnvironment (E) IBP IBP IBP 2 N 2 N. server. System (S) Adapter (A) ACV
The Adaptive Cross Validation Method - applied to polling schemes Anders Svensson and Johan M Karlsson Department of Communication Systems Lund Institute of Technology P. O. Box 118, 22100 Lund, Sweden
More information6.5.3 An NP-complete domino game
26 Chapter 6. Complexity Theory 3SAT NP. We know from Theorem 6.5.7 that this is true. A P 3SAT, for every language A NP. Hence, we have to show this for languages A such as kcolor, HC, SOS, NPrim, KS,
More informationChapter 8. NP and Computational Intractability
Chapter 8 NP and Computational Intractability Slides by Kevin Wayne. Copyright 2005 Pearson-Addison Wesley. All rights reserved. Acknowledgement: This lecture slide is revised and authorized from Prof.
More informationCommunication-avoiding LU and QR factorizations for multicore architectures
Communication-avoiding LU and QR factorizations for multicore architectures DONFACK Simplice INRIA Saclay Joint work with Laura Grigori INRIA Saclay Alok Kumar Gupta BCCS,Norway-5075 16th April 2010 Communication-avoiding
More informationAutomatic Verification of Parameterized Data Structures
Automatic Verification of Parameterized Data Structures Jyotirmoy V. Deshmukh, E. Allen Emerson and Prateek Gupta The University of Texas at Austin The University of Texas at Austin 1 Outline Motivation
More informationLattice Boltzmann simulations on heterogeneous CPU-GPU clusters
Lattice Boltzmann simulations on heterogeneous CPU-GPU clusters H. Köstler 2nd International Symposium Computer Simulations on GPU Freudenstadt, 29.05.2013 1 Contents Motivation walberla software concepts
More information