Performance Evaluation of a Reconfigurable Instruction Set Processor

Similar documents
A Combined Analytical and Simulation-Based Model for Performance Evaluation of a Reconfigurable Instruction Set Processor

Developing an Architecture for a Single-Flux Quantum Based Reconfigurable Accelerator

CPSC 3300 Spring 2017 Exam 2

CMP N 301 Computer Architecture. Appendix C

ICS 233 Computer Architecture & Assembly Language

CSCI-564 Advanced Computer Architecture

COVER SHEET: Problem#: Points

NCU EE -- DSP VLSI Design. Tsung-Han Tsai 1

Computer Engineering Department. CC 311- Computer Architecture. Chapter 4. The Processor: Datapath and Control. Single Cycle

CPU DESIGN The Single-Cycle Implementation

CMP 338: Third Class

SELECTION OF ARX MODELS ESTIMATED BY THE PENALIZED WEIGHTED LEAST SQUARES METHOD

Performance, Power & Energy. ELEC8106/ELEC6102 Spring 2010 Hayden Kwok-Hay So

Computer Architecture. ECE 361 Lecture 5: The Design Process & ALU Design. 361 design.1

Computer Architecture

ECE 3401 Lecture 23. Pipeline Design. State Table for 2-Cycle Instructions. Control Unit. ISA: Instruction Specifications (for reference)

Lecture 13: Sequential Circuits, FSM

Lecture: Pipelining Basics

Processor Design & ALU Design

Exploring the Potential of Instruction-Level Parallelism of Exposed Datapath Architectures with Buffered Processing Units

A Simple Architectural Enhancement for Fast and Flexible Elliptic Curve Cryptography over Binary Finite Fields GF(2 m )

Implementing the Controller. Harvard-Style Datapath for DLX

Project Two RISC Processor Implementation ECE 485

Vector Lane Threading

FPGA Implementation of a Predictive Controller

Digital Integrated Circuits A Design Perspective. Arithmetic Circuits. Jan M. Rabaey Anantha Chandrakasan Borivoje Nikolic.

ww.padasalai.net

Simple Instruction-Pipelining. Pipelined Harvard Datapath

3. (2) What is the difference between fixed and hybrid instructions?

On the Design of a Hardware-Software Architecture for Acceleration of SVM s Training Phase

Simple Instruction-Pipelining. Pipelined Harvard Datapath

Arithmetic and Logic Unit First Part

Performance, Power & Energy

Computer Architecture

Evaluating Overheads of Multi-bit Soft Error Protection Techniques at Hardware Level Sponsored by SRC and Freescale under SRC task number 2042

MULTIRELATIONAL MODELS OF LAZY, MONODIC TREE, AND PROBABILISTIC KLEENE ALGEBRAS

2. Accelerated Computations

Performance Metrics & Architectural Adaptivity. ELEC8106/ELEC6102 Spring 2010 Hayden Kwok-Hay So

L07-L09 recap: Fundamental lesson(s)!

EE115C Winter 2017 Digital Electronic Circuits. Lecture 19: Timing Analysis

Visualization of the Hollowness in Unit Particles of Allophane and Imogolite

MICROPROCESSOR REPORT. THE INSIDER S GUIDE TO MICROPROCESSOR HARDWARE

CMP 334: Seventh Class

九州大学学術情報リポジトリ Kyushu University Institutional Repository. Ohara, Kenji.

Instruction Set Extensions for Reed-Solomon Encoding and Decoding

Embedded Systems Design: Optimization Challenges. Paul Pop Embedded Systems Lab (ESLAB) Linköping University, Sweden

Lecture 34: Portable Systems Technology Background Professor Randy H. Katz Computer Science 252 Fall 1995

LRADNN: High-Throughput and Energy- Efficient Deep Neural Network Accelerator using Low Rank Approximation

Power Consumption Analysis. Arithmetic Level Countermeasures for ECC Coprocessor. Arithmetic Operators for Cryptography.

INFORMATION THEORETICAL APPROACHES IN GAME THEORY

[2] Predicting the direction of a branch is not enough. What else is necessary?

Temperature-aware Task Partitioning for Real-Time Scheduling in Embedded Systems

Table of Content. Chapter 11 Dedicated Microprocessors Page 1 of 25

Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism

Microprocessor Power Analysis by Labeled Simulation

Novel Devices and Circuits for Computing

[2] Predicting the direction of a branch is not enough. What else is necessary?

1. (2 )Clock rates have grown by a factor of 1000 while power consumed has only grown by a factor of 30. How was this accomplished?

Chapter 8. Low-Power VLSI Design Methodology

INF2270 Spring Philipp Häfliger. Lecture 8: Superscalar CPUs, Course Summary/Repetition (1/2)

Simple Instruction-Pipelining (cont.) Pipelining Jumps

CHAPTER log 2 64 = 6 lines/mux or decoder 9-2.* C = C 8 V = C 8 C * 9-4.* (Errata: Delete 1 after problem number) 9-5.

This Unit: Scheduling (Static + Dynamic) CIS 501 Computer Architecture. Readings. Review Example

Lecture 13: Sequential Circuits, FSM

School of EECS Seoul National University

Clock signal in digital circuit is responsible for synchronizing the transfer to the data between processing elements.

EEC 216 Lecture #2: Metrics and Logic Level Power Estimation. Rajeevan Amirtharajah University of California, Davis

Lecture 6: Logical Effort

For smaller NRE cost For faster time to market For smaller high-volume manufacturing cost For higher performance

Disturbance-Estimation-Based Robust Control for Ropeless Linear Elevator System

THE UNIVERSITY OF MICHIGAN. Faster Static Timing Analysis via Bus Compression

High Performance Computing

Logic BIST. Sungho Kang Yonsei University

4. (3) What do we mean when we say something is an N-operand machine?

Scalable Store-Load Forwarding via Store Queue Index Prediction

EE 447 VLSI Design. Lecture 5: Logical Effort

Accurate Estimation of Cache-Related Preemption Delay

Word-length Optimization and Error Analysis of a Multivariate Gaussian Random Number Generator

A Second Datapath Example YH16

PERFORMANCE METRICS. Mahdi Nazm Bojnordi. CS/ECE 6810: Computer Architecture. Assistant Professor School of Computing University of Utah

CprE 281: Digital Logic

Computer Architecture. ESE 345 Computer Architecture. Design Process. CA: Design process

ELECTROMAGNETIC FAULT INJECTION: TOWARDS A FAULT MODEL ON A 32-BIT MICROCONTROLLER

ブドウ属植物の光合成速度に及ぼす光強度の影響

Analog Computation in Flash Memory for Datacenter-scale AI Inference in a Small Chip

EE382 Processor Design Winter 1999 Chapter 2 Lectures Clocking and Pipelining

Fall 2008 CSE Qualifying Exam. September 13, 2008

ECEN 651: Microprogrammed Control of Digital Systems Department of Electrical and Computer Engineering Texas A&M University

Dynamic Voltage and Frequency Scaling Under a Precise Energy Model Considering Variable and Fixed Components of the System Power Dissipation

CSE Computer Architecture I

Digital Integrated Circuits A Design Perspective. Arithmetic Circuits. Jan M. Rabaey Anantha Chandrakasan Borivoje Nikolic.

Formal Verification of Systems-on-Chip

Transposition Mechanism for Sparse Matrices on Vector Processors

L16: Power Dissipation in Digital Systems. L16: Spring 2007 Introductory Digital Systems Laboratory

Influence of Pore Size and Surface Functionality of Activated Carbons on Adsorption Behaviors of Indole and Amylase

Stochastic Dynamic Thermal Management: A Markovian Decision-based Approach. Hwisung Jung, Massoud Pedram

CMU Introduction to Computer Architecture, Spring 2015 HW 2: ISA Tradeoffs, Microprogramming and Pipelining

Unit 6: Branch Prediction

Today. ESE532: System-on-a-Chip Architecture. Energy. Message. Preclass Challenge: Power. Energy Today s bottleneck What drives Efficiency of

From Sequential Circuits to Real Computers

Transcription:

九州大学学術情報リポジトリ Kyushu University Institutional Repository Performance Evaluation of a Reconfigurable Instruction Set Processor Mehdipour, Farhad Faculty of Information Science and Electrical Engineering, Kyushu University Noori, Hamid Faculty of Information Science and Electrical Engineering, Kyushu University Honda, Hiroaki Faculty of Information Science and Electrical Engineering, Kyushu University Inoue, Koji Faculty of Information Science and Electrical Engineering, Kyushu University 他 https://doi.org/10.15017/14879 出版情報 :SLRC プレゼンテーション, pp.1-, 2008-11-24 バージョン :published 権利関係 :

Performance Evaluation of a Reconfigurable Instruction Set Processor Farhad Mehdipour, H. Noori, H. Honda, K. Inoue, K. Murakami Faculty of Information Science and Electrical Engineering, g, KYUSHU UNIVERSITY, Fukuoka, JAPAN {farhad@c.csce.kyushu-u.ac.jp}

Outline Reconfigurable Instructions Set Processors A Combined Analytical and Simulation-Based Model (CAnSO) Model Extraction and Calibration Basic Model Definitions Speedup Formulations Simplification and Calibration Experiments Experimental Setup Model Validation Design Space Exploration Using CAnSO Effects of Modifications Conclusions and Future Work 2/XXXII

Designing Embedded Systems Embedded d Microprocessors Application-Specific Integrated Circuits (ASICs) Application-Specific p Instruction set Processors (ASIPs) Extensible Processors 3/XXXII

Extensible Processors Mechanism Acceleration by using CFU a hardware is augmented to the base processor Executes hot portions of applications 4/XXXII

Extensible Processors Base processor (BP)'s fixed instruction set + Custom Instructions Goals Improving the performance and energy efficiency Maintaining compatibility and flexibility CPU LD/ST: Load / Store CFU: Custom Functional Unit Instruction Dispatcher + & x LD/ST CFU1 CFU2 Register File 5/XXXII

Custom Instructions Instruction set customization hardware/software partitioning (Identifying critical segments in applications) Custom Instructions (CIs) are extracted from critical segments of an application and executed on a Custom Functional Unit (CFU) Critical segments: Most frequently executed (Hot) portions of the applications A CI can be represented as a DFG 6/XXXII

Extensible Processors Drawbacks: Lack of flexibility Long time and cost of designing and verifying Many issues associated with designing a new processor from scratch: longer time-to-market to market and significant NRE (Non-Recurring Engineering) costs Solution Using a Reconfigurable Functional Unit (RFU) instead of fixed architecture CFU 7/XXXII

Reconfigurable Processors Microprocessor Reconfigurable Logic Reconfigurable Processor 8/XXXII

Processor coupling Processor RFU Coprocessor Memory Bridge Attached Processor Loose Coupling (PRISM-I) 9/XXXII

Reconfigurable e Instruction Set Processors s (RISPs) Adding and generating custom instructions after fabrication Using a reconfigurable FU(RFU) instead of custom FU CFU: Custom Functional Unit RFU: Reconfigurable Functional Unit CPU Instruction Dispatcher Config Mem + & x LD/ST CFU1 RFU CFU2 Register File 10/XXXII

How a RISP Works 400680 subiu $25,$25,1 400688 lbu $13,0($7) 400690 lbu $2,0($4) 400698 sll $2,$2,0x18 4006a0 sra $14,$2,0x18 4006a8 addiu $4,$4,1 4006b0 srl $8,$2,0x1c 4006b8 sll $2,$8,0x2 4006c0 addu $2,$2,$25 4006c8 lw $2,0($2) 4006d0 xori $13,$13,1 4006d8 addu $10,$10,$2 400680 subiu $25,$25,1 400698 sll $2,$2,0x18 4006a0 sra $14,$2,0x18 400688 lbu $13,0($7) Baseline Processor RISP RAC...... ALU... Register File Configuration Memory 4006e0 bgez $10,4006f0 y... A Hot Basic Block GPP: General Purpose Processor RAC=RFU: Reconfigurable Accelerator 11/XXXII

RISP Benefits and Drawbacks Benefits Specialized datapath Shared hardware Higher Speedup Less power consumption Drawbacks More area Difficult to use 12/XXXII

Performance Evaluation of a RISP Performance evaluation of a RISP challenges designing of a RISP architecture optimizing an existing arch. for an objective function For a designer obtaining optimum system configuration is desirable a performance analysis in terms of the performance metrics (speedup, area and so on) is required Performance evaluation models Structural models: includes empirical studies based on measurements and simulations of the target system Analytical models: incorporates a system (usually simplified) structure to obtain mathematically solvable models 13/XXXII

Fraction of Dynamic Instructions s in Applications % 100 90 80 70 60 50 40 30 20 10 0 BP Portion RAC Portion adpcm(d (dec) blowfish(e (enc) blowfish(d (dec) crc cjp jpeg djped dijks kstra lame patri tricia qsort rijndael(e (enc) rijndael(d (dec) sha sus usan Avera rage the RAC is responsible for executing almost 30% of dynamic instructions of applications in average 14/XXXII

Model Extraction and Utilization 15/XXXII

General e Template e of a RISP 16/XXXII

Basic Model Definitions Base Processor an in-order general five-stage RISC processor RAC a coarse-grained tightly-coupled reconfigurable hardware CIs are indexed for direct accessing of the configuration bit-stream The content of all registers are sent to the RAC (Shared RF) Controlling configurations Hardware-based: starting address of CI and index to the config. Mem. is stored in a CAM for quick retrieval Software-based: starting address of a CI is replaced with a special instruction Memory accesses Control instructions 17/XXXII

Single and Continuous Executions Continuous Execution Single Execution 18/XXXII

Latency of execution of CIi instructions on the BP Execution time on the BP Overall Speedup p Speedup Formulation nci i τ BP Ο i i= frac = 1 ntcc Total no. of executions of CIi fbp nci n i cc τ BP Οi i = 1 = n tcc Fraction of instructions executing on BP = 1 Execution time on the RAC frac n = tcc so n τ CI i ntcc + ψ ( θ, τ ) BP Οi 1 i= nci ψ ( θ, τ ) = ( θ ( + ) ) + ( + ) + ( ) ij τ RAC τovh τ RAC τovh θij 1 τ RAC i= 1 j S i j Ci ( ( ) frequency of jth occurrence of CIi Latency of RAC and the overhead reconfiguration time 19/XXXII

The Effect of CI Length Large CIs Including more instructions than the no. of available resources in the RAC Temporal Partitioning Dividing larger CIs to a number of smaller CIs L = { k k {1,..., nci }, lk > nfu } k L = Οi pk, mk L m = m i P = { p k k L, p k l = n ( ( 1,...,11 ), ( 1,...,11 ),..., ( 1,...,1 ) ), θ k L = m k L θ k L = θ k L θ k L = 1, = θ k FU S i L = { 1,... m i }, S i L = S i C ' i L =, C' i L = Ci } 20/XXXII

Side-Effects Control Instructions the rate of miss-predicted branches might be reduced higher speedup Instruction ti Cache Misses no need for fetching instructions belonging to the CIs access and miss rates to instruction cache are reduced BP fraction reduces speedup increases no. of penalty cycles for branch misspredictions/cache misses variation in branch/cache miss- predictions/misses s o n = tcc n δ + DI n τ i + ψ ( θ, tcc xm p xm Ο BP i x = { b, i} i = 1 τ ) 21/XXXII

RF s Input/Output Ports Register file is shared between BP and RAC Additional clock cycles for reading/writing from/to the RF no. of CIi s inputs no. of RF s read ports τ OVH = τ OVH + max(0, Δ i CI Δ Δ reg reg ) + max(0, no. of RF s write ports i CI reg reg ) 22/XXXII

The Assumed RAC Architecture FU FU FU FU............ 23/XXXII

RAC s Delay All FUs in the RAC implement similar operations Each mux receives all outputs of the FUs in upper rows and Outputs from its adjacent FUs at the same row w h h h 1 = { 0,1 w} τ RAC τ FU + τ, k,..., i= 1 i= 1 k MUX i n CI ψ ( θ, τ ) = τ i= 1 j S i j C i ( θ ( + ) ) + ( + ) + ( ) ) ij τ RAC τovh τ RAC τovh θij 1 RAC 24/XXXII

Simplification and Calibration Control instructions are not supported Reduction in instruction cache accesses as well as cache misses average reduction in access to i-cache is almost 17% average i-cache miss rate is almost 3%. % 35 30 25 20 15 10 5 0 Reduction in i-cache accesses Reduction in i-cache misses Average i-cache Accesses: 17% Misses: 3% adpcm(dec) blowfish(enc) blowfish(dec) crc cjpeg djped dijkstra lame patricia qsort rijndael(enc) rijndael(dec) sha susan Average 25/XXXII

Simplification and Calibration 100 90 80 70 60 % 50 40 30 20 10 0 Single Executions Single Executions after Partitioning CIs adpcm(dec) blowfish(enc) blowfish(dec) crc cjpeg djped dijkstra lame patricia Fraction of Single & Continuous Executions qsort rijndael(enc) rijndael(dec) sha susan Average the ratio of single to continuous execution 43% s o = n tcc ndi + + i Ο = w ntcc δ im pim τ BP i ψ θ, τ RACh τovh i 1 ψ * n CI ( w θ, τ RACh + τovh ) = Ο i α τ OVH i= 1 w + τ h RAC 26/XXXII

Experimental Setup Fourteen applications of Mibench automotive, security, consumer, network, telecommunication CIs (DFGs) are extracted from applications Simplescalar s cycle-accurate simulator is extended to simulate a reconfigurable instruction set processor Model Establishment simulating all applications collecting required information model simplification and calibration ~ 4 hours to completion on a PC: Dual Core, Intel 6600@2400Mhz, 2GB RAM 27/XXXII

ge Model Validation 240 220 Cycle-accurate simulation CAnSO 200 Uncalibrated CAnSO 180 160 140 Average variation= 2% c) sha susan Average rijndael(dec) ort Speedup x 100 120 100 adpcm(dec) blowfish(enc) blowfish(dec) crc cjpeg djped dijkstra lame patricia qsor rijndael(enc) Average variation= 22% 28/XXXII

Design Space Exploration Using CAnSO The design of a RAC including different components entails a multitude of design parameters Examining 100 design points using 14 applications: Simulation: 17 days CAnSO: 4 hours Using CAnSO, re-simulation is not needed d after establishing the model 29/XXXII

Using CAnSO for Design Space Exploration of the RAC Increasing the width of RAC increases speedup dikjstra blowfish(enc) crc 130 130 120 Spee edup x 100 130 120 110 100 90 80 70 60 1 2 3 4 Heigth 5 6 7 qsort 1 7 4 Width Spe eedup x 100 120 110 100 90 80 70 60 1 2 3 Heigth 4 5 6 7 1 3 7 5 Width Width> 6: no more speedup is achievable rijndael(enc) Speedup x 100 110 100 90 80 70 60 1 2 Heigth 3 4 5 6 7 2 3 4 5 6 7 Width 1 susan Speedup x 100 110 105 100 95 90 85 80 1 2 3 4 5 Heigth 6 7 7 5 3 Width 1 Speedup x 100 120 110 100 90 80 70 1 2 3 4 5 6 Heigth 7 5 3 Width 1 Speedup x 100 180 160 140 120 100 80 60 1 2 3 4 5 Heigth 6 7 7 5 3 Width 1 the small heights very low speedup Height> 5: RAC s longer critical path delay speedup declines 30/XXXII

Effect of Modifications Applying modification to the design - Small time is required for repeating the simulation - Each iteration of the CAnSO takes less than a minute Speed dup x 100 180 160 140 120 Simulation (8-read/4-write port RF) CAnSO (8-read/4-write port RF) Simulation (4-read/2-write port RF) CAnSO (4-read/2-write port RF) Simulation (2-read/1-write port RF) CAnSO (2-read/1-write port RF) 100 adpcm(dec) blowfish(enc) blowfish(dec) crc cjpeg djped dijkstra lame patricia qsort rijndael(enc) rijndael(dec) sha susan Average 31/XXXII

CONCLUSION Reconfigurable instruction set processors A combined analytical and simulation-based model (CAnSO) Suitable for exploring a large design space for the accelerator Sufficient flexibility in a rapid evaluation of modified target architectures Substantially reduce the design or optimization time while preserving a reasonable accuracy Proves less than 2% variation in evaluation results Uncalibrated CAnSO depicts 22% difference in average Future work: Expanding CAnSO to support control instructions Considering more complicated RAC architectures 32/XXXII