Mapping Sparse Matrix-Vector Multiplication on FPGAs

Size: px
Start display at page:

Download "Mapping Sparse Matrix-Vector Multiplication on FPGAs"

Transcription

1 Mapping Sparse Matrix-Vector Multiplication on FPGAs Junqing Sun 1, Gregory Peterson 1, Olaf Storaasli 2 1 University of Tennessee, Knoxville 2 Oak Ridge National Laboratory July 20, 2007

2 Outline Introduction Sparse matrix storage format Basic Design Design for floating points Implementation results Performance analysis

3 Sparse Matrix-Vector Multiplication (SpMxV) y=ab SpMxV on CPUs Introduction Inefficient Optimization algorithms: performance depends on matrix structures SpMxV on FPGAs High throughput achieved for FPGA kernels System Performance affected by I/O and other overheads

4 Sparse Matrix Storage Formats CRS Widely Used An example: Row blocked CRS (RBCRS) Compatible to CRS format Lower I/O requirements A Val: 2, -3, -1, 6, 1, 9, 5, 8, 6 y 0 y 1 y 2 A 10 0 A A 11 A A 21 x Col: 0, 2, 1, 4, 0, 1, 0, 1, 3 Len: 2, 2, 1, 1,3

5 r Basic Design Application Program Matrix Storage Matrix Manager valid col val stall valid col val stall Row ID F I F O 2 PE b ACC Multiplier F I F O 1 Circuit Row ID F I F O 2 PE b ACC Multiplier F I F O 1 Circuit R esu Result Mux Summation Circuit lt Co n trole R esu lt BR Properties Deeply pipelined Common CRS format Data-flow controlled architecture Independent PEs Arbitrary matrix size A M SpMxV

6 Design for Floating Points - Accumulation Circuit - Partial summation circuit Adder Problems in the Data flow Not accumulated to a single value (2, 3, 4, 5, 7 for first row) Different rows are summed up (3+8 => 11) input Wrong output Correct output * The adder has a pipeline of 5 stages

7 Design for floating points - Adder Tree - + r0 F I F O 0 r1 F I F O 1 r2 F I F O 2 r3 F I F O 3 Wen Row ID S S h h Level 0 Level 1 Level Sh i f t e r Adder Tree i i f f t t e e r r Dout Wen Row Result BRAM Level 0 Level 1 Level 2 Level 3 Adder tree using pipelined adders Data flow for adder tree * Final values are automatically captured by the wen signal

8 Design for floating points - Summation Circuit - Result Controller Wen Row ID s h s h i i f f t t e e r r Summation Circuit Dout Wen Row Level 0 Level Level 2 55 Level 3 Reduced summation circuit Data flow for reduced summation circuit * Buffers are used to take the place of expensive adders * Lower throughput * Longer latency

9 Design for floating points - Accumulation Circuit - Comparison of adder tree and summation circuit Lower Cost Design Number of Adders latency Adder Tree Summation Circuit 4 55 Higher Performance

10 Implementation Results Characteristics (8PEs) on XC2VP70-7 Design 64 bit Integer Single FP Double FP Achievable Frequency 175MHz 200MHz 165MHz Slices 8282 (25%) (31%) (72%) BRAMs 36 (10%) 50 (15%) 92 (28%) MULT18X (39%) 32 (9%) 128 (39%) 64 bit v.s. 32/64 bit Mixed Integer Design 32/64 bit Mixed 64 bit 32 bit 32 bit X 64 bit Achievable Frequency 183Mhz 175Mhz Slices 3475 (10%) 8282 (25%) BRAMs 20 (6%) 36 (10%) MULT18X18 32 (9%) 128 (39%) Multiplier Latency 4 cycles 6 cycles Required I/O Bandwidth 8.8GB/s 14GB/s

11 Performance Modeling Required data movements : n IO n Required Floating point operations : n nz p ( N M) n flop 2n nz PEs floating point capability: Computation time : F 2n pe freq T comp Total FloatingPointOperations F 2n F z

12 Execution time: Performance Modeling T max( T comp, T IO ) T init T syn T overhead 2n max( F * nz, n * nz ( val width col B IO width) ) T init T syn T overhead Performance bounded by I/O bandwidth for double floating point design: F 2n T z 2nz B / 5 10n / B Maximum PEs allowed by I/O bandwidth: N B / 5 B PE 2 frequency 10 frequency Cray XD1: 1 PE is needed, 200Mflops peak performance! z

13 Test Matrices ID Matrix Area Size (N) Nonzeros (N nz) Sparsity (%) 1 Crystk02 FEM Crystal Crystk03 FEM Crystal stat96v1 linear programming 5995 x nasasrb Structure analysis raefsky4 Buckling problem ex11 3D steady flow rim FEM fluid mechanics goodwin FEM fluid mechanics dbic1 linear programming x rail4284 Railways * All these matrices come from University of Florida Tim Davis Matrix Collection

14 Overhead Percentage Percentage of Achievable Performance Performance 16.00% 14.00% 12.00% 10.00% 8.00% 6.00% 4.00% 2.00% 0.00% Test Matrices % 98.00% 96.00% 94.00% 92.00% 90.00% 88.00% 86.00% 84.00% 82.00% 80.00% 78.00% Test Matrices Overhead Percentage Achievable performance Percentage Simulation Results for XC2VP70

15 Speeup over P4 2.8 GHz Kernel Performance Test Matrices Speed up over 2.8 GHz Pentium 4 * Clock cycle accurate simulation results! * Depends less on sparse structures!

16 Conclusions Architecture and performance modeling for SpMxV on FPGAs Big performance difference for different data formats Up to 20X Speed up over 2.8 GHz Pentium 4 for xc2vp70 Depends less on Sparse structure than CPUs Performance limited by I/O Bandwidth

17 Acknowledgement This project is supported by the University of Tennessee Science Alliance and the ORNL Laboratory Director s Research and Development program. We also would like to thank Richard Barrett of ORNL for useful discussion on sparse matrices.

18 Questions? Thanks!

A Deep Convolutional Neural Network Based on Nested Residue Number System

A Deep Convolutional Neural Network Based on Nested Residue Number System A Deep Convolutional Neural Network Based on Nested Residue Number System Hiroki Nakahara Tsutomu Sasao Ehime University, Japan Meiji University, Japan Outline Background Deep convolutional neural network

More information

NCU EE -- DSP VLSI Design. Tsung-Han Tsai 1

NCU EE -- DSP VLSI Design. Tsung-Han Tsai 1 NCU EE -- DSP VLSI Design. Tsung-Han Tsai 1 Multi-processor vs. Multi-computer architecture µp vs. DSP RISC vs. DSP RISC Reduced-instruction-set Register-to-register operation Higher throughput by using

More information

ALU A functional unit

ALU A functional unit ALU A functional unit that performs arithmetic operations such as ADD, SUB, MPY logical operations such as AND, OR, XOR, NOT on given data types: 8-,16-,32-, or 64-bit values A n-1 A n-2... A 1 A 0 B n-1

More information

COVER SHEET: Problem#: Points

COVER SHEET: Problem#: Points EEL 4712 Midterm 3 Spring 2017 VERSION 1 Name: UFID: Sign here to give permission for your test to be returned in class, where others might see your score: IMPORTANT: Please be neat and write (or draw)

More information

Accelerating linear algebra computations with hybrid GPU-multicore systems.

Accelerating linear algebra computations with hybrid GPU-multicore systems. Accelerating linear algebra computations with hybrid GPU-multicore systems. Marc Baboulin INRIA/Université Paris-Sud joint work with Jack Dongarra (University of Tennessee and Oak Ridge National Laboratory)

More information

Transposition Mechanism for Sparse Matrices on Vector Processors

Transposition Mechanism for Sparse Matrices on Vector Processors Transposition Mechanism for Sparse Matrices on Vector Processors Pyrrhos Stathis Stamatis Vassiliadis Sorin Cotofana Electrical Engineering Department, Delft University of Technology, Delft, The Netherlands

More information

Optimizing Loop Operation and Dataflow in FPGA Acceleration of Deep Convolutional Neural Networks

Optimizing Loop Operation and Dataflow in FPGA Acceleration of Deep Convolutional Neural Networks 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays Optimizing Loop Operation and Dataflow in FPGA Acceleration of Deep Convolutional Neural Networks Yufei Ma, Yu Cao, Sarma Vrudhula,

More information

9. Datapath Design. Jacob Abraham. Department of Electrical and Computer Engineering The University of Texas at Austin VLSI Design Fall 2017

9. Datapath Design. Jacob Abraham. Department of Electrical and Computer Engineering The University of Texas at Austin VLSI Design Fall 2017 9. Datapath Design Jacob Abraham Department of Electrical and Computer Engineering The University of Texas at Austin VLSI Design Fall 2017 October 2, 2017 ECE Department, University of Texas at Austin

More information

Novel Devices and Circuits for Computing

Novel Devices and Circuits for Computing Novel Devices and Circuits for Computing UCSB 594BB Winter 2013 Lecture 4: Resistive switching: Logic Class Outline Material Implication logic Stochastic computing Reconfigurable logic Material Implication

More information

Simple sparse matrices we have seen so far include diagonal matrices and tridiagonal matrices, but these are not the only ones.

Simple sparse matrices we have seen so far include diagonal matrices and tridiagonal matrices, but these are not the only ones. A matrix is sparse if most of its entries are zero. Simple sparse matrices we have seen so far include diagonal matrices and tridiagonal matrices, but these are not the only ones. In fact sparse matrices

More information

FPGA Implementation of a Predictive Controller

FPGA Implementation of a Predictive Controller FPGA Implementation of a Predictive Controller SIAM Conference on Optimization 2011, Darmstadt, Germany Minisymposium on embedded optimization Juan L. Jerez, George A. Constantinides and Eric C. Kerrigan

More information

ERLANGEN REGIONAL COMPUTING CENTER

ERLANGEN REGIONAL COMPUTING CENTER ERLANGEN REGIONAL COMPUTING CENTER Making Sense of Performance Numbers Georg Hager Erlangen Regional Computing Center (RRZE) Friedrich-Alexander-Universität Erlangen-Nürnberg OpenMPCon 2018 Barcelona,

More information

A Mathematical Solution to. by Utilizing Soft Edge Flip Flops

A Mathematical Solution to. by Utilizing Soft Edge Flip Flops A Mathematical Solution to Power Optimal Pipeline Design by Utilizing Soft Edge Flip Flops M. Ghasemazar, B. Amelifard, M. Pedram University of Southern California Department of Electrical Engineering

More information

CS 700: Quantitative Methods & Experimental Design in Computer Science

CS 700: Quantitative Methods & Experimental Design in Computer Science CS 700: Quantitative Methods & Experimental Design in Computer Science Sanjeev Setia Dept of Computer Science George Mason University Logistics Grade: 35% project, 25% Homework assignments 20% midterm,

More information

ICS 233 Computer Architecture & Assembly Language

ICS 233 Computer Architecture & Assembly Language ICS 233 Computer Architecture & Assembly Language Assignment 6 Solution 1. Identify all of the RAW data dependencies in the following code. Which dependencies are data hazards that will be resolved by

More information

Cost/Performance Tradeoffs:

Cost/Performance Tradeoffs: Cost/Performance Tradeoffs: a case study Digital Systems Architecture I. L10 - Multipliers 1 Binary Multiplication x a b n bits n bits EASY PROBLEM: design combinational circuit to multiply tiny (1-, 2-,

More information

Proposal to Improve Data Format Conversions for a Hybrid Number System Processor

Proposal to Improve Data Format Conversions for a Hybrid Number System Processor Proposal to Improve Data Format Conversions for a Hybrid Number System Processor LUCIAN JURCA, DANIEL-IOAN CURIAC, AUREL GONTEAN, FLORIN ALEXA Department of Applied Electronics, Department of Automation

More information

Measurement & Performance

Measurement & Performance Measurement & Performance Timers Performance measures Time-based metrics Rate-based metrics Benchmarking Amdahl s law Topics 2 Page The Nature of Time real (i.e. wall clock) time = User Time: time spent

More information

Measurement & Performance

Measurement & Performance Measurement & Performance Topics Timers Performance measures Time-based metrics Rate-based metrics Benchmarking Amdahl s law 2 The Nature of Time real (i.e. wall clock) time = User Time: time spent executing

More information

Table 1. Comparison of QR Factorization (Square: , Tall-Skinny (TS): )

Table 1. Comparison of QR Factorization (Square: , Tall-Skinny (TS): ) ENHANCING PERFORMANCE OF TALL-SKINNY QR FACTORIZATION USING FPGAS Abid Rafique, Nachiket Kapre and George A. Constantinides Electrical and Electronic Engineering Department Imperial College London London,

More information

Exploiting In-Memory Processing Capabilities for Density Functional Theory Applications

Exploiting In-Memory Processing Capabilities for Density Functional Theory Applications Exploiting In-Memory Processing Capabilities for Density Functional Theory Applications 2016 Aug 23 P. F. Baumeister, T. Hater, D. Pleiter H. Boettiger, T. Maurer, J. R. Brunheroto Contributors IBM R&D

More information

A Gray Code Based Time-to-Digital Converter Architecture and its FPGA Implementation

A Gray Code Based Time-to-Digital Converter Architecture and its FPGA Implementation A Gray Code Based Time-to-Digital Converter Architecture and its FPGA Implementation Congbing Li Haruo Kobayashi Gunma University Gunma University Kobayashi Lab Outline Research Objective & Background

More information

A Simple Architectural Enhancement for Fast and Flexible Elliptic Curve Cryptography over Binary Finite Fields GF(2 m )

A Simple Architectural Enhancement for Fast and Flexible Elliptic Curve Cryptography over Binary Finite Fields GF(2 m ) A Simple Architectural Enhancement for Fast and Flexible Elliptic Curve Cryptography over Binary Finite Fields GF(2 m ) Stefan Tillich, Johann Großschädl Institute for Applied Information Processing and

More information

EECS150 - Digital Design Lecture 21 - Design Blocks

EECS150 - Digital Design Lecture 21 - Design Blocks EECS150 - Digital Design Lecture 21 - Design Blocks April 3, 2012 John Wawrzynek Spring 2012 EECS150 - Lec21-db3 Page 1 Fixed Shifters / Rotators fixed shifters hardwire the shift amount into the circuit.

More information

Hardware Design I Chap. 4 Representative combinational logic

Hardware Design I Chap. 4 Representative combinational logic Hardware Design I Chap. 4 Representative combinational logic E-mail: shimada@is.naist.jp Already optimized circuits There are many optimized circuits which are well used You can reduce your design workload

More information

PERFORMANCE METRICS. Mahdi Nazm Bojnordi. CS/ECE 6810: Computer Architecture. Assistant Professor School of Computing University of Utah

PERFORMANCE METRICS. Mahdi Nazm Bojnordi. CS/ECE 6810: Computer Architecture. Assistant Professor School of Computing University of Utah PERFORMANCE METRICS Mahdi Nazm Bojnordi Assistant Professor School of Computing University of Utah CS/ECE 6810: Computer Architecture Overview Announcement Jan. 17 th : Homework 1 release (due on Jan.

More information

Vectorized 128-bit Input FP16/FP32/ FP64 Floating-Point Multiplier

Vectorized 128-bit Input FP16/FP32/ FP64 Floating-Point Multiplier Vectorized 128-bit Input FP16/FP32/ FP64 Floating-Point Multiplier Espen Stenersen Master of Science in Electronics Submission date: June 2008 Supervisor: Per Gunnar Kjeldsberg, IET Co-supervisor: Torstein

More information

What s the Deal? MULTIPLICATION. Time to multiply

What s the Deal? MULTIPLICATION. Time to multiply What s the Deal? MULTIPLICATION Time to multiply Multiplying two numbers requires a multiply Luckily, in binary that s just an AND gate! 0*0=0, 0*1=0, 1*0=0, 1*1=1 Generate a bunch of partial products

More information

A CPU-GPU Hybrid Implementation and Model-Driven Scheduling of the Fast Multipole Method

A CPU-GPU Hybrid Implementation and Model-Driven Scheduling of the Fast Multipole Method A CPU-GPU Hybrid Implementation and Model-Driven Scheduling of the Fast Multipole Method Jee Choi 1, Aparna Chandramowlishwaran 3, Kamesh Madduri 4, and Richard Vuduc 2 1 ECE, Georgia Tech 2 CSE, Georgia

More information

Molecular Dynamics Simulations

Molecular Dynamics Simulations MDGRAPE-3 chip: A 165- Gflops application-specific LSI for Molecular Dynamics Simulations Makoto Taiji High-Performance Biocomputing Research Team Genomic Sciences Center, RIKEN Molecular Dynamics Simulations

More information

CS470: Computer Architecture. AMD Quad Core

CS470: Computer Architecture. AMD Quad Core CS470: Computer Architecture Yashwant K. Malaiya, Professor malaiya@cs.colostate.edu AMD Quad Core 1 Architecture Layers Building blocks Gates, flip-flops Functional bocks: Combinational, Sequential Instruction

More information

CMPEN 411 VLSI Digital Circuits Spring Lecture 21: Shifters, Decoders, Muxes

CMPEN 411 VLSI Digital Circuits Spring Lecture 21: Shifters, Decoders, Muxes CMPEN 411 VLSI Digital Circuits Spring 2011 Lecture 21: Shifters, Decoders, Muxes [Adapted from Rabaey s Digital Integrated Circuits, Second Edition, 2003 J. Rabaey, A. Chandrakasan, B. Nikolic] Sp11 CMPEN

More information

Radiation Induced Multi bit Upsets in Xilinx SRAM Based FPGAs

Radiation Induced Multi bit Upsets in Xilinx SRAM Based FPGAs LA-UR-05-6725 Radiation Induced Multi bit Upsets in Xilinx SRAM Based FPGAs Heather Quinn, Paul Graham, Jim Krone, and Michael Caffrey Los Alamos National Laboratory Sana Rezgui and Carl Carmichael Xilinx

More information

Menu. 7-Segment LED. Misc. 7-Segment LED MSI Components >MUX >Adders Memory Devices >D-FF, RAM, ROM Computer/Microprocessor >GCPU

Menu. 7-Segment LED. Misc. 7-Segment LED MSI Components >MUX >Adders Memory Devices >D-FF, RAM, ROM Computer/Microprocessor >GCPU Menu 7-Segment LED MSI Components >MUX >Adders Memory Devices >D-FF, RAM, ROM Computer/Microprocessor >GCPU Look into my... 1 7-Segment LED a b c h GND c g b d f a e h Show 7-segment LED in LogicWorks,

More information

INF2270 Spring Philipp Häfliger. Lecture 8: Superscalar CPUs, Course Summary/Repetition (1/2)

INF2270 Spring Philipp Häfliger. Lecture 8: Superscalar CPUs, Course Summary/Repetition (1/2) INF2270 Spring 2010 Philipp Häfliger Summary/Repetition (1/2) content From Scalar to Superscalar Lecture Summary and Brief Repetition Binary numbers Boolean Algebra Combinational Logic Circuits Encoder/Decoder

More information

Reduced-Area Constant-Coefficient and Multiple-Constant Multipliers for Xilinx FPGAs with 6-Input LUTs

Reduced-Area Constant-Coefficient and Multiple-Constant Multipliers for Xilinx FPGAs with 6-Input LUTs Article Reduced-Area Constant-Coefficient and Multiple-Constant Multipliers for Xilinx FPGAs with 6-Input LUTs E. George Walters III Department of Electrical and Computer Engineering, Penn State Erie,

More information

Digital Integrated Circuits A Design Perspective. Arithmetic Circuits. Jan M. Rabaey Anantha Chandrakasan Borivoje Nikolic.

Digital Integrated Circuits A Design Perspective. Arithmetic Circuits. Jan M. Rabaey Anantha Chandrakasan Borivoje Nikolic. Digital Integrated Circuits A Design Perspective Jan M. Rabaey Anantha Chandrakasan Borivoje Nikolic Arithmetic Circuits January, 2003 1 A Generic Digital Processor MEM ORY INPUT-OUTPUT CONTROL DATAPATH

More information

Static-scheduling and hybrid-programming in SuperLU DIST on multicore cluster systems

Static-scheduling and hybrid-programming in SuperLU DIST on multicore cluster systems Static-scheduling and hybrid-programming in SuperLU DIST on multicore cluster systems Ichitaro Yamazaki University of Tennessee, Knoxville Xiaoye Sherry Li Lawrence Berkeley National Laboratory MS49: Sparse

More information

Digital Integrated Circuits A Design Perspective. Arithmetic Circuits. Jan M. Rabaey Anantha Chandrakasan Borivoje Nikolic.

Digital Integrated Circuits A Design Perspective. Arithmetic Circuits. Jan M. Rabaey Anantha Chandrakasan Borivoje Nikolic. Digital Integrated Circuits A Design Perspective Jan M. Rabaey Anantha Chandrakasan Borivoje Nikolic Arithmetic Circuits January, 2003 1 A Generic Digital Processor MEMORY INPUT-OUTPUT CONTROL DATAPATH

More information

Today. ESE532: System-on-a-Chip Architecture. Energy. Message. Preclass Challenge: Power. Energy Today s bottleneck What drives Efficiency of

Today. ESE532: System-on-a-Chip Architecture. Energy. Message. Preclass Challenge: Power. Energy Today s bottleneck What drives Efficiency of ESE532: System-on-a-Chip Architecture Day 20: November 8, 2017 Energy Today Energy Today s bottleneck What drives Efficiency of Processors, FPGAs, accelerators How does parallelism impact energy? 1 2 Message

More information

Outline. policies for the first part. with some potential answers... MCS 260 Lecture 10.0 Introduction to Computer Science Jan Verschelde, 9 July 2014

Outline. policies for the first part. with some potential answers... MCS 260 Lecture 10.0 Introduction to Computer Science Jan Verschelde, 9 July 2014 Outline 1 midterm exam on Friday 11 July 2014 policies for the first part 2 questions with some potential answers... MCS 260 Lecture 10.0 Introduction to Computer Science Jan Verschelde, 9 July 2014 Intro

More information

Skew-Tolerant Circuit Design

Skew-Tolerant Circuit Design Skew-Tolerant Circuit Design David Harris David_Harris@hmc.edu December, 2000 Harvey Mudd College Claremont, CA Outline Introduction Skew-Tolerant Circuits Traditional Domino Circuits Skew-Tolerant Domino

More information

A FPGA Implementation of Large Restricted Boltzmann Machines. Charles Lo. Supervisor: Paul Chow April 2010

A FPGA Implementation of Large Restricted Boltzmann Machines. Charles Lo. Supervisor: Paul Chow April 2010 A FPGA Implementation of Large Restricted Boltzmann Machines by Charles Lo Supervisor: Paul Chow April 2010 Abstract A FPGA Implementation of Large Restricted Boltzmann Machines Charles Lo Engineering

More information

Building a Multi-FPGA Virtualized Restricted Boltzmann Machine Architecture Using Embedded MPI

Building a Multi-FPGA Virtualized Restricted Boltzmann Machine Architecture Using Embedded MPI Building a Multi-FPGA Virtualized Restricted Boltzmann Machine Architecture Using Embedded MPI Charles Lo and Paul Chow {locharl1, pc}@eecg.toronto.edu Department of Electrical and Computer Engineering

More information

Implementation Of Digital Fir Filter Using Improved Table Look Up Scheme For Residue Number System

Implementation Of Digital Fir Filter Using Improved Table Look Up Scheme For Residue Number System Implementation Of Digital Fir Filter Using Improved Table Look Up Scheme For Residue Number System G.Suresh, G.Indira Devi, P.Pavankumar Abstract The use of the improved table look up Residue Number System

More information

ab initio Electronic Structure Calculations

ab initio Electronic Structure Calculations ab initio Electronic Structure Calculations New scalability frontiers using the BG/L Supercomputer C. Bekas, A. Curioni and W. Andreoni IBM, Zurich Research Laboratory Rueschlikon 8803, Switzerland ab

More information

Communication avoiding parallel algorithms for dense matrix factorizations

Communication avoiding parallel algorithms for dense matrix factorizations Communication avoiding parallel dense matrix factorizations 1/ 44 Communication avoiding parallel algorithms for dense matrix factorizations Edgar Solomonik Department of EECS, UC Berkeley October 2013

More information

A High Throughput FPGA-Based Implementation of the Lanczos Method for the Symmetric Extremal Eigenvalue Problem

A High Throughput FPGA-Based Implementation of the Lanczos Method for the Symmetric Extremal Eigenvalue Problem A High Throughput FPGA-Based Implementation of the Lanczos Method for the Symmetric Extremal Eigenvalue Problem Abid Rafique, Nachiket Kapre, and George A. Constantinides Electrical and Electronic Engineering,

More information

BeiHang Short Course, Part 7: HW Acceleration: It s about Performance, Energy and Power

BeiHang Short Course, Part 7: HW Acceleration: It s about Performance, Energy and Power BeiHang Short Course, Part 7: HW Acceleration: It s about Performance, Energy and Power James C. Hoe Department of ECE Carnegie Mellon niversity Eric S. Chung, et al., Single chip Heterogeneous Computing:

More information

Overview: Parallelisation via Pipelining

Overview: Parallelisation via Pipelining Overview: Parallelisation via Pipelining three type of pipelines adding numbers (type ) performance analysis of pipelines insertion sort (type ) linear system back substitution (type ) Ref: chapter : Wilkinson

More information

Efficient Implementation of High- Energy Physics Processing Using Modified Jet Reconstruction and Active Size Partitioning

Efficient Implementation of High- Energy Physics Processing Using Modified Jet Reconstruction and Active Size Partitioning Efficient Implementation of High- Energy Physics Processing Using Modified Jet Reconstruction and Active Size Partitioning Tony Gregerson University of Wisconsin Tony Gregerson, U. Wisconsin, 13 April

More information

An Approximate Parallel Multiplier with Deterministic Errors for Ultra-High Speed Integrated Optical Circuits

An Approximate Parallel Multiplier with Deterministic Errors for Ultra-High Speed Integrated Optical Circuits An Approximate Parallel Multiplier with Deterministic Errors for Ultra-High Speed Integrated Optical Circuits Jun Shiomi 1, Tohru Ishihara 1, Hidetoshi Onodera 1, Akihiko Shinya 2, Masaya Notomi 2 1 Graduate

More information

Proposal to Improve Data Format Conversions for a Hybrid Number System Processor

Proposal to Improve Data Format Conversions for a Hybrid Number System Processor Proceedings of the 11th WSEAS International Conference on COMPUTERS, Agios Nikolaos, Crete Island, Greece, July 6-8, 007 653 Proposal to Improve Data Format Conversions for a Hybrid Number System Processor

More information

Parallel Multipliers. Dr. Shoab Khan

Parallel Multipliers. Dr. Shoab Khan Parallel Multipliers Dr. Shoab Khan String Property 7=111=8-1=1001 31= 1 1 1 1 1 =32-1 Or 1 0 0 0 0 1=32-1=31 Replace string of 1s in multiplier with In a string when ever we have the least significant

More information

Combinatorial Logic Design Multiplexers and ALUs CS 64: Computer Organization and Design Logic Lecture #13

Combinatorial Logic Design Multiplexers and ALUs CS 64: Computer Organization and Design Logic Lecture #13 Combinatorial Logic Design Multiplexers and ALUs CS 64: Computer Organization and Design Logic Lecture #13 Ziad Matni Dept. of Computer Science, UCSB Administrative Re: Midterm Exam #2 Graded! 5/22/18

More information

Integer Factorisation on the AP1000

Integer Factorisation on the AP1000 Integer Factorisation on the AP000 Craig Eldershaw Mathematics Department University of Queensland St Lucia Queensland 07 cs9@student.uq.edu.au Richard P. Brent Computer Sciences Laboratory Australian

More information

EECS150 - Digital Design Lecture 11 - Shifters & Counters. Register Summary

EECS150 - Digital Design Lecture 11 - Shifters & Counters. Register Summary EECS50 - Digital Design Lecture - Shifters & Counters February 24, 2003 John Wawrzynek Spring 2005 EECS50 - Lec-counters Page Register Summary All registers (this semester) based on Flip-flops: q 3 q 2

More information

L16: Power Dissipation in Digital Systems. L16: Spring 2007 Introductory Digital Systems Laboratory

L16: Power Dissipation in Digital Systems. L16: Spring 2007 Introductory Digital Systems Laboratory L16: Power Dissipation in Digital Systems 1 Problem #1: Power Dissipation/Heat Power (Watts) 100000 10000 1000 100 10 1 0.1 4004 80088080 8085 808686 386 486 Pentium proc 18KW 5KW 1.5KW 500W 1971 1974

More information

Lecture 2: Metrics to Evaluate Systems

Lecture 2: Metrics to Evaluate Systems Lecture 2: Metrics to Evaluate Systems Topics: Metrics: power, reliability, cost, benchmark suites, performance equation, summarizing performance with AM, GM, HM Sign up for the class mailing list! Video

More information

ACCELERATING SPARSE CHOLESKY FACTORIZATION ON THE GPU

ACCELERATING SPARSE CHOLESKY FACTORIZATION ON THE GPU ACCELERATING SPARSE CHOLESKY FACTORIZATION ON THE GPU STEVE RENNICH, SR. ENGINEER, NVIDIA DEVELOPER TECHNOLOGY DARKO STOSIC, PHD CANDIDATE, UNIV. FEDERAL DE PERNAMBUCO TIM DAVIS, PROFESSOR, CSE, TEXAS

More information

Spiral 2-1. Datapath Components: Counters Adders Design Example: Crosswalk Controller

Spiral 2-1. Datapath Components: Counters Adders Design Example: Crosswalk Controller 2-. piral 2- Datapath Components: Counters s Design Example: Crosswalk Controller 2-.2 piral Content Mapping piral Theory Combinational Design equential Design ystem Level Design Implementation and Tools

More information

Scalable and Power-Efficient Data Mining Kernels

Scalable and Power-Efficient Data Mining Kernels Scalable and Power-Efficient Data Mining Kernels Alok Choudhary, John G. Searle Professor Dept. of Electrical Engineering and Computer Science and Professor, Kellogg School of Management Director of the

More information

Parallel Sparse Matrix Vector Multiplication (PSC 4.3)

Parallel Sparse Matrix Vector Multiplication (PSC 4.3) Parallel Sparse Matrix Vector Multiplication (PSC 4.) original slides by Rob Bisseling, Universiteit Utrecht, accompanying the textbook Parallel Scientific Computation adapted for the lecture HPC Algorithms

More information

Logic BIST. Sungho Kang Yonsei University

Logic BIST. Sungho Kang Yonsei University Logic BIST Sungho Kang Yonsei University Outline Introduction Basics Issues Weighted Random Pattern Generation BIST Architectures Deterministic BIST Conclusion 2 Built In Self Test Test/ Normal Input Pattern

More information

Goals for Performance Lecture

Goals for Performance Lecture Goals for Performance Lecture Understand performance, speedup, throughput, latency Relationship between cycle time, cycles/instruction (CPI), number of instructions (the performance equation) Amdahl s

More information

Performance Metrics for Computer Systems. CASS 2018 Lavanya Ramapantulu

Performance Metrics for Computer Systems. CASS 2018 Lavanya Ramapantulu Performance Metrics for Computer Systems CASS 2018 Lavanya Ramapantulu Eight Great Ideas in Computer Architecture Design for Moore s Law Use abstraction to simplify design Make the common case fast Performance

More information

Sparse LU Factorization on GPUs for Accelerating SPICE Simulation

Sparse LU Factorization on GPUs for Accelerating SPICE Simulation Nano-scale Integrated Circuit and System (NICS) Laboratory Sparse LU Factorization on GPUs for Accelerating SPICE Simulation Xiaoming Chen PhD Candidate Department of Electronic Engineering Tsinghua University,

More information

Enrico Nardelli Logic Circuits and Computer Architecture

Enrico Nardelli Logic Circuits and Computer Architecture Enrico Nardelli Logic Circuits and Computer Architecture Appendix B The design of VS0: a very simple CPU Rev. 1.4 (2009-10) by Enrico Nardelli B - 1 Instruction set Just 4 instructions LOAD M - Copy into

More information

LRADNN: High-Throughput and Energy- Efficient Deep Neural Network Accelerator using Low Rank Approximation

LRADNN: High-Throughput and Energy- Efficient Deep Neural Network Accelerator using Low Rank Approximation LRADNN: High-Throughput and Energy- Efficient Deep Neural Network Accelerator using Low Rank Approximation Jingyang Zhu 1, Zhiliang Qian 2, and Chi-Ying Tsui 1 1 The Hong Kong University of Science and

More information

Exploiting Low-Rank Structure in Computing Matrix Powers with Applications to Preconditioning

Exploiting Low-Rank Structure in Computing Matrix Powers with Applications to Preconditioning Exploiting Low-Rank Structure in Computing Matrix Powers with Applications to Preconditioning Erin C. Carson, Nicholas Knight, James Demmel, Ming Gu U.C. Berkeley SIAM PP 12, Savannah, Georgia, USA, February

More information

Lecture 10, ATIK. Data converters 3

Lecture 10, ATIK. Data converters 3 Lecture, ATIK Data converters 3 What did we do last time? A quick glance at sigma-delta modulators Understanding how the noise is shaped to higher frequencies DACs A case study of the current-steering

More information

XI STANDARD [ COMPUTER SCIENCE ] 5 MARKS STUDY MATERIAL.

XI STANDARD [ COMPUTER SCIENCE ] 5 MARKS STUDY MATERIAL. 2017-18 XI STANDARD [ COMPUTER SCIENCE ] 5 MARKS STUDY MATERIAL HALF ADDER 1. The circuit that performs addition within the Arithmetic and Logic Unit of the CPU are called adders. 2. A unit that adds two

More information

EECS150 - Digital Design Lecture 27 - misc2

EECS150 - Digital Design Lecture 27 - misc2 EECS150 - Digital Design Lecture 27 - misc2 May 1, 2002 John Wawrzynek Spring 2002 EECS150 - Lec27-misc2 Page 1 Outline Linear Feedback Shift Registers Theory and practice Simple hardware division algorithms

More information

Binary Multipliers. Reading: Study Chapter 3. The key trick of multiplication is memorizing a digit-to-digit table Everything else was just adding

Binary Multipliers. Reading: Study Chapter 3. The key trick of multiplication is memorizing a digit-to-digit table Everything else was just adding Binary Multipliers The key trick of multiplication is memorizing a digit-to-digit table Everything else was just adding 2 3 4 5 6 7 8 9 2 3 4 5 6 7 8 9 2 2 4 6 8 2 4 6 8 3 3 6 9 2 5 8 2 24 27 4 4 8 2 6

More information

Digital Logic: Boolean Algebra and Gates. Textbook Chapter 3

Digital Logic: Boolean Algebra and Gates. Textbook Chapter 3 Digital Logic: Boolean Algebra and Gates Textbook Chapter 3 Basic Logic Gates XOR CMPE12 Summer 2009 02-2 Truth Table The most basic representation of a logic function Lists the output for all possible

More information

Low Latency Architectures of a Comparator for Binary Signed Digits in a 28-nm CMOS Technology

Low Latency Architectures of a Comparator for Binary Signed Digits in a 28-nm CMOS Technology Low Latency Architectures of a Comparator for Binary Signed Digits in a 28-nm CMOS Technology Martin Schmidt, Thomas Veigel, Sebastian Haug, Markus Grözing, Manfred Berroth Stuttgart, Germany 1 Outline

More information

CSE140: Design of Sequential Logic

CSE140: Design of Sequential Logic CSE4: Design of Sequential Logic Instructor: Mohsen Imani Flip Flops 2 Counter 3 Up counter 4 Up counter 5 FSM with JK-Flip Flop 6 State Table 7 State Table 8 Circuit Minimization 9 Circuit Timing Constraints

More information

Tunable Floating-Point for Energy Efficient Accelerators

Tunable Floating-Point for Energy Efficient Accelerators Tunable Floating-Point for Energy Efficient Accelerators Alberto Nannarelli DTU Compute, Technical University of Denmark 25 th IEEE Symposium on Computer Arithmetic A. Nannarelli (DTU Compute) Tunable

More information

NCL Throughput Derivation

NCL Throughput Derivation Throughput Derivation Stage i-1 Stage i Registration Stage i-2 TD i-1, TN i-1 Registration Stage i-1 TD i, TN i Registration Stage i In Out Combinational Circuit In Out Combinational Circuit In Out TRFD

More information

CMP 334: Seventh Class

CMP 334: Seventh Class CMP 334: Seventh Class Performance HW 5 solution Averages and weighted averages (review) Amdahl's law Ripple-carry adder circuits Binary addition Half-adder circuits Full-adder circuits Subtraction, negative

More information

A Custom Accelerator for Homomorphic Encryption Applications

A Custom Accelerator for Homomorphic Encryption Applications A Custom Accelerator for Homomorphic Encryption Applications Erdinç Öztürk, Yarkın Doröz, Erkay Savaş and Berk Sunar Abstract After the introduction of first fully homomorphic encryption scheme in 2009,

More information

A COMBINED 16-BIT BINARY AND DUAL GALOIS FIELD MULTIPLIER. Jesus Garcia and Michael J. Schulte

A COMBINED 16-BIT BINARY AND DUAL GALOIS FIELD MULTIPLIER. Jesus Garcia and Michael J. Schulte A COMBINED 16-BIT BINARY AND DUAL GALOIS FIELD MULTIPLIER Jesus Garcia and Michael J. Schulte Lehigh University Department of Computer Science and Engineering Bethlehem, PA 15 ABSTRACT Galois field arithmetic

More information

Adders, subtractors comparators, multipliers and other ALU elements

Adders, subtractors comparators, multipliers and other ALU elements CSE4: Components and Design Techniques for Digital Systems Adders, subtractors comparators, multipliers and other ALU elements Instructor: Mohsen Imani UC San Diego Slides from: Prof.Tajana Simunic Rosing

More information

Design and Comparison of Wallace Multiplier Based on Symmetric Stacking and High speed counters

Design and Comparison of Wallace Multiplier Based on Symmetric Stacking and High speed counters International Journal of Engineering Research and Advanced Technology (IJERAT) DOI:http://dx.doi.org/10.31695/IJERAT.2018.3271 E-ISSN : 2454-6135 Volume.4, Issue 6 June -2018 Design and Comparison of Wallace

More information

How fast can we calculate?

How fast can we calculate? November 30, 2013 A touch of History The Colossus Computers developed at Bletchley Park in England during WW2 were probably the first programmable computers. Information about these machines has only been

More information

ECC for NAND Flash. Osso Vahabzadeh. TexasLDPC Inc. Flash Memory Summit 2017 Santa Clara, CA 1

ECC for NAND Flash. Osso Vahabzadeh. TexasLDPC Inc. Flash Memory Summit 2017 Santa Clara, CA 1 ECC for NAND Flash Osso Vahabzadeh TexasLDPC Inc. 1 Overview Why Is Error Correction Needed in Flash Memories? Error Correction Codes Fundamentals Low-Density Parity-Check (LDPC) Codes LDPC Encoding and

More information

14 Gb/s AC Coupled Receiver in 90 nm CMOS. Masum Hossain & Tony Chan Carusone University of Toronto

14 Gb/s AC Coupled Receiver in 90 nm CMOS. Masum Hossain & Tony Chan Carusone University of Toronto 14 Gb/s AC Coupled Receiver in 90 nm CMOS Masum Hossain & Tony Chan Carusone University of Toronto masum@eecg.utoronto.ca OUTLINE Chip-to-Chip link overview AC interconnects Link modelling ISI & sensitivity

More information

Hardware Acceleration of the Tate Pairing in Characteristic Three

Hardware Acceleration of the Tate Pairing in Characteristic Three Hardware Acceleration of the Tate Pairing in Characteristic Three CHES 2005 Hardware Acceleration of the Tate Pairing in Characteristic Three Slide 1 Introduction Pairing based cryptography is a (fairly)

More information

Processor Design & ALU Design

Processor Design & ALU Design 3/8/2 Processor Design A. Sahu CSE, IIT Guwahati Please be updated with http://jatinga.iitg.ernet.in/~asahu/c22/ Outline Components of CPU Register, Multiplexor, Decoder, / Adder, substractor, Varity of

More information

Scalable Non-blocking Preconditioned Conjugate Gradient Methods

Scalable Non-blocking Preconditioned Conjugate Gradient Methods Scalable Non-blocking Preconditioned Conjugate Gradient Methods Paul Eller and William Gropp University of Illinois at Urbana-Champaign Department of Computer Science Supercomputing 16 Paul Eller and William

More information

VLSI Signal Processing

VLSI Signal Processing VLSI Signal Processing Lecture 1 Pipelining & Retiming ADSP Lecture1 - Pipelining & Retiming (cwliu@twins.ee.nctu.edu.tw) 1-1 Introduction DSP System Real time requirement Data driven synchronized by data

More information

Fundamentals of Digital Design

Fundamentals of Digital Design Fundamentals of Digital Design Digital Radiation Measurement and Spectroscopy NE/RHP 537 1 Binary Number System The binary numeral system, or base-2 number system, is a numeral system that represents numeric

More information

CMP N 301 Computer Architecture. Appendix C

CMP N 301 Computer Architecture. Appendix C CMP N 301 Computer Architecture Appendix C Outline Introduction Pipelining Hazards Pipelining Implementation Exception Handling Advanced Issues (Dynamic Scheduling, Out of order Issue, Superscalar, etc)

More information

A Bit-Plane Decomposition Matrix-Based VLSI Integer Transform Architecture for HEVC

A Bit-Plane Decomposition Matrix-Based VLSI Integer Transform Architecture for HEVC IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS II: EXPRESS BRIEFS, VOL. 64, NO. 3, MARCH 2017 349 A Bit-Plane Decomposition Matrix-Based VLSI Integer Transform Architecture for HEVC Honggang Qi, Member, IEEE,

More information

Lecture 9: Clocking, Clock Skew, Clock Jitter, Clock Distribution and some FM

Lecture 9: Clocking, Clock Skew, Clock Jitter, Clock Distribution and some FM Lecture 9: Clocking, Clock Skew, Clock Jitter, Clock Distribution and some FM Mark McDermott Electrical and Computer Engineering The University of Texas at Austin 9/27/18 VLSI-1 Class Notes Why Clocking?

More information

DSP Design Lecture 7. Unfolding cont. & Folding. Dr. Fredrik Edman.

DSP Design Lecture 7. Unfolding cont. & Folding. Dr. Fredrik Edman. SP esign Lecture 7 Unfolding cont. & Folding r. Fredrik Edman fredrik.edman@eit.lth.se Unfolding Unfolding creates a program with more than one iteration, J=unfolding factor Unfolding is a structured way

More information

Efficient implementation of the overlap operator on multi-gpus

Efficient implementation of the overlap operator on multi-gpus Efficient implementation of the overlap operator on multi-gpus Andrei Alexandru Mike Lujan, Craig Pelissier, Ben Gamari, Frank Lee SAAHPC 2011 - University of Tennessee Outline Motivation Overlap operator

More information

EEC 216 Lecture #3: Power Estimation, Interconnect, & Architecture. Rajeevan Amirtharajah University of California, Davis

EEC 216 Lecture #3: Power Estimation, Interconnect, & Architecture. Rajeevan Amirtharajah University of California, Davis EEC 216 Lecture #3: Power Estimation, Interconnect, & Architecture Rajeevan Amirtharajah University of California, Davis Outline Announcements Review: PDP, EDP, Intersignal Correlations, Glitching, Top

More information

Table of Content. Chapter 11 Dedicated Microprocessors Page 1 of 25

Table of Content. Chapter 11 Dedicated Microprocessors Page 1 of 25 Chapter 11 Dedicated Microprocessors Page 1 of 25 Table of Content Table of Content... 1 11 Dedicated Microprocessors... 2 11.1 Manual Construction of a Dedicated Microprocessor... 3 11.2 FSM + D Model

More information

8. Design Tradeoffs x Computation Structures Part 1 Digital Circuits. Copyright 2015 MIT EECS

8. Design Tradeoffs x Computation Structures Part 1 Digital Circuits. Copyright 2015 MIT EECS 8. Design Tradeoffs 6.004x Computation Structures Part 1 Digital Circuits Copyright 2015 MIT EECS 6.004 Computation Structures L08: Design Tradeoffs, Slide #1 There are a large number of implementations

More information