NCU EE -- DSP VLSI Design. Tsung-Han Tsai 1

Similar documents
Introduction The Nature of High-Performance Computation

Retiming. delay elements in a circuit without affecting the input/output characteristics of the circuit.

Processor Design & ALU Design

INF2270 Spring Philipp Häfliger. Lecture 8: Superscalar CPUs, Course Summary/Repetition (1/2)

Instruction Set Extensions for Reed-Solomon Encoding and Decoding

CMP N 301 Computer Architecture. Appendix C

Numbering Systems. Computational Platforms. Scaling and Round-off Noise. Special Purpose. here that is dedicated architecture

CS470: Computer Architecture. AMD Quad Core

CSCI-564 Advanced Computer Architecture

COVER SHEET: Problem#: Points

Enrico Nardelli Logic Circuits and Computer Architecture

DHANALAKSHMI COLLEGE OF ENGINEERING DEPARTMENT OF ELECTRICAL AND ELECTRONICS ENGINEERING EC2314- DIGITAL SIGNAL PROCESSING UNIT I INTRODUCTION PART A

CMPEN 411 VLSI Digital Circuits Spring Lecture 21: Shifters, Decoders, Muxes

ECE 172 Digital Systems. Chapter 12 Instruction Pipelining. Herbert G. Mayer, PSU Status 7/20/2018

VLSI Signal Processing

Adders, subtractors comparators, multipliers and other ALU elements

Chapter 5. Digital Design and Computer Architecture, 2 nd Edition. David Money Harris and Sarah L. Harris. Chapter 5 <1>

CS 700: Quantitative Methods & Experimental Design in Computer Science

CMP 338: Third Class

Lecture: Pipelining Basics

CMP 334: Seventh Class

Chapter 7. Sequential Circuits Registers, Counters, RAM

DSP Configurations. responded with: thus the system function for this filter would be

Project Two RISC Processor Implementation ECE 485

From Sequential Circuits to Real Computers

Adders, subtractors comparators, multipliers and other ALU elements

DIGITAL TECHNICS. Dr. Bálint Pődör. Óbuda University, Microelectronics and Technology Institute

ALU A functional unit

EECS 579: Logic and Fault Simulation. Simulation

CMPEN 411 VLSI Digital Circuits Spring Lecture 19: Adder Design

Performance, Power & Energy. ELEC8106/ELEC6102 Spring 2010 Hayden Kwok-Hay So

Introduction to CMOS VLSI Design Lecture 1: Introduction

CprE 281: Digital Logic

Logic and Computer Design Fundamentals. Chapter 8 Sequencing and Control

A COMBINED 16-BIT BINARY AND DUAL GALOIS FIELD MULTIPLIER. Jesus Garcia and Michael J. Schulte

Digital Integrated Circuits A Design Perspective. Arithmetic Circuits. Jan M. Rabaey Anantha Chandrakasan Borivoje Nikolic.

CSE370: Introduction to Digital Design

Design of Sequential Circuits

Lecture 13: Sequential Circuits, FSM

ww.padasalai.net

ENEE350 Lecture Notes-Weeks 14 and 15

Question Bank. UNIT 1 Part-A

Performance Metrics for Computer Systems. CASS 2018 Lavanya Ramapantulu

ECE290 Fall 2012 Lecture 22. Dr. Zbigniew Kalbarczyk

CPSC 3300 Spring 2017 Exam 2

CSE140: Components and Design Techniques for Digital Systems. Decoders, adders, comparators, multipliers and other ALU elements. Tajana Simunic Rosing

System Data Bus (8-bit) Data Buffer. Internal Data Bus (8-bit) 8-bit register (R) 3-bit address 16-bit register pair (P) 2-bit address

EECS150 - Digital Design Lecture 21 - Design Blocks

ICS 233 Computer Architecture & Assembly Language

Literature Review on Multiplier Accumulation Unit by Using Hybrid Adder

Lecture 8: Sequential Multipliers

Parallel Numerics. Scope: Revise standard numerical methods considering parallel computations!

Novel Devices and Circuits for Computing

DSP Design Lecture 7. Unfolding cont. & Folding. Dr. Fredrik Edman.

Exploiting In-Memory Processing Capabilities for Density Functional Theory Applications

Verilog HDL:Digital Design and Modeling. Chapter 11. Additional Design Examples. Additional Figures

LOGIC CIRCUITS. Basic Experiment and Design of Electronics. Ho Kyung Kim, Ph.D.

Optimizing Loop Operation and Dataflow in FPGA Acceleration of Deep Convolutional Neural Networks

Compiling Techniques

Pipelining and Parallel Processing

Special Nodes for Interface

DFT & Fast Fourier Transform PART-A. 7. Calculate the number of multiplications needed in the calculation of DFT and FFT with 64 point sequence.

EECS150 - Digital Design Lecture 11 - Shifters & Counters. Register Summary

L07-L09 recap: Fundamental lesson(s)!

Vector Lane Threading

ALU (3) - Division Algorithms

2 k Factorial Designs Raj Jain

ECE 407 Computer Aided Design for Electronic Systems. Simulation. Instructor: Maria K. Michael. Overview

Logic. Basic Logic Functions. Switches in series (AND) Truth Tables. Switches in Parallel (OR) Alternative view for OR

Today. ESE532: System-on-a-Chip Architecture. Energy. Message. Preclass Challenge: Power. Energy Today s bottleneck What drives Efficiency of

Mark Redekopp, All rights reserved. Lecture 1 Slides. Intro Number Systems Logic Functions

CPS 104 Computer Organization and Programming Lecture 11: Gates, Buses, Latches. Robert Wagner

9. Datapath Design. Jacob Abraham. Department of Electrical and Computer Engineering The University of Texas at Austin VLSI Design Fall 2017

[2] Predicting the direction of a branch is not enough. What else is necessary?

Computer Engineering Department. CC 311- Computer Architecture. Chapter 4. The Processor: Datapath and Control. Single Cycle

Proposal to Improve Data Format Conversions for a Hybrid Number System Processor

Reduced-Area Constant-Coefficient and Multiple-Constant Multipliers for Xilinx FPGAs with 6-Input LUTs

Performance, Power & Energy

The Digital Logic Level

EE 354 Fall 2013 Lecture 10 The Sampling Process and Evaluation of Difference Equations

Digital Integrated Circuits A Design Perspective. Arithmetic Circuits. Jan M. Rabaey Anantha Chandrakasan Borivoje Nikolic.

Goals for Performance Lecture

LOGIC CIRCUITS. Basic Experiment and Design of Electronics

From Sequential Circuits to Real Computers

Chapter 4. Combinational: Circuits with logic gates whose outputs depend on the present combination of the inputs. elements. Dr.

Logic BIST. Sungho Kang Yonsei University

Fall 2008 CSE Qualifying Exam. September 13, 2008

Marwan Burelle. Parallel and Concurrent Programming. Introduction and Foundation

Word-length Optimization and Error Analysis of a Multivariate Gaussian Random Number Generator

Modeling Computation and Communication Performance of Parallel Scientific Applications: A Case Study of the IBM SP2

GPU Acceleration of Cutoff Pair Potentials for Molecular Modeling Applications

Evaluating Overheads of Multi-bit Soft Error Protection Techniques at Hardware Level Sponsored by SRC and Freescale under SRC task number 2042

Clock signal in digital circuit is responsible for synchronizing the transfer to the data between processing elements.

Models of Computation

Microprocessor Power Analysis by Labeled Simulation

HYPERCUBE ALGORITHMS FOR IMAGE PROCESSING AND PATTERN RECOGNITION SANJAY RANKA SARTAJ SAHNI Sanjay Ranka and Sartaj Sahni

Anton: A Specialized ASIC for Molecular Dynamics

Implementation Of Digital Fir Filter Using Improved Table Look Up Scheme For Residue Number System

Lecture 22 Chapters 3 Logic Circuits Part 1

R13 SET - 1

Transcription:

NCU EE -- DSP VLSI Design. Tsung-Han Tsai 1

Multi-processor vs. Multi-computer architecture µp vs. DSP RISC vs. DSP RISC Reduced-instruction-set Register-to-register operation Higher throughput by using highly pipelined execution of simple instructions and efficient compilers. CISC Complex-instruction-set NCU EE -- DSP VLSI Design. Tsung-Han Tsai 2

A fast on-chip multiplier An instruction cycle is generally one or two clock cycles long. Several functional units that perform several parallel operations. The functional units typically include ALU, two or more address generator, etc. Several on-chip memory units used to store instructions, data, or look-up table. Large numbers of temporary registers are used of store values used for long periods of time. Several on-chip system buses to increase memory bandwidth. Support for special addressing modes, especially modulo and bit-reversed addressing. NCU EE -- DSP VLSI Design. Tsung-Han Tsai 3

Classification Von Neumann architecture Havard architecture TMS32010 Each MAC requires at least two cycles since the two multiplier operands must be fetched sequentially The performance is limited by the single bus 5 MOPs, i.e., 200ns instruction cycle time TMS320C25 & C50 Single-cycle MAC Harvard architecture is used C50 is an enhanced version of C25 with twice throughput NCU EE -- DSP VLSI Design. Tsung-Han Tsai 4

(cont d) TMS320C30 Multi-bus 320-bit floating-point DSP It runs at 50MHz and can perform up to eight operations per instruction cycle, which corresponds to 200 MFLOPS(clock is divides by 2 internally) Motorola DSP 56001 and 56002 Triple-bus Harvard architecture Two clock cycles are required per instruction It runs at 40MHz, which corresponds to 50ns instruction cycle time. 56002 is an enhanced, software compatible, version of 56001 56002 can perform up to six operations simultaneously, which corresponds to 120MOPS@40MHz.(VLIW) VLIW architecture A parallel architecture that uses multiple, independent functional units and packages multiple operations into one very long instruction. NCU EE -- DSP VLSI Design. Tsung-Han Tsai 5

(cont d) Motorola 96001 and 96002 A floating-point version of the 56001 Ideal DSP architecture Processing elements Storage elements Interconnection networks Control NCU EE -- DSP VLSI Design. Tsung-Han Tsai 6

Multi-processors (cont d) Tightly coupled system: all processor share the same common main memory Loosely coupled system: the main memory is distributed among the processors, although the processors shares the same memory space Multi-computers Memory Interconnection Network PE Interconnection Network PE PE PE PE M PE M PE M PE M Multi-processor Architure Multi-computer Architure NCU EE -- DSP VLSI Design. Tsung-Han Tsai 7

Message-based architectures SISD: classical serial computer MISD: impractical (cont d) SIMD: suitable for applications with high parallelism, such as image processing MIMD: complex design Interconnection topologies Crossbar: complete interconnection between all processor induce high cost Single-bus: inefficient Packet switching vs. circuit switching NCU EE -- DSP VLSI Design. Tsung-Han Tsai 8

(cont d) Shared-memory architecture Memory bandwidth bottleneck Imbalance between computation capacity and communication bandwidth T pe 2NT m??? Reducing the memory cycle time Interleaving memory Cycle time is reduced by a factor of K. if choosing K=2N, a good balance is obtained. Reducing communication Broadcasting All PE operate on the same input data Hence, only one memory read cycle is needed Interprocessor communication Cache memory NCU EE -- DSP VLSI Design. Tsung-Han Tsai 9

NCU EE -- DSP VLSI Design. Tsung-Han Tsai 10

NCU EE -- DSP VLSI Design. Tsung-Han Tsai 11

NCU EE -- DSP VLSI Design. Tsung-Han Tsai 12

S/N contribution 1 bit 6dB Ex: 12 bits 72dB 16 bits 96dB NCU EE -- DSP VLSI Design. Tsung-Han Tsai 13

NCU EE -- DSP VLSI Design. Tsung-Han Tsai 14

(cont d) NCU EE -- DSP VLSI Design. Tsung-Han Tsai 15

NCU EE -- DSP VLSI Design. Tsung-Han Tsai 16

(cont d) NCU EE -- DSP VLSI Design. Tsung-Han Tsai 17

NCU EE -- DSP VLSI Design. Tsung-Han Tsai 18

NCU EE -- DSP VLSI Design. Tsung-Han Tsai 19

NCU EE -- DSP VLSI Design. Tsung-Han Tsai 20

Immediate: The operand is part of the instruction Direct: A memory address is part of the instruction Indirect: An auxiliary register contains the memory address NCU EE -- DSP VLSI Design. Tsung-Han Tsai 21

Modulo mode: Used to implement circular buffers, queues, or delay lines, such as those in FIR filters Indexed: An index is added to an address before the memory access Bit reversed: Used to compute FFTs NCU EE -- DSP VLSI Design. Tsung-Han Tsai 22

Is it there? Is it visible? Reservation tables Time stationary vs. Data stationary code vs. interlocking NCU EE -- DSP VLSI Design. Tsung-Han Tsai 23

3 distinct programming models for pipelining: Interlocking Time stationary coding Data stationary coding NCU EE -- DSP VLSI Design. Tsung-Han Tsai 24

NCU EE -- DSP VLSI Design. Tsung-Han Tsai 25

There may not be sufficient time between instruction fetches to decode a branch instruction before the next instruction is fetched If the program address space is large, the destination address may not fit in an instruction word, so a second fetch from the instruction memory may be required. Alternatives are paging or PC-relative addressing In the case of conditional branching, the fetch of the next instruction cannot occur before the condition codes in the ALU can be tested NCU EE -- DSP VLSI Design. Tsung-Han Tsai 26

Multi-cycle branch instructions Delayed branches Low overhead looping NCU EE -- DSP VLSI Design. Tsung-Han Tsai 27

Use a disjoint set of registers for each task The control structure of the tasks is tied together however NCU EE -- DSP VLSI Design. Tsung-Han Tsai 28

NCU EE -- DSP VLSI Design. Tsung-Han Tsai 29

(cont d) Programmer (or compiler) sees multiple parallel processors that share memory without contention The assembly language is unaffected by the amount of pipelining It would help to have software to automate task partitioning and synchronization NCU EE -- DSP VLSI Design. Tsung-Han Tsai 30

Support mechanisms currently in place DMA (use excess memory cycles) Hardware controlled wait states (allows for transparent contention resolution) Bit test-and-set operations (allows shared data structures) Dual external busses (allows private and shared memory) Chip-level simulators (allows system simulation) NCU EE -- DSP VLSI Design. Tsung-Han Tsai 31

Cover the architecture design in block level Elements Multiplier / Accumulator ALU Shifter DAG Program Sequencer Memory Communication (DMA) NCU EE -- DSP VLSI Design. Tsung-Han Tsai 32

(cont d) System Selection & Design Hardware system consideration Selection Procedure Selecting Hardware High-level symbols design Component organization Bit-level design NCU EE -- DSP VLSI Design. Tsung-Han Tsai 33

General scheme of data-address generator NCU EE -- DSP VLSI Design. Tsung-Han Tsai 34

(cont d) The ADSP-1410 Data-Address Generator NCU EE -- DSP VLSI Design. Tsung-Han Tsai 35

Instruction sequencer and its relationship to other DSP system component NCU EE -- DSP VLSI Design. Tsung-Han Tsai 36

(cont d) Simple sequencer control structure A then B If W then A else B While W do A NCU EE -- DSP VLSI Design. Tsung-Han Tsai 37

(cont d) Functional organization of a powerful program sequencer (ADSP-2100) Next (program) - address Multiplexer NCU EE -- DSP VLSI Design. Tsung-Han Tsai 38

(cont d) Next address selection Condition logic NCU EE -- DSP VLSI Design. Tsung-Han Tsai 39

(cont d) A word-slice micro-programmable program sequencer (ADSP-1401) NCU EE -- DSP VLSI Design. Tsung-Han Tsai 40

Micro-coded system Overview NCU EE -- DSP VLSI Design. Tsung-Han Tsai 41

Micro-coded system (cont d) (cont d) Micro-code word broken up into sections, each an instruction to an individual component NCU EE -- DSP VLSI Design. Tsung-Han Tsai 42

(cont d) Fast FIR filter structure employing micro-programmed parallel computing architecture NCU EE -- DSP VLSI Design. Tsung-Han Tsai 43

(cont d) Design Example: FIR filter chip Hardware implementation of an FIR chip NCU EE -- DSP VLSI Design. Tsung-Han Tsai 44

Design Example: FIR filter chip (cont d) (cont d) FIR filter memory organization for coefficients and data NCU EE -- DSP VLSI Design. Tsung-Han Tsai 45

(cont d) Am29500 array processor, a high-performance microprogrammed complex signal processor NCU EE -- DSP VLSI Design. Tsung-Han Tsai 46

(cont d) Throughput implementation by custom paralleling AR = AR + BR WR - BI WI AI = AI + BI WR + BR WI BR = AR -BR WR + BI WI BI = AI - BI WR - BR WI The system requires Input of A and B, two complex or 4 real number inputs Four multiplications: BR WR, BI WI, BR WI, BI WR Four summations of the form (a±b±c), or six simple summations Scaling down by two to prevent overflow Output of A and B, two complex (4 real) outputs NCU EE -- DSP VLSI Design. Tsung-Han Tsai 47

(cont d) Effect on throughput and resource utilization of paralleled components in a FFT spectrum analyzer Allocation NCU EE -- DSP VLSI Design. Tsung-Han Tsai 48

(cont d) Scheduling NCU EE -- DSP VLSI Design. Tsung-Han Tsai 49

Block diagram of TMS32020 NCU EE -- DSP VLSI Design. Tsung-Han Tsai 50

(cont d) Arithmetic-logic-unit block diagram NCU EE -- DSP VLSI Design. Tsung-Han Tsai 51

(cont d) Multiplier-accumulator section of ADSP-2100 NCU EE -- DSP VLSI Design. Tsung-Han Tsai 52

(cont d) Block diagram of ADSP-2100 s shifter NCU EE -- DSP VLSI Design. Tsung-Han Tsai 53

(cont d) Data-Address-Generator block diagram NCU EE -- DSP VLSI Design. Tsung-Han Tsai 54

(cont d) Block diagram of ADSP-2100 s program sequencer NCU EE -- DSP VLSI Design. Tsung-Han Tsai 55