LRADNN: High-Throughput and Energy- Efficient Deep Neural Network Accelerator using Low Rank Approximation

Similar documents
Tunable Floating-Point for Energy Efficient Accelerators

Digital Integrated Circuits A Design Perspective. Arithmetic Circuits. Jan M. Rabaey Anantha Chandrakasan Borivoje Nikolic.

9. Datapath Design. Jacob Abraham. Department of Electrical and Computer Engineering The University of Texas at Austin VLSI Design Fall 2017

Digital Integrated Circuits A Design Perspective. Arithmetic Circuits. Jan M. Rabaey Anantha Chandrakasan Borivoje Nikolic.

arxiv: v1 [cs.ar] 11 Dec 2017

CSE 140 Lecture 11 Standard Combinational Modules. CK Cheng and Diba Mirza CSE Dept. UC San Diego

Processor Design & ALU Design

<Special Topics in VLSI> Learning for Deep Neural Networks (Back-propagation)

Department of Electrical and Computer Engineering University of Wisconsin - Madison. ECE/CS 752 Advanced Computer Architecture I.

CSCI-564 Advanced Computer Architecture

Hardware Acceleration of DNNs

Logic and Computer Design Fundamentals. Chapter 8 Sequencing and Control

EECS150 - Digital Design Lecture 11 - Shifters & Counters. Register Summary

Efficient Methods and Hardware for Deep Learning

Vectorized 128-bit Input FP16/FP32/ FP64 Floating-Point Multiplier

Convolutional Neural Network Architecture

Hardware Design I Chap. 4 Representative combinational logic

A Gray Code Based Time-to-Digital Converter Architecture and its FPGA Implementation

Enrico Nardelli Logic Circuits and Computer Architecture

INF2270 Spring Philipp Häfliger. Lecture 8: Superscalar CPUs, Course Summary/Repetition (1/2)

Reduced-Area Constant-Coefficient and Multiple-Constant Multipliers for Xilinx FPGAs with 6-Input LUTs

課程名稱 : 數位邏輯設計 P-1/ /6/11

ELEC516 Digital VLSI System Design and Design Automation (spring, 2010) Assignment 4 Reference solution

Energy Delay Optimization

DNPU: An Energy-Efficient Deep Neural Network Processor with On-Chip Stereo Matching

Simple neuron model Components of simple neuron

Optimizing Loop Operation and Dataflow in FPGA Acceleration of Deep Convolutional Neural Networks

High-Speed Low-Power Viterbi Decoder Design for TCM Decoders

New Implementations of the WG Stream Cipher

Professor Fearing EECS150/Problem Set Solution Fall 2013 Due at 10 am, Thu. Oct. 3 (homework box under stairs)

Implementation Of Digital Fir Filter Using Improved Table Look Up Scheme For Residue Number System

CMPEN 411 VLSI Digital Circuits Spring Lecture 19: Adder Design

Design at the Register Transfer Level

Introduction to Convolutional Neural Networks (CNNs)

CMP N 301 Computer Architecture. Appendix C

Data Mining Part 5. Prediction

Numbering Systems. Computational Platforms. Scaling and Round-off Noise. Special Purpose. here that is dedicated architecture

Simple Instruction-Pipelining (cont.) Pipelining Jumps

ICS 233 Computer Architecture & Assembly Language

Analog Computation in Flash Memory for Datacenter-scale AI Inference in a Small Chip

DSP Design Lecture 7. Unfolding cont. & Folding. Dr. Fredrik Edman.

Low Power Design Methodologies and Techniques: An Overview

A Simple Architectural Enhancement for Fast and Flexible Elliptic Curve Cryptography over Binary Finite Fields GF(2 m )

Microprocessor Power Analysis by Labeled Simulation

ECE 448 Lecture 6. Finite State Machines. State Diagrams, State Tables, Algorithmic State Machine (ASM) Charts, and VHDL Code. George Mason University

Integer Multipliers 1

Binary Convolutional Neural Network on RRAM

Logic BIST. Sungho Kang Yonsei University

Backpropagation and Neural Networks part 1. Lecture 4-1

L15: Custom and ASIC VLSI Integration

COVER SHEET: Problem#: Points

CSE140: Components and Design Techniques for Digital Systems. Decoders, adders, comparators, multipliers and other ALU elements. Tajana Simunic Rosing

arxiv: v1 [cs.ne] 10 May 2018

Circuit Modeling for Practical Many-core Architecture Design Exploration

COMBINATIONAL LOGIC CIRCUITS. Dr. Mudathir A. Fagiri

A COMBINED 16-BIT BINARY AND DUAL GALOIS FIELD MULTIPLIER. Jesus Garcia and Michael J. Schulte

EEC 216 Lecture #2: Metrics and Logic Level Power Estimation. Rajeevan Amirtharajah University of California, Davis

What s the Deal? MULTIPLICATION. Time to multiply

Chapter 8. Low-Power VLSI Design Methodology

Low Latency Architectures of a Comparator for Binary Signed Digits in a 28-nm CMOS Technology

Arithmetic Circuits-2

Datapath Component Tradeoffs

Scheduling. Uwe R. Zimmer & Alistair Rendell The Australian National University

Fast FPGA Placement Based on Quantum Model

EECS150 - Digital Design Lecture 24 - Arithmetic Blocks, Part 2 + Shifters

Synaptic Devices and Neuron Circuits for Neuron-Inspired NanoElectronics

Digital Integrated Circuits A Design Perspective

A Deep Convolutional Neural Network Based on Nested Residue Number System

Introduction to Deep Neural Networks

ECEN 248: INTRODUCTION TO DIGITAL SYSTEMS DESIGN. Week 9 Dr. Srinivas Shakkottai Dept. of Electrical and Computer Engineering

ECE 3401 Lecture 23. Pipeline Design. State Table for 2-Cycle Instructions. Control Unit. ISA: Instruction Specifications (for reference)

Implementing the Controller. Harvard-Style Datapath for DLX

EECS Components and Design Techniques for Digital Systems. FSMs 9/11/2007

Slide Set Data Converters. Digital Enhancement Techniques

Performance Evaluation of Scientific Applications on POWER8

Introduction to (Convolutional) Neural Networks

Large-Scale FPGA implementations of Machine Learning Algorithms

Building a Multi-FPGA Virtualized Restricted Boltzmann Machine Architecture Using Embedded MPI

DIGITAL TECHNICS. Dr. Bálint Pődör. Óbuda University, Microelectronics and Technology Institute

School of EECS Seoul National University

VLSI Arithmetic. Lecture 9: Carry-Save and Multi-Operand Addition. Prof. Vojin G. Oklobdzija University of California

Project Two RISC Processor Implementation ECE 485

A Mathematical Solution to. by Utilizing Soft Edge Flip Flops

LOGIC CIRCUITS. Basic Experiment and Design of Electronics

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

Binary Multipliers. Reading: Study Chapter 3. The key trick of multiplication is memorizing a digit-to-digit table Everything else was just adding

Computer Architecture. ECE 361 Lecture 5: The Design Process & ALU Design. 361 design.1

Appendix B. Review of Digital Logic. Baback Izadi Division of Engineering Programs

Binary Deep Learning. Presented by Roey Nagar and Kostya Berestizshevsky

Computer Architecture. ESE 345 Computer Architecture. Design Process. CA: Design process

YodaNN 1 : An Architecture for Ultra-Low Power Binary-Weight CNN Acceleration

WARM SRAM: A Novel Scheme to Reduce Static Leakage Energy in SRAM Arrays

NCU EE -- DSP VLSI Design. Tsung-Han Tsai 1

Median architecture by accumulative parallel counters

Skew Management of NBTI Impacted Gated Clock Trees

Computer Architecture

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

LOGIC CIRCUITS. Basic Experiment and Design of Electronics. Ho Kyung Kim, Ph.D.

Compressing deep neural networks

Computer Engineering Department. CC 311- Computer Architecture. Chapter 4. The Processor: Datapath and Control. Single Cycle

Transcription:

LRADNN: High-Throughput and Energy- Efficient Deep Neural Network Accelerator using Low Rank Approximation Jingyang Zhu 1, Zhiliang Qian 2, and Chi-Ying Tsui 1 1 The Hong Kong University of Science and Technology, Hong Kong 2 Shanghai Jiao Tong University, Shanghai, China IEEE/ACM ASP-DAC 2016, 28 th Jan., 2016, Macao

Outline Introduction Related work Motivation of LRADNN Low rank approximation (LRA) predictor LRADNN: hardware architecture Experiment results Conclusion

Deep neural network (DNN) and hardware acceleration Layer-wise organization with hierarchical feature extractions Input Layer 1 st Hidden Layer 2 nd Hidden Layer Output Layer

Deep neural network (DNN) and hardware acceleration Layer-wise organization with hierarchical feature extractions Hardware acceleration CPU clusters: Google brain GPU clusters: AlexNet ASIC: IBM TrueNorth AlexNet Google brain IBM TrueNorth

Outline Introduction Related work Motivation of LRADNN Low rank approximation (LRA) predictor LRADNN: hardware architecture Experiment results Conclusion

Related work DianNao: a general accelerator for DNN Large energy consumption in memory access AxNN: energy-efficient accelerator by approximating resilient neurons Only reduce power in datapath! Energy consumption in DianNao AxNN datapath

Outline Introduction Related work Motivation of LRADNN Low rank approximation (LRA) predictor LRADNN: hardware architecture Experiment results Conclusion

Motivation: sparsity in DNN Intrinsic sparsity in DNN to avoid overfitting / better feature extractions Sparsity: proportion of inactive neurons (activation = 0) 95.00% 94.00% 93.00% 92.00% 91.00% 90.00% 89.00% 88.00% 87.00% 86.00% Sparsity in DNN targeted for MNIST 1st hidden layer 2nd hidden layer 3rd hidden layer 4th hidden layer

Motivation: sparsity in DNN Intrinsic sparsity in DNN to avoid overfitting / better feature extractions Sparsity: proportion of inactive neurons (activation = 0) Conventional computation a i l+1 = σ s l j=1 W l l ij a j Regardless of the activeness of neurons, i [1, s l+1 ] Arithmetic ops / neuron: s l mults, s l 1 adds, 1 nonlinear Memory accesses / neuron: 2s l (weights & acts)

Motivation: bypass unnecessary operations Dynamically bypass the inactive neurons Inactive Neuron Active Neuron Layer l Layer l Layer l-1 Layer l-1 Dedicated predictor for the activeness of neurons Simple: less arithmetic operations involved Accurate: small performance degradation

Outline Introduction Related work Motivation of LRADNN Low rank approximation (LRA) predictor LRADNN: hardware architecture Experiment results Conclusion

Low rank approximation (LRA) predictor [1] Approximation for synaptic weights W min W W F W s. t. rank W = r where r is a hyper-parameter controlling the prediction accuracy vs. computation complexity A nice closed form solution based on SVD 0 1 0 0 Layer l U = U r V = Σ r V r T U V LRA Predictor W Layer l-1 Calculated weights [1] Davis A, Arel I. Low-Rank Approximations for Conditional Feedforward Computation in Deep Neural Networks[J]. arxiv preprint arxiv:1312.4461, 2013.

Feedforward pass with LRA Take LRA into offline training (backpropagation) to improve the performance

Outline Introduction Related work Motivation of LRADNN Low rank approximation (LRA) predictor LRADNN: hardware architecture Experiment results Conclusion

LRADNN: hardware architecture for LRA predictor Top view of the architecture: 5-stage pipeline Address calculation Memory fetch Multiplication Addition Nonlinear operation and write back Address calculation Memory fetch Multiplier array Adder tree a i l+1 = σ s l j=1 Nonlinear and write back l W ij a j l FSM controller Memory address calculation Memory bank Activation address calculation Activation register bank

Status controller 3 calculation statuses V computation U computation W computation Hidden layer Output layer (no nonlinear operation) Data dependency (delay 3 CCs) Only output layers Idle V Stall Out W Stall Output W finishes Hidden W finishes Hid W V finishes U Stall U finishes p (l) =U (l) V (l) z (l) a (l+1) =LRA(p (l), a (l) ) ADDR MEM MUL ADD NON/W B ADDR MEM MUL ADD NON/W B

Address calculation: memory organization Memory word width: parallelism of the accelerator W 11 W 21 W 1n W 2n W m1 W mn W(1, :) W(2, :) For a specified weight W ij addr W ij = i One word: parallelism of the accelerator n p + j p # word lines for an output neuron Column block j blk

Memory access: activation register banks Two physical register banks are interleaved to two logical registers: input and output activations Register bank 0 Input act Input activations Block data out (to MULT stage) Register bank 1 Input activations register bank Dispatch mux array +0 +1 +p-1 Output act Dispatch selection control j blk p

Memory access: activation register banks (cont.) Output activations Extra register for LRA scheme ( 5%) Tmp1: tmp1 = V (l) z (l) Tmp2: tmp2 = U (l) tmp1 Predictor: p (l) = tmp2 > 0 i Output activations register bank Read out activations i Write demux To WB stage (for accumulation) Logical variable Tmp1 Tmp2 Predictor Physical location Tmp1 reg Output act Predictor reg Size (depth x width) rank x FP width # acts x FP width # acts x 1

Depth: log 2 (p) Computational stages Parallel multiplication p multipliers, determining the parallelism of the accelerator Merging operation Adder tree Nonlinear operation ReLU Dropout ReLU 0 x MSB of x # adders @ level 1: p >> 2 Dropout

Active neurons search: behavior level Priority encoder based search Search miss penalty: 1 CC Priority encoder search Search miss

Scanning window s Active neurons search: hardware level Priority encoder based search Higher priority assigned to LSB Decoder One hot to binary decoder Predictor 1 0 1 0 0 1 1 Current neuron i r 0 r 1 r 2 r s-1 Fixed priority arbiter g 0 g 1 g 2 g s-1 0 1 0 0 0 0 Decoder i+2 Search miss

Outline Introduction Related work Motivation of LRADNN Low rank approximation (LRA) predictor LRADNN: hardware architecture Experiment results Conclusion

Simulation setup Behavior level simulation on MATLAB Offline training Fixed point simulation RTL implementation using Verilog Technology node: TSMC 65nm LP standard cell Memory model: HP CACTI 6.5 Comparison of behavior, pre-synthesis and postsynthesis simulations Power evaluation based on the post-synthesis simulation (extracted switching activities) 3 real benchmarks are tested on LRADNN: MNIST, Caltech101, and SVHN

Summary of the implementation Micro-architecture parameters Value # multipliers (parallelism) 32 Depth of the activation register bank 1024 Layer index width (max layer no.) 3 Fixed point for the internal data path Q7.12 Fixed point for W memory Q2.11 Fixed point for U, V memory Q5.8 W memory size 3.5MB U, V memory size 448KB Scanning window size 16 # V calculation registers (max rank) 128

Area and timing results Memory dominant for area and timing LRADNN (direct) LRADNN Total area (mm2) 51.94 64.18 Critical path (ns) 8.94 9.03 Area overhead caused by U, V memory 448KB 2 3.5MB = 25% Timing overhead caused by extra U, V, and W memory data MUX selection (~ 1% increase)

Training results on real applications Prediction loss: test error rate for LRA feedforward test error rate for plain feedforward Caltech 101 silhouettes Architecture 784-1023-101 MNIST 784-1000- 600-400-10 SVHN 1023-1000-700-400-150-10 # connections 0.90M 1.63M 2.06M Rank 50 50-35-25 100-70-50-25 Prediction loss (fixed point) -0.09% 0.07% -0.93%

Theoretical lower bounds for accelerators Number of cycles # synptic connections indnn parallelism = l=1 L 1 s l+1 (1 + s l ) parallelism Power consumption (only consider memory access power) E FF t FF = E MEM n cyc T cyc n cyc E FF : energy consumption during feedforward t FF : elapsed time for feedforward n cyc : number of ideal cycles E MEM : memory read energy per access

Timing and power results on real applications Power consumption: average post-synthesis simulations on the first 10 testing samples Number of cycles Ideal LRADNN (direct) LRADNN Caltech 101 silhouettes 28327 29840 23141 MNIST 50938 52971 30105 SVHN 64586 66245 49371 Power consumption (mw) / Energy consumption (mj) Ideal LRADNN (direct) LRADNN Caltech 101 silhouettes 517.76 / 0.15 551.61 / 0.16 487.88 / 0.11 MNIST 517.76 / 0.26 557.98 / 0.30 459.73 / 0.14 SVHN 517.76 / 0.33 561.42 / 0.37 438.37 / 0.22

Scalability for high parallelism of LRADNN Not fully utilize the hardware (multipliers) due to the memory alignment m n U = m p n p 1.2 1 0.8 0.6 0.4 0.2 0 Hardware utilization under different parallelsims Caltech 101 silhouettes MNIST SVHN 16 32 64 128 256 512

Outline Introduction Related work Motivation of LRADNN Low rank approximation (LRA) predictor LRADNN: hardware architecture Experiment results Conclusion

Conclusion A general hardware accelerator LRADNN for DNN is proposed A time and power-saving accelerator based on LRA 31% ~ 53% energy reduction 22% ~ 43% throughput increase A better scheme compared to the existing work A better scheme compared to the existing work AxNN LRADNN Prediction loss 0.5% < 0.1% Energy improvement (w/o memory) 1.14x 1.92x 1.18x 1.61x Energy improvement (w/ memory) N.A. 1.45x 2.13x

Thank you Q & A