Optimizing Loop Operation and Dataflow in FPGA Acceleration of Deep Convolutional Neural Networks

Similar documents
A Deep Convolutional Neural Network Based on Nested Residue Number System

arxiv: v1 [cs.ar] 11 Dec 2017

FPGA Implementation of a Predictive Controller

NCU EE -- DSP VLSI Design. Tsung-Han Tsai 1

Today. ESE532: System-on-a-Chip Architecture. Energy. Message. Preclass Challenge: Power. Energy Today s bottleneck What drives Efficiency of

Building a Multi-FPGA Virtualized Restricted Boltzmann Machine Architecture Using Embedded MPI

Binary Convolutional Neural Network on RRAM

SP-CNN: A Scalable and Programmable CNN-based Accelerator. Dilan Manatunga Dr. Hyesoon Kim Dr. Saibal Mukhopadhyay

COVER SHEET: Problem#: Points

Word-length Optimization and Error Analysis of a Multivariate Gaussian Random Number Generator

Analog Computation in Flash Memory for Datacenter-scale AI Inference in a Small Chip

Implementation Of Digital Fir Filter Using Improved Table Look Up Scheme For Residue Number System

Multivariate Gaussian Random Number Generator Targeting Specific Resource Utilization in an FPGA

Reduced-Area Constant-Coefficient and Multiple-Constant Multipliers for Xilinx FPGAs with 6-Input LUTs

DNPU: An Energy-Efficient Deep Neural Network Processor with On-Chip Stereo Matching

Shift Register Counters

Efficient random number generation on FPGA-s

WITH rapid growth of traditional FPGA industry, heterogeneous

Digital Systems Design Overview. ENGIN 341 Advanced Digital Design University of Massachuse?s Boston Department of Engineering Dr.

VLSI Signal Processing

2. Accelerated Computations

ABHELSINKI UNIVERSITY OF TECHNOLOGY

EECS 579: Logic and Fault Simulation. Simulation

Numbering Systems. Computational Platforms. Scaling and Round-off Noise. Special Purpose. here that is dedicated architecture

Chapter 5. Digital Design and Computer Architecture, 2 nd Edition. David Money Harris and Sarah L. Harris. Chapter 5 <1>

LRADNN: High-Throughput and Energy- Efficient Deep Neural Network Accelerator using Low Rank Approximation

arxiv: v1 [cs.ne] 10 May 2018

EECS150 - Digital Design Lecture 23 - FFs revisited, FIFOs, ECCs, LSFRs. Cross-coupled NOR gates

ww.padasalai.net

On LUT Cascade Realizations of FIR Filters

Advanced Hardware Architecture for Soft Decoding Reed-Solomon Codes

Mapping Sparse Matrix-Vector Multiplication on FPGAs

UTPlaceF 3.0: A Parallelization Framework for Modern FPGA Global Placement

Vectorized 128-bit Input FP16/FP32/ FP64 Floating-Point Multiplier

CMSC 313 Lecture 25 Registers Memory Organization DRAM

DEVELOPMENT OF A SMART CAMERA SYSTEM ON AN FPGA. Monica Jane Whitaker. A thesis submitted in partial fulfillment of the requirements for the degree

Learning to Skip Ineffectual Recurrent Computations in LSTMs

YodaNN 1 : An Architecture for Ultra-Low Power Binary-Weight CNN Acceleration

Today. ESE532: System-on-a-Chip Architecture. Energy. Message. Preclass Challenge: Power. Energy Today s bottleneck What drives Efficiency of

An Implementation of an Address Generator Using Hash Memories

EEC 216 Lecture #3: Power Estimation, Interconnect, & Architecture. Rajeevan Amirtharajah University of California, Davis

8-Bit Dot-Product Acceleration

Administrative Stuff

FUZZY PERFORMANCE ANALYSIS OF NTT BASED CONVOLUTION USING RECONFIGURABLE DEVICE

An Optimized Hardware Architecture of Montgomery Multiplication Algorithm

Implementation of Nonlinear Template Runner Emulated Digital CNN-UM on FPGA

A Simple Architectural Enhancement for Fast and Flexible Elliptic Curve Cryptography over Binary Finite Fields GF(2 m )

EECS150 - Digital Design Lecture 11 - Shifters & Counters. Register Summary

ALU A functional unit

Hardware Acceleration of DNNs

Power Consumption Analysis. Arithmetic Level Countermeasures for ECC Coprocessor. Arithmetic Operators for Cryptography.

MTJ-Based Nonvolatile Logic-in-Memory Architecture and Its Application

Efficient Polynomial Evaluation Algorithm and Implementation on FPGA

Design Exploration of an FPGA-Based Multivariate Gaussian Random Number Generator

FPGA Implementation of a HOG-based Pedestrian Recognition System

FPGA Resource Utilization Estimates for NI PXI-7854R. LabVIEW FPGA Version: 8.6 NI-RIO Version: 3.0 Date: 8/5/2008

Combinational Logic Design Combinational Functions and Circuits

Pipelined Viterbi Decoder Using FPGA

Performance Metrics & Architectural Adaptivity. ELEC8106/ELEC6102 Spring 2010 Hayden Kwok-Hay So

Reorganized and Compact DFA for Efficient Regular Expression Matching

LOGIC CIRCUITS. Basic Experiment and Design of Electronics. Ho Kyung Kim, Ph.D.

A FPGA Implementation of Large Restricted Boltzmann Machines. Charles Lo. Supervisor: Paul Chow April 2010

Fast FPGA Placement Based on Quantum Model

New Exploration Frameworks for Temperature-Aware Design of MPSoCs. Prof. David Atienza

Successive Approximation ADCs

A High Throughput FPGA-Based Implementation of the Lanczos Method for the Symmetric Extremal Eigenvalue Problem

CMP 338: Third Class

LOGIC CIRCUITS. Basic Experiment and Design of Electronics

Design and Implementation of High Speed CRC Generators

Chapter 7. Sequential Circuits Registers, Counters, RAM

Janus: FPGA Based System for Scientific Computing Filippo Mantovani

ESE 570: Digital Integrated Circuits and VLSI Fundamentals

Mark Redekopp, All rights reserved. Lecture 1 Slides. Intro Number Systems Logic Functions

Logic and Computer Design Fundamentals. Chapter 8 Sequencing and Control

School of EECS Seoul National University

Che-Wei Chang Department of Computer Science and Information Engineering, Chang Gung University

PERFORMANCE METRICS. Mahdi Nazm Bojnordi. CS/ECE 6810: Computer Architecture. Assistant Professor School of Computing University of Utah

Artificial Neural Networks D B M G. Data Base and Data Mining Group of Politecnico di Torino. Elena Baralis. Politecnico di Torino

Chapter Overview. Memory Classification. Memory Architectures. The Memory Core. Periphery. Reliability. Memory

! Charge Leakage/Charge Sharing. " Domino Logic Design Considerations. ! Logic Comparisons. ! Memory. " Classification. " ROM Memories.

Performance, Power & Energy. ELEC8106/ELEC6102 Spring 2010 Hayden Kwok-Hay So

Second Order Function Approximation Using a Single Multiplication on FPGAs

Quantum Computer Architecture

Professor Fearing EECS150/Problem Set Solution Fall 2013 Due at 10 am, Thu. Oct. 3 (homework box under stairs)

Lecture #4: Potpourri

ECE 645: Lecture 2. Carry-Lookahead, Carry-Select, & Hybrid Adders

Programmable Logic Devices II

L16: Power Dissipation in Digital Systems. L16: Spring 2007 Introductory Digital Systems Laboratory

A Scalable ECC Processor for High-Speed and Light-Weight Implementations with Side-Channel Countermeasures

ELEC516 Digital VLSI System Design and Design Automation (spring, 2010) Assignment 4 Reference solution

On the Design of a Hardware-Software Architecture for Acceleration of SVM s Training Phase

Ch 9. Sequential Logic Technologies. IX - Sequential Logic Technology Contemporary Logic Design 1

Analysis and Synthesis of Weighted-Sum Functions

Logic BIST. Sungho Kang Yonsei University

Low-complexity generation of scalable complete complementary sets of sequences

An Approximate Parallel Multiplier with Deterministic Errors for Ultra-High Speed Integrated Optical Circuits

AN IMPROVED LOW LATENCY SYSTOLIC STRUCTURED GALOIS FIELD MULTIPLIER

A COMBINED 16-BIT BINARY AND DUAL GALOIS FIELD MULTIPLIER. Jesus Garcia and Michael J. Schulte

Vector Quantization for Machine Vision

Table 1. Comparison of QR Factorization (Square: , Tall-Skinny (TS): )

Transcription:

2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays Optimizing Loop Operation and Dataflow in FPGA Acceleration of Deep Convolutional Neural Networks Yufei Ma, Yu Cao, Sarma Vrudhula, Jae-sun Seo School of Electrical, Computer and Energy Engineering School of Computing, Informatics, Decision Systems Engineering Arizona State University, Tempe, USA 2017/02/22

Outline Overview of CNN Algorithm and Accelerator Convolution Loop Optimization Design Objectives of CNN Accelerator Loop Optimization in Related Works Proposed CNN Accelerator Experimental Results Conclusion - 2 -

Convolutional Neural Networks (CNN) Dominant approach for recognition and detection tasks Require large # of operations (>1G) and parameters (>50M). High variability in the sizes of different convolution layers. Evolving rapidly with more layers to achieve higher accuracy. From a few to >1000 layers Input Image Feature Maps Convolution (Conv.) +Activation Pooling (Subsampling) Convolution +Activation Fully-connected (Inner Product) - 3 -

General CNN Acceleration System Three levels of hardware accelerator hierarchy External memory (DRAM, ~GB) On-chip buffers (RAM, ~MB) Registers and processing engines (PEs) Image Weights Result External Memory Pixels Weights On-chip Buffers Pixels Weights Pixels Registers & Processing Engine Arrays PE PE PE Our contribution In-depth analysis of convolution loop optimization techniques. CNN accelerator with low communication and high performance - 4 -

Outline Overview of CNN Algorithm and Accelerator Convolution Loop Optimization Loop Unrolling Loop Tiling Design Objectives of CNN Accelerator Loop Loop Optimization Interchangein Related Works Proposed CNN Accelerator Experimental Results Conclusion - 5 -

Convolution Parameters and Loops Input feature maps Nif Nky Nkx Nix Niy Nif Nif Nif Nky Nkx Nky Nkx Nky Nkx Output feature maps = Nof Nox Kernel maps Noy Loop-4 Loop-3 Loop-2 Loop-1 Across the output feature maps of Nof Scan within one input feature map along Nix Niy Across the input feature maps of Nif. MAC within a kernel window of Nkx Nky - 6 -

Loop Optimization Techniques Loop Unrolling parallel computation of conv. MACs register arrays and PE architecture Loop Tiling increase data locality determine on-chip buffer size Loop Interchange computation order of four conv. loops - 7 -

Parameters and Design Variables Input feature maps Nif Tif Pif Nkx Tkx Pkx Nky Tky Pky Nix Tix Pix Niy Tiy Piy Nky Tky Pky Nif Tif Pif Nkx Tkx Pkx N* : Convolution dimensions T* : Loop tiling design variables P* : Loop unrolling design variables Output feature maps = Nof Tof Pof Kernel maps Nox Tox Pox Noy Toy Poy 1 P* T* N*, e.g. 1 Pif Tif Nif - 8 -

Loop Unrolling Unroll Loop-1 Kernel Weights Input Pixels Pky Pkx Pixel1 Weight1 Pixel2 Weight2 Pixel3 Weight3 Pixel4 Weight4 + + Adder Tree DFF + + Accumulator Pky Pkx Pkx Pky Multipliers - 9 -

Loop Unrolling Unroll Loop-2 Kernel Weights Pif Input Pixels Pif Pixel1 Weight1 Pixel2 Weight2 Pixel3 Weight3 Pixel4 Weight4 Pif Multipliers + + Adder Tree DFF + + Accumulator - 10 -

Loop Unrolling Unroll Loop-3 Kernel Weights Input Pixels Piy Pix The weight can be reused by Pix Piy times. Pixel1 Weight Pixel2 Pixel3 Pixel4 + + + + Pix Piy Multipliers DFF DFF DFF DFF Pix Piy Accumulators - 11 -

Loop Unrolling Unroll Loop-4 Kernel Weights Input Pixels Pixel Weight1 + DFF Pof Weight2 Weight3 + + DFF DFF Pof Multipliers Pof Accumulators The pixel can be reused by Pof times. - 12 -

- 13 - Loop Optimization Techniques Loop Unrolling (P*) total # of parallel MACs or multipliers (Pm) is Pm = Pkx Pky Pif Pix Piy Pof Loop Tiling (T*) Input pixel buffer: Tif Tix Tiy pixel_datawidth Weight buffer: Tkx Tky Tif Tof weight_datawidth Output pixel buffer: Tox Toy Tof pixel_datawidth Loop Interchange Intra-tile order: buffers to registers or PEs Inter-tile order: external memory to buffers

Outline Overview of CNN Algorithm and Accelerator Convolution Loop Optimization Design Objectives of CNN Accelerator Minimize Partial Sum Storage Minimize On-chip Buffer Access Minimize Off-chip DRAM Access Loop Optimization in Related Works Proposed CNN Accelerator Experimental Results Conclusion - 14 -

Design Space Exp. Minimize #psum #psum = # of partial sums need to be stored in memory To obtain one pixel, need to finish Loop-1 and Loop-2. Yes Loop-1&2 are fully unrolled (Pkx = Tkx = Nkx & Pky = Tky = Nky & Pif =Tif = Nif)? No No Yes Loop-1&2 are fully buffered (Tkx = Nkx & Tky = Nky & Tif = Nif)? #psum = Pkx Pky Pif No Yes inter-tile Loop-1 and Loop-2 are computed first intra-tile Loop-1 and Loop-2 are computed first Yes #psum = Tof Tox Toy #psum = Pof Pox Poy inter-tile Loop-1 or Loop-2 is computed at last #psum = Nof Nox Noy inter-tile Loop-3 computed before Loop-4 #psum = Tof Nox Noy inter-tile Loop-4 computed before Loop-3 #psum = Nof Tox Toy - 15 - intra-tile Loop-1 or Loop-2 is computed at last #psum = Tof Tox Toy intra-tile Loop-3 computed before Loop-4 #psum = Pof Tox Toy intra-tile Loop-4 computed before Loop-3 #psum = Tof Pox Poy 1 P* T* N*, e.g. 1 Pof Tof Nof

Design Space Exp. Minimize #psum #psum = # of partial sums need to be stored in memory To obtain one pixel, need to finish Loop-1 and Loop-2. Loop-1&2 are fully unrolled (Pkx = Tkx = Nkx & Pky = Tky = Nky & Pif =Tif = Nif)? Yes #psum = Pkx Pky Pif Yes No If both Loop-1 and Loop-2 are fully unrolled, #psum = Pkx Pky Pif. - 16-1 P* T* N*, e.g. 1 Pof Tof Nof

Design Space Exp. Minimize #psum #psum = # of partial sums need to be stored in memory To obtain one pixel, need to finish Loop-1 and Loop-2. Yes Loop-1&2 are fully unrolled (Pkx = Tkx = Nkx & Pky = Tky = Nky & Pif =Tif = Nif)? No No Loop-1&2 are fully buffered (Tkx = Nkx & Tky = Nky & Tif = Nif)? Yes intra-tile Loop-1 and Loop-2 are computed first If Loop-1&2 are fully buffered, partial sums can be consumed within one tile. #psum = Pof Pox Poy intra-tile Loop-1 or Loop-2 is computed at last #psum = Tof Tox Toy The earlier Loop-1&2 are computed, the smaller #psum is. - 17-1 P* T* N*, e.g. 1 Pof Tof Nof

Design Space Exp. Minimize #psum #psum = # of partial sums need to be stored in memory To obtain one pixel, need to finish Loop-1 and Loop-2. Loop-1&2 are fully unrolled (Pkx = Tkx = Nkx & Pky = Tky = Nky & Pif =Tif = Nif)? No Loop-1&2 are fully buffered (Tkx = Nkx & Tky = Nky & Tif = Nif)? No inter-tile Loop-1 and Loop-2 are computed first Yes #psum = Tof Tox Toy inter-tile Loop-1 or Loop-2 is computed at last #psum = Nof Nox Noy Otherwise, partial sums have to be kept across different tiles. Yes No Largest partial sums storage May require external memory - 18-1 P* T* N*, e.g. 1 Pof Tof Nof

Design Space Exp. Minimize #psum #psum = # of partial sums need to be stored in memory To obtain one pixel, need to finish Loop-1 and Loop-2. Yes Loop-1&2 are fully unrolled (Pkx = Tkx = Nkx & Pky = Tky = Nky & Pif =Tif = Nif)? No No Loop-1&2 are fully buffered (Tkx = Nkx & Tky = Nky & Tif = Nif)? #psum = Pkx Pky Pif Yes intra-tile Loop-1 and Loop-2 are computed first #psum = Pof Pox Poy Our design point Loop1&2 buffered and computed first Partial sums stored in registers - 19-1 P* T* N*, e.g. 1 Pof Tof Nof

Reduce Buffer Access by Data Reuse A single pixel or weight can be reused for multiple multipliers after fetched out of on-chip buffers. Reuse times of a weight: Reuse_wt = Pix Piy Loop-3 unrolled - 20 -

Reduce Buffer Access by Data Reuse Reuse times of a pixel: Reuse_px Loop-4 unrolled Both Loop-1 and Loop-3 unrolled Reuse_px = Pof, if Pkx = 1, Pky = 1 Reuse_px = Pof Pkx Pky Pix Piy ((Pix-1)S + Pkx) ((Piy-1)S + Pky) - 21 - S: sliding stride

Design Spa. Exp. Min DRAM Access Both weights and intermediate results of pixels are stored in external memory large-scale CNNs (> 50 MB) limited FPGA RAM capacity (< 50 Mbits) Minimum access of external memory read every pixel and weight only once maximal data reuse with on-chip buffer proper loop computing orders - 22 -

Design Spa. Exp. Min DRAM Access #DRAM_px : # of DRAM reads of one input pixel in one layer #DRAM_wt : # of DRAM reads of one weight in one layer Tif Tix Tiy = Nif Nix Niy (all input pixels buffered) or Tif Tkx Tky Tof = Nif Nkx Nky Nof (all weights buffered) Yes Tkx Tky Tof = Nkx Nky Nof (all required weights for a pixel are buffered) No No Yes Loop-3 is computed first #DRAM_px = 1, #DRAM_wt = Nox Noy/(Tox Toy) Yes No #DRAM_px = 1, #DRAM_wt = 1 Tix Tiy = Nix Niy (all required pixels for a weight are buffered) No Loop-4 is computed first #DRAM_px = Nof/Tof, #DRAM_wt = 1 #DRAM_px = 1, #DRAM_wt = 1 Loop-3 is computed first #DRAM_px = Nof/Tof, #DRAM_wt = 1 Loop-4 is computed first #DRAM_px = 1, #DRAM_wt = Nox Noy/(Tox Toy) #DRAM_px = 1, #DRAM_wt = 1 #DRAM_px = Nof/Tof, #DRAM_wt = Nox Noy/(Tox Toy) - 23 -

Design Spa. Exp. Min DRAM Access #DRAM_px : # of DRAM reads of one input pixel in one layer #DRAM_wt : # of DRAM reads of one weight in one layer Tif Tix Tiy = Nif Nix Niy (all input pixels buffered) or Tif Tkx Tky Tof = Nif Nkx Nky Nof (all weights buffered) Yes No Yes All buffered data are exhaustedly utilized. #DRAM_px = 1, #DRAM_wt = 1-24 -

Design Spa. Exp. Min DRAM Access #DRAM_px : # of DRAM reads of one input pixel in one layer #DRAM_wt : # of DRAM reads of one weight in one layer Tif Tix Tiy = Nif Nix Niy (all input pixels buffered) or Tif Tkx Tky Tof = Nif Nkx Nky Nof (all weights buffered) Tkx Tky Tof = Nkx Nky Nof (all required weights for a pixel are buffered) No Yes Yes No #DRAM_px = 1, #DRAM_px = 1, Tix Tiy = Nix Niy (all required pixels for a weight are buffered) #DRAM_wt = 1 #DRAM_wt = 1-25 -

Design Spa. Exp. Min DRAM Access #DRAM_px : # of DRAM reads of one input pixel in one layer #DRAM_wt : # of DRAM reads of one weight in one layer Tif Tix Tiy = Nif Nix Niy (all input pixels buffered) or Tif Tkx Tky Tof = Nif Nkx Nky Nof (all weights buffered) No Loop-3 is computed first Yes No #DRAM_wt = 1 Loop-4 is computed first #DRAM_px = 1 Loop-3 is computed first #DRAM_wt = 1 Loop-4 is computed first #DRAM_px = 1-26 -

Design Spa. Exp. Min DRAM Access #DRAM_px : # of DRAM reads of one input pixel in one layer #DRAM_wt : # of DRAM reads of one weight in one layer Tif Tix Tiy = Nif Nix Niy (all input pixels buffered) or Tif Tkx Tky Tof = Nif Nkx Nky Nof (all weights buffered) No Loop-3 is computed first Yes No Yes #DRAM_px = 1, #DRAM_wt = 1 Loop-4 is computed first #DRAM_px = 1, #DRAM_wt = 1 #DRAM_px = 1, #DRAM_wt = 1 Our design point Loop3&4 NOT computed first Assign enough buffer size - 27 -

Design Spa. Exp. Min DRAM Access Buffer size vs. # of DRAM access by tuning Toy and Tof Tkx Tky = Nkx Nky & Tif = Nif : buffer both Loop-1&2 Tox = Nox : benefit DMA transactions with continuous data Our design locations ~50Mbits RAM in our FPGA On-chip buffer size vs. # of DRAM access and conv. delay - 28 - Estimated conv. delay = computing time + DRAM delay

Outline Overview of CNN Algorithm and Accelerator Convolution Loop Optimization Design Objectives of CNN Accelerator Loop Optimization in Related Works Proposed CNN Accelerator Type-(A): unroll Loop-1, Loop-2, Loop-4 Type-(B): unroll Loop-2, Loop-4 Type-(C): unroll Loop-1, Loop-3 Type-(D): unroll Loop-3, Loop-4 Experimental Results Conclusion - 29 -

Loop Optimization in Related Work Type-(A): Unroll Loop-1, Loop-2, Loop-4 [1][2][3] Pif [1] J. Qiu, et al. Going deeper with embedded FPGA platform for convolutional neural network. In ACM FPGA 2016. [2] H. Li, et al. A high performance FPGA-based accelerator for large-scale convolutional neural networks, In FPL 2016. [3] M. Motamedi, et al. Design space exploration of FPGA based deep convolutional neural networks. In ASP-DAC 2016. - 30 - Pif Pof Pkx Pky [2] [1] Pof [3] Pkx Pky Pkx Pky Pif Pof

Loop Optimization in Related Work Type-(A): Unroll Loop-1, Loop-2, Loop-4 [1][2][3] Pif [1] J. Qiu, et al. Going deeper with embedded FPGA platform for convolutional neural network. In ACM FPGA 2016. [2] H. Li, et al. Workload A high performance imbalance FPGA-based accelerator and for PE large-scale mapping convolutional difficulty neural networks, In FPL 2016. [3] M. Motamedi, et al. Design space exploration of FPGA based deep convolutional neural networks. In ASP-DAC 2016. - 31 - [1] Pkx Pky Pof Pkx Pky Pkx Pky Pif Pif Unroll Loop-1 employs parallelism within kernel maps Pof Kernel sizes (Nkx Nky) are Pof small [2] Cannot provide sufficient parallelism [3] Kernel sizes vary considerably across different conv. layers

Loop Optimization in Related Work Type-(B): Unroll Loop-2, Loop-4 [4][5] Type-(C): Unroll Loop-1, Loop-3 [6][7] Pif Pof Pky [7] Pix Piy [4] (a) (b) (c) - 32 - [4] C. Zhang, et al. Optimizing FPGA-based accelerator design for deep Convolutional Neural Networks. In ACM FPGA 2015. [5] Y. Ma, et al. Scalable and modularized RTL compilation of Convolutional Neural Networks onto FPGA. In FPL 2016. [6] Y.-H. Chen, et al. Eyeriss: An energy-efficient reconfigurable accelerator for deep Convolutional Neural Networks. In ISSCC 2016. [7] Y.-H. Chen, et al. Eyeriss: A spatial architecture for energy-efficient dataflow for Convolutional Neural Networks. In ISCA 2016.

Loop Optimization in Related Work Type-(B): Unroll Loop-2, Loop-4 [4][5] Type-(C): Unroll Loop-1, Loop-3 [6][7] Pif Pof Pky [7] Pix Piy [4] (a) (b) (c) - 33 - Type-(A)&(B) only reuses pixels by unrolling Loop-4. [4] C. Zhang, et al. Optimizing FPGA-based accelerator design for deep Convolutional Neural Networks. In ACM FPGA 2015. [5] Y. Ma, Type-(C) et al. Scalable reuses and modularized pixels RTL compilation by the of Convolutional overlapping Neural Networks of Loop-1 onto FPGA. and In FPL 2016. [6] Y.-H. Loop-3, Chen, et al. Eyeriss: and An reuses energy-efficient weights reconfigurable by accelerator unrolling for deep Loop-3. Convolutional Neural Networks. In ISSCC 2016. [7] Y.-H. Type-(C) Chen, et al. Eyeriss: is also A spatial affected architecture for by energy-efficient the issue dataflow of for unrolling Convolutional Neural Loop-1. Networks. In ISCA 2016.

Loop Optimization in Related Work Type-(D): Unroll Loop-3, Loop-4 [8] Pof [8] Piy - 34 - Pix Type-(D) reuses weights by unrolling Loop-4. Type-(D) reuses pixels by unrolling Loop-3. Nix Niy Nof is large to provide sufficient parallelism. Data tiles in [8] do NOT cover Loop-2. increase movements and storage of partial sums. [8] A. Rahman, et al. Efficient FPGA acceleration of Convolutional Neural Networks using logical-3d compute array. In DATE 2016.

Outline Overview of CNN Algorithm and Accelerator Convolution Loop Optimization Design Objectives of CNN Accelerator Loop Optimization in Related Works Proposed CNN Accelerator Acceleration Scheme Convolution Dataflow Convolution PE Architecture Experimental Results Conclusion - 35 -

Proposed Acceleration Scheme Loop Unrolling (P*) Pkx = Pky = 1, Pif = 1, Pox = Poy = 14, Pof = 16 for all conv. layers Reuse pixels/weights by unrolling Loop-3/Loop-4, respectively Total # of PEs = 3,136 ( # of DSP multipliers in our FPGA) Uniform PE mapping by constant unrolling factors (P*) for all conv. Loop Tiling (T*) Loop-1 & Loop-2 are buffered: Tkx Tky = Nkx Nky, Tif = Nif Only Pof Pox Poy partial sums stored in separate registers Either Tix Tiy = Nix Niy or Tof = Nof for every conv. layer Ensure every pixel and weight are read only once from DRAM Loop Interchange Serially compute Loop-1&2 first to consume partial sums ASAP - 36 -

Kernel Weights Proposed Convolution Dataflow Serially compute Loop-1: 3 3 Conv kernel illustration Input feature map with zero padding M* denotes MAC units Green pixels are in registers (R) M21 0 0 0 0 0 0 0 0 0 11 12 13 14 15 16 0 0 21 22 23 24 25 26 0 0 31 32 33 34 35 36 0 0 41 42 43 44 45 46 0 0 51 52 53 54 55 56 0 0 61 62 63 64 65 66 0 0 0 0 0 0 0 0 0 : output pixel location Kernel Map 1 2 3 4 5 6 7 8 9 clk WT 0 0 1 1 11 2 2 12 3 3 0 4 4 21 5 5 22 6 6 0 7 7 31 8 8 32 9-37 -

Kernel Weights Proposed Convolution Dataflow Serially compute Loop-1: 3 3 Conv kernel illustration Adjacent output pixel computation Input feature map with zero padding M* denotes MAC units Green pixels are in registers (R) M22 0 0 0 0 0 0 0 0 0 11 12 13 14 15 16 0 0 21 22 23 24 25 26 0 0 31 32 33 34 35 36 0 0 41 42 43 44 45 46 0 0 51 52 53 54 55 56 0 0 61 62 63 64 65 66 0 0 0 0 0 0 0 0 0 : output pixel location Kernel Map 1 2 3 4 5 6 7 8 9 clk WT 0 11 1 1 12 2 2 13 3 3 21 4 4 22 5 5 23 6 6 31 7 7 32 8 8 33 9-38 -

Kernel Weights Proposed Convolution Dataflow Serially compute Loop-1: 3 3 Conv kernel illustration Adjacent output pixel computation Input feature map with zero padding M* denotes MAC units Green pixels are in registers (R) M23 0 0 0 0 0 0 0 0 0 11 12 13 14 15 16 0 0 21 22 23 24 25 26 0 0 31 32 33 34 35 36 0 0 41 42 43 44 45 46 0 0 51 52 53 54 55 56 0 0 61 62 63 64 65 66 0 0 0 0 0 0 0 0 0 : output pixel location Kernel Map 1 2 3 4 5 6 7 8 9 clk WT 0 12 1 1 13 2 2 14 3 3 22 4 4 23 5 5 24 6 6 32 7 7 33 8 8 34 9-39 -

Kernel Weights Proposed Convolution Dataflow Same weight is used (multiplied) at all MACs in a given clock cycle Input feature map with zero padding M* denotes MAC units Green pixels are in registers (R) M21 M22 M23 0 0 0 0 0 0 0 0 0 11 12 13 14 15 16 0 0 21 22 23 24 25 26 0 0 31 32 33 34 35 36 0 0 41 42 43 44 45 46 0 0 51 52 53 54 55 56 0 0 61 62 63 64 65 66 0 0 0 0 0 0 0 0 0 : output pixel location Kernel Map 1 2 3 4 5 6 7 8 9 clk WT 0 0 11 12 1 1 11 12 13 2 2 12 13 14 3 3 0 21 22 4 4 21 22 23 5 5 22 23 24 6 6 0 31 32 7 7 31 32 33 8 8 32 33 34 9-40 -

Kernel Weights Proposed Convolution Dataflow Pixels are reused by movements among register arrays Input feature map with zero padding M* denotes MAC units Green pixels are in registers (R) M21 M22 M23 0 0 0 0 0 0 0 0 0 11 12 13 14 15 16 0 0 21 22 23 24 25 26 0 0 31 32 33 34 35 36 0 0 41 42 43 44 45 46 0 0 51 52 53 54 55 56 0 0 61 62 63 64 65 66 0 0 0 0 0 0 0 0 0 : output pixel location Kernel Map 1 2 3 4 5 6 7 8 9 clk R23 R24 R25 WT 0 0 11 12 1 1 11 12 13 2 2 12 13 14 3 3 0 21 22 4 4 21 22 23 5 5 22 23 24 6 6 0 31 32 7 7 31 32 33 8 8 32 33 34 9-41 -

Kernel Weights Proposed Convolution Dataflow Same weight is used (multiplied) at all MACs in a given clock cycle Unroll Loop-3: parallelism within one input feature map Input feature map with zero padding M* denotes MAC units Green pixels are in registers (R) M11 M12 M13 M21 M22 M23 M31 M32 M33 0 0 0 0 0 0 0 0 0 11 12 13 14 15 16 0 0 21 22 23 24 25 26 0 0 31 32 33 34 35 36 0 0 41 42 43 44 45 46 0 0 51 52 53 54 55 56 0 0 61 62 63 64 65 66 0 0 0 0 0 0 0 0 0 : output pixel location Kernel Map 1 2 3 4 5 6 7 8 9 clk WT 0 0 0 0 0 11 12 0 21 22 1 1 0 0 0 11 12 13 21 22 23 2 2 0 0 0 12 13 14 22 23 24 3 3 0 11 12 0 21 22 0 31 32 4 4 11 12 13 21 22 23 31 32 33 5 5 12 13 14 22 23 24 32 33 34 6 6 0 21 22 0 31 32 0 41 42 7 7 21 22 23 31 32 33 41 42 43 8 8 22 23 24 32 33 34 42 43 44 9-42 -

Kernel Weights Proposed Convolution Dataflow Same weight is used (multiplied) at all MACs in a given clock cycle Pixels are reused by movements among register arrays. Input feature map with zero padding M* denotes MAC units Green pixels are in registers (R) M11 M12 M13 M21 M22 M23 M31 M32 M33 0 0 0 0 0 0 0 0 0 11 12 13 14 15 16 0 0 21 22 23 24 25 26 0 0 31 32 33 34 35 36 0 0 41 42 43 44 45 46 0 0 51 52 53 54 55 56 0 0 61 62 63 64 65 66 0 0 0 0 0 0 0 0 0 : output pixel location Kernel Map 1 2 3 4 5 6 7 8 9 clk R11 R12 R13 R14 R15 R21 R22 R23 R24 R25 R31 R32 R33 R34 R35 WT 0 0 0 0 0 11 12 0 21 22 1 1 0 0 0 0 0 11 12 13 0 21 22 23 2 2 0 0 0 0 0 0 11 12 13 14 0 21 22 23 24 3 3 13 14 0 11 12 23 24 0 21 22 0 31 32 4 4 14 0 11 12 13 24 0 21 22 23 0 31 32 33 5 5 0 11 12 13 14 0 21 22 23 24 0 31 32 33 34 6 6 23 24 0 21 22 33 34 0 31 32 0 41 42 7 7 24 0 21 22 23 34 0 31 32 33 0 41 42 43 8 8 0 21 22 23 24 0 31 32 33 34 0 41 42 43 44 9-43 -

BUF address Kernel Weights - 44 - Proposed Convolution Dataflow Same weight is used (multiplied) at all MACs in a given clock cycle Pixels are reused by movements among register arrays. Input feature map with zero padding Pix Piy 0 0 0 0 0 0 0 0 0 11 12 13 14 15 16 0 0 21 22 23 24 25 26 0 0 31 32 33 34 35 36 0 0 41 42 43 44 45 46 0 0 51 52 53 54 55 56 0 0 61 62 63 64 65 66 0 0 0 0 0 0 0 0 0 Kernel Map 1 2 3 4 5 6 7 8 9 Nkx = Nky = 3 Pkx = Pky = 1 Pix = Piy = 3 M* denotes MAC units Green pixels are in registers (R) Blue pixels are in buffers (BUF) M11 M12 M13 M21 M22 M23 M31 M32 M33 clk R11 R12 R13 R14 R15 R21 R22 R23 R24 R25 R31 R32 R33 R34 R35 WT 0 0 0 0 0 11 12 0 21 22 1 1 0 0 0 0 0 11 12 13 0 21 22 23 2 2 0 0 0 0 0 0 11 12 13 14 0 21 22 23 24 3 3 13 14 0 11 12 23 24 0 21 22 0 31 32 4 4 14 0 11 12 13 24 0 21 22 23 0 31 32 33 5 5 0 11 12 13 14 0 21 22 23 24 0 31 32 33 34 6 6 23 24 0 21 22 33 34 0 31 32 0 41 42 7 7 24 0 21 22 23 34 0 31 32 33 0 41 42 43 8 8 0 21 22 23 24 0 31 32 33 34 0 41 42 43 44 9 0 0 0 0 16 11 12 26 21 22 1 0 0 0 13 14 15 23 24 25 2 36 31 32 46 41 42 offset by zero 56 51 52 padding 3 33 34 35 43 44 45 53 54 55 Input Pixel BUF 1 Input Pixel BUF 2 Input Pixel BUF 3

BUF address Kernel Weights - 44 - Proposed Convolution Dataflow Same weight is used (multiplied) at all MACs in a given clock cycle Pixels are reused by movements among register arrays. Input feature map with zero padding Pix Piy 0 0 0 0 0 0 0 0 0 11 12 13 14 15 16 0 0 21 22 23 24 25 26 0 0 31 32 33 34 35 36 0 0 41 42 43 44 45 46 0 0 51 52 53 54 55 56 0 0 61 62 63 64 65 66 0 0 0 0 0 0 0 0 0 Kernel Map 1 2 3 4 5 6 7 8 9 Nkx = Nky = 3 Pkx = Pky = 1 Pix = Piy = 3 M* denotes MAC units Green pixels are in registers (R) Blue pixels are in buffers (BUF) M11 M12 M13 M21 M22 M23 M31 M32 M33 clk R11 R12 R13 R14 R15 R21 R22 R23 R24 R25 R31 R32 R33 R34 R35 WT 0 0 0 0 0 11 12 0 21 22 1 1 0 0 0 0 0 11 12 13 0 21 22 23 2 2 0 0 0 0 0 0 11 12 13 14 0 21 22 23 24 3 3 13 14 0 11 12 23 24 0 21 22 0 31 32 4 4 14 0 11 12 13 24 0 21 22 23 0 31 32 33 5 5 0 11 12 13 14 0 21 22 23 24 0 31 32 33 34 6 6 23 24 0 21 22 33 34 0 31 32 0 41 42 7 7 24 0 21 22 23 34 0 31 32 33 0 41 42 43 8 8 0 21 22 23 24 0 31 32 33 34 0 41 42 43 44 9 0 0 0 0 16 11 12 26 21 22 1 0 0 0 13 14 15 23 24 25 2 36 31 32 46 41 42 offset by zero 56 51 52 padding 3 33 34 35 43 44 45 53 54 55 Input Pixel BUF 1 Input Pixel BUF 2 Input Pixel BUF 3

BUF address Kernel Weights - 44 - Proposed Convolution Dataflow Same weight is used (multiplied) at all MACs in a given clock cycle Pixels are reused by movements among register arrays. Input feature map with zero padding Pix Piy 0 0 0 0 0 0 0 0 0 11 12 13 14 15 16 0 0 21 22 23 24 25 26 0 0 31 32 33 34 35 36 0 0 41 42 43 44 45 46 0 0 51 52 53 54 55 56 0 0 61 62 63 64 65 66 0 0 0 0 0 0 0 0 0 Kernel Map 1 2 3 4 5 6 7 8 9 Nkx = Nky = 3 Pkx = Pky = 1 Pix = Piy = 3 M* denotes MAC units Green pixels are in registers (R) Blue pixels are in buffers (BUF) M11 M12 M13 M21 M22 M23 M31 M32 M33 clk R11 R12 R13 R14 R15 R21 R22 R23 R24 R25 R31 R32 R33 R34 R35 WT 0 0 0 0 0 11 12 0 21 22 1 1 0 0 0 0 0 11 12 13 0 21 22 23 2 2 0 0 0 0 0 0 11 12 13 14 0 21 22 23 24 3 3 13 14 0 11 12 23 24 0 21 22 0 31 32 4 4 14 0 11 12 13 24 0 21 22 23 0 31 32 33 5 5 0 11 12 13 14 0 21 22 23 24 0 31 32 33 34 6 6 23 24 0 21 22 33 34 0 31 32 0 41 42 7 7 24 0 21 22 23 34 0 31 32 33 0 41 42 43 8 8 0 21 22 23 24 0 31 32 33 34 0 41 42 43 44 9 0 0 0 0 16 11 12 26 21 22 1 0 0 0 13 14 15 23 24 25 2 36 31 32 46 41 42 offset by zero 56 51 52 padding 3 33 34 35 43 44 45 53 54 55 Input Pixel BUF 1 Input Pixel BUF 2 Input Pixel BUF 3

BUF address Kernel Weights - 44 - Proposed Convolution Dataflow Same weight is used (multiplied) at all MACs in a given clock cycle Pixels are reused by movements among register arrays. Input feature map with zero padding Pix Piy 0 0 0 0 0 0 0 0 0 11 12 13 14 15 16 0 0 21 22 23 24 25 26 0 0 31 32 33 34 35 36 0 0 41 42 43 44 45 46 0 0 51 52 53 54 55 56 0 0 61 62 63 64 65 66 0 0 0 0 0 0 0 0 0 Kernel Map 1 2 3 4 5 6 7 8 9 Nkx = Nky = 3 Pkx = Pky = 1 Pix = Piy = 3 M* denotes MAC units Green pixels are in registers (R) Blue pixels are in buffers (BUF) M11 M12 M13 M21 M22 M23 M31 M32 M33 clk R11 R12 R13 R14 R15 R21 R22 R23 R24 R25 R31 R32 R33 R34 R35 WT 0 0 0 0 0 11 12 0 21 22 1 1 0 0 0 0 0 11 12 13 0 21 22 23 2 2 0 0 0 0 0 0 11 12 13 14 0 21 22 23 24 3 3 13 14 0 11 12 23 24 0 21 22 0 31 32 4 4 14 0 11 12 13 24 0 21 22 23 0 31 32 33 5 5 0 11 12 13 14 0 21 22 23 24 0 31 32 33 34 6 6 23 24 0 21 22 33 34 0 31 32 0 41 42 7 7 24 0 21 22 23 34 0 31 32 33 0 41 42 43 8 8 0 21 22 23 24 0 31 32 33 34 0 41 42 43 44 9 0 0 0 0 16 11 12 26 21 22 1 0 0 0 13 14 15 23 24 25 2 36 31 32 46 41 42 offset by zero 56 51 52 padding 3 33 34 35 43 44 45 53 54 55 Input Pixel BUF 1 Input Pixel BUF 2 Input Pixel BUF 3

BUF address Kernel Weights - 44 - Proposed Convolution Dataflow Same weight is used (multiplied) at all MACs in a given clock cycle Pixels are reused by movements among register arrays. Input feature map with zero padding Pix Piy 0 0 0 0 0 0 0 0 0 11 12 13 14 15 16 0 0 21 22 23 24 25 26 0 0 31 32 33 34 35 36 0 0 41 42 43 44 45 46 0 0 51 52 53 54 55 56 0 0 61 62 63 64 65 66 0 0 0 0 0 0 0 0 0 Kernel Map 1 2 3 4 5 6 7 8 9 Nkx = Nky = 3 Pkx = Pky = 1 Pix = Piy = 3 M* denotes MAC units Green pixels are in registers (R) Blue pixels are in buffers (BUF) M11 M12 M13 M21 M22 M23 M31 M32 M33 clk R11 R12 R13 R14 R15 R21 R22 R23 R24 R25 R31 R32 R33 R34 R35 WT 0 0 0 0 0 11 12 0 21 22 1 1 0 0 0 0 0 11 12 13 0 21 22 23 2 2 0 0 0 0 0 0 11 12 13 14 0 21 22 23 24 3 3 13 14 0 11 12 23 24 0 21 22 0 31 32 4 4 14 0 11 12 13 24 0 21 22 23 0 31 32 33 5 5 0 11 12 13 14 0 21 22 23 24 0 31 32 33 34 6 6 23 24 0 21 22 33 34 0 31 32 0 41 42 7 7 24 0 21 22 23 34 0 31 32 33 0 41 42 43 8 8 0 21 22 23 24 0 31 32 33 34 0 41 42 43 44 9 0 0 0 0 16 11 12 26 21 22 1 0 0 0 13 14 15 23 24 25 2 36 31 32 46 41 42 offset by zero 56 51 52 padding 3 33 34 35 43 44 45 53 54 55 Input Pixel BUF 1 Input Pixel BUF 2 Input Pixel BUF 3

BUF address Kernel Weights - 44 - Proposed Convolution Dataflow Same weight is used (multiplied) at all MACs in a given clock cycle Pixels are reused by movements among register arrays. Input feature map with zero padding Pix Piy 0 0 0 0 0 0 0 0 0 11 12 13 14 15 16 0 0 21 22 23 24 25 26 0 0 31 32 33 34 35 36 0 0 41 42 43 44 45 46 0 0 51 52 53 54 55 56 0 0 61 62 63 64 65 66 0 0 0 0 0 0 0 0 0 Kernel Map 1 2 3 4 5 6 7 8 9 Nkx = Nky = 3 Pkx = Pky = 1 Pix = Piy = 3 M* denotes MAC units Green pixels are in registers (R) Blue pixels are in buffers (BUF) M11 M12 M13 M21 M22 M23 M31 M32 M33 clk R11 R12 R13 R14 R15 R21 R22 R23 R24 R25 R31 R32 R33 R34 R35 WT 0 0 0 0 0 11 12 0 21 22 1 1 0 0 0 0 0 11 12 13 0 21 22 23 2 2 0 0 0 0 0 0 11 12 13 14 0 21 22 23 24 3 3 13 14 0 11 12 23 24 0 21 22 0 31 32 4 4 14 0 11 12 13 24 0 21 22 23 0 31 32 33 5 5 0 11 12 13 14 0 21 22 23 24 0 31 32 33 34 6 6 23 24 0 21 22 33 34 0 31 32 0 41 42 7 7 24 0 21 22 23 34 0 31 32 33 0 41 42 43 8 8 0 21 22 23 24 0 31 32 33 34 0 41 42 43 44 9 0 0 0 0 16 11 12 26 21 22 1 0 0 0 13 14 15 23 24 25 2 36 31 32 46 41 42 offset by zero 56 51 52 padding 3 33 34 35 43 44 45 53 54 55 Input Pixel BUF 1 Input Pixel BUF 2 Input Pixel BUF 3

BUF address Kernel Weights - 44 - Proposed Convolution Dataflow Same weight is used (multiplied) at all MACs in a given clock cycle Pixels are reused by movements among register arrays. Input feature map with zero padding Pix Piy 0 0 0 0 0 0 0 0 0 11 12 13 14 15 16 0 0 21 22 23 24 25 26 0 0 31 32 33 34 35 36 0 0 41 42 43 44 45 46 0 0 51 52 53 54 55 56 0 0 61 62 63 64 65 66 0 0 0 0 0 0 0 0 0 Kernel Map 1 2 3 4 5 6 7 8 9 Nkx = Nky = 3 Pkx = Pky = 1 Pix = Piy = 3 M* denotes MAC units Green pixels are in registers (R) Blue pixels are in buffers (BUF) M11 M12 M13 M21 M22 M23 M31 M32 M33 clk R11 R12 R13 R14 R15 R21 R22 R23 R24 R25 R31 R32 R33 R34 R35 WT 0 0 0 0 0 11 12 0 21 22 1 1 0 0 0 0 0 11 12 13 0 21 22 23 2 2 0 0 0 0 0 0 11 12 13 14 0 21 22 23 24 3 3 13 14 0 11 12 23 24 0 21 22 0 31 32 4 4 14 0 11 12 13 24 0 21 22 23 0 31 32 33 5 5 0 11 12 13 14 0 21 22 23 24 0 31 32 33 34 6 6 23 24 0 21 22 33 34 0 31 32 0 41 42 7 7 24 0 21 22 23 34 0 31 32 33 0 41 42 43 8 8 0 21 22 23 24 0 31 32 33 34 0 41 42 43 44 9 0 0 0 0 16 11 12 26 21 22 1 0 0 0 13 14 15 23 24 25 2 36 31 32 46 41 42 offset by zero 56 51 52 padding 3 33 34 35 43 44 45 53 54 55 Input Pixel BUF 1 Input Pixel BUF 2 Input Pixel BUF 3

BUF address Kernel Weights - 44 - Proposed Convolution Dataflow Same weight is used (multiplied) at all MACs in a given clock cycle Pixels are reused by movements among register arrays. Input feature map with zero padding Pix Piy 0 0 0 0 0 0 0 0 0 11 12 13 14 15 16 0 0 21 22 23 24 25 26 0 0 31 32 33 34 35 36 0 0 41 42 43 44 45 46 0 0 51 52 53 54 55 56 0 0 61 62 63 64 65 66 0 0 0 0 0 0 0 0 0 Kernel Map 1 2 3 4 5 6 7 8 9 Nkx = Nky = 3 Pkx = Pky = 1 Pix = Piy = 3 M* denotes MAC units Green pixels are in registers (R) Blue pixels are in buffers (BUF) M11 M12 M13 M21 M22 M23 M31 M32 M33 clk R11 R12 R13 R14 R15 R21 R22 R23 R24 R25 R31 R32 R33 R34 R35 WT 0 0 0 0 0 11 12 0 21 22 1 1 0 0 0 0 0 11 12 13 0 21 22 23 2 2 0 0 0 0 0 0 11 12 13 14 0 21 22 23 24 3 3 13 14 0 11 12 23 24 0 21 22 0 31 32 4 4 14 0 11 12 13 24 0 21 22 23 0 31 32 33 5 5 0 11 12 13 14 0 21 22 23 24 0 31 32 33 34 6 6 23 24 0 21 22 33 34 0 31 32 0 41 42 7 7 24 0 21 22 23 34 0 31 32 33 0 41 42 43 8 8 0 21 22 23 24 0 31 32 33 34 0 41 42 43 44 9 0 0 0 0 16 11 12 26 21 22 1 0 0 0 13 14 15 23 24 25 2 36 31 32 46 41 42 offset by zero 56 51 52 padding 3 33 34 35 43 44 45 53 54 55 Input Pixel BUF 1 Input Pixel BUF 2 Input Pixel BUF 3

BUF address Kernel Weights - 44 - Proposed Convolution Dataflow Same weight is used (multiplied) at all MACs in a given clock cycle Pixels are reused by movements among register arrays. Input feature map with zero padding Pix Piy 0 0 0 0 0 0 0 0 0 11 12 13 14 15 16 0 0 21 22 23 24 25 26 0 0 31 32 33 34 35 36 0 0 41 42 43 44 45 46 0 0 51 52 53 54 55 56 0 0 61 62 63 64 65 66 0 0 0 0 0 0 0 0 0 Kernel Map 1 2 3 4 5 6 7 8 9 Nkx = Nky = 3 Pkx = Pky = 1 Pix = Piy = 3 M* denotes MAC units Green pixels are in registers (R) Blue pixels are in buffers (BUF) M11 M12 M13 M21 M22 M23 M31 M32 M33 clk R11 R12 R13 R14 R15 R21 R22 R23 R24 R25 R31 R32 R33 R34 R35 WT 0 0 0 0 0 11 12 0 21 22 1 1 0 0 0 0 0 11 12 13 0 21 22 23 2 2 0 0 0 0 0 0 11 12 13 14 0 21 22 23 24 3 3 13 14 0 11 12 23 24 0 21 22 0 31 32 4 4 14 0 11 12 13 24 0 21 22 23 0 31 32 33 5 5 0 11 12 13 14 0 21 22 23 24 0 31 32 33 34 6 6 23 24 0 21 22 33 34 0 31 32 0 41 42 7 7 24 0 21 22 23 34 0 31 32 33 0 41 42 43 8 8 0 21 22 23 24 0 31 32 33 34 0 41 42 43 44 9 0 0 0 0 16 11 12 26 21 22 1 0 0 0 13 14 15 23 24 25 2 36 31 32 46 41 42 offset by zero 56 51 52 padding 3 33 34 35 43 44 45 53 54 55 Input Pixel BUF 1 Input Pixel BUF 2 Input Pixel BUF 3

BUF address Kernel Weights - 44 - Proposed Convolution Dataflow Same weight is used (multiplied) at all MACs in a given clock cycle Pixels are reused by movements among register arrays. Input feature map with zero padding Pix Piy 0 0 0 0 0 0 0 0 0 11 12 13 14 15 16 0 0 21 22 23 24 25 26 0 0 31 32 33 34 35 36 0 0 41 42 43 44 45 46 0 0 51 52 53 54 55 56 0 0 61 62 63 64 65 66 0 0 0 0 0 0 0 0 0 Kernel Map 1 2 3 4 5 6 7 8 9 Nkx = Nky = 3 Pkx = Pky = 1 Pix = Piy = 3 M* denotes MAC units Green pixels are in registers (R) Blue pixels are in buffers (BUF) M11 M12 M13 M21 M22 M23 M31 M32 M33 clk R11 R12 R13 R14 R15 R21 R22 R23 R24 R25 R31 R32 R33 R34 R35 WT 0 0 0 0 0 11 12 0 21 22 1 1 0 0 0 0 0 11 12 13 0 21 22 23 2 2 0 0 0 0 0 0 11 12 13 14 0 21 22 23 24 3 3 13 14 0 11 12 23 24 0 21 22 0 31 32 4 4 14 0 11 12 13 24 0 21 22 23 0 31 32 33 5 5 0 11 12 13 14 0 21 22 23 24 0 31 32 33 34 6 6 23 24 0 21 22 33 34 0 31 32 0 41 42 7 7 24 0 21 22 23 34 0 31 32 33 0 41 42 43 8 8 0 21 22 23 24 0 31 32 33 34 0 41 42 43 44 9 0 0 0 0 16 11 12 26 21 22 1 0 0 0 13 14 15 23 24 25 2 36 31 32 46 41 42 offset by zero 56 51 52 padding 3 33 34 35 43 44 45 53 54 55 Input Pixel BUF 1 Input Pixel BUF 2 Input Pixel BUF 3

BUF address Kernel Weights - 44 - Proposed Convolution Dataflow Same weight is used (multiplied) at all MACs in a given clock cycle Pixels are reused by movements among register arrays. Input feature map with zero padding Pix Piy 0 0 0 0 0 0 0 0 0 11 12 13 14 15 16 0 0 21 22 23 24 25 26 0 0 31 32 33 34 35 36 0 0 41 42 43 44 45 46 0 0 51 52 53 54 55 56 0 0 61 62 63 64 65 66 0 0 0 0 0 0 0 0 0 Kernel Map 1 2 3 4 5 6 7 8 9 Nkx = Nky = 3 Pkx = Pky = 1 Pix = Piy = 3 M* denotes MAC units Green pixels are in registers (R) Blue pixels are in buffers (BUF) M11 M12 M13 M21 M22 M23 M31 M32 M33 clk R11 R12 R13 R14 R15 R21 R22 R23 R24 R25 R31 R32 R33 R34 R35 WT 0 0 0 0 0 11 12 0 21 22 1 1 0 0 0 0 0 11 12 13 0 21 22 23 2 2 0 0 0 0 0 0 11 12 13 14 0 21 22 23 24 3 3 13 14 0 11 12 23 24 0 21 22 0 31 32 4 4 14 0 11 12 13 24 0 21 22 23 0 31 32 33 5 5 0 11 12 13 14 0 21 22 23 24 0 31 32 33 34 6 6 23 24 0 21 22 33 34 0 31 32 0 41 42 7 7 24 0 21 22 23 34 0 31 32 33 0 41 42 43 8 8 0 21 22 23 24 0 31 32 33 34 0 41 42 43 44 9 0 0 0 0 16 11 12 26 21 22 1 0 0 0 13 14 15 23 24 25 2 36 31 32 46 41 42 offset by zero 56 51 52 padding 3 33 34 35 43 44 45 53 54 55 Input Pixel BUF 1 Input Pixel BUF 2 Input Pixel BUF 3

BUF address Kernel Weights - 44 - Proposed Convolution Dataflow Same weight is used (multiplied) at all MACs in a given clock cycle Pixels are reused by movements among register arrays. Input feature map with zero padding Pix Piy 0 0 0 0 0 0 0 0 0 11 12 13 14 15 16 0 0 21 22 23 24 25 26 0 0 31 32 33 34 35 36 0 0 41 42 43 44 45 46 0 0 51 52 53 54 55 56 0 0 61 62 63 64 65 66 0 0 0 0 0 0 0 0 0 Kernel Map 1 2 3 4 5 6 7 8 9 Nkx = Nky = 3 Pkx = Pky = 1 Pix = Piy = 3 M* denotes MAC units Green pixels are in registers (R) Blue pixels are in buffers (BUF) M11 M12 M13 M21 M22 M23 M31 M32 M33 clk R11 R12 R13 R14 R15 R21 R22 R23 R24 R25 R31 R32 R33 R34 R35 WT 0 0 0 0 0 11 12 0 21 22 1 1 0 0 0 0 0 11 12 13 0 21 22 23 2 2 0 0 0 0 0 0 11 12 13 14 0 21 22 23 24 3 3 13 14 0 11 12 23 24 0 21 22 0 31 32 4 4 14 0 11 12 13 24 0 21 22 23 0 31 32 33 5 5 0 11 12 13 14 0 21 22 23 24 0 31 32 33 34 6 6 23 24 0 21 22 33 34 0 31 32 0 41 42 7 7 24 0 21 22 23 34 0 31 32 33 0 41 42 43 8 8 0 21 22 23 24 0 31 32 33 34 0 41 42 43 44 9 0 0 0 0 16 11 12 26 21 22 1 0 0 0 13 14 15 23 24 25 2 36 31 32 46 41 42 offset by zero 56 51 52 padding 3 33 34 35 43 44 45 53 54 55 Input Pixel BUF 1 Input Pixel BUF 2 Input Pixel BUF 3

Weight BUF Convolution PE Architecture Pix (14) Piy (14) Pof (16) independent PEs (MAC unit). Pixels / Weights shared by Pof / Pix Piy PEs, respectively. Partial sum is consumed inside each MAC unit. Pix M11 M12 M13 R11 R12 R13 R14 R15 Pof M21 M22 M23 MAC Unit Register R21 R22 R23 R24 R25 M31 M32 M33-56 - 0 (clk cycle) 3,6 1,2,4,5,7,8 0~8 R31 R32 R33 R34 R35 Input Pixel BUF 1 Input Pixel BUF 2 Input Pixel BUF 3 Piy

Outline Overview of CNN Algorithm and Accelerator Convolution Loop Optimization Design Objectives of CNN Accelerator Loop Optimization in Related Works Proposed CNN Accelerator Experimental Results Conclusion - 57 -

Controller Experiment System Setup Altera Arria 10 GX 1150 FPGA + 2 4GB DDR3L SDRAM 1,150K logic elements, 1,518 DSP blocks, 2,713 M20K RAMs VGG-16 CNN Model 13 convolution, 5 pooling, 3 FC layers, 138.3 million parameters Fixed point data with dynamically adjusted decimal points 16-bit pixels, 8-bit weights, 30-bit partial sums. Output Pixel Buffers Convolution Registers/PEs Pooling Registers/PEs Input Pixel Buffers Weight Buffers msgdmas External Memory DMA Manager Control signals Pixels Weights - 58 -

Experimental Results End-to-end latency per image is 47.97 ms Conv. computation (70%), DMA_conv (10%), DMA_FC (20%) 1,900 M20K RAMs are used Conv. buffers (70.1%), FC buffers (11.9%), FIFO (3.1%), DMA (11.7%) Simulated thermal power is 21.2 W from PowerPlay DSP (3.6%), M20K (15.0%), logic cells (16%), clock (12.9%), transceiver (15.7%), I/O (13.2%), static consumption (23.5%) Timing Breakdown of VGG-16 Processing One Input Image 1 0 5 10 15 20 25 30 35 40 45 50 Execution Time (ms) conv1 conv2 conv3 conv4 conv5 pools DMA_conv DMA_FC - 59 -

Comparison with Prior Works J. Qiu FPGA 16 N. Suda FPGA 16 C. Zhang ISLPED 16 This Work CNN Model VGG-16 VGG-16 VGG-16 VGG-16 FPGA Zynq XC7Z045 Stratix-V GSD8 Virtex-7 VX690t Arria-10 GX 1150 Frequency (MHz) 150 120 150 150 # Operations (GOP) 30.76 30.95 30.95 30.95 # of Weights 50.18 M 138.3 M 138.3 M 138.3 M Precision 16 bit 8-16 bit 16 bit 8-16 bit DSP Utilization 780 (89%) - - 1,518 (100%) Logic Utilization a 183K (84%) - - 161K (38%) On-chip RAM b 486 (87%) - - 1,900 (70%) Latency/Image (ms) 224.6 262.9 151.8 47.97 Throughput (GOPS) 136.97 117.8 203.9 645.25-60 - a. Xilinx FPGAs in LUTs and Altera FPGAs in ALMs b. Xilinx FPGAs in BRAMs (36 Kb) and Altera FPGAs in M20K RAMs (20 Kb)

Outline Overview of CNN Algorithm and Accelerator Convolution Loop Optimization Design Objectives of CNN Accelerator Loop Optimization in Related Works Proposed CNN Accelerator Experimental Results Conclusion - 61 -

Conclusion In-depth analysis of convolution acceleration strategy. Numerically characterize the loop optimization techniques, including loop unrolling, tilling and interchange. Quantitatively investigate the relationship between accelerator objectives and design variables. Propose efficient dataflow and architecture to minimize data communication and enhance throughput. Implement VGG-16 on Arria 10 FPGA End-to-end 645.25 GOPS of throughput and 47.97 ms of latency. Outperform all prior VGG FPGA implementations by 3.2. - 62 -

2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays Thanks! Questions? - 63 -