SP-CNN: A Scalable and Programmable CNN-based Accelerator. Dilan Manatunga Dr. Hyesoon Kim Dr. Saibal Mukhopadhyay

SP-CNN: A Scalable and Programmable CNN-based Accelerator Dilan Manatunga Dr. Hyesoon Kim Dr. Saibal Mukhopadhyay

Motivation Power is a first-order design constraint, especially for embedded devices. Certain applications prevalent in embedded devices Image processing, audio processing, context awareness, etc. Trend of including more specialized, energy-efficient accelerators. Moto X 2013

Cellular Neural Networks Investigated using Cellular Neural Networks (CNN) as a specialized accelerator Used for variety of applications, particularly image processing Hardware implementations known to consume very little power Lee et al. (2008) implemented CNN chip that consumed 84mW@ 130nm Issue: Current hardware Cellular Neural Network processing capabilities have not scaled with the growth in image sizes.

Outline Cellular Neural Network (CNN) Background Multiplexing SP-CNN Architecture Results

Neural Networks Radial Basis Function Network Self-Organizing Map Recurrent Neural Network Adaptive Resonance Theory Pulse-Coupled Neural Network Spiking Neural Network Artificial Neural Network Hopfield Neural Network Convolutional Cellular Neural Neural Network Network (CNN)

CNN Background Introduced by Leon Chua and Lin Yang in 1998 Characterized by a spatial arrangement of locally-coupled cells The set of cells a cell is coupled with is known as the cell s neighborhood Typical Cellular Neural Network consists of an MxN array of cells C(i,j)

CNN Applications Image Processing Edge Detection, Image Segmentation, Movement Detection, etc. Pattern Recognition Associative Memory Solving Partial Differential Equations Pre-data processing for other neural networks CNN Universal Machine is Turing Complete

CNN Cell Each cell is a dynamical system with an input, output, and a state that evolves according to some prescribed laws. Most CNNs follow the standard CNN state and output equations: dx ij (t) = x dt ij t + a kl y(t) kl + b kl u kl + z C k,l N r i,j C k,l N r i,j y ij t = 1 2 x ij t + 1 x ij t 1 N r i, k represents neighborhood for cell at location i,j x represents the current state y is the cell outputs u is the cell inputs z is the threshold value a and b are the feed-forward and feedback weights

Digital CNN Cell For our work, we focused on digital CNN implementation. In these implementations, the state equation simplifies to: x ij (t + 1) = a kl y(t) kl + b kl u kl + z C k,l N r i,k C k,l N r i,k (from neighborhood cells) A (outputs) z (inputs) B Σ x ij y ij

CNN Gene The programmable unit of a CNN is known as the CNN gene The CNN gene specifies: the z threshold value of the CNN state equation the a and b coefficients of the CNN state equation Furthermore, the programmer will specify the initial state and boundary conditions for the CNN 0.0 1.0 0.0 Hole Filling Gene: A = 1.0 2.0 1.0 0.0 1 0.0 Initial State: x ij 0 = 1 Boundary Condition: y kl = 0, u kl = 0, B = 0.0 0.0 0.0 0.0 4.0 0.0 0.0 0.0 0.0, z = 1.0

Example Application The following is an example of running the HoleFilling gene, an application commonly used in character recognition, on a input image of the number eight.

Hardware CNNs Different companies and research institutions have implemented CNNs in hardware. Name Array Size Technology Node VAE 80x60 130 nm Flak+ 4x4 180 nm SCAMP 21x21 600 nm 4-layer Gabor Filter 32x64 250 nm ACE16k 128x128 350 nm QCIF 176x144 150 nm Kingetset+ Estimated Size 128x128 240 nm The array sizes for most implementations is typically small.

Issue: Image Dim vs CNN Array Dim Generally, most applications assume the CNN array s dimensions match the input (image) dimensions. But what if CNN array is smaller? 1. Shrink the Image Issue: Image Size to CNN Size Ratio Ex. 1024x1024 image to 128x128 CNN. That s a 64:1 ratio! 2. Emulate Ideal CNN through multiplexing

Ideal Multiplexing Partition 0 CNN State @T T=0 1 Partition 2 CNN State T=0 Partition 1 CNN State @T T=0 1 Partition 3 CNN State @T T=0 1 With ideal multiplexing, we transfer and run each partition on the CNN array for 1 CNN unit of time. Heavily Memory Bound Number of input/state data transfers proportional to convergence time. 1 CNN Time Unit @T 1 Rd. P0 u Rd. P0 x(0) Comp Wr. P0 x(1) Rd. P1 u time (t) CNN CNN Array Array P0 CNN State @ T1+1 Too little work to utilize memory-level parallelism (MLP) What if we add another CNN array to take advantage of independence between computations?

Key-Insight Let s take advantage of CNNs ability to still converge to the correct solution in presence of small errors. So, instead of running for partition on CNN array for 1 unit time, run partition for an interval T >> 1 unit of time. This is the key insight to our Scalable and Programmable CNN (SP-CNN) architecture.

SP-CNN Multiplexing With SP-CNN multiplexing, we run each partition on the CNN array for a certain interval of time. Run for 1 CNN Unit of Time Rd. P0 u Rd. P0 x(0) Comp Wr. P0 x(1) Run for INTERVAL CNN unit of Time Rd. P0 u Rd. P0 x(0) Comp (INTERVAL=4) Wr. P0 x(1) 1 st iteration 2 nd iteration time (t) SP-CNN INTERVAL=4 0 4 8 12 16 20 24 28 32 cnn-time (t) interval

Boundary Conditions When running a partition on the CNN array, what should we set boundary conditions to? To ensure that information propagates between partitions, we set the boundary condition for a given partition to its neighboring cells. Partition 0 Pixel Partition 1 Pixel Partition 0 Boundary Pixel Partition 0 Boundary made of Partition 1 Data Partition 0 Boundary also part of True Boundary Image Pixel True Boundary Condition for Image

Boundary Conditions Example P0 @T 1 P2 P1 @T 1 P3 CNN Array P0 P1 P2 P3 @T 2 P0 @T 2 @T 1 @T 1 True Boundary Cond. Next-State @ T 1 @ T 3 =T 2 +4*INTERVAL Next-State @ T 2 @ T 2 =T 1 +4*INTERVAL P0 @T 1 P1 @T 1 P2 @T 1 P3 @T 1 P0 @T 2 T 1 T 1 +INT T 1 +2*INT T 1 +3*INT T 2 =T 1 +4*INT T 2 +INT cnn-time (t)

Total Time and Virtual Time To compare convergence of Ideal-CNN vs SP-CNN, we introduce the concept of Virtual Time. Ideal-CNN 0 4 8 v(t)=4 time (t) v(t)=8 SP-CNN INTERVAL=4 0 4 8 12 16 20 24 28 32 time (t) interval 1 st iteration 2 nd iteration

Reduced Memory Transfers If virtual convergence time does not significantly increase in the SP-CNN case, we can significantly reduce number of memory transfers. T Memory Transfers Ideal-CNN P0 P1 P2 P3 P0 P1 P2 P3 P0 P1 P2 P3 0 1 2 3 4 5 6 7 8 T-4 T-2 T-2 T-1 T Virtual Convergence Time (t) T 2 /INTERVAL Memory Transfers SP-CNN INTERVAL=4 0 P0 P1 1 2 3 4 5 6 7 8 P3 T 2-4 T 2-3 T 2-2 T 2-1 T 2 Virtual Convergence Time (t) If T T 2, than memory transfers reduced by factor of about 1/INTERVAL.

Increased MLP By increasing computation time, we can take advantage of parallel computation within the iteration. CNN0 Rd. P0 u Rd. P0 x(0) Comp (INTERVAL=4) Wr. P0 x(1) CNN1 Rd. P1 u Rd. P1 x(0) Comp (INTERVAL=4) CNN2 Increased memory level parallelism due to lengthier computation. Rd. P2 u time (t)

SP-CNN Architecture Host Processor Sends work to SP-CNN Global memory Stores the input and cnn state. Scheduler Assigns partitions to CNN-P CNN-P CNN processing unit. Processes partition

Methodology 6 benchmarks with 10 input images. The test images are of size 1024x1024 (~720p) or 2048x2048 (~1440p) Name Ideal SP-CNN Application Mechanism CNN array size equal to Image Size Proposed architecture Type *Local applications are essentially simple convolution operations that only care about neighborhood values. Corner Detection Edge Detection Connected Component Hole Filling Rotation Detector Local* Local* Global Global Global Shadow Creator Global

Simulators Functional Simulator Specifications CNN Array Size = 128x128 Interval Time = 128 cnn time units Timing simulator (DRAMSim2 for memory) CNN timing parameters based on VAE architecture* (200Mhz @ 130nm) 1 Processing Element for 32 cells Each CNN-P unit has 64 MSHR entries. (Based off Nvidia Fermi) Used 2GB DRAM memory with DDR3 timing parameters. *VAE is an existing hardware CNN chip implementation [Lee et. Al 2008]

Virtual Convergence Time (CNN Time Units) Virtual Convergence Results 4000 3500 3000 2500 2000 1500 1000 500 0 Ideal (1024x1024) Ideal (2048x2048) 3 4 3 4 2 3 2 3 SP-CNN (1024x1024) SP-CNN (2048x2048) Crn-Detect Edge-Detect Conn-Comp Hole-Fill Rot-Detect Shadow The virtual convergence time of using SP-CNN is comparable to the ideal case for both dimensions of images.

Time (ms) Timing Results for 1024x1024 128 112 96 80 64 48 32 16 30 FPS Boundary 60 FPS Boundary CNN-P=1 CNN-P=2 CNN-P=4 CNN-P=8 0 Crn-Detect Edge-Detect Conn-Comp Hole-Fill Rot-Detect Shadow Meet 60 FPS for most applications when CNN-P = 1 Meet 60 FPS for all applications when CNN-P = 8

Time (ms) Timing Results for 2048x2048 128 375 195 112 96 80 64 48 32 16 30 FPS Boundary 60 FPS Boundary CNN-P=1 CNN-P=2 CNN-P=4 CNN-P=8 0 Crn-Detect Edge-Detect Conn-Comp Hole-Fill Rot-Detect Shadow Meet 30 FPS for most applications when CNN-P=8 60 FPS difficult to meet even with CNN-P=8

Future Work Try more computationally-intensive CNN applications Video processing, Face recognition, pattern recognition, etc. FPGA implementation of SP-CNN Will allow us to perform better performance and power comparisons against CPU, GPU, etc. Further SP-CNN optimizations Determine optimal interval values Boundary Condition Propagation Time Determine if a CNN gene is suitable for SP-CNN

Conclusion Our proposed SP-CNN architecture brings scalability to small sized hardware CNN arrays. SP-CNN shows performance that is comparable to the ideal case in terms of virtual convergence time. Energy consumption around 10 to 30 mj. SP-CNN can meet 30 FPS and 60 FPS standards for most benchmarks. This can further improved with better scaling of the CNN technology.

Questions?

Digital CNN Operation x ij (t + 1) = C k,l N r i,k a kl y(t) kl + C k,l N r i,k b kl u kl + z A = 0.0 1.0 0.0 1.0 2.0 1.0 0.0 1.0 0.0 0.0 0.0 0.0 B = 0.0 4.0 0.0 z = -1.0 0.0 0.0 0.0 x ij t + 1 = C k,l N r i,k 0.0 1.0 0.0 1.0 2.0 1.0 y t0.0 kl 1.0 0.0 + C k,l N r i,k b kl u kl + z y(t) = 1.0 u = -1.0 y(t) =.75 u = -1.0 y(t) u =

1 26 51 76 101 126 151 176 201 226 251 276 301 326 351 376 401 426 451 476 501 526 551 576 601 626 651 676 701 726 751 776 801 The basis of the CNN operation is that we eventually converge to the correct solution. However, there are no guarantees on the convergence time and different inputs can converge at different rates. 100% Output Error of Running HoleFilling Gene on a CNN for Various 1024x1024 Test Images 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% capitala circle eight filledsquare lowera nine rect seven vertline zero 33

Slow Propagation Updated information in boundary conditions is seen at the beginning of the next iteration 1 2 5 partition 0 y(0) y(0) s output y(1) 3 4 partition 1 y(0) y(0) s output 6 y(1) 34

The SP-CNN algorithm can be viewed below: while change do change = false for p in partitions do loadstateandinputtocnn(p) for n = 0; n < interval; n++, t++ do change = parallel-computation of state parallel-computation of output end for savetonextstatefromcnn(p) end for swapstateandnextstatepointers(p) iter += 1, vt += interval end while 35

Fast propagation Updated information in boundary conditions is seen at the beginning of the next interval partition 0 1 y(0) 2 y(0) s output 5 y(1) partition 1 3 y(0) 4 y(0) s output 36

Using fast propagation, the partition order can improve on program convergence time. For example, if data is passed through the cells from right to left, then we can converge faster by moving through partitions in a reverse column-major order. Partition ordering only matters if fast propagation used. 37

Avg. Speedup in Total Convergence Time 1.5 1.4 1.3 1.2 1.1 0.9 1 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 1024x1024 2048x2048 Crn-Detect Edge-Detect Conn-Comp Hole-Fill Rot-Detect Shadow For Connected Component, Hole Filling, and Rotation Detector, fast propagation can provide an average speedup from 15% to 30%. Shadow Creator shows no benefit since the information propagation in that benchmark is from right-to-left. 38

Avg. Speedup in Total Convergence Time 2 1.8 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0 1024x1024 2048x2048 With Shadow Creator, using reverse row-major order can provide an average speedup from 13% and 30% respectively. 39

Virtual Convergence Time Originally: virtualconvtime = n T, where N is the number of iterations Now: virtualconvtime = n min(t, max( p Partitions convtime(p, i))) i=1 convtime(p, i) is the convergence time for partition p during iteration i 40

Percent Error No-Share Avg. Error No-Share Max. Error Share Avg. Error Share Max. Error 100% 80% 60% 40% 20% 0% Crn-Detect Edge-Detect Conn-Comp Hole-Fill Rot-Detect Shadow Both naïve mechanisms show errors on average of 20%, with max errors as high as 90% for global applications. Local type applications only show error for the No-Share mechanisms. SP-CNN converged correctly to solution for all benchmarks and tests. 41

Avg. Execution Time Speedup 6 CNN-P=2 (1024x1024) CNN-P=4 (1024x1024) CNN-P=8 (1024x1024) CNN-P=2 (2048x2048) CNN-P=4 (2048x2048) CNN-P=8 (2048x2048) 5 4 3 2 1 0 Crn-Detect Edge-Detect Conn-Comp Hole-Fill Rot-Detect Shadow For the global applications, memory bandwidth contention decreases the ability to achieve linear scaling. Shadow shows very poor scaling since many of the partitions converge quickly, therefore the CNN occupation time is low. 42

Avg. Execution Time Speedup 2 1.8 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0 1 2 4 8 Number of CNN-P Crn-Detect Edge-Detect Conn-Comp Hole-Fill Rot-Detect Shadow Prefetching only provides some benefits when the number of CNN-Ps is equal to 1. When the number of CNN-Ps scales up, than the memory bandwidth contention becomes too large, and prefetching can actually cause slowdowns. 43

We also evaluated our SP-CNN mechanisms over CPU and GPU implementations of the benchmarks. For two benchmarks, Hole-Filling and Rotation Detector, we could not easily develop a CPU/GPU algorithm, so we collect their timing results by emulating the CNN s operation on the corresponding platform. Name Model Power Frequency Technology CPU Intel Core i5-35550 77W (TDP) 3.3 GHz 22nm GPU SP-CNN NVIDIA K1 mobile 1.5W 900 MHz 28nm 0.73mW per PE, 35.6 uw per node 200 MHz 45nm 44

Time (ms) 96 CPU*: 2923 GPU*:1006 CPU*: 3004 GPU*:1038 CPU 80 GPU SP-CNN (CNN-P=1) 64 SP-CNN (CNN-P=4) 48 Ideal-Mul (CNN-P=1) Ideal-Mul (CNN-P=4) 32 30 FPS Boundary 16 60 FPS Boundary 0 Crn-Detect Edge-Detect Conn-Comp Hole-Fill* Rot-Detect* Shadow For simple global applications like Connected Component and Shadow, the CPU/GPU versions perform better than the SP-CNN versions. However, emulation of a CNN is prohibitively slow, and for those applications SP-CNN provides much better performance. 45

Energy (mj) 60 CPU GPU SP-CNN (CNN-P=1) SP-CNN (CNN-P=4) 50 40 30 20 10 0 Crn-Detect Edge-Detect Conn-Comp Hole-Fill* Rot-Detect* Shadow For Connected Component and Shadow, execution time dominates over the low power consumption of SP-CNN, causing overall energy to be larger than the GPU case. Hole-Filling and Rotation Detector do show cases where the SP-CNN can perform complex tasks at a low energy cost. 46

Optimization: Early-Finish Originally, each partition run for on the CNN for the entire interval time T. We saw cases where a partition converged on the CNN before interval finished. Rest of execution essentially does nothing. So, introduced Early-Finish, where a partition runs on the CNN till it either converges or interval time T is reached.