Hardware Acceleration of DNNs

Size: px

Start display at page:

Download "Hardware Acceleration of DNNs"

Myles Pierce
6 years ago
Views:

1 Lecture 12: Hardware Acceleration of DNNs Visual omputing Systems Stanford S348V, Winter 2018

2 Hardware acceleration for DNNs Huawei Kirin NPU Google TPU: Apple Neural Engine Intel Lake rest Deep Learning Accelerator MIT Eyeriss Slide credit: Xuan Yang Volta GPU with Tensor ores Stanford S348V, Winter 2018

And many more Image credit: Shan Tang [https://medium.com/@shan.tang.

3 And many more Image credit: Shan Tang Stanford S348V, Winter 2018

4 Modern NVIDIA GPU (Volta) Stanford S348V, Winter 2018

Recall properties of GPUs ompute rich : packed densely with processing elements - Good for compute-bound applications Good, because dense-matrix multiplication and DNN convolutional layers (when

5 Recall properties of GPUs ompute rich : packed densely with processing elements - Good for compute-bound applications Good, because dense-matrix multiplication and DNN convolutional layers (when implemented properly) is compute bound ut also remember cost of instruction stream processing and control in a programmable processor: Note: these figures are estimates for a PU: [Figure credit Eric hung] Stanford S348V, Winter 2018

Volta GPU Single instruction to perform 2x4x4x4 + 4x4 ops Each SM core has: 64 fp32 ALUs (mul-add) 32 fp64 ALUs 8 tensor cores Execute 4x4 matrix mul-add instr A x + for 4x4 matrices A,, A, stored as

6 Volta GPU Single instruction to perform 2x4x4x4 + 4x4 ops Each SM core has: 64 fp32 ALUs (mul-add) 32 fp64 ALUs 8 tensor cores Execute 4x4 matrix mul-add instr A x + for 4x4 matrices A,, A, stored as fp16, accumulation with fp32 There are 80 SM cores in the GV100 GPU: 5,120 fp32 mul-add ALUs 640 tensor cores 6 M of L2 cache 1.5 GHz max clock = 15.7 TFLOPs fp32 = 125 TFLOPs (fp16/32 mixed) in tensor cores Stanford S348V, Winter 2018

7 Google TPU (version 1) Stanford S348V, Winter 2018

8 Discussion: workloads What did TPU paper state about characteristics of modern DNN workloads at Google? Stanford S348V, Winter 2018

9 Google s TPU Figure credit: Jouppi et al Stanford S348V, Winter 2018

10 TPU area proportionality ompute ~ 30% of chip Note low area footprint of control Key instructions: read host memory write host memory read weights matrix_multiply / convolve activate Figure credit: Jouppi et al Stanford S348V, Winter 2018

11 Systolic array (matrix vector multiplication example: y=wx) Weights FIFO w00 w10 w20 w30 w01 w11 w21 w31 w02 w12 w22 w32 w03 w13 w23 w Accumulators (32-bit) Stanford S348V, Winter 2018

12 Systolic array (matrix vector multiplication example: y=wx) Weights FIFO x0 w00 w10 w20 w30 w01 w11 w21 w31 w02 w12 w22 w32 w03 w13 w23 w Accumulators (32-bit) Stanford S348V, Winter 2018

13 Systolic array (matrix vector multiplication example: y=wx) Weights FIFO x0 w00 w10 w20 w30 x0 * w00 x1 w01 w11 w21 w31 w02 w12 w22 w32 w03 w13 w23 w Accumulators (32-bit) Stanford S348V, Winter 2018

14 Systolic array (matrix vector multiplication example: y=wx) Weights FIFO x0 w00 w10 w20 w30 x0 * w10 x1 x2 w01 w11 w21 w31 x0 * w00 + x1 * w01 w02 w12 w22 w32 w03 w13 w23 w Accumulators (32-bit) Stanford S348V, Winter 2018

15 Systolic array (matrix vector multiplication example: y=wx) Weights FIFO w00 w10 w20 x0 x0 * w20 w30 x3 w01 w11 x0 * w10 + x1 * w11 w21 x2 x1 w31 w02 w12 w22 w32 x0 * w00 + x1 * w01 + x2 * w02 + w03 w13 w23 w Accumulators (32-bit) Stanford S348V, Winter 2018

16 Systolic array (matrix vector multiplication example: y=wx) Weights FIFO w00 w10 w20 w30 x0 * w30 w01 w02 w12 w22 w32 x0 * w10 + x1 * w11 + x2 * w12 + w03 x3 x0 * w10 + x1 * w11 + x2 * w12 + w11 w13 x2 w21 w23 x1 x0 * w20 + x1 * w21 + x3 * w w31 w33 Accumulators (32-bit) Stanford S348V, Winter 2018

17 Systolic array (matrix matrix multiplication example: Y=WX) Weights FIFO x30 x20 x10 w00 w10 w20 w30 x30 * w00 x20 * w10 x10 * w20 x00 * w30 Notice: need multiple 4x32bit accumulators to hold output columns x31 x22 x13 w01 w03 x21 x20 * w20 + x21 * w21 x12 w11 w13 x02 w21 w23 + x03 * w Accumulators (32-bit) w31 w02 w12 w22 w32 x10 * w10 + x00 * w10 + x11 * w11 x01 * w11 + x12 * w12 + x02 * w12 + x03 x00 * w10 + x01 * w11 + x02 * w12 + x11 x10 * w20 + x11 * w21 x01 x00 * w20 + x01 * w21 w33 Stanford S348V, Winter 2018

18 uilding larger matrix-matrix multiplies Example: A = 8x8, = 8x4096, =8x = Assume 4096 accumulators A Stanford S348V, Winter 2018

19 uilding larger matrix-matrix multiplies Example: A = 8x8, = 8x4096, =8x = Assume 4096 accumulators A Stanford S348V, Winter 2018

20 uilding larger matrix-matrix multiplies Example: A = 8x8, = 8x4096, =8x = Assume 4096 accumulators A Stanford S348V, Winter 2018

21 uilding larger matrix-matrix multiplies Example: A = 8x8, = 8x4096, =8x = Assume 4096 accumulators A Stanford S348V, Winter 2018

22 TPU Perf/Watt GM = geometric mean over all apps WM = weighted mean over all apps total = cost of host machine + PU incremental = only cost of TPU Figure credit: Jouppi et al Stanford S348V, Winter 2018

Output Stationary (c) No Local Reuse Fig. 8.

23 Alternative scheduling strategies TPU was weight stationary (weights kept in register at PE) (a) Weight Stationary (b) Output Stationary (c) No Local Reuse Fig. 8. Dataflows for DNNs. Figure credit: Sze et al Stanford S348V, Winter 2018

24 EIE: targeting sparsified networks Stanford S348V, Winter 2018

25 (which is almost assured a c Sparse, weight-sharing fully-connected layer. Representation b i = ReLU 0 1 nx W ij a A j j=0 ession [23] describes a metho 0 1 To exploit the sparsity of a sparse weight matrix W in a Fully-connected layer: column (S) format [24]. Matrix-vector multiplication of activation For each column W j of vector a against that weight contains matrix Wthe non-zero length vector z that encode the corresponding entry in b i = X S[I ij ]a A j represented by a four-bit valu Sparse, weight-sharing representation: before a non-zero entry we j2x i \Y I ij = index for weight W example, ij we encode the foll S[] = table of shared weight values s the set of columns for which X i = list of non-zero indices in row i [0, 0, 1, 2, 0, 0, 0, 0, 0, 0, 0 Y = list of non-zero indices in a as v =[1, 2, 0, 3], z =[2, 0, Note: activations are sparse due to ReLU are stored in one large pair o pointing to the beginning of final entry in p points one be that the number Stanford ofs348v, non-zeros Winter 2018

26 Efficient inference engine (EIE) ASI ustom hardware for decode and evaluate sparse, compressed DNNs Hardware represents weight matrix in compressed sparse column (S) format to exploit sparsity in activations: for each nonzero a_j in a: for each nonzero M_ij in column M_j: b_i += M_ij * a_j More detailed version: int16* a_values; PTR* M_j_start; // column j int4* M_j_values; int4* M_j_indices; int16* lookup; // lookup table for // cluster values for j=0 to length(a): if (a[j] == 0) continue; // scan to nonzero col_values = M_j_values[M_j_start[j]]; col_indices = M_j_indices[M_j_start[j]]; col_nonzeros = M_j_start[j+1]-M_j_start[j]; for i=0, i_count=0 to col_nonzeros: i += col_indices[i_count] b[i] += lookup[m_j_values[i]] * a_values[j_count] * Keep in mind there s a unique lookup table for each chunk of matrix values Stanford S348V, Winter 2018

27 Stanford S348V, Winter 2018 Parallelization of sparse-matrix-vector product Stride rows of matrix across processing elements Output activations strided across processing elements ~a 0 0 a 2 0 a 4 a 5 0 a 7 ~ b PE0 PE1 PE2 PE3 w 0,0 0 w 0,2 0 w 0,4 w 0,5 w 0,6 0 0 w 1,1 0 w 1,3 0 0 w 1, w 2,2 0 w 2,4 0 0 w 2,7 0 w 3, w 0, w 4,1 0 0 w 4, w 5, w 5, w 6,4 0 w 6,6 0 w 7,0 0 0 w 7,4 0 0 w 7,7 0 w 8, w 8,7 w 9, w 9,6 w 9, w 10, w 11, w 11,7 w 12,0 0 w 12,2 0 0 w 12,5 0 w 12,7 w 13,0 w 13, w 13, w 14,2 w 14,3 w 14,4 w 14, w 15,2 w 15,3 0 w 15, A = b 0 b 1 b 2 b 3 b 4 b 5 b 6 b 7 b 8 b 9 b 10 b 11 b 12 b 13 b 14 b 15 1 A ReLU ) b 0 b 1 0 b 3 0 b 5 b b b 13 b A Weights stored local to PEs. Must broadcast non-zero a_j s to all PEs Accumulation of each output b_i is local to PE

28 EIE unit for quantized sparse/matrix vector product Tuple representing non-zero activation (a j, j) arrives and is enqueued Act Queue Act Index Act Value Encoded Weight Act SRAM Leading NZero Detect Even Ptr SRAM ank Odd Ptr SRAM ank Pointer Read ol Start/ End Addr Sparse Matrix SRAM Regs Sparse Matrix Access Relative Index Weight Decoder Address Accum ypass Absolute Address Arithmetic Unit Dest Act Regs Src Act Regs Act R/W ReLU (b) Stanford S348V, Winter 2018

29 EIE Eﬃciency PU Dense (aseline) Speedup 1000x 5x 3x 2x 0.6x 24x 9x 2 14x 5x 0.5x x 92x 22x 10x 8x 1.0x mgpu Dense mgpu ompressed EIE 618x 210x 115x 25x 14x GPU ompressed 1018x 94x 56x 10x GPU Dense 507x 248x 100x PU ompressed 16x 10x 9x 1.0x 63x 34x 0.3x 60x 33x 15x 9x 0.5x 48x 25x 9x 3x 3x 2x 189x 98x 2x 0.5x 2x 15x 3x 0.5x 3x 0.6x 0. Alex-6 Figure 6. Alex-7 Alex-8 VGG-7 VGG-8 NT-We NT-Wd NT-LSTM Geo Mean Speedups of GPU, mobile GPU and EIE compared with PU running uncompressed DNN model. There is no batching in all cases. PU Dense (aseline) Energy Efficiency VGG x PU ompressed GPU Dense 119,797x 61,533x 34,522x GPU ompressed mgpu Dense mgpu ompressed EIE 76,784x 14,826x 11,828x 10000x 10,904x 9,485x 24,207x 8,053x 1000x 100x 10x 37x 12x 9x 26x 37x 5x 7x 10x Alex-6 78x 10 59x 15x Alex-7 7x10x 3x 18x 7x Alex-8 102x 20x 10x 17x 10x 6 13x VGG-6 14x VGG-7 2x 5x 8x 25x 14x 5x VGG-8 39x 14x 6x 6x 6x 6x 8x NT-We 25x 7x NT-Wd 15x 20x 4x 5x 7x NT-LSTM 23x 6x 7x 36x 9x Geo Mean Energy efficiency of GPU, PU: ore i7 5930k (6 mobile cores)gpu and EIE compared with PU running uncompressed DNN model. There is no batching in all cases. Table III GPU: GTX Titan X energy numbers. We annotated the toggle rate from the RTLWarning:these arefrom notstate end-to-end: justmodels fully ENHMARK - OF - THE - ART DNN mgpu: K1 netlist, which was dumped to simulation to thetegra gate-level Figure 7. Layer Weight% layers! Act% FLOP% connected Size 9216, , , , , 4096 Description switching activity interchange format (SAIF), and estimated Alex-6 9% 35.1% 3% ompressed the powersources using Prime-Time PX. of energy savings: AlexNet [1] for Alex-7 9% 35.3% 3% omparison aseline. We compare EIE with three diflarge scale image ompression allows all weights to be stored in SRAM (few DRAM loads) ferent off-the-shelf computing units: PU, GPU and mobile classification Alex-8 25% 37.5% 10% GPU. - Low-precision 16-bit fixed-point math (5x more eﬃcient than 32-bit fixed math) VGG-6 4% 18.3% 1% ompressed 1) PU. We use Intel ore i k PU, a Haswell-E Skip math on inputs activations that are zero (65% less math) VGG-16 [3] for class processor, that has been used in NVIDIA Digits Deep VGG-7 4% 37.5% 2% large scale image Stanford S348V, Winter 2018 Learning Dev ox as a PU baseline. To run the benchmark classification and

30 Thoughts EIE paper highlights performance on fully connected layers (see graph above) - Final layers of networks like AlexNet, VGG - ommon in recurrent network topologies like LSTMs ut many state-of-the-art image processing networks have moved to fully convolutional solutions - Recall Inception, SqueezeNet, etc.. Stanford S348V, Winter 2018

31 Summary of hardware techniques Specialized datapaths for dense linear algebra computations - Reduce overhead of control (compared to PUs/GPUs) Reduced precision (computation and storage) Exploit sparsity Accelerate decompression Stanford S348V, Winter 2018

Optimizing Inference via Approximation

Lecture 11: Optimizing Inference via Approximation Visual omputing Systems Take-home midterm To be released Sunday morning 10/23. Hand-in noon Tuesday 10/25. Example forms of questions: - Short answer