Efficient Methods and Hardware for Deep Learning

Size: px

Start display at page:

Download "Efficient Methods and Hardware for Deep Learning"

Dora Parsons
6 years ago
Views:

1 1 Efficient Methods and Hardware for Deep Learning Song Han Stanford University

2 2 Deep Learning is Changing Our Lives Self-Driving Machine Translation 2 AlphaGo Smart Robots

3 3 Models are Getting Larger IMAGE RECOGNITION SPEECH RECOGNITION 16X Model 8 layers 1.4 GFLOP ~16% Error 152 layers 22.6 GFLOP ~3.5% error 10X Training Ops 80 GFLOP 7,000 hrs of Data ~8% Error 465 GFLOP 12,000 hrs of Data ~5% Error 2012 AlexNet 2015 ResNet 2014 Deep Speech Deep Speech 2 Dally, NIPS 2016 workshop on Efficient Methods for Deep Neural Networks

4 4 The first Challenge: Model Size Hard to distribute large models through over-the-air update

5 The Second Challenge: Speed ResNet18: ResNet50: ResNet101: ResNet152: Error rate 10.76% 7.02% 6.21% 6.16% Training time 2.5 days 5 days 1 week 1.5 weeks Such long training time limits ML researcher s productivity Training time benchmarked with fb.resnet.torch using four M40 GPUs 5

6 6 The Third Challenge: Energy Efficiency AlphaGo: 1920 CPUs and 280 GPUs, $3000 electric bill per game on mobile: drains battery on data-center: increases TCO

7 8 The Problem of Large DNN Hardware engineer suffers from the large model size larger model => more memory reference => more energy Operation 32 bit int ADD bit float ADD bit Register File 1 32 bit int MULT bit float MULT bit SRAM Cache 5 32 bit DRAM Memory 640 Energy [pj] Relative Energy Cost = 1000

9 32 bit Register File 1 32 bit int MULT 3.1 32 bit float MULT 3.

8 9 The Problem of Large DNN Hardware engineer suffers from the large model size larger model => more memory reference => more energy Operation 32 bit int ADD bit float ADD bit Register File 1 32 bit int MULT bit float MULT bit SRAM Cache 5 32 bit DRAM Memory 640 Energy [pj] Relative Energy Cost how to make deep learning more efficient?

9 10 Improve the Efficiency of Deep Learning by Algorithm-Hardware Co-Design

10 11 Application as a Black Box Algorithm Spec 2006 Hardware

11 12 Open the Box before Hardware Design Algorithm? Hardware Breaks the boundary between algorithm and hardware

12 14 What s in the Box: Deep Learning 101 weights/activations training dataset test data Training Inference Model: CNN, RNN, LSTM training hardware inference hardware

Proposed Paradigm Conventional Training Inference Slow Power- Hungry Proposed Training Han et al. ICLR 17 Compression Pruning Quantization Han et al.

13 Proposed Paradigm Conventional Training Inference Slow Power- Hungry Proposed Training Han et al. ICLR 17 Compression Pruning Quantization Han et al. NIPS 15 Han et al. ICLR 16 (Best paper award) Accelerated Inference Han et al. ISCA 16 Han et al. FPGA 17 (Best paper award) Fast Power- Efficient 15

14 Proposed Paradigm Conventional Training Inference Slow Power- Hungry Proposed Training Han et al. ICLR 17 Compression Pruning Quantization Han et al. NIPS 15 Han et al. ICLR 16 (Best paper award) Accelerated Inference Han et al. ISCA 16 Han et al. FPGA 17 (Best paper award) Fast Power- Efficient 16

15 17 The Goal & Trade-off Small Fast Accurate Energy Efficient

16 Agenda Model Compression (Small) Pruning [NIPS 15] Trained Quantization [ICLR 16] Compression Pruning Quantization Pruning Quantization Hardware Acceleration (Fast, Efficient) EIE Accelerator [ISCA 16] ESE Accelerator [FPGA 17] Accelerated Inference Efficient Training (Accurate) Dense-Sparse-Dense Regularization [ICLR 17] Training 18

17 Small Fast Agenda Model Compression (Small) Pruning [NIPS 15] Trained Quantization [ICLR 16] Accurate Energy Efficient Hardware Acceleration (Fast, Efficient) EIE Accelerator [ISCA 16] ESE Accelerator [FPGA 17] Efficient Training (Accurate) Dense-Sparse-Dense Regularization [ICLR 17] 19

18 Learning both Weights and Connections for Efficient Neural Networks Han et al. NIPS

19 Pruning Neural Networks [Lecun et al. NIPS 89] [Han et al. NIPS 15] Pruning Trained Quantization Huffman Coding 22

20 [Han et al. NIPS 15] Pruning Neural Networks -0.01x 2 +x+1 6M 60 Million 10x less connections Pruning Trained Quantization Huffman Coding 23

21 Pruning Neural Networks [Han et al. NIPS 15] Accuracy Loss 0.5% 0.0% -0.5% -1.0% -1.5% -2.0% -2.5% -3.0% -3.5% -4.0% -4.5% 40% 50% 60% 70% 80% 90% 100% Parameters Pruned Away Pruning Trained Quantization Huffman Coding 24

22 Pruning Neural Networks [Han et al. NIPS 15] Accuracy Loss Pruning 0.5% 0.0% -0.5% -1.0% -1.5% -2.0% -2.5% -3.0% -3.5% -4.0% -4.5% 40% 50% 60% 70% 80% 90% 100% Parameters Pruned Away Pruning Trained Quantization Huffman Coding 25

23 Retrain to Recover Accuracy [Han et al. NIPS 15] Accuracy Loss Pruning Pruning+Retraining 0.5% 0.0% -0.5% -1.0% -1.5% -2.0% -2.5% -3.0% -3.5% -4.0% -4.5% 40% 50% 60% 70% 80% 90% 100% Parameters Pruned Away Pruning Trained Quantization Huffman Coding 26

24 [Han et al. NIPS 15] Iteratively Retrain to Recover Accuracy Accuracy Loss Pruning Pruning+Retraining Iterative Pruning and Retraining 0.5% 0.0% -0.5% -1.0% -1.5% -2.0% -2.5% -3.0% -3.5% -4.0% -4.5% 40% 50% 60% 70% 80% 90% 100% Parameters Pruned Away Pruning Trained Quantization Huffman Coding 27

25 Pruning RNN and LSTM [Han et al. NIPS 15] *Karpathy et al "Deep Visual- Semantic Alignments for Generating Image Descriptions" Pruning Trained Quantization Huffman Coding 28

a ball Pruned 90%: a basketball player in a white uniform is playing with a

90%: a brown dog is running through a grassy area 90% 95% Original : a man is

on a beach Original : a soccer player in red is running in the field Pruned

26 Pruning RNN and LSTM [Han et al. NIPS 15] 90% Original: a basketball player in a white uniform is playing with a ball Pruned 90%: a basketball player in a white uniform is playing with a basketball 90% Original : a brown dog is running through a grassy field Pruned 90%: a brown dog is running through a grassy area 90% 95% Original : a man is riding a surfboard on a wave Pruned 90%: a man in a wetsuit is riding a wave on a beach Original : a soccer player in red is running in the field Pruned 95%: a man in a red shirt and black and white black shirt is running through a field Pruning Trained Quantization Huffman Coding 29

27 [Han et al. NIPS 15] Pruning Changes Weight Distribution Before Pruning After Pruning After Retraining Conv5 layer of Alexnet. Representative for other network layers as well. Pruning Trained Quantization Huffman Coding 30

28 Small Fast Agenda Model Compression (Small) Pruning [NIPS 15] Trained Quantization [ICLR 16] Accurate Energy Efficient Pruning Quantization Pruning Quantization Hardware Acceleration (Fast, Efficient) EIE Accelerator [ISCA 16] ESE Accelerator [FPGA 17] Efficient Training (Accurate) Dense-Sparse-Dense Regularization [ICLR 17] 31

29 Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding Han et al. ICLR 2016 Best Paper Pruning Trained Quantization Huffman Coding 32

30 [Han et al. ICLR 16] Trained Quantization 4bit 32 bit 8x less memory footprint 2.09, 2.12, 1.92, Pruning Trained Quantization Huffman Coding 33

31 [Han et al. ICLR 16] Trained Quantization Pruning Trained Quantization Huffman Coding 34

32 [Han et al. ICLR 16] Trained Quantization Pruning Trained Quantization Huffman Coding 35

33 [Han et al. ICLR 16] Trained Quantization Pruning Trained Quantization Huffman Coding 36

34 [Han et al. ICLR 16] Trained Quantization Pruning Trained Quantization Huffman Coding 37

35 [Han et al. ICLR 16] Trained Quantization Pruning Trained Quantization Huffman Coding 38

36 [Han et al. ICLR 16] Trained Quantization Pruning Trained Quantization Huffman Coding 39

37 [Han et al. ICLR 16] Trained Quantization Pruning Trained Quantization Huffman Coding 40

38 Before Trained Quantization: Continuous Weight [Han et al. ICLR 16] Count Weight Value Pruning Trained Quantization Huffman Coding 41

39 After Trained Quantization: Discrete Weight [Han et al. ICLR 16] Count Weight Value Pruning Trained Quantization Huffman Coding 42

40 After Trained Quantization: Discrete Weight after Training [Han et al. ICLR 16] Count Weight Value Pruning Trained Quantization Huffman Coding 43

41 [Han et al. ICLR 16] Bits Per Weight Pruning Trained Quantization Huffman Coding 44

42 [Han et al. ICLR 16] Pruning + Trained Quantization AlexNet on ImageNet Pruning Trained Quantization Huffman Coding 45

43 [Han et al. ICLR 16] Huffman Coding In-frequent weights: use more bits to represent Frequent weights: use less bits to represent Pruning Trained Quantization Huffman Coding 46

44 [Han et al. ICLR 16] Summary of Deep Compression Pruning Trained Quantization Huffman Coding 47

45 [Han et al. ICLR 16] Results: Compression Ratio Network Original Size Compressed Size Compression Ratio Original Accuracy Compressed Accuracy LeNet KB 27KB 40x 98.36% 98.42% LeNet KB 44KB 39x 99.20% 99.26% AlexNet 240MB 6.9MB 35x 80.27% 80.30% VGGNet 550MB 11.3MB 49x 88.68% 89.09% Fit in Cache! GoogleNet 28MB 2.8MB 10x 88.90% 88.92% ResNet MB 4.0MB 11x 89.24% 89.28% Can we make compact models to begin with? 48

46 SqueezeNet squeeze8 1x18convolu.on8filters8 ReLU8 expand8 1x18and83x38convolu.on8filters8 ReLU8 Iandola et al, SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5MB model size, arxiv

47 Compressing SqueezeNet Network Approach Size Ratio Top-1 Accuracy Top-5 Accuracy AlexNet - 240MB 1x 57.2% 80.3% AlexNet SVD 48MB 5x 56.0% 79.4% AlexNet Deep Compression 6.9MB 35x 57.2% 80.3% SqueezeNet - 4.8MB 50x 57.5% 80.3% SqueezeNet Deep Compression 0.47MB 510x 57.5% 80.3% 50

48 Results: Speedup Average 0.6x CPU GPU mgpu 51

49 Results: Energy Efficiency Average CPU GPU mgpu 52

50 Industrial Impact Deep Compression At Baidu, our #1 motivation for compressing networks is to bring down the size of the binary file. As a mobile-first company, we frequently update various apps via different app stores. We've very sensitive to the size of our binary files, and a feature that increases the binary size by 100MB will receive much more scrutiny than one that increases it by 10MB. Andrew Ng 53

51 Challenges Online de-compression while computing Special purpose logic Computation becomes irregular Sparse weight Sparse activation Indirect lookup Parallelization becomes challenging Synchronization overhead. Load imbalance issue. Scalability 54

52 Having Opened the Box, HW Design? Algorithm Hardware? Breaks the boundary between algorithm and hardware 55

53 Small Fast Agenda Model Compression (Small) Pruning [NIPS 15] Trained Quantization [ICLR 16] Accurate Energy Efficient Hardware Acceleration (Fast, Efficient) EIE Accelerator [ISCA 16] ESE Accelerator [FPGA 17] Efficient Training (Accurate) Dense-Sparse-Dense Regularization [ICLR 17] 56

54 EIE: Efficient Inference Engine on Compressed Deep Neural Network Han et al. ISCA

57 Operation 32 bit int ADD 0.1 32 bit float ADD 0.9 32 bit Register File 1 32 bit int MULT 3.1 32 bit float MULT 3.

55 57 Operation 32 bit int ADD bit float ADD bit Register File 1 32 bit int MULT bit float MULT bit SRAM Cache 5 32 bit DRAM Memory 640 Energy [pj] = 1000 How to reduce the memory footprint?

Related Work SpMat Ptr_Even Act_0 Act_1 Arithm Ptr_Odd SpMat Eyeriss [1] MIT Dataflow TPU [2]

" ISSCC 2016 [2] Norm Jouppi, Google supercharges machine learning tasks with TPU custom chip,

56 Related Work SpMat Ptr_Even Act_0 Act_1 Arithm Ptr_Odd SpMat Eyeriss [1] MIT Dataflow TPU [2] Google 8-bit Integer DaDiannao [3] CAS edram EIE [this work] Stanford Compression [1] Yu-Hsin Chen, et al. "Eyeriss: An energy-efficient reconfigurable accelerator for deep convolutional neural networks." ISSCC 2016 [2] Norm Jouppi, Google supercharges machine learning tasks with TPU custom chip, 2016 [3] Yunji Chen, et al. "Dadiannao: A machine-learning supercomputer." Micro 2014 [4] Song Han et al. EIE: Efficient Inference Engine on Compressed Deep Neural Network, ISCA

57 [Han et al. ISCA 16] EIE: Efficient Inference Engine 0 * A = 0 W * 0 = , 1.92=> 2 Sparse Weight 90% static sparsity Sparse Activation 70% dynamic sparsity Weight Sharing 4-bit weights 10x less computation 3x less computation 5x less memory footprint 8x less memory footprint 60

58 EIE: Parallelization on Sparsity [Han et al. ISCA 16] a ( ) 0 a 1 0 a 3 b PE0 w 0,0 w 0,1 0 w 0,3 b 0 PE1 0 0 w 1,2 0 b 1 b 1 PE2 0 w 2,1 0 w 2,3 b 2 0 PE b 3 ReLU b 3 = 0 0 w 4,2 w 4,3 b 4 0 w 5, b w 6,3 b 6 b 6 0 w 7,1 0 0 b 7 0 b 0 b 5 61

59 EIE: Parallelization on Sparsity [Han et al. ISCA 16] PE PE PE PE PE PE PE PE Central Control PE PE PE PE PE PE PE PE a ( ) 0 a 1 0 a 3! b PE0 w 0,0 w 0,1 0 w 0,3 b 0 PE1 0 0 w 1,2 0 b 1 b 1 PE2 0 w 2,1 0 w 2,3 " b 2 0 PE b 3 ReLU b 3 = # 0 0 w 4,2 w 4,3 " b 4 0 w 5, b w 6,3 b 6 b 6 0 w 7,1 0 0 " b 7 0 b 0 b 5 62

60 EIE: Parallelization on Sparsity [Han et al. ISCA 16] logically a ( ) 0 a 1 0 a 3! b PE0 w 0,0 w 0,1 0 w 0,3 b 0 PE1 0 0 w 1,2 0 b 1 b 1 PE2 0 w 2,1 0 w 2,3 " b 2 0 PE b 3 ReLU b 3 = # 0 0 w 4,2 w 4,3 " b 4 0 w 5, b w 6,3 b 6 b 6 0 w 7,1 0 0 " b 7 0 b 0 b 5 physically Virtual Weight W 0,0 W 0,1 W 4,2 W 0,3 W 4,3 Relative Index Column Pointer

61 [Han et al. ISCA 16] Dataflow a ( ) 0 a 1 0 a 3! b PE0 w 0,0 w 0,1 0 w 0,3 b 0 PE1 0 0 w 1,2 0 b 1 b 1 PE2 0 w 2,1 0 w 2,3 " b 2 0 PE b 3 ReLU b 3 = # 0 0 w 4,2 w 4,3 " b 4 0 w 5, b w 6,3 b 6 b 6 0 w 7,1 0 0 " b 7 0 b 0 b 5 64

62 [Han et al. ISCA 16] Dataflow a ( ) 0 a 1 0 a 3! b PE0 w 0,0 w 0,1 0 w 0,3 b 0 PE1 0 0 w 1,2 0 b 1 b 1 PE2 0 w 2,1 0 w 2,3 " b 2 0 PE b 3 ReLU b 3 = # 0 0 w 4,2 w 4,3 " b 4 0 w 5, b w 6,3 b 6 b 6 0 w 7,1 0 0 " b 7 0 b 0 b 5 65

63 [Han et al. ISCA 16] Dataflow a ( ) 0 a 1 0 a 3! b PE0 w 0,0 w 0,1 0 w 0,3 b 0 PE1 0 0 w 1,2 0 b 1 b 1 PE2 0 w 2,1 0 w 2,3 " b 2 0 PE b 3 ReLU b 3 = # 0 0 w 4,2 w 4,3 " b 4 0 w 5, b w 6,3 b 6 b 6 0 w 7,1 0 0 " b 7 0 b 0 b 5 66

64 [Han et al. ISCA 16] Dataflow a ( ) 0 a 1 0 a 3! b PE0 w 0,0 w 0,1 0 w 0,3 b 0 PE1 0 0 w 1,2 0 b 1 b 1 PE2 0 w 2,1 0 w 2,3 " b 2 0 PE b 3 ReLU b 3 = # 0 0 w 4,2 w 4,3 " b 4 0 w 5, b w 6,3 b 6 b 6 0 w 7,1 0 0 " b 7 0 b 0 b 5 67

65 [Han et al. ISCA 16] Dataflow a ( ) 0 a 1 0 a 3! b PE0 w 0,0 w 0,1 0 w 0,3 b 0 PE1 0 0 w 1,2 0 b 1 b 1 PE2 0 w 2,1 0 w 2,3 " b 2 0 PE b 3 ReLU b 3 = # 0 0 w 4,2 w 4,3 " b 4 0 w 5, b w 6,3 b 6 b 6 0 w 7,1 0 0 " b 7 0 b 0 b 5 68

66 [Han et al. ISCA 16] Dataflow a ( ) 0 a 1 0 a 3! b PE0 w 0,0 w 0,1 0 w 0,3 b 0 PE1 0 0 w 1,2 0 b 1 b 1 PE2 0 w 2,1 0 w 2,3 " b 2 0 PE b 3 ReLU b 3 = # 0 0 w 4,2 w 4,3 " b 4 0 w 5, b w 6,3 b 6 b 6 0 w 7,1 0 0 " b 7 0 b 0 b 5 69

67 [Han et al. ISCA 16] Dataflow a ( ) 0 a 1 0 a 3! b PE0 w 0,0 w 0,1 0 w 0,3 b 0 PE1 0 0 w 1,2 0 b 1 b 1 PE2 0 w 2,1 0 w 2,3 " b 2 0 PE b 3 ReLU b 3 = # 0 0 w 4,2 w 4,3 " b 4 0 w 5, b w 6,3 b 6 b 6 0 w 7,1 0 0 " b 7 0 b 0 b 5 70

68 [Han et al. ISCA 16] Dataflow a ( ) 0 a 1 0 a 3! b PE0 w 0,0 w 0,1 0 w 0,3 b 0 PE1 0 0 w 1,2 0 b 1 b 1 PE2 0 w 2,1 0 w 2,3 " b 2 0 PE b 3 ReLU b 3 = # 0 0 w 4,2 w 4,3 " b 4 0 w 5, b w 6,3 b 6 b 6 0 w 7,1 0 0 " b 7 0 b 0 b 5 71

69 [Han et al. ISCA 16] Dataflow a ( ) 0 a 1 0 a 3! b PE0 w 0,0 w 0,1 0 w 0,3 b 0 PE1 0 0 w 1,2 0 b 1 b 1 PE2 0 w 2,1 0 w 2,3 " b 2 0 PE b 3 ReLU b 3 = # 0 0 w 4,2 w 4,3 " b 4 0 w 5, b w 6,3 b 6 b 6 0 w 7,1 0 0 " b 7 0 b 0 b 5 72

70 [Han et al. ISCA 16] Dataflow a ( ) 0 a 1 0 a 3! b PE0 w 0,0 w 0,1 0 w 0,3 b 0 PE1 0 0 w 1,2 0 b 1 b 1 PE2 0 w 2,1 0 w 2,3 " b 2 0 PE b 3 ReLU b 3 = # 0 0 w 4,2 w 4,3 " b 4 0 w 5, b w 6,3 b 6 b 6 0 w 7,1 0 0 " b 7 0 b 0 b 5 73

71 [Han et al. ISCA 16] EIE Architecture Weight decode Compressed DNN Model Input Image Encoded Weight Relative Index Sparse Format 4-bit Virtual weight 4-bit Relative Index Weight Look-up Index Accum 16-bit Real weight ALU 16-bit Mem Absolute Index Prediction Result Address Accumulate 74

72 Micro Architecture for each PE [Han et al. ISCA 16] Act Value Act Index Act Queue Act Index Act Value Encoded Weight Act SRAM Leading NZero Detect Even Ptr SRAM Bank Odd Ptr SRAM Bank Pointer Read Col Start/ End Addr Sparse Matrix SRAM Sparse Matrix Access Regs Relative Index Weight Decoder Address Accum Absolute Address Arithmetic Unit Bypass Dest Act Regs Src Act Regs Act R/W ReLU SRAM Regs Comb 75

73 [Han et al. ISCA 16] Load Balance PE PE PE PE PE PE PE PE Central Control PE PE PE PE Act Value Act Index Act Queue PE PE PE PE SRAM Regs Comb 76

74 [Han et al. ISCA 16] Activation Sparsity PE PE PE PE PE PE PE PE Central Control Act Value Act Index Act Queue Act Index Act Value Leading NZero Detect PE PE PE PE PE PE PE PE Even Ptr SRAM Bank Odd Ptr SRAM Bank Pointer Read SRAM Regs Comb 77

75 [Han et al. ISCA 16] Weight Sparsity PE PE PE PE PE PE PE PE Central Control Act Value Act Index Act Queue Act Index Act Value Leading NZero Detect PE PE PE PE PE PE PE PE Even Ptr SRAM Bank Odd Ptr SRAM Bank Col Start/ End Addr Sparse Matrix SRAM Regs Pointer Read Sparse Matrix Access SRAM Regs Comb 78

76 [Han et al. ISCA 16] Weight Sharing PE PE PE PE PE PE PE PE Central Control Act Value Act Index Act Queue Act Index Act Value Encoded Weight Decoded Weight Leading NZero Detect PE PE PE PE PE PE PE PE Even Ptr SRAM Bank Odd Ptr SRAM Bank Col Start/ End Addr Sparse Matrix SRAM Regs Weight Decoder Pointer Read Sparse Matrix Access SRAM Regs Comb 79

77 [Han et al. ISCA 16] Address Accumulate PE PE PE PE PE PE PE PE Central Control Act Value Act Index Act Queue Act Index Act Value Encoded Weight Decoded Weight Leading NZero Detect PE PE PE PE PE PE PE PE Even Ptr SRAM Bank Odd Ptr SRAM Bank Pointer Read Col Start/ End Addr Sparse Matrix SRAM Sparse Matrix Access Regs Relative Index Weight Decoder Address Accum Absolute Index SRAM Regs Comb 80

78 [Han et al. ISCA 16] Arithmetic PE PE PE PE PE PE PE PE Central Control Act Value Act Index Act Queue Act Index Act Value Encoded Weight Leading NZero Detect PE PE PE PE PE PE PE PE Even Ptr SRAM Bank Odd Ptr SRAM Bank Pointer Read Col Start/ End Addr Sparse Matrix SRAM Sparse Matrix Access Regs Relative Index Weight Decoder Address Accum Absolute Address Arithmetic Unit Bypass SRAM Regs Comb 81

79 [Han et al. ISCA 16] Write Back PE PE PE PE PE PE PE PE Central Control Act Value Act Index Act Queue Act Index Act Value Encoded Weight Act SRAM Leading NZero Detect PE PE PE PE PE PE PE PE Even Ptr SRAM Bank Odd Ptr SRAM Bank Pointer Read Col Start/ End Addr Sparse Matrix SRAM Sparse Matrix Access Regs Relative Index Weight Decoder Address Accum Absolute Address Arithmetic Unit Bypass Dest Act Regs Src Act Regs Act R/W SRAM Regs Comb 82

80 [Han et al. ISCA 16] Relu, Non-zero Detection PE PE PE PE PE PE PE PE Central Control Act Value Act Index Act Queue Act Index Act Value Encoded Weight Act SRAM Leading NZero Detect PE PE PE PE PE PE PE PE Even Ptr SRAM Bank Odd Ptr SRAM Bank Pointer Read Col Start/ End Addr Sparse Matrix SRAM Sparse Matrix Access Regs Relative Index Weight Decoder Address Accum Absolute Address Arithmetic Unit Bypass Dest Act Regs Src Act Regs Act R/W ReLU SRAM Regs Comb 83

81 [Han et al. ISCA 16] What s Special PE PE PE PE PE PE PE PE Central Control Act Value Act Index Act Queue Act Index Act Value Encoded Weight Act SRAM Leading NZero Detect PE PE PE PE PE PE PE PE Even Ptr SRAM Bank Odd Ptr SRAM Bank Pointer Read Col Start/ End Addr Sparse Matrix SRAM Sparse Matrix Access Regs Relative Index Weight Decoder Address Accum Absolute Address Arithmetic Unit Bypass Dest Act Regs Src Act Regs Act R/W ReLU SRAM Regs Comb 84

82 [Han et al. ISCA 16] Post Layout Result of EIE Technology 45 nm # PEs 64 on-chip SRAM Max Model Size Static Sparsity Dynamic Sparsity Quantization ALU Width Area MxV Throughput Power 8 MB 84 Million 10x 3x 4-bit 16-bit 40.8 mm^2 81,967 layers/s 586 mw 1. Post layout result 2. Throughput measured on AlexNet FC-7 85

83 [Han et al. ISCA 16] Benchmark CPU: Intel Core-i7 5930k GPU: NVIDIA TitanX Mobile GPU: NVIDIA Jetson TK1 Layer Size Weight Density Activation Density FLOP Reduction AlexNet % 35% 33x AlexNet % 35% 33x AlexNet % 38% 10x VGG % 18% 100x VGG % 37% 50x VGG % 41% 10x NeuralTalk-We % 100% 10x NeuralTalk-Wd % 100% 10x NeuralTalk-LSTM % 100% 10x Description AlexNet for image classification VGG-16 for image classification RNN and LSTM for image caption 86

84 5. [Han et al. ISCA 16] Speedup on EIE SpMat CPU Dense (Baseline) 1000x CPU Compressed GPU Dense 507x 248x mgpu Compressed Ptr_Even Act_1 Arithm SpMat Ptr_Odd Speedup Act_0 100x 25x 14x 10x 1x 5x 3x 2x 1x 1x 0.6x 24x 9x 21x 14x 5x 1x 0.5x 1x 1x 1.1x 1x 135x 1x 1.0x EIE 92x 22x 10x 8x 9x mgpu Compressed EIE 1x 1x 1.0x 1x 0.5x 48x 25x 9x 3x 3x 2x 60x 33x 15x 9x 1x 1x 0.3x 189x 98x 63x 34x 16x 10x 2x 1x 0.5x 2x 15x 3x 1x 0.5x 1x 3x 0.6x 0.1x Alex-6 Layout of one PE in EIE under TSMC 45nm process. Table II Figure 6. Alex-7 Alex-8 VGG-6 VGG-7 VGG-8 ry network er national cell ueue ad tread munit W cell Power (mw) (59.15%) (20.46%) (11.20%) (9.18%) (1.23%) (19.73%) (54.11%) (12.68%) (12.25%) (%) Area (µm2 ) 638, , ,465 8,946 23, , ,412 3,110 18,934 23,961 Energy Efficiency 98x (%) CPU Dense (Baseline) x (93.22%) 10000x (0.14%) (1.48%) 1000x (1.40%) (3.76%) 100x (0.12%) (19.10%) 10x (73.57%) (0.49%) 1x (2.97%) (3.76%) Figure 7. 2x NT-We NT-Wd NT-LSTM Geo Mean Speedups of GPU, mobile GPU and EIE compared with CPU running uncompressed DNN model. There is no batching in all cases. 189x MPLEMENTATION RESULTS OF ONE PE IN EIE AND THE OWN BY COMPONENT TYPE ( LINE 3-7), BY MODULE ( LINE 8-13). T HE CRITICAL PATH OF EIE IS 1.15 NS x mgpu Dense 618x 210x 115x 94x 56x GPU Compressed 1018x 60x 37x 9x12x 26x 37x 5x 7x 1x 10x Alex-6 GPU Dense 119,797x 61,533x 34,522x 25x 9x CPU Compressed 1x 48x 14,826x 15x Alex-7 15x 78x 101x 59x 7x10x 3x 1x 18x 7x Alex-8 GPU Compressed 17x 10x 1x 13x VGG-6 3x mgpu Dense mgpu Compressed EIE 76,784x 61x 20x 10x 1x 11,828x 24,207x 10,904x 9,485x 8,053x 102x 14x VGG-7 1x 2x 5x 25x 14x 8x 39x 14x 6x 6x 6x 6x 8x 1x 5x VGG-8 NT-We 3x 1x 25x 7x 15x 20x 4x 5x 7x 1x 23x 6x 7x 1x 36x 9x Geo Mean NT-Wd NT-LSTM Geo Mean Energy efficiency of GPU, mobile GPU and EIE compared with CPU running uncompressed DNN model. There is no batching in all cases. 1x 1xcorner. We placed and routed 1x the PE using the Synopsys IC B 0.6x 0.5x compiler (ICC). We used Cacti [25] to get SRAM area and Layer nit: I/O and Computing. In the I/O mode, all of re idle while the activations and weights in every e accessed by a DMA connected with the Central s is one time cost. In the Computing mode, the eatedly collects a non-zero value from the LNZD and broadcasts this value to all PEs. This process until the input length is exceeded. By setting the gth and starting address of pointer array, EIE is to execute different layers. Table III ENCHMARK FROM STATE - OF - THE - ART Size Weight% Act% energy numbers. We annotated the toggle rate from the RTL 9216, Alex-6 9% 4096 simulation to the gate-level netlist, which was dumped to 4096, switching activity interchange format (SAIF), and estimated Alex-7 9% V. E VALUATION M ETHODOLOGY 4096 power using Prime-Time PX. tor, RTL and Layout. We the implemented a custom 4096, Alex-8 25% urate C++ simulator for the accelerator aimed to 1000 Comparison Baseline. We compare EIE with three dife RTL behavior of synchronous circuits. Each EIE CPU GPU mgpu 25088, module is abstracted as an ferent object thatoff-the-shelf implements VGG-6 4% computing units: CPU, GPU and mobile 4096 act methods: propagate and update, corresponding 4096, ation logic and the flip-flop GPU. in RTL. The simulator VGG-7 4% or design space exploration. It also serves as a ) CPU. We use Intel Core i k CPU, a Haswell-E or RTL verification. 4096, asure the area, power and critical delay, we that has been used in NVIDIA Digits Deep VGG-8 23% classpath processor, 1000 ted the RTL of EIE in Verilog. The RTL is verified Dev Box as aacceleration CPU baseline. To run the benchmarkregularization 4096, e cycle-accurate simulator.learning Then we synthesized Compression NT-LSTM g the Synopsys Design Compiler (DC) under the Geo Mean NT-We % DNN MODELS FLOP% 35.1% 3% 35.3% 3% 37.5% 10% 18.3% 1% 37.5% 2% 41.1% 9% 100% 10% Description Compressed AlexNet [1] for large scale image classification Compressed VGG-16 [3] for large scale image classification and object detection Compressed NeuralTalk [7] 87

85 CPU Dense (Baseline) Wd NT-LSTM Speedup 1000x 25x 14x GPU Compressed mgpu Dense Geo Mean 24x 9x 21x 14x mgpu Compressed 135x 92x 22x 10x 34x 16x 10x 60x 33x 15x 9x 2x 1x 0.6x 5x 1x 0.5x 1x 1x 1.1x 1x 1x 1.0x 1x 1x 1.0x 1x 0.3x 1x 25x 9x 3x 3x 2x 1x 2x 1x 0.5x 0.5x 48x 15x 2x 1x 0.5x 3x 1x 3x 0.6x 0.1x Alex-6 Figure 6. Act_0 Ptr_Even Act_1 Arithm Alex-7 Alex-8 CPU Dense (Baseline) Ptr_Odd VGG-6 VGG x CPU Compressed GPU Dense 119,797x 61,533x 34,522x 100x Figure 7. 37x 9x12x 26x 37x 5x 7x 10x 1x 1x 10x Alex-6 1x 78x 101x 59x 15x 7x10x 3x 1x Alex-7 18x 7x Alex-8 (%) mgpu Dense 11,828x 17x 10x 1x 13x VGG-6 61x 20x 10x 102x 14x 1x VGG-7 NT-LSTM Geo Mean mgpu Compressed EIE EIE 1x 2x 5x 8x 25x 14x 5x VGG-8 10,904x 9,485x 39x 14x 6x 6x 6x 6x 1x 8x NT-We 1x 25x 7x NT-Wd 24,207x 8,053x 15x 20x 4x 5x 7x 1x NT-LSTM 23x 6x 7x 1x 36x 9x Geo Mean Energy efficiency of GPU, mobile GPU and EIE compared with CPU running uncompressed DNN model. There is no batching in all cases. 10,904x corner. We placed and routed8,053x the PE using the Synopsys IC Area (µm2 ) 638, , ,465 8,946 23, , ,412 3,110 18,934 23,961 NT-Wd 76,784x 1000x MPLEMENTATION RESULTS OF ONE PE IN EIE AND THE OWN BY COMPONENT TYPE ( LINE 3-7), BY MODULE ( LINE 8-13). T HE CRITICAL PATH OF EIE IS 1.15 NS Power (mw) GPU Compressed 10000x Layout of one PE in EIE under TSMC 45nm process. Table II NT-We 14,826x mgpu Compressed SpMat VGG-8 Speedups of GPU, mobile GPU and EIE compared with CPU running uncompressed DNN model. There is no batching in all cases. SpMat Energy Efficiency 5. 3x 1x 1x 5x 9x [Han et al. ISCA 16] 189x 98x 63x Energy Efficiency on EIE There is no batching in all cases. 10x 8x EIE 618x 210x 115x 94x 56x GPU Dense 1018x 507x 248x 100x CPU Compressed (%) 24,207x Table III B ENCHMARK FROM STATE - OF - THE - ART DNN MODELS compiler(93.22%) (ICC). We used Cacti [25] to get SRAM area and Layer Size Weight% Act% FLOP% Description (0.14%) energy numbers. We annotated the toggle rate from the RTL 9216, (1.48%) Alex-6 9% 35.1% 3% (1.40%) 4096 Compressed simulation to the gate-level netlist, which was dumped to (3.76%) 4096, AlexNet [1] for (1.23%) (0.12%) activity interchange format (SAIF), and estimated Alex-7 9% 35.3% 3% (19.73%)switching (19.10%) 4096 large scale image (54.11%) (73.57%) the power using Prime-Time PX (12.68%) (0.49%) 4096, classification Alex-8 25% 37.5% 10% (12.25%) (2.97%) 1000 Comparison Baseline. We compare EIE with three dif(3.76%) 25088, VGG-6 4% 18.3% 1% Compressed ferent off-the-shelf computing units: CPU, GPU and mobile 4096 nit: I/O and Computing. In the I/O mode, all of VGG-16 [3] for GPU. 36x re idle while the activations and weights in every 4096, VGG-7 4% 37.5% 2% large scale image 25x 23x e accessed by a DMA connected with the Central x 1) CPU. We use Intel Core i k CPU, a Haswell-E classification and 15x s is one time cost. In the Computing mode, the 4096, eatedly collects a non-zero value from the LNZD VGG-8 23% 41.1% 9% object detection class processor, that has been used in NVIDIA Digits 7x Deep x and broadcasts this value to all PEs. This process 5x Dev the Box as a CPU baseline. To run the benchmark 4096, Compressed 4x until the input length is Learning exceeded. By setting NT-We 10% 100% 10% gth and starting address on of pointer array, EIE is 600 NeuralTalk [7] CPU, we used MKL CBLAS GEMV to implement the to execute different layers. 9x 600, with RNN and 7xMKL SPBLAS CSRMV 7xV. EVALUATION METHODOLOGY 1x dense model and 1x for the NT-Wd 11% 100% 11% original 8791 LSTM for compressed sparse model. CPU socket and DRAM power 1201, automatic tor, RTL and Layout. We implemented a custom NTLSTM 10% 100% 11% 2400 image captioning urate C++ simulator for the aimed toby the pcm-power utility provided by Intel. areaccelerator as reported e RTL behavior of synchronous circuits. Each GPU mgpu EIE CPUX GPU, 2) GPU. We use NVIDIA GeForce GTX Titan module is abstracted as an object that implements act methods: propagate and update, corresponding The uncompressed DNN model is obtained from Caffe a state-of-the-art GPU for deep learning as our baseline ation logic and the flip-flop in RTL. The simulator or design space exploration. It alsonvidia-smi serves as a model zoo [28] and NeuralTalk model zoo [7]; The comusing utility to report the power. To run or RTL verification. pressed DNN model is produced as described in [16], [23]. benchmark, asure the area, power andthe critical path delay, we we used cublas GEMV to implement ted the RTL of EIE in Verilog. The RTL is verified The benchmark networks have 9 layers in total obtained thethen original dense layer. For the compressed sparse layer, e cycle-accurate simulator. we synthesized Compression Acceleration Regularization from AlexNet, VGGNet, and NeuralTalk. We use the Imagewe stored thethesparse matrix in in CSR format, and used g the Synopsys Design Compiler (DC) under ry network er national cell ueue ad tread munit W cell Wd (59.15%) (20.46%) (11.20%) (9.18%) Geo Mean NT-LSTM Geo Mean 88

86 [Han et al. ISCA 16] Comparison: Throughput EIE 1E+06 Throughput (Layers/s in log scale) 1E+05 1E+04 1E+03 GPU ASIC ASIC ASIC ASIC 1E+02 1E+01 CPU mgpu FPGA 1E+00 Core-i7 5930k 22nm CPU TitanX 28nm GPU Tegra K1 28nm mgpu A-Eye 28nm FPGA DaDianNao 28nm ASIC TrueNorth 28nm ASIC EIE 45nm ASIC 64PEs EIE 28nm ASIC 256PEs 89

87 Comparison: Energy Efficiency [Han et al. ISCA 16] EIE 1E+06 Energy Efficiency (Layers/J in log scale) 1E+05 ASIC ASIC 1E+04 ASIC ASIC 1E+03 1E+02 1E+01 GPU mgpu 1E+00 CPU Core-i7 5930k 22nm CPU TitanX 28nm GPU Tegra K1 28nm mgpu FPGA A-Eye 28nm FPGA DaDianNao 28nm ASIC TrueNorth 28nm ASIC EIE 45nm ASIC 64PEs EIE 28nm ASIC 256PEs 90

88 [Han et al. ISCA 16] Scalability Speedup PE 2PEs 4PEs 8PEs 16PEs 32PEs 64PEs 128PEs 256PEs Alex-6 Alex-7 Alex-8 VGG-6 VGG-7 VGG-8 NT-We NT-Wd NT-LSTM Figure 11. System scalability. It measures the speedups with different numbers of PEs. The speedup is near-linear. #PEs ~ Speedup 64PEs: 64x 128PEs: 124x 256PEs: 210x 91

89 [Han et al. ISCA 16] Load Balancing Load Balance 100% 80% 60% 40% 20% 0% FIFO=1 FIFO=2 FIFO=4 FIFO=8 FIFO=16 FIFO=32 FIFO=64 FIFO=128 FIFO=256 Alex-6 Alex-7 Alex-8 VGG-6 VGG-7 VGG-8 NT-We NT-Wd NT-LSTM 8. Load efficiency improves as FIFO size increases. When FIFO deepth 8, the marginal gain quickly diminishes. So we choose FIFO de Imbalanced non-zeros among PEs degrades system utilization. This load imbalance could be solved by FIFO. With FIFO depth=8, ALU utilization is > 80%. 92

90 Can we do better with load imbalance? Feedforward => Recurrent neural network? 93

91 Small Fast Agenda Model Compression (Small) Pruning [NIPS 15] Trained Quantization [ICLR 16] Accurate Energy Efficient Compression Hardware Acceleration (Fast, Efficient) EIE Accelerator [ISCA 16] ESE Accelerator [FPGA 17] Efficient Training (Accurate) Dense-Sparse-Dense Regularization [ICLR 17] 94

92 ESE: Efficient Speech Recognition Engine for Sparse LSTM on FPGA Han et al. FPGA 2017 Best Paper 95

93 Accelerating Recurrent Neural Networks speech recognition image caption machine translation visual question answering The recurrent nature of RNN/LSTM produces complicated data dependency, which is more challenging than feedforward neural nets. 96

94 Rethinking Model Compression [Han et al. FPGA 17] Compression Pruning Quantization load balance-aware pruning Accelerated Inference 97

95 Pruning Lead to Load Imbalance [Han et al. FPGA 17] w0,0 w0,1 0 w0,3 0 0 w1,2 0 0 w2,1 0 w2, w4,2 w4,3 w5, w6,0 0 0 w6,3 0 w7,1 0 0 Unbalanced 5 cycles 2 cycles 4 cycles 1 cycle Overall: 5 cycles 100

96 Load Balance Aware Pruning [Han et al. FPGA 17] w0,0 w0,1 0 w0,3 0 0 w1,2 0 0 w2,1 0 w2, w4,2 w4,3 w5, w6,0 0 0 w6,3 w0,0 0 0 w0,3 0 0 w1,2 0 0 w2,1 0 w2,3 0 0 w3, w4,2 0 w5,0 0 0 w5,3 w6, w7, w7,1 0 w7,3 Unbalanced 5 cycles 2 cycles 4 cycles 1 cycle Overall: 5 cycles 101

97 Load Balance Aware Pruning [Han et al. FPGA 17] w0,0 w0,1 0 w0,3 0 0 w1,2 0 0 w2,1 0 w2, w4,2 w4,3 w5, w6,0 0 0 w6,3 w0,0 0 0 w0,3 0 0 w1,2 0 0 w2,1 0 w2,3 0 0 w3, w4,2 0 w5,0 0 0 w5,3 w6, w7, w7,1 0 w7,3 Unbalanced 5 cycles 2 cycles 4 cycles 1 cycle Overall: 5 cycles 102

98 Load Balance Aware Pruning [Han et al. FPGA 17] w0,0 w0,1 0 w0,3 0 0 w1,2 0 0 w2,1 0 w2, w4,2 w4,3 w5, w6,0 0 0 w6,3 w0,0 0 0 w0,3 0 0 w1,2 0 0 w2,1 0 w2,3 0 0 w3, w4,2 0 w5,0 0 0 w5,3 w6, w7, w7,1 0 w7,3 Unbalanced 5 cycles 2 cycles 4 cycles 1 cycle Overall: 5 cycles Balanced 3 cycles 3 cycles 3 cycles 3 cycles Overall: 3 cycles 103

99 Load Balance Aware Pruning: Same Accuracy [Han et al. FPGA 17] 28.0% with load balance without load balance 26.5% Phone Error Rate 25.0% 23.5% 22.0% 20.5% sweet spot 19.0% 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Parameters Pruned Away 106

100 Load Balance Aware Pruning: Better Speedup [Han et al. FPGA 17] 7x 6x 5x with load balance without load balance 6.2x speedup over dense Speedup 4x 3x 5.5x speedup over dense 2x 1x 0x 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% Parameters Pruned Away 109

101 [Han et al. FPGA 17] Hardware Architecture Software Program CPU MEM External Memory x/y t-1 m t ActQueue FIFO FIFO FIFO FIFO Channel with multiple PEs FPGA PCIE Controller MEM Controller Memory DATA BUS Input Buffer Output Buffer PtrRead PtrRead PtrRead Pointer Buffer PtrRead Pointer Buffer Pointer Buf Buffer Buf Pointer Buf Buffer Buf Buf Buf Buf Buf SpMM SpmatRead SpmatRead SpMV Accu SpmatRead Weight Buffer SpMV Accu SpmatRead Weight Buffer SpMV Accu Weight Buf Buffer Buf SpMV Accu Weight Buf Buffer Buf Buf Buf Buf Buf PE N PE k PE 1 PE Act 0 Buffer Act Buffer Act Buffer Buf Act Buffer Buf Buf Buf Buf Buf Assemble y t ESE Controller Channel 0 PE PE PE ESE Accelerator Channel 1 PE PE PE Channel N PE PE PE W c /c t-1 c t ElemMul Wx t /Wy t-1 Adder Tree Elt-wise Sigmoid ElemMul /Tanh H t Buffer 110

Speedup and Energy Efficiency Platforms GPU CPU ESE Latency 240us 6017us 82.7us Power 202W 111W 41W Speedup 1x 0.04x 3x Energy Efficiency 1x 0.07x 14x Table 6: ESE Resource Utilization.

102 Speedup and Energy Efficiency Platforms GPU CPU ESE Latency 240us 6017us 82.7us Power 202W 111W 41W Speedup 1x 0.04x 3x Energy Efficiency 1x 0.07x 14x Table 6: ESE Resource Utilization. LUT LUTRAM 1 FF BRAM 1 DSP Avail. 331, , ,360 1,080 2,760 Used 293,920 69, , ,504 Utili. 88.6% 47.6% 68.3% 87.7% 54.5% Resource Utilization on Xilinx KU

103 From Compression to Acceleration Challenge 1: memory access is expensive Deep Compression: 10x-49x smaller, no loss of accuracy Challenge 2: sparsity, indirection, load balance EIE / ESE Accelerator: energy-efficient accelerated inference 112

104 What about Training? Compressed Model Size: Same accuracy => Original Model Size: Higher accuracy 113

105 Small Fast Agenda Model Compression (Small) Pruning [NIPS 15] Trained Quantization [ICLR 16] Accurate Energy Efficient Hardware Acceleration (Fast, Efficient) EIE Accelerator [ISCA 16] ESE Accelerator [FPGA 17] Efficient Training (Accurate) Dense-Sparse-Dense Regularization [ICLR 17] Training 114

106 DSD: Dense-Sparse-Dense Training for Deep Neural Networks Han et al. ICLR 2017

107 DSD: Dense Sparse Dense Training [Han et al. ICLR 2017] Dense Sparse Dense Pruning Sparsity Constraint Re-Dense Increase Model Capacity DSD produces same model architecture but can find better optimization solution, arrives at better local minima, and achieves higher prediction accuracy across a wide range of deep neural networks on CNNs / RNNs / LSTMs. 116

108 [Han et al. ICLR 2017] DSD: Intuition Learn the trunk first Then learn the leaves 117

109 [Han et al. ICLR 2017] DSD is General Purpose: Vision, Speech, Natural Language Network Domain Dataset Type Baseline DSD Abs. Imp. Rel. Imp. GoogleNet Vision ImageNet CNN 31.1% 30.0% 1.1% 3.6% VGG-16 Vision ImageNet CNN 31.5% 27.2% 4.3% 13.7% ResNet-18 Vision ImageNet CNN 30.4% 29.3% 1.1% 3.7% ResNet-50 Vision ImageNet CNN 24.0% 23.2% 0.9% 3.5% NeuralTalk Caption Flickr-8K LSTM % DeepSpeech Speech WSJ 93 RNN 33.6% 31.6% 2.0% 5.8% DeepSpeech-2 Speech WSJ 93 RNN 14.5% 13.4% 1.1% 7.4% Open Sourced DSD Model Zoo: The beseline results of AlexNet, VGG16, GoogleNet, SqueezeNet are from Caffe Model Zoo. ResNet18, ResNet50 are from fb.resnet.torch. 120

110 Related Work Dropout [1] and DropConnect [2] - Dropout use a random sparsity pattern. - DSD training learns with a deterministic data driven sparsity pattern. Distillation [3] - Transfer the knowledge from the large model to a small model. - Both DSD and Distillation don t incur architectural changes. [1] Srivastava, Nitish, et al. "Dropout: a simple way to prevent neural networks from overfitting." Journal of Machine Learning Research 15.1 (2014): [2] Wan, Li, et al. "Regularization of neural networks using dropconnect." Proceedings of the 30th International Conference on Machine Learning (ICML-13) [3] Hinton, Geoffrey, Oriol Vinyals, and Jeff Dean. "Distilling the knowledge in a neural network." arxiv preprint arxiv: (2015). 121

111 Small Fast Summary Model Compression (Small) Pruning [NIPS 15] Trained Quantization [ICLR 16] Accurate Energy Efficient Compression Pruning Quantization Hardware Acceleration (Fast, Efficient) EIE Accelerator [ISCA 16] ESE Accelerator [FPGA 17] Accelerated Inference Efficient Training (Accurate) Dense-Sparse-Dense Regularization [ICLR 17] Training 122

112 Summary Algorithm Inference sparsity Training Hardware 126

113 Summary Algorithm Smaller Size: Deep Compression Higher Accuracy: DSD Regularization Inference sparsity Training Better Speed, Energy Efficiency: EIE / ESE Accelerator Hardware 129

114 Summary Algorithm Smaller Size: Deep PhonesCompression Drones Higher Accuracy: Robots DSD Regularization Inference sparsity Training Better Speed, Energy Efficiency: EIE / ESE Accelerator future work Hardware Self Driving Cars AI in the Cloud 130

115 131 Future Smart Low Latency Privacy Mobility Energy-Efficient

116 132 Outlook: the Path for Computation PC Mobile-First AI-First Computing Mobile Computing Brain-Inspired Cognitive Computing Sundar Pichai, Google IO, 2016

117 Thank you! stanford.edu/~songhan Smart Training Han et al. ICLR 17 Dense Sparse Dense Pruning Re-Dense Sparsity Constraint Increase Model Capacity Compression Pruning Quantization Han et al. NIPS 15 Han et al. ICLR 16 (Best paper award) Accelerated Inference Han et al. ISCA 16 Han et al. FPGA 17 (Best paper award) SpMat Fast Ptr_Even Act_0 Act_1 Arithm Ptr_Odd SpMat Efficient 134

Hardware Acceleration of DNNs

Lecture 12: Hardware Acceleration of DNNs Visual omputing Systems Stanford S348V, Winter 2018 Hardware acceleration for DNNs Huawei Kirin NPU Google TPU: Apple Neural Engine Intel Lake rest Deep Learning