Efficient Methods and Hardware for Deep Learning

Size: px
Start display at page:

Download "Efficient Methods and Hardware for Deep Learning"

Transcription

1 1 Efficient Methods and Hardware for Deep Learning Song Han Stanford University

2 2 Deep Learning is Changing Our Lives Self-Driving Machine Translation 2 AlphaGo Smart Robots

3 3 Models are Getting Larger IMAGE RECOGNITION SPEECH RECOGNITION 16X Model 8 layers 1.4 GFLOP ~16% Error 152 layers 22.6 GFLOP ~3.5% error 10X Training Ops 80 GFLOP 7,000 hrs of Data ~8% Error 465 GFLOP 12,000 hrs of Data ~5% Error 2012 AlexNet 2015 ResNet 2014 Deep Speech Deep Speech 2 Dally, NIPS 2016 workshop on Efficient Methods for Deep Neural Networks

4 4 The first Challenge: Model Size Hard to distribute large models through over-the-air update

5 The Second Challenge: Speed ResNet18: ResNet50: ResNet101: ResNet152: Error rate 10.76% 7.02% 6.21% 6.16% Training time 2.5 days 5 days 1 week 1.5 weeks Such long training time limits ML researcher s productivity Training time benchmarked with fb.resnet.torch using four M40 GPUs 5

6 6 The Third Challenge: Energy Efficiency AlphaGo: 1920 CPUs and 280 GPUs, $3000 electric bill per game on mobile: drains battery on data-center: increases TCO

7 8 The Problem of Large DNN Hardware engineer suffers from the large model size larger model => more memory reference => more energy Operation 32 bit int ADD bit float ADD bit Register File 1 32 bit int MULT bit float MULT bit SRAM Cache 5 32 bit DRAM Memory 640 Energy [pj] Relative Energy Cost = 1000

8 9 The Problem of Large DNN Hardware engineer suffers from the large model size larger model => more memory reference => more energy Operation 32 bit int ADD bit float ADD bit Register File 1 32 bit int MULT bit float MULT bit SRAM Cache 5 32 bit DRAM Memory 640 Energy [pj] Relative Energy Cost how to make deep learning more efficient?

9 10 Improve the Efficiency of Deep Learning by Algorithm-Hardware Co-Design

10 11 Application as a Black Box Algorithm Spec 2006 Hardware

11 12 Open the Box before Hardware Design Algorithm? Hardware Breaks the boundary between algorithm and hardware

12 14 What s in the Box: Deep Learning 101 weights/activations training dataset test data Training Inference Model: CNN, RNN, LSTM training hardware inference hardware

13 Proposed Paradigm Conventional Training Inference Slow Power- Hungry Proposed Training Han et al. ICLR 17 Compression Pruning Quantization Han et al. NIPS 15 Han et al. ICLR 16 (Best paper award) Accelerated Inference Han et al. ISCA 16 Han et al. FPGA 17 (Best paper award) Fast Power- Efficient 15

14 Proposed Paradigm Conventional Training Inference Slow Power- Hungry Proposed Training Han et al. ICLR 17 Compression Pruning Quantization Han et al. NIPS 15 Han et al. ICLR 16 (Best paper award) Accelerated Inference Han et al. ISCA 16 Han et al. FPGA 17 (Best paper award) Fast Power- Efficient 16

15 17 The Goal & Trade-off Small Fast Accurate Energy Efficient

16 Agenda Model Compression (Small) Pruning [NIPS 15] Trained Quantization [ICLR 16] Compression Pruning Quantization Pruning Quantization Hardware Acceleration (Fast, Efficient) EIE Accelerator [ISCA 16] ESE Accelerator [FPGA 17] Accelerated Inference Efficient Training (Accurate) Dense-Sparse-Dense Regularization [ICLR 17] Training 18

17 Small Fast Agenda Model Compression (Small) Pruning [NIPS 15] Trained Quantization [ICLR 16] Accurate Energy Efficient Hardware Acceleration (Fast, Efficient) EIE Accelerator [ISCA 16] ESE Accelerator [FPGA 17] Efficient Training (Accurate) Dense-Sparse-Dense Regularization [ICLR 17] 19

18 Learning both Weights and Connections for Efficient Neural Networks Han et al. NIPS

19 Pruning Neural Networks [Lecun et al. NIPS 89] [Han et al. NIPS 15] Pruning Trained Quantization Huffman Coding 22

20 [Han et al. NIPS 15] Pruning Neural Networks -0.01x 2 +x+1 6M 60 Million 10x less connections Pruning Trained Quantization Huffman Coding 23

21 Pruning Neural Networks [Han et al. NIPS 15] Accuracy Loss 0.5% 0.0% -0.5% -1.0% -1.5% -2.0% -2.5% -3.0% -3.5% -4.0% -4.5% 40% 50% 60% 70% 80% 90% 100% Parameters Pruned Away Pruning Trained Quantization Huffman Coding 24

22 Pruning Neural Networks [Han et al. NIPS 15] Accuracy Loss Pruning 0.5% 0.0% -0.5% -1.0% -1.5% -2.0% -2.5% -3.0% -3.5% -4.0% -4.5% 40% 50% 60% 70% 80% 90% 100% Parameters Pruned Away Pruning Trained Quantization Huffman Coding 25

23 Retrain to Recover Accuracy [Han et al. NIPS 15] Accuracy Loss Pruning Pruning+Retraining 0.5% 0.0% -0.5% -1.0% -1.5% -2.0% -2.5% -3.0% -3.5% -4.0% -4.5% 40% 50% 60% 70% 80% 90% 100% Parameters Pruned Away Pruning Trained Quantization Huffman Coding 26

24 [Han et al. NIPS 15] Iteratively Retrain to Recover Accuracy Accuracy Loss Pruning Pruning+Retraining Iterative Pruning and Retraining 0.5% 0.0% -0.5% -1.0% -1.5% -2.0% -2.5% -3.0% -3.5% -4.0% -4.5% 40% 50% 60% 70% 80% 90% 100% Parameters Pruned Away Pruning Trained Quantization Huffman Coding 27

25 Pruning RNN and LSTM [Han et al. NIPS 15] *Karpathy et al "Deep Visual- Semantic Alignments for Generating Image Descriptions" Pruning Trained Quantization Huffman Coding 28

26 Pruning RNN and LSTM [Han et al. NIPS 15] 90% Original: a basketball player in a white uniform is playing with a ball Pruned 90%: a basketball player in a white uniform is playing with a basketball 90% Original : a brown dog is running through a grassy field Pruned 90%: a brown dog is running through a grassy area 90% 95% Original : a man is riding a surfboard on a wave Pruned 90%: a man in a wetsuit is riding a wave on a beach Original : a soccer player in red is running in the field Pruned 95%: a man in a red shirt and black and white black shirt is running through a field Pruning Trained Quantization Huffman Coding 29

27 [Han et al. NIPS 15] Pruning Changes Weight Distribution Before Pruning After Pruning After Retraining Conv5 layer of Alexnet. Representative for other network layers as well. Pruning Trained Quantization Huffman Coding 30

28 Small Fast Agenda Model Compression (Small) Pruning [NIPS 15] Trained Quantization [ICLR 16] Accurate Energy Efficient Pruning Quantization Pruning Quantization Hardware Acceleration (Fast, Efficient) EIE Accelerator [ISCA 16] ESE Accelerator [FPGA 17] Efficient Training (Accurate) Dense-Sparse-Dense Regularization [ICLR 17] 31

29 Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding Han et al. ICLR 2016 Best Paper Pruning Trained Quantization Huffman Coding 32

30 [Han et al. ICLR 16] Trained Quantization 4bit 32 bit 8x less memory footprint 2.09, 2.12, 1.92, Pruning Trained Quantization Huffman Coding 33

31 [Han et al. ICLR 16] Trained Quantization Pruning Trained Quantization Huffman Coding 34

32 [Han et al. ICLR 16] Trained Quantization Pruning Trained Quantization Huffman Coding 35

33 [Han et al. ICLR 16] Trained Quantization Pruning Trained Quantization Huffman Coding 36

34 [Han et al. ICLR 16] Trained Quantization Pruning Trained Quantization Huffman Coding 37

35 [Han et al. ICLR 16] Trained Quantization Pruning Trained Quantization Huffman Coding 38

36 [Han et al. ICLR 16] Trained Quantization Pruning Trained Quantization Huffman Coding 39

37 [Han et al. ICLR 16] Trained Quantization Pruning Trained Quantization Huffman Coding 40

38 Before Trained Quantization: Continuous Weight [Han et al. ICLR 16] Count Weight Value Pruning Trained Quantization Huffman Coding 41

39 After Trained Quantization: Discrete Weight [Han et al. ICLR 16] Count Weight Value Pruning Trained Quantization Huffman Coding 42

40 After Trained Quantization: Discrete Weight after Training [Han et al. ICLR 16] Count Weight Value Pruning Trained Quantization Huffman Coding 43

41 [Han et al. ICLR 16] Bits Per Weight Pruning Trained Quantization Huffman Coding 44

42 [Han et al. ICLR 16] Pruning + Trained Quantization AlexNet on ImageNet Pruning Trained Quantization Huffman Coding 45

43 [Han et al. ICLR 16] Huffman Coding In-frequent weights: use more bits to represent Frequent weights: use less bits to represent Pruning Trained Quantization Huffman Coding 46

44 [Han et al. ICLR 16] Summary of Deep Compression Pruning Trained Quantization Huffman Coding 47

45 [Han et al. ICLR 16] Results: Compression Ratio Network Original Size Compressed Size Compression Ratio Original Accuracy Compressed Accuracy LeNet KB 27KB 40x 98.36% 98.42% LeNet KB 44KB 39x 99.20% 99.26% AlexNet 240MB 6.9MB 35x 80.27% 80.30% VGGNet 550MB 11.3MB 49x 88.68% 89.09% Fit in Cache! GoogleNet 28MB 2.8MB 10x 88.90% 88.92% ResNet MB 4.0MB 11x 89.24% 89.28% Can we make compact models to begin with? 48

46 SqueezeNet squeeze8 1x18convolu.on8filters8 ReLU8 expand8 1x18and83x38convolu.on8filters8 ReLU8 Iandola et al, SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5MB model size, arxiv

47 Compressing SqueezeNet Network Approach Size Ratio Top-1 Accuracy Top-5 Accuracy AlexNet - 240MB 1x 57.2% 80.3% AlexNet SVD 48MB 5x 56.0% 79.4% AlexNet Deep Compression 6.9MB 35x 57.2% 80.3% SqueezeNet - 4.8MB 50x 57.5% 80.3% SqueezeNet Deep Compression 0.47MB 510x 57.5% 80.3% 50

48 Results: Speedup Average 0.6x CPU GPU mgpu 51

49 Results: Energy Efficiency Average CPU GPU mgpu 52

50 Industrial Impact Deep Compression At Baidu, our #1 motivation for compressing networks is to bring down the size of the binary file. As a mobile-first company, we frequently update various apps via different app stores. We've very sensitive to the size of our binary files, and a feature that increases the binary size by 100MB will receive much more scrutiny than one that increases it by 10MB. Andrew Ng 53

51 Challenges Online de-compression while computing Special purpose logic Computation becomes irregular Sparse weight Sparse activation Indirect lookup Parallelization becomes challenging Synchronization overhead. Load imbalance issue. Scalability 54

52 Having Opened the Box, HW Design? Algorithm Hardware? Breaks the boundary between algorithm and hardware 55

53 Small Fast Agenda Model Compression (Small) Pruning [NIPS 15] Trained Quantization [ICLR 16] Accurate Energy Efficient Hardware Acceleration (Fast, Efficient) EIE Accelerator [ISCA 16] ESE Accelerator [FPGA 17] Efficient Training (Accurate) Dense-Sparse-Dense Regularization [ICLR 17] 56

54 EIE: Efficient Inference Engine on Compressed Deep Neural Network Han et al. ISCA

55 57 Operation 32 bit int ADD bit float ADD bit Register File 1 32 bit int MULT bit float MULT bit SRAM Cache 5 32 bit DRAM Memory 640 Energy [pj] = 1000 How to reduce the memory footprint?

56 Related Work SpMat Ptr_Even Act_0 Act_1 Arithm Ptr_Odd SpMat Eyeriss [1] MIT Dataflow TPU [2] Google 8-bit Integer DaDiannao [3] CAS edram EIE [this work] Stanford Compression [1] Yu-Hsin Chen, et al. "Eyeriss: An energy-efficient reconfigurable accelerator for deep convolutional neural networks." ISSCC 2016 [2] Norm Jouppi, Google supercharges machine learning tasks with TPU custom chip, 2016 [3] Yunji Chen, et al. "Dadiannao: A machine-learning supercomputer." Micro 2014 [4] Song Han et al. EIE: Efficient Inference Engine on Compressed Deep Neural Network, ISCA

57 [Han et al. ISCA 16] EIE: Efficient Inference Engine 0 * A = 0 W * 0 = , 1.92=> 2 Sparse Weight 90% static sparsity Sparse Activation 70% dynamic sparsity Weight Sharing 4-bit weights 10x less computation 3x less computation 5x less memory footprint 8x less memory footprint 60

58 EIE: Parallelization on Sparsity [Han et al. ISCA 16] a ( ) 0 a 1 0 a 3 b PE0 w 0,0 w 0,1 0 w 0,3 b 0 PE1 0 0 w 1,2 0 b 1 b 1 PE2 0 w 2,1 0 w 2,3 b 2 0 PE b 3 ReLU b 3 = 0 0 w 4,2 w 4,3 b 4 0 w 5, b w 6,3 b 6 b 6 0 w 7,1 0 0 b 7 0 b 0 b 5 61

59 EIE: Parallelization on Sparsity [Han et al. ISCA 16] PE PE PE PE PE PE PE PE Central Control PE PE PE PE PE PE PE PE a ( ) 0 a 1 0 a 3! b PE0 w 0,0 w 0,1 0 w 0,3 b 0 PE1 0 0 w 1,2 0 b 1 b 1 PE2 0 w 2,1 0 w 2,3 " b 2 0 PE b 3 ReLU b 3 = # 0 0 w 4,2 w 4,3 " b 4 0 w 5, b w 6,3 b 6 b 6 0 w 7,1 0 0 " b 7 0 b 0 b 5 62

60 EIE: Parallelization on Sparsity [Han et al. ISCA 16] logically a ( ) 0 a 1 0 a 3! b PE0 w 0,0 w 0,1 0 w 0,3 b 0 PE1 0 0 w 1,2 0 b 1 b 1 PE2 0 w 2,1 0 w 2,3 " b 2 0 PE b 3 ReLU b 3 = # 0 0 w 4,2 w 4,3 " b 4 0 w 5, b w 6,3 b 6 b 6 0 w 7,1 0 0 " b 7 0 b 0 b 5 physically Virtual Weight W 0,0 W 0,1 W 4,2 W 0,3 W 4,3 Relative Index Column Pointer

61 [Han et al. ISCA 16] Dataflow a ( ) 0 a 1 0 a 3! b PE0 w 0,0 w 0,1 0 w 0,3 b 0 PE1 0 0 w 1,2 0 b 1 b 1 PE2 0 w 2,1 0 w 2,3 " b 2 0 PE b 3 ReLU b 3 = # 0 0 w 4,2 w 4,3 " b 4 0 w 5, b w 6,3 b 6 b 6 0 w 7,1 0 0 " b 7 0 b 0 b 5 64

62 [Han et al. ISCA 16] Dataflow a ( ) 0 a 1 0 a 3! b PE0 w 0,0 w 0,1 0 w 0,3 b 0 PE1 0 0 w 1,2 0 b 1 b 1 PE2 0 w 2,1 0 w 2,3 " b 2 0 PE b 3 ReLU b 3 = # 0 0 w 4,2 w 4,3 " b 4 0 w 5, b w 6,3 b 6 b 6 0 w 7,1 0 0 " b 7 0 b 0 b 5 65

63 [Han et al. ISCA 16] Dataflow a ( ) 0 a 1 0 a 3! b PE0 w 0,0 w 0,1 0 w 0,3 b 0 PE1 0 0 w 1,2 0 b 1 b 1 PE2 0 w 2,1 0 w 2,3 " b 2 0 PE b 3 ReLU b 3 = # 0 0 w 4,2 w 4,3 " b 4 0 w 5, b w 6,3 b 6 b 6 0 w 7,1 0 0 " b 7 0 b 0 b 5 66

64 [Han et al. ISCA 16] Dataflow a ( ) 0 a 1 0 a 3! b PE0 w 0,0 w 0,1 0 w 0,3 b 0 PE1 0 0 w 1,2 0 b 1 b 1 PE2 0 w 2,1 0 w 2,3 " b 2 0 PE b 3 ReLU b 3 = # 0 0 w 4,2 w 4,3 " b 4 0 w 5, b w 6,3 b 6 b 6 0 w 7,1 0 0 " b 7 0 b 0 b 5 67

65 [Han et al. ISCA 16] Dataflow a ( ) 0 a 1 0 a 3! b PE0 w 0,0 w 0,1 0 w 0,3 b 0 PE1 0 0 w 1,2 0 b 1 b 1 PE2 0 w 2,1 0 w 2,3 " b 2 0 PE b 3 ReLU b 3 = # 0 0 w 4,2 w 4,3 " b 4 0 w 5, b w 6,3 b 6 b 6 0 w 7,1 0 0 " b 7 0 b 0 b 5 68

66 [Han et al. ISCA 16] Dataflow a ( ) 0 a 1 0 a 3! b PE0 w 0,0 w 0,1 0 w 0,3 b 0 PE1 0 0 w 1,2 0 b 1 b 1 PE2 0 w 2,1 0 w 2,3 " b 2 0 PE b 3 ReLU b 3 = # 0 0 w 4,2 w 4,3 " b 4 0 w 5, b w 6,3 b 6 b 6 0 w 7,1 0 0 " b 7 0 b 0 b 5 69

67 [Han et al. ISCA 16] Dataflow a ( ) 0 a 1 0 a 3! b PE0 w 0,0 w 0,1 0 w 0,3 b 0 PE1 0 0 w 1,2 0 b 1 b 1 PE2 0 w 2,1 0 w 2,3 " b 2 0 PE b 3 ReLU b 3 = # 0 0 w 4,2 w 4,3 " b 4 0 w 5, b w 6,3 b 6 b 6 0 w 7,1 0 0 " b 7 0 b 0 b 5 70

68 [Han et al. ISCA 16] Dataflow a ( ) 0 a 1 0 a 3! b PE0 w 0,0 w 0,1 0 w 0,3 b 0 PE1 0 0 w 1,2 0 b 1 b 1 PE2 0 w 2,1 0 w 2,3 " b 2 0 PE b 3 ReLU b 3 = # 0 0 w 4,2 w 4,3 " b 4 0 w 5, b w 6,3 b 6 b 6 0 w 7,1 0 0 " b 7 0 b 0 b 5 71

69 [Han et al. ISCA 16] Dataflow a ( ) 0 a 1 0 a 3! b PE0 w 0,0 w 0,1 0 w 0,3 b 0 PE1 0 0 w 1,2 0 b 1 b 1 PE2 0 w 2,1 0 w 2,3 " b 2 0 PE b 3 ReLU b 3 = # 0 0 w 4,2 w 4,3 " b 4 0 w 5, b w 6,3 b 6 b 6 0 w 7,1 0 0 " b 7 0 b 0 b 5 72

70 [Han et al. ISCA 16] Dataflow a ( ) 0 a 1 0 a 3! b PE0 w 0,0 w 0,1 0 w 0,3 b 0 PE1 0 0 w 1,2 0 b 1 b 1 PE2 0 w 2,1 0 w 2,3 " b 2 0 PE b 3 ReLU b 3 = # 0 0 w 4,2 w 4,3 " b 4 0 w 5, b w 6,3 b 6 b 6 0 w 7,1 0 0 " b 7 0 b 0 b 5 73

71 [Han et al. ISCA 16] EIE Architecture Weight decode Compressed DNN Model Input Image Encoded Weight Relative Index Sparse Format 4-bit Virtual weight 4-bit Relative Index Weight Look-up Index Accum 16-bit Real weight ALU 16-bit Mem Absolute Index Prediction Result Address Accumulate 74

72 Micro Architecture for each PE [Han et al. ISCA 16] Act Value Act Index Act Queue Act Index Act Value Encoded Weight Act SRAM Leading NZero Detect Even Ptr SRAM Bank Odd Ptr SRAM Bank Pointer Read Col Start/ End Addr Sparse Matrix SRAM Sparse Matrix Access Regs Relative Index Weight Decoder Address Accum Absolute Address Arithmetic Unit Bypass Dest Act Regs Src Act Regs Act R/W ReLU SRAM Regs Comb 75

73 [Han et al. ISCA 16] Load Balance PE PE PE PE PE PE PE PE Central Control PE PE PE PE Act Value Act Index Act Queue PE PE PE PE SRAM Regs Comb 76

74 [Han et al. ISCA 16] Activation Sparsity PE PE PE PE PE PE PE PE Central Control Act Value Act Index Act Queue Act Index Act Value Leading NZero Detect PE PE PE PE PE PE PE PE Even Ptr SRAM Bank Odd Ptr SRAM Bank Pointer Read SRAM Regs Comb 77

75 [Han et al. ISCA 16] Weight Sparsity PE PE PE PE PE PE PE PE Central Control Act Value Act Index Act Queue Act Index Act Value Leading NZero Detect PE PE PE PE PE PE PE PE Even Ptr SRAM Bank Odd Ptr SRAM Bank Col Start/ End Addr Sparse Matrix SRAM Regs Pointer Read Sparse Matrix Access SRAM Regs Comb 78

76 [Han et al. ISCA 16] Weight Sharing PE PE PE PE PE PE PE PE Central Control Act Value Act Index Act Queue Act Index Act Value Encoded Weight Decoded Weight Leading NZero Detect PE PE PE PE PE PE PE PE Even Ptr SRAM Bank Odd Ptr SRAM Bank Col Start/ End Addr Sparse Matrix SRAM Regs Weight Decoder Pointer Read Sparse Matrix Access SRAM Regs Comb 79

77 [Han et al. ISCA 16] Address Accumulate PE PE PE PE PE PE PE PE Central Control Act Value Act Index Act Queue Act Index Act Value Encoded Weight Decoded Weight Leading NZero Detect PE PE PE PE PE PE PE PE Even Ptr SRAM Bank Odd Ptr SRAM Bank Pointer Read Col Start/ End Addr Sparse Matrix SRAM Sparse Matrix Access Regs Relative Index Weight Decoder Address Accum Absolute Index SRAM Regs Comb 80

78 [Han et al. ISCA 16] Arithmetic PE PE PE PE PE PE PE PE Central Control Act Value Act Index Act Queue Act Index Act Value Encoded Weight Leading NZero Detect PE PE PE PE PE PE PE PE Even Ptr SRAM Bank Odd Ptr SRAM Bank Pointer Read Col Start/ End Addr Sparse Matrix SRAM Sparse Matrix Access Regs Relative Index Weight Decoder Address Accum Absolute Address Arithmetic Unit Bypass SRAM Regs Comb 81

79 [Han et al. ISCA 16] Write Back PE PE PE PE PE PE PE PE Central Control Act Value Act Index Act Queue Act Index Act Value Encoded Weight Act SRAM Leading NZero Detect PE PE PE PE PE PE PE PE Even Ptr SRAM Bank Odd Ptr SRAM Bank Pointer Read Col Start/ End Addr Sparse Matrix SRAM Sparse Matrix Access Regs Relative Index Weight Decoder Address Accum Absolute Address Arithmetic Unit Bypass Dest Act Regs Src Act Regs Act R/W SRAM Regs Comb 82

80 [Han et al. ISCA 16] Relu, Non-zero Detection PE PE PE PE PE PE PE PE Central Control Act Value Act Index Act Queue Act Index Act Value Encoded Weight Act SRAM Leading NZero Detect PE PE PE PE PE PE PE PE Even Ptr SRAM Bank Odd Ptr SRAM Bank Pointer Read Col Start/ End Addr Sparse Matrix SRAM Sparse Matrix Access Regs Relative Index Weight Decoder Address Accum Absolute Address Arithmetic Unit Bypass Dest Act Regs Src Act Regs Act R/W ReLU SRAM Regs Comb 83

81 [Han et al. ISCA 16] What s Special PE PE PE PE PE PE PE PE Central Control Act Value Act Index Act Queue Act Index Act Value Encoded Weight Act SRAM Leading NZero Detect PE PE PE PE PE PE PE PE Even Ptr SRAM Bank Odd Ptr SRAM Bank Pointer Read Col Start/ End Addr Sparse Matrix SRAM Sparse Matrix Access Regs Relative Index Weight Decoder Address Accum Absolute Address Arithmetic Unit Bypass Dest Act Regs Src Act Regs Act R/W ReLU SRAM Regs Comb 84

82 [Han et al. ISCA 16] Post Layout Result of EIE Technology 45 nm # PEs 64 on-chip SRAM Max Model Size Static Sparsity Dynamic Sparsity Quantization ALU Width Area MxV Throughput Power 8 MB 84 Million 10x 3x 4-bit 16-bit 40.8 mm^2 81,967 layers/s 586 mw 1. Post layout result 2. Throughput measured on AlexNet FC-7 85

83 [Han et al. ISCA 16] Benchmark CPU: Intel Core-i7 5930k GPU: NVIDIA TitanX Mobile GPU: NVIDIA Jetson TK1 Layer Size Weight Density Activation Density FLOP Reduction AlexNet % 35% 33x AlexNet % 35% 33x AlexNet % 38% 10x VGG % 18% 100x VGG % 37% 50x VGG % 41% 10x NeuralTalk-We % 100% 10x NeuralTalk-Wd % 100% 10x NeuralTalk-LSTM % 100% 10x Description AlexNet for image classification VGG-16 for image classification RNN and LSTM for image caption 86

84 5. [Han et al. ISCA 16] Speedup on EIE SpMat CPU Dense (Baseline) 1000x CPU Compressed GPU Dense 507x 248x mgpu Compressed Ptr_Even Act_1 Arithm SpMat Ptr_Odd Speedup Act_0 100x 25x 14x 10x 1x 5x 3x 2x 1x 1x 0.6x 24x 9x 21x 14x 5x 1x 0.5x 1x 1x 1.1x 1x 135x 1x 1.0x EIE 92x 22x 10x 8x 9x mgpu Compressed EIE 1x 1x 1.0x 1x 0.5x 48x 25x 9x 3x 3x 2x 60x 33x 15x 9x 1x 1x 0.3x 189x 98x 63x 34x 16x 10x 2x 1x 0.5x 2x 15x 3x 1x 0.5x 1x 3x 0.6x 0.1x Alex-6 Layout of one PE in EIE under TSMC 45nm process. Table II Figure 6. Alex-7 Alex-8 VGG-6 VGG-7 VGG-8 ry network er national cell ueue ad tread munit W cell Power (mw) (59.15%) (20.46%) (11.20%) (9.18%) (1.23%) (19.73%) (54.11%) (12.68%) (12.25%) (%) Area (µm2 ) 638, , ,465 8,946 23, , ,412 3,110 18,934 23,961 Energy Efficiency 98x (%) CPU Dense (Baseline) x (93.22%) 10000x (0.14%) (1.48%) 1000x (1.40%) (3.76%) 100x (0.12%) (19.10%) 10x (73.57%) (0.49%) 1x (2.97%) (3.76%) Figure 7. 2x NT-We NT-Wd NT-LSTM Geo Mean Speedups of GPU, mobile GPU and EIE compared with CPU running uncompressed DNN model. There is no batching in all cases. 189x MPLEMENTATION RESULTS OF ONE PE IN EIE AND THE OWN BY COMPONENT TYPE ( LINE 3-7), BY MODULE ( LINE 8-13). T HE CRITICAL PATH OF EIE IS 1.15 NS x mgpu Dense 618x 210x 115x 94x 56x GPU Compressed 1018x 60x 37x 9x12x 26x 37x 5x 7x 1x 10x Alex-6 GPU Dense 119,797x 61,533x 34,522x 25x 9x CPU Compressed 1x 48x 14,826x 15x Alex-7 15x 78x 101x 59x 7x10x 3x 1x 18x 7x Alex-8 GPU Compressed 17x 10x 1x 13x VGG-6 3x mgpu Dense mgpu Compressed EIE 76,784x 61x 20x 10x 1x 11,828x 24,207x 10,904x 9,485x 8,053x 102x 14x VGG-7 1x 2x 5x 25x 14x 8x 39x 14x 6x 6x 6x 6x 8x 1x 5x VGG-8 NT-We 3x 1x 25x 7x 15x 20x 4x 5x 7x 1x 23x 6x 7x 1x 36x 9x Geo Mean NT-Wd NT-LSTM Geo Mean Energy efficiency of GPU, mobile GPU and EIE compared with CPU running uncompressed DNN model. There is no batching in all cases. 1x 1xcorner. We placed and routed 1x the PE using the Synopsys IC B 0.6x 0.5x compiler (ICC). We used Cacti [25] to get SRAM area and Layer nit: I/O and Computing. In the I/O mode, all of re idle while the activations and weights in every e accessed by a DMA connected with the Central s is one time cost. In the Computing mode, the eatedly collects a non-zero value from the LNZD and broadcasts this value to all PEs. This process until the input length is exceeded. By setting the gth and starting address of pointer array, EIE is to execute different layers. Table III ENCHMARK FROM STATE - OF - THE - ART Size Weight% Act% energy numbers. We annotated the toggle rate from the RTL 9216, Alex-6 9% 4096 simulation to the gate-level netlist, which was dumped to 4096, switching activity interchange format (SAIF), and estimated Alex-7 9% V. E VALUATION M ETHODOLOGY 4096 power using Prime-Time PX. tor, RTL and Layout. We the implemented a custom 4096, Alex-8 25% urate C++ simulator for the accelerator aimed to 1000 Comparison Baseline. We compare EIE with three dife RTL behavior of synchronous circuits. Each EIE CPU GPU mgpu 25088, module is abstracted as an ferent object thatoff-the-shelf implements VGG-6 4% computing units: CPU, GPU and mobile 4096 act methods: propagate and update, corresponding 4096, ation logic and the flip-flop GPU. in RTL. The simulator VGG-7 4% or design space exploration. It also serves as a ) CPU. We use Intel Core i k CPU, a Haswell-E or RTL verification. 4096, asure the area, power and critical delay, we that has been used in NVIDIA Digits Deep VGG-8 23% classpath processor, 1000 ted the RTL of EIE in Verilog. The RTL is verified Dev Box as aacceleration CPU baseline. To run the benchmarkregularization 4096, e cycle-accurate simulator.learning Then we synthesized Compression NT-LSTM g the Synopsys Design Compiler (DC) under the Geo Mean NT-We % DNN MODELS FLOP% 35.1% 3% 35.3% 3% 37.5% 10% 18.3% 1% 37.5% 2% 41.1% 9% 100% 10% Description Compressed AlexNet [1] for large scale image classification Compressed VGG-16 [3] for large scale image classification and object detection Compressed NeuralTalk [7] 87

85 CPU Dense (Baseline) Wd NT-LSTM Speedup 1000x 25x 14x GPU Compressed mgpu Dense Geo Mean 24x 9x 21x 14x mgpu Compressed 135x 92x 22x 10x 34x 16x 10x 60x 33x 15x 9x 2x 1x 0.6x 5x 1x 0.5x 1x 1x 1.1x 1x 1x 1.0x 1x 1x 1.0x 1x 0.3x 1x 25x 9x 3x 3x 2x 1x 2x 1x 0.5x 0.5x 48x 15x 2x 1x 0.5x 3x 1x 3x 0.6x 0.1x Alex-6 Figure 6. Act_0 Ptr_Even Act_1 Arithm Alex-7 Alex-8 CPU Dense (Baseline) Ptr_Odd VGG-6 VGG x CPU Compressed GPU Dense 119,797x 61,533x 34,522x 100x Figure 7. 37x 9x12x 26x 37x 5x 7x 10x 1x 1x 10x Alex-6 1x 78x 101x 59x 15x 7x10x 3x 1x Alex-7 18x 7x Alex-8 (%) mgpu Dense 11,828x 17x 10x 1x 13x VGG-6 61x 20x 10x 102x 14x 1x VGG-7 NT-LSTM Geo Mean mgpu Compressed EIE EIE 1x 2x 5x 8x 25x 14x 5x VGG-8 10,904x 9,485x 39x 14x 6x 6x 6x 6x 1x 8x NT-We 1x 25x 7x NT-Wd 24,207x 8,053x 15x 20x 4x 5x 7x 1x NT-LSTM 23x 6x 7x 1x 36x 9x Geo Mean Energy efficiency of GPU, mobile GPU and EIE compared with CPU running uncompressed DNN model. There is no batching in all cases. 10,904x corner. We placed and routed8,053x the PE using the Synopsys IC Area (µm2 ) 638, , ,465 8,946 23, , ,412 3,110 18,934 23,961 NT-Wd 76,784x 1000x MPLEMENTATION RESULTS OF ONE PE IN EIE AND THE OWN BY COMPONENT TYPE ( LINE 3-7), BY MODULE ( LINE 8-13). T HE CRITICAL PATH OF EIE IS 1.15 NS Power (mw) GPU Compressed 10000x Layout of one PE in EIE under TSMC 45nm process. Table II NT-We 14,826x mgpu Compressed SpMat VGG-8 Speedups of GPU, mobile GPU and EIE compared with CPU running uncompressed DNN model. There is no batching in all cases. SpMat Energy Efficiency 5. 3x 1x 1x 5x 9x [Han et al. ISCA 16] 189x 98x 63x Energy Efficiency on EIE There is no batching in all cases. 10x 8x EIE 618x 210x 115x 94x 56x GPU Dense 1018x 507x 248x 100x CPU Compressed (%) 24,207x Table III B ENCHMARK FROM STATE - OF - THE - ART DNN MODELS compiler(93.22%) (ICC). We used Cacti [25] to get SRAM area and Layer Size Weight% Act% FLOP% Description (0.14%) energy numbers. We annotated the toggle rate from the RTL 9216, (1.48%) Alex-6 9% 35.1% 3% (1.40%) 4096 Compressed simulation to the gate-level netlist, which was dumped to (3.76%) 4096, AlexNet [1] for (1.23%) (0.12%) activity interchange format (SAIF), and estimated Alex-7 9% 35.3% 3% (19.73%)switching (19.10%) 4096 large scale image (54.11%) (73.57%) the power using Prime-Time PX (12.68%) (0.49%) 4096, classification Alex-8 25% 37.5% 10% (12.25%) (2.97%) 1000 Comparison Baseline. We compare EIE with three dif(3.76%) 25088, VGG-6 4% 18.3% 1% Compressed ferent off-the-shelf computing units: CPU, GPU and mobile 4096 nit: I/O and Computing. In the I/O mode, all of VGG-16 [3] for GPU. 36x re idle while the activations and weights in every 4096, VGG-7 4% 37.5% 2% large scale image 25x 23x e accessed by a DMA connected with the Central x 1) CPU. We use Intel Core i k CPU, a Haswell-E classification and 15x s is one time cost. In the Computing mode, the 4096, eatedly collects a non-zero value from the LNZD VGG-8 23% 41.1% 9% object detection class processor, that has been used in NVIDIA Digits 7x Deep x and broadcasts this value to all PEs. This process 5x Dev the Box as a CPU baseline. To run the benchmark 4096, Compressed 4x until the input length is Learning exceeded. By setting NT-We 10% 100% 10% gth and starting address on of pointer array, EIE is 600 NeuralTalk [7] CPU, we used MKL CBLAS GEMV to implement the to execute different layers. 9x 600, with RNN and 7xMKL SPBLAS CSRMV 7xV. EVALUATION METHODOLOGY 1x dense model and 1x for the NT-Wd 11% 100% 11% original 8791 LSTM for compressed sparse model. CPU socket and DRAM power 1201, automatic tor, RTL and Layout. We implemented a custom NTLSTM 10% 100% 11% 2400 image captioning urate C++ simulator for the aimed toby the pcm-power utility provided by Intel. areaccelerator as reported e RTL behavior of synchronous circuits. Each GPU mgpu EIE CPUX GPU, 2) GPU. We use NVIDIA GeForce GTX Titan module is abstracted as an object that implements act methods: propagate and update, corresponding The uncompressed DNN model is obtained from Caffe a state-of-the-art GPU for deep learning as our baseline ation logic and the flip-flop in RTL. The simulator or design space exploration. It alsonvidia-smi serves as a model zoo [28] and NeuralTalk model zoo [7]; The comusing utility to report the power. To run or RTL verification. pressed DNN model is produced as described in [16], [23]. benchmark, asure the area, power andthe critical path delay, we we used cublas GEMV to implement ted the RTL of EIE in Verilog. The RTL is verified The benchmark networks have 9 layers in total obtained thethen original dense layer. For the compressed sparse layer, e cycle-accurate simulator. we synthesized Compression Acceleration Regularization from AlexNet, VGGNet, and NeuralTalk. We use the Imagewe stored thethesparse matrix in in CSR format, and used g the Synopsys Design Compiler (DC) under ry network er national cell ueue ad tread munit W cell Wd (59.15%) (20.46%) (11.20%) (9.18%) Geo Mean NT-LSTM Geo Mean 88

86 [Han et al. ISCA 16] Comparison: Throughput EIE 1E+06 Throughput (Layers/s in log scale) 1E+05 1E+04 1E+03 GPU ASIC ASIC ASIC ASIC 1E+02 1E+01 CPU mgpu FPGA 1E+00 Core-i7 5930k 22nm CPU TitanX 28nm GPU Tegra K1 28nm mgpu A-Eye 28nm FPGA DaDianNao 28nm ASIC TrueNorth 28nm ASIC EIE 45nm ASIC 64PEs EIE 28nm ASIC 256PEs 89

87 Comparison: Energy Efficiency [Han et al. ISCA 16] EIE 1E+06 Energy Efficiency (Layers/J in log scale) 1E+05 ASIC ASIC 1E+04 ASIC ASIC 1E+03 1E+02 1E+01 GPU mgpu 1E+00 CPU Core-i7 5930k 22nm CPU TitanX 28nm GPU Tegra K1 28nm mgpu FPGA A-Eye 28nm FPGA DaDianNao 28nm ASIC TrueNorth 28nm ASIC EIE 45nm ASIC 64PEs EIE 28nm ASIC 256PEs 90

88 [Han et al. ISCA 16] Scalability Speedup PE 2PEs 4PEs 8PEs 16PEs 32PEs 64PEs 128PEs 256PEs Alex-6 Alex-7 Alex-8 VGG-6 VGG-7 VGG-8 NT-We NT-Wd NT-LSTM Figure 11. System scalability. It measures the speedups with different numbers of PEs. The speedup is near-linear. #PEs ~ Speedup 64PEs: 64x 128PEs: 124x 256PEs: 210x 91

89 [Han et al. ISCA 16] Load Balancing Load Balance 100% 80% 60% 40% 20% 0% FIFO=1 FIFO=2 FIFO=4 FIFO=8 FIFO=16 FIFO=32 FIFO=64 FIFO=128 FIFO=256 Alex-6 Alex-7 Alex-8 VGG-6 VGG-7 VGG-8 NT-We NT-Wd NT-LSTM 8. Load efficiency improves as FIFO size increases. When FIFO deepth 8, the marginal gain quickly diminishes. So we choose FIFO de Imbalanced non-zeros among PEs degrades system utilization. This load imbalance could be solved by FIFO. With FIFO depth=8, ALU utilization is > 80%. 92

90 Can we do better with load imbalance? Feedforward => Recurrent neural network? 93

91 Small Fast Agenda Model Compression (Small) Pruning [NIPS 15] Trained Quantization [ICLR 16] Accurate Energy Efficient Compression Hardware Acceleration (Fast, Efficient) EIE Accelerator [ISCA 16] ESE Accelerator [FPGA 17] Efficient Training (Accurate) Dense-Sparse-Dense Regularization [ICLR 17] 94

92 ESE: Efficient Speech Recognition Engine for Sparse LSTM on FPGA Han et al. FPGA 2017 Best Paper 95

93 Accelerating Recurrent Neural Networks speech recognition image caption machine translation visual question answering The recurrent nature of RNN/LSTM produces complicated data dependency, which is more challenging than feedforward neural nets. 96

94 Rethinking Model Compression [Han et al. FPGA 17] Compression Pruning Quantization load balance-aware pruning Accelerated Inference 97

95 Pruning Lead to Load Imbalance [Han et al. FPGA 17] w0,0 w0,1 0 w0,3 0 0 w1,2 0 0 w2,1 0 w2, w4,2 w4,3 w5, w6,0 0 0 w6,3 0 w7,1 0 0 Unbalanced 5 cycles 2 cycles 4 cycles 1 cycle Overall: 5 cycles 100

96 Load Balance Aware Pruning [Han et al. FPGA 17] w0,0 w0,1 0 w0,3 0 0 w1,2 0 0 w2,1 0 w2, w4,2 w4,3 w5, w6,0 0 0 w6,3 w0,0 0 0 w0,3 0 0 w1,2 0 0 w2,1 0 w2,3 0 0 w3, w4,2 0 w5,0 0 0 w5,3 w6, w7, w7,1 0 w7,3 Unbalanced 5 cycles 2 cycles 4 cycles 1 cycle Overall: 5 cycles 101

97 Load Balance Aware Pruning [Han et al. FPGA 17] w0,0 w0,1 0 w0,3 0 0 w1,2 0 0 w2,1 0 w2, w4,2 w4,3 w5, w6,0 0 0 w6,3 w0,0 0 0 w0,3 0 0 w1,2 0 0 w2,1 0 w2,3 0 0 w3, w4,2 0 w5,0 0 0 w5,3 w6, w7, w7,1 0 w7,3 Unbalanced 5 cycles 2 cycles 4 cycles 1 cycle Overall: 5 cycles 102

98 Load Balance Aware Pruning [Han et al. FPGA 17] w0,0 w0,1 0 w0,3 0 0 w1,2 0 0 w2,1 0 w2, w4,2 w4,3 w5, w6,0 0 0 w6,3 w0,0 0 0 w0,3 0 0 w1,2 0 0 w2,1 0 w2,3 0 0 w3, w4,2 0 w5,0 0 0 w5,3 w6, w7, w7,1 0 w7,3 Unbalanced 5 cycles 2 cycles 4 cycles 1 cycle Overall: 5 cycles Balanced 3 cycles 3 cycles 3 cycles 3 cycles Overall: 3 cycles 103

99 Load Balance Aware Pruning: Same Accuracy [Han et al. FPGA 17] 28.0% with load balance without load balance 26.5% Phone Error Rate 25.0% 23.5% 22.0% 20.5% sweet spot 19.0% 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Parameters Pruned Away 106

100 Load Balance Aware Pruning: Better Speedup [Han et al. FPGA 17] 7x 6x 5x with load balance without load balance 6.2x speedup over dense Speedup 4x 3x 5.5x speedup over dense 2x 1x 0x 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% Parameters Pruned Away 109

101 [Han et al. FPGA 17] Hardware Architecture Software Program CPU MEM External Memory x/y t-1 m t ActQueue FIFO FIFO FIFO FIFO Channel with multiple PEs FPGA PCIE Controller MEM Controller Memory DATA BUS Input Buffer Output Buffer PtrRead PtrRead PtrRead Pointer Buffer PtrRead Pointer Buffer Pointer Buf Buffer Buf Pointer Buf Buffer Buf Buf Buf Buf Buf SpMM SpmatRead SpmatRead SpMV Accu SpmatRead Weight Buffer SpMV Accu SpmatRead Weight Buffer SpMV Accu Weight Buf Buffer Buf SpMV Accu Weight Buf Buffer Buf Buf Buf Buf Buf PE N PE k PE 1 PE Act 0 Buffer Act Buffer Act Buffer Buf Act Buffer Buf Buf Buf Buf Buf Assemble y t ESE Controller Channel 0 PE PE PE ESE Accelerator Channel 1 PE PE PE Channel N PE PE PE W c /c t-1 c t ElemMul Wx t /Wy t-1 Adder Tree Elt-wise Sigmoid ElemMul /Tanh H t Buffer 110

102 Speedup and Energy Efficiency Platforms GPU CPU ESE Latency 240us 6017us 82.7us Power 202W 111W 41W Speedup 1x 0.04x 3x Energy Efficiency 1x 0.07x 14x Table 6: ESE Resource Utilization. LUT LUTRAM 1 FF BRAM 1 DSP Avail. 331, , ,360 1,080 2,760 Used 293,920 69, , ,504 Utili. 88.6% 47.6% 68.3% 87.7% 54.5% Resource Utilization on Xilinx KU

103 From Compression to Acceleration Challenge 1: memory access is expensive Deep Compression: 10x-49x smaller, no loss of accuracy Challenge 2: sparsity, indirection, load balance EIE / ESE Accelerator: energy-efficient accelerated inference 112

104 What about Training? Compressed Model Size: Same accuracy => Original Model Size: Higher accuracy 113

105 Small Fast Agenda Model Compression (Small) Pruning [NIPS 15] Trained Quantization [ICLR 16] Accurate Energy Efficient Hardware Acceleration (Fast, Efficient) EIE Accelerator [ISCA 16] ESE Accelerator [FPGA 17] Efficient Training (Accurate) Dense-Sparse-Dense Regularization [ICLR 17] Training 114

106 DSD: Dense-Sparse-Dense Training for Deep Neural Networks Han et al. ICLR 2017

107 DSD: Dense Sparse Dense Training [Han et al. ICLR 2017] Dense Sparse Dense Pruning Sparsity Constraint Re-Dense Increase Model Capacity DSD produces same model architecture but can find better optimization solution, arrives at better local minima, and achieves higher prediction accuracy across a wide range of deep neural networks on CNNs / RNNs / LSTMs. 116

108 [Han et al. ICLR 2017] DSD: Intuition Learn the trunk first Then learn the leaves 117

109 [Han et al. ICLR 2017] DSD is General Purpose: Vision, Speech, Natural Language Network Domain Dataset Type Baseline DSD Abs. Imp. Rel. Imp. GoogleNet Vision ImageNet CNN 31.1% 30.0% 1.1% 3.6% VGG-16 Vision ImageNet CNN 31.5% 27.2% 4.3% 13.7% ResNet-18 Vision ImageNet CNN 30.4% 29.3% 1.1% 3.7% ResNet-50 Vision ImageNet CNN 24.0% 23.2% 0.9% 3.5% NeuralTalk Caption Flickr-8K LSTM % DeepSpeech Speech WSJ 93 RNN 33.6% 31.6% 2.0% 5.8% DeepSpeech-2 Speech WSJ 93 RNN 14.5% 13.4% 1.1% 7.4% Open Sourced DSD Model Zoo: The beseline results of AlexNet, VGG16, GoogleNet, SqueezeNet are from Caffe Model Zoo. ResNet18, ResNet50 are from fb.resnet.torch. 120

110 Related Work Dropout [1] and DropConnect [2] - Dropout use a random sparsity pattern. - DSD training learns with a deterministic data driven sparsity pattern. Distillation [3] - Transfer the knowledge from the large model to a small model. - Both DSD and Distillation don t incur architectural changes. [1] Srivastava, Nitish, et al. "Dropout: a simple way to prevent neural networks from overfitting." Journal of Machine Learning Research 15.1 (2014): [2] Wan, Li, et al. "Regularization of neural networks using dropconnect." Proceedings of the 30th International Conference on Machine Learning (ICML-13) [3] Hinton, Geoffrey, Oriol Vinyals, and Jeff Dean. "Distilling the knowledge in a neural network." arxiv preprint arxiv: (2015). 121

111 Small Fast Summary Model Compression (Small) Pruning [NIPS 15] Trained Quantization [ICLR 16] Accurate Energy Efficient Compression Pruning Quantization Hardware Acceleration (Fast, Efficient) EIE Accelerator [ISCA 16] ESE Accelerator [FPGA 17] Accelerated Inference Efficient Training (Accurate) Dense-Sparse-Dense Regularization [ICLR 17] Training 122

112 Summary Algorithm Inference sparsity Training Hardware 126

113 Summary Algorithm Smaller Size: Deep Compression Higher Accuracy: DSD Regularization Inference sparsity Training Better Speed, Energy Efficiency: EIE / ESE Accelerator Hardware 129

114 Summary Algorithm Smaller Size: Deep PhonesCompression Drones Higher Accuracy: Robots DSD Regularization Inference sparsity Training Better Speed, Energy Efficiency: EIE / ESE Accelerator future work Hardware Self Driving Cars AI in the Cloud 130

115 131 Future Smart Low Latency Privacy Mobility Energy-Efficient

116 132 Outlook: the Path for Computation PC Mobile-First AI-First Computing Mobile Computing Brain-Inspired Cognitive Computing Sundar Pichai, Google IO, 2016

117 Thank you! stanford.edu/~songhan Smart Training Han et al. ICLR 17 Dense Sparse Dense Pruning Re-Dense Sparsity Constraint Increase Model Capacity Compression Pruning Quantization Han et al. NIPS 15 Han et al. ICLR 16 (Best paper award) Accelerated Inference Han et al. ISCA 16 Han et al. FPGA 17 (Best paper award) SpMat Fast Ptr_Even Act_0 Act_1 Arithm Ptr_Odd SpMat Efficient 134

Hardware Acceleration of DNNs

Hardware Acceleration of DNNs Lecture 12: Hardware Acceleration of DNNs Visual omputing Systems Stanford S348V, Winter 2018 Hardware acceleration for DNNs Huawei Kirin NPU Google TPU: Apple Neural Engine Intel Lake rest Deep Learning

More information

Optimizing Inference via Approximation

Optimizing Inference via Approximation Lecture 11: Optimizing Inference via Approximation Visual omputing Systems Take-home midterm To be released Sunday morning 10/23. Hand-in noon Tuesday 10/25. Example forms of questions: - Short answer

More information

LRADNN: High-Throughput and Energy- Efficient Deep Neural Network Accelerator using Low Rank Approximation

LRADNN: High-Throughput and Energy- Efficient Deep Neural Network Accelerator using Low Rank Approximation LRADNN: High-Throughput and Energy- Efficient Deep Neural Network Accelerator using Low Rank Approximation Jingyang Zhu 1, Zhiliang Qian 2, and Chi-Ying Tsui 1 1 The Hong Kong University of Science and

More information

arxiv: v3 [cs.cv] 20 Nov 2015

arxiv: v3 [cs.cv] 20 Nov 2015 DEEP COMPRESSION: COMPRESSING DEEP NEURAL NETWORKS WITH PRUNING, TRAINED QUANTIZATION AND HUFFMAN CODING Song Han Stanford University, Stanford, CA 94305, USA songhan@stanford.edu arxiv:1510.00149v3 [cs.cv]

More information

arxiv: v1 [cs.ar] 11 Dec 2017

arxiv: v1 [cs.ar] 11 Dec 2017 Multi-Mode Inference Engine for Convolutional Neural Networks Arash Ardakani, Carlo Condo and Warren J. Gross Electrical and Computer Engineering Department, McGill University, Montreal, Quebec, Canada

More information

Optimizing Loop Operation and Dataflow in FPGA Acceleration of Deep Convolutional Neural Networks

Optimizing Loop Operation and Dataflow in FPGA Acceleration of Deep Convolutional Neural Networks 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays Optimizing Loop Operation and Dataflow in FPGA Acceleration of Deep Convolutional Neural Networks Yufei Ma, Yu Cao, Sarma Vrudhula,

More information

Binary Deep Learning. Presented by Roey Nagar and Kostya Berestizshevsky

Binary Deep Learning. Presented by Roey Nagar and Kostya Berestizshevsky Binary Deep Learning Presented by Roey Nagar and Kostya Berestizshevsky Deep Learning Seminar, School of Electrical Engineering, Tel Aviv University January 22 nd 2017 Lecture Outline Motivation and existing

More information

Learning to Skip Ineffectual Recurrent Computations in LSTMs

Learning to Skip Ineffectual Recurrent Computations in LSTMs 1 Learning to Skip Ineffectual Recurrent Computations in LSTMs Arash Ardakani, Zhengyun Ji, and Warren J. Gross McGill University, Montreal, Canada arxiv:1811.1396v1 [cs.lg] 9 Nov 218 Abstract Long Short-Term

More information

Exploring the Granularity of Sparsity in Convolutional Neural Networks

Exploring the Granularity of Sparsity in Convolutional Neural Networks Exploring the Granularity of Sparsity in Convolutional Neural Networks Anonymous TMCV submission Abstract Sparsity helps reducing the computation complexity of DNNs by skipping the multiplication with

More information

DNPU: An Energy-Efficient Deep Neural Network Processor with On-Chip Stereo Matching

DNPU: An Energy-Efficient Deep Neural Network Processor with On-Chip Stereo Matching DNPU: An Energy-Efficient Deep Neural Network Processor with On-hip Stereo atching Dongjoo Shin, Jinmook Lee, Jinsu Lee, Juhyoung Lee, and Hoi-Jun Yoo Semiconductor System Laboratory School of EE, KAIST

More information

Towards Accurate Binary Convolutional Neural Network

Towards Accurate Binary Convolutional Neural Network Paper: #261 Poster: Pacific Ballroom #101 Towards Accurate Binary Convolutional Neural Network Xiaofan Lin, Cong Zhao and Wei Pan* firstname.lastname@dji.com Photos and videos are either original work

More information

CSC321 Lecture 16: ResNets and Attention

CSC321 Lecture 16: ResNets and Attention CSC321 Lecture 16: ResNets and Attention Roger Grosse Roger Grosse CSC321 Lecture 16: ResNets and Attention 1 / 24 Overview Two topics for today: Topic 1: Deep Residual Networks (ResNets) This is the state-of-the

More information

Convolutional Neural Networks II. Slides from Dr. Vlad Morariu

Convolutional Neural Networks II. Slides from Dr. Vlad Morariu Convolutional Neural Networks II Slides from Dr. Vlad Morariu 1 Optimization Example of optimization progress while training a neural network. (Loss over mini-batches goes down over time.) 2 Learning rate

More information

NCU EE -- DSP VLSI Design. Tsung-Han Tsai 1

NCU EE -- DSP VLSI Design. Tsung-Han Tsai 1 NCU EE -- DSP VLSI Design. Tsung-Han Tsai 1 Multi-processor vs. Multi-computer architecture µp vs. DSP RISC vs. DSP RISC Reduced-instruction-set Register-to-register operation Higher throughput by using

More information

arxiv: v1 [cs.ne] 10 May 2018

arxiv: v1 [cs.ne] 10 May 2018 Laconic Deep Learning Computing arxiv:85.53v [cs.ne] May 28 Sayeh Sharify, Mostafa Mahmoud, Alberto Delmas Lascorz, Milos Nikolic, Andreas Moshovos Electrical and Computer Engineering, University of Toronto

More information

Exploring the Granularity of Sparsity in Convolutional Neural Networks

Exploring the Granularity of Sparsity in Convolutional Neural Networks Exploring the Granularity of Sparsity in Convolutional Neural Networks Huizi Mao 1, Song Han 1, Jeff Pool 2, Wenshuo Li 3, Xingyu Liu 1, Yu Wang 3, William J. Dally 1,2 1 Stanford University 2 NVIDIA 3

More information

Neural Network Approximation. Low rank, Sparsity, and Quantization Oct. 2017

Neural Network Approximation. Low rank, Sparsity, and Quantization Oct. 2017 Neural Network Approximation Low rank, Sparsity, and Quantization zsc@megvii.com Oct. 2017 Motivation Faster Inference Faster Training Latency critical scenarios VR/AR, UGV/UAV Saves time and energy Higher

More information

<Special Topics in VLSI> Learning for Deep Neural Networks (Back-propagation)

<Special Topics in VLSI> Learning for Deep Neural Networks (Back-propagation) Learning for Deep Neural Networks (Back-propagation) Outline Summary of Previous Standford Lecture Universal Approximation Theorem Inference vs Training Gradient Descent Back-Propagation

More information

Differentiable Fine-grained Quantization for Deep Neural Network Compression

Differentiable Fine-grained Quantization for Deep Neural Network Compression Differentiable Fine-grained Quantization for Deep Neural Network Compression Hsin-Pai Cheng hc218@duke.edu Yuanjun Huang University of Science and Technology of China Anhui, China yjhuang@mail.ustc.edu.cn

More information

Recurrent Neural Networks (RNN) and Long-Short-Term-Memory (LSTM) Yuan YAO HKUST

Recurrent Neural Networks (RNN) and Long-Short-Term-Memory (LSTM) Yuan YAO HKUST 1 Recurrent Neural Networks (RNN) and Long-Short-Term-Memory (LSTM) Yuan YAO HKUST Summary We have shown: Now First order optimization methods: GD (BP), SGD, Nesterov, Adagrad, ADAM, RMSPROP, etc. Second

More information

Faster Machine Learning via Low-Precision Communication & Computation. Dan Alistarh (IST Austria & ETH Zurich), Hantian Zhang (ETH Zurich)

Faster Machine Learning via Low-Precision Communication & Computation. Dan Alistarh (IST Austria & ETH Zurich), Hantian Zhang (ETH Zurich) Faster Machine Learning via Low-Precision Communication & Computation Dan Alistarh (IST Austria & ETH Zurich), Hantian Zhang (ETH Zurich) 2 How many bits do you need to represent a single number in machine

More information

Introduction to Deep Neural Networks

Introduction to Deep Neural Networks Introduction to Deep Neural Networks Presenter: Chunyuan Li Pattern Classification and Recognition (ECE 681.01) Duke University April, 2016 Outline 1 Background and Preliminaries Why DNNs? Model: Logistic

More information

Introduction to Convolutional Neural Networks (CNNs)

Introduction to Convolutional Neural Networks (CNNs) Introduction to Convolutional Neural Networks (CNNs) nojunk@snu.ac.kr http://mipal.snu.ac.kr Department of Transdisciplinary Studies Seoul National University, Korea Jan. 2016 Many slides are from Fei-Fei

More information

Efficient DNN Neuron Pruning by Minimizing Layer-wise Nonlinear Reconstruction Error

Efficient DNN Neuron Pruning by Minimizing Layer-wise Nonlinear Reconstruction Error Efficient DNN Neuron Pruning by Minimizing Layer-wise Nonlinear Reconstruction Error Chunhui Jiang, Guiying Li, Chao Qian, Ke Tang Anhui Province Key Lab of Big Data Analysis and Application, University

More information

arxiv: v2 [cs.ne] 23 Nov 2017

arxiv: v2 [cs.ne] 23 Nov 2017 Minimum Energy Quantized Neural Networks Bert Moons +, Koen Goetschalckx +, Nick Van Berckelaer* and Marian Verhelst + Department of Electrical Engineering* + - ESAT/MICAS +, KU Leuven, Leuven, Belgium

More information

Improved Bayesian Compression

Improved Bayesian Compression Improved Bayesian Compression Marco Federici University of Amsterdam marco.federici@student.uva.nl Karen Ullrich University of Amsterdam karen.ullrich@uva.nl Max Welling University of Amsterdam Canadian

More information

Designing Energy-Efficient Convolutional Neural Networks using Energy-Aware Pruning

Designing Energy-Efficient Convolutional Neural Networks using Energy-Aware Pruning Designing Energy-Efficient Convolutional Neural Networks using Energy-Aware Pruning Tien-Ju Yang, Yu-Hsin Chen, Vivienne Sze Massachusetts Institute of Technology {tjy, yhchen, sze}@mit.edu Abstract Deep

More information

TYPES OF MODEL COMPRESSION. Soham Saha, MS by Research in CSE, CVIT, IIIT Hyderabad

TYPES OF MODEL COMPRESSION. Soham Saha, MS by Research in CSE, CVIT, IIIT Hyderabad TYPES OF MODEL COMPRESSION Soham Saha, MS by Research in CSE, CVIT, IIIT Hyderabad 1. Pruning 2. Quantization 3. Architectural Modifications PRUNING WHY PRUNING? Deep Neural Networks have redundant parameters.

More information

Machine Learning for Signal Processing Neural Networks Continue. Instructor: Bhiksha Raj Slides by Najim Dehak 1 Dec 2016

Machine Learning for Signal Processing Neural Networks Continue. Instructor: Bhiksha Raj Slides by Najim Dehak 1 Dec 2016 Machine Learning for Signal Processing Neural Networks Continue Instructor: Bhiksha Raj Slides by Najim Dehak 1 Dec 2016 1 So what are neural networks?? Voice signal N.Net Transcription Image N.Net Text

More information

Value-aware Quantization for Training and Inference of Neural Networks

Value-aware Quantization for Training and Inference of Neural Networks Value-aware Quantization for Training and Inference of Neural Networks Eunhyeok Park 1, Sungjoo Yoo 1, and Peter Vajda 2 1 Department of Computer Science and Engineering Seoul National University {eunhyeok.park,sungjoo.yoo}@gmail.com

More information

Analog Computation in Flash Memory for Datacenter-scale AI Inference in a Small Chip

Analog Computation in Flash Memory for Datacenter-scale AI Inference in a Small Chip 1 Analog Computation in Flash Memory for Datacenter-scale AI Inference in a Small Chip Dave Fick, CTO/Founder Mike Henry, CEO/Founder About Mythic 2 Focused on high-performance Edge AI Full stack co-design:

More information

Building a Multi-FPGA Virtualized Restricted Boltzmann Machine Architecture Using Embedded MPI

Building a Multi-FPGA Virtualized Restricted Boltzmann Machine Architecture Using Embedded MPI Building a Multi-FPGA Virtualized Restricted Boltzmann Machine Architecture Using Embedded MPI Charles Lo and Paul Chow {locharl1, pc}@eecg.toronto.edu Department of Electrical and Computer Engineering

More information

arxiv: v1 [cs.ne] 20 Apr 2018

arxiv: v1 [cs.ne] 20 Apr 2018 arxiv:1804.07802v1 [cs.ne] 20 Apr 2018 Value-aware Quantization for Training and Inference of Neural Networks Eunhyeok Park 1, Sungjoo Yoo 1, Peter Vajda 2 1 Department of Computer Science and Engineering

More information

PRUNING CONVOLUTIONAL NEURAL NETWORKS. Pavlo Molchanov Stephen Tyree Tero Karras Timo Aila Jan Kautz

PRUNING CONVOLUTIONAL NEURAL NETWORKS. Pavlo Molchanov Stephen Tyree Tero Karras Timo Aila Jan Kautz PRUNING CONVOLUTIONAL NEURAL NETWORKS Pavlo Molchanov Stephen Tyree Tero Karras Timo Aila Jan Kautz 2017 WHY WE CAN PRUNE CNNS? 2 WHY WE CAN PRUNE CNNS? Optimization failures : Some neurons are "dead":

More information

Distributed Machine Learning: A Brief Overview. Dan Alistarh IST Austria

Distributed Machine Learning: A Brief Overview. Dan Alistarh IST Austria Distributed Machine Learning: A Brief Overview Dan Alistarh IST Austria Background The Machine Learning Cambrian Explosion Key Factors: 1. Large s: Millions of labelled images, thousands of hours of speech

More information

SP-CNN: A Scalable and Programmable CNN-based Accelerator. Dilan Manatunga Dr. Hyesoon Kim Dr. Saibal Mukhopadhyay

SP-CNN: A Scalable and Programmable CNN-based Accelerator. Dilan Manatunga Dr. Hyesoon Kim Dr. Saibal Mukhopadhyay SP-CNN: A Scalable and Programmable CNN-based Accelerator Dilan Manatunga Dr. Hyesoon Kim Dr. Saibal Mukhopadhyay Motivation Power is a first-order design constraint, especially for embedded devices. Certain

More information

ALU A functional unit

ALU A functional unit ALU A functional unit that performs arithmetic operations such as ADD, SUB, MPY logical operations such as AND, OR, XOR, NOT on given data types: 8-,16-,32-, or 64-bit values A n-1 A n-2... A 1 A 0 B n-1

More information

Binary Convolutional Neural Network on RRAM

Binary Convolutional Neural Network on RRAM Binary Convolutional Neural Network on RRAM Tianqi Tang, Lixue Xia, Boxun Li, Yu Wang, Huazhong Yang Dept. of E.E, Tsinghua National Laboratory for Information Science and Technology (TNList) Tsinghua

More information

Tasks ADAS. Self Driving. Non-machine Learning. Traditional MLP. Machine-Learning based method. Supervised CNN. Methods. Deep-Learning based

Tasks ADAS. Self Driving. Non-machine Learning. Traditional MLP. Machine-Learning based method. Supervised CNN. Methods. Deep-Learning based UNDERSTANDING CNN ADAS Tasks Self Driving Localizati on Perception Planning/ Control Driver state Vehicle Diagnosis Smart factory Methods Traditional Deep-Learning based Non-machine Learning Machine-Learning

More information

LOGNET: ENERGY-EFFICIENT NEURAL NETWORKS USING LOGARITHMIC COMPUTATION. Stanford University 1, Toshiba 2

LOGNET: ENERGY-EFFICIENT NEURAL NETWORKS USING LOGARITHMIC COMPUTATION. Stanford University 1, Toshiba 2 LOGNET: ENERGY-EFFICIENT NEURAL NETWORKS USING LOGARITHMIC COMPUTATION Edward H. Lee 1, Daisuke Miyashita 1,2, Elaina Chai 1, Boris Murmann 1, S. Simon Wong 1 Stanford University 1, Toshiba 2 ABSTRACT

More information

Synaptic Devices and Neuron Circuits for Neuron-Inspired NanoElectronics

Synaptic Devices and Neuron Circuits for Neuron-Inspired NanoElectronics Synaptic Devices and Neuron Circuits for Neuron-Inspired NanoElectronics Byung-Gook Park Inter-university Semiconductor Research Center & Department of Electrical and Computer Engineering Seoul National

More information

Large-Scale FPGA implementations of Machine Learning Algorithms

Large-Scale FPGA implementations of Machine Learning Algorithms Large-Scale FPGA implementations of Machine Learning Algorithms Philip Leong ( ) Computer Engineering Laboratory School of Electrical and Information Engineering, The University of Sydney Computer Engineering

More information

Deep Learning Architectures and Algorithms

Deep Learning Architectures and Algorithms Deep Learning Architectures and Algorithms In-Jung Kim 2016. 12. 2. Agenda Introduction to Deep Learning RBM and Auto-Encoders Convolutional Neural Networks Recurrent Neural Networks Reinforcement Learning

More information

INF2270 Spring Philipp Häfliger. Lecture 8: Superscalar CPUs, Course Summary/Repetition (1/2)

INF2270 Spring Philipp Häfliger. Lecture 8: Superscalar CPUs, Course Summary/Repetition (1/2) INF2270 Spring 2010 Philipp Häfliger Summary/Repetition (1/2) content From Scalar to Superscalar Lecture Summary and Brief Repetition Binary numbers Boolean Algebra Combinational Logic Circuits Encoder/Decoder

More information

Convolutional Neural Network Architecture

Convolutional Neural Network Architecture Convolutional Neural Network Architecture Zhisheng Zhong Feburary 2nd, 2018 Zhisheng Zhong Convolutional Neural Network Architecture Feburary 2nd, 2018 1 / 55 Outline 1 Introduction of Convolution Motivation

More information

arxiv: v1 [cs.cl] 1 Dec 2016

arxiv: v1 [cs.cl] 1 Dec 2016 ESE: Efficient Speech Recognition Engine with Compressed LSTM on FGA arxiv:1612.00694v1 [cs.cl] 1 Dec 2016 Song Han 1,2, Junlong Kang 2, Huizi Mao 1,2, Yiming Hu 2,3, Xin Li 2, Yubin Li 2, Dongliang Xie

More information

Memory-Augmented Attention Model for Scene Text Recognition

Memory-Augmented Attention Model for Scene Text Recognition Memory-Augmented Attention Model for Scene Text Recognition Cong Wang 1,2, Fei Yin 1,2, Cheng-Lin Liu 1,2,3 1 National Laboratory of Pattern Recognition Institute of Automation, Chinese Academy of Sciences

More information

Lecture 27: Hardware Acceleration. James C. Hoe Department of ECE Carnegie Mellon University

Lecture 27: Hardware Acceleration. James C. Hoe Department of ECE Carnegie Mellon University 18 447 Lecture 27: Hardware Acceleration James C. Hoe Department of ECE Carnegie Mellon niversity 18 447 S18 L27 S1, James C. Hoe, CM/ECE/CALCM, 2018 18 447 S18 L27 S2, James C. Hoe, CM/ECE/CALCM, 2018

More information

TRAINING LONG SHORT-TERM MEMORY WITH SPAR-

TRAINING LONG SHORT-TERM MEMORY WITH SPAR- TRAINING LONG SHORT-TERM MEMORY WITH SPAR- SIFIED STOCHASTIC GRADIENT DESCENT Maohua Zhu, Yuan Xie Department of Electrical and Computer Engineering University of California, Santa Barbara Santa Barbara,

More information

Vectorized 128-bit Input FP16/FP32/ FP64 Floating-Point Multiplier

Vectorized 128-bit Input FP16/FP32/ FP64 Floating-Point Multiplier Vectorized 128-bit Input FP16/FP32/ FP64 Floating-Point Multiplier Espen Stenersen Master of Science in Electronics Submission date: June 2008 Supervisor: Per Gunnar Kjeldsberg, IET Co-supervisor: Torstein

More information

Lecture 7 Convolutional Neural Networks

Lecture 7 Convolutional Neural Networks Lecture 7 Convolutional Neural Networks CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago April 17, 2017 We saw before: ŷ x 1 x 2 x 3 x 4 A series of matrix multiplications:

More information

Auto-balanced Filter Pruning for Efficient Convolutional Neural Networks

Auto-balanced Filter Pruning for Efficient Convolutional Neural Networks Auto-balanced Filter Pruning for Efficient Convolutional Neural Networks Xiaohan Ding, 1 Guiguang Ding, 1 Jungong Han, 2 Sheng Tang 3 1 School of Software, Tsinghua University, Beijing 100084, China 2

More information

Tunable Floating-Point for Energy Efficient Accelerators

Tunable Floating-Point for Energy Efficient Accelerators Tunable Floating-Point for Energy Efficient Accelerators Alberto Nannarelli DTU Compute, Technical University of Denmark 25 th IEEE Symposium on Computer Arithmetic A. Nannarelli (DTU Compute) Tunable

More information

TIME:A Training-in-memory Architecture for RRAM-based Deep Neural Networks

TIME:A Training-in-memory Architecture for RRAM-based Deep Neural Networks 1 TIME:A Training-in-memory Architecture for -based Deep Neural Networks Ming Cheng, Lixue Xia, Student Member, IEEE, Zhenhua Zhu, Yi Cai, Yuan Xie, Fellow, IEEE, Yu Wang, Senior Member, IEEE, Huazhong

More information

Artificial Neural Networks D B M G. Data Base and Data Mining Group of Politecnico di Torino. Elena Baralis. Politecnico di Torino

Artificial Neural Networks D B M G. Data Base and Data Mining Group of Politecnico di Torino. Elena Baralis. Politecnico di Torino Artificial Neural Networks Data Base and Data Mining Group of Politecnico di Torino Elena Baralis Politecnico di Torino Artificial Neural Networks Inspired to the structure of the human brain Neurons as

More information

arxiv: v2 [cs.cl] 20 Feb 2017

arxiv: v2 [cs.cl] 20 Feb 2017 ESE: Efficient Speech Recognition Engine with Sparse LSTM on FGA Song Han1,2, Junlong Kang2, Huizi Mao1,2, Yiming Hu2,3, Xin Li2, Yubin Li2, Dongliang Xie2 Hong Luo2, Song Yao2, Yu Wang2,3, Huazhong Yang3

More information

Sajid Anwar, Kyuyeon Hwang and Wonyong Sung

Sajid Anwar, Kyuyeon Hwang and Wonyong Sung Sajid Anwar, Kyuyeon Hwang and Wonyong Sung Department of Electrical and Computer Engineering Seoul National University Seoul, 08826 Korea Email: sajid@dsp.snu.ac.kr, khwang@dsp.snu.ac.kr, wysung@snu.ac.kr

More information

ECE 407 Computer Aided Design for Electronic Systems. Simulation. Instructor: Maria K. Michael. Overview

ECE 407 Computer Aided Design for Electronic Systems. Simulation. Instructor: Maria K. Michael. Overview 407 Computer Aided Design for Electronic Systems Simulation Instructor: Maria K. Michael Overview What is simulation? Design verification Modeling Levels Modeling circuits for simulation True-value simulation

More information

Encoder Based Lifelong Learning - Supplementary materials

Encoder Based Lifelong Learning - Supplementary materials Encoder Based Lifelong Learning - Supplementary materials Amal Rannen Rahaf Aljundi Mathew B. Blaschko Tinne Tuytelaars KU Leuven KU Leuven, ESAT-PSI, IMEC, Belgium firstname.lastname@esat.kuleuven.be

More information

BeiHang Short Course, Part 7: HW Acceleration: It s about Performance, Energy and Power

BeiHang Short Course, Part 7: HW Acceleration: It s about Performance, Energy and Power BeiHang Short Course, Part 7: HW Acceleration: It s about Performance, Energy and Power James C. Hoe Department of ECE Carnegie Mellon niversity Eric S. Chung, et al., Single chip Heterogeneous Computing:

More information

SGD and Deep Learning

SGD and Deep Learning SGD and Deep Learning Subgradients Lets make the gradient cheating more formal. Recall that the gradient is the slope of the tangent. f(w 1 )+rf(w 1 ) (w w 1 ) Non differentiable case? w 1 Subgradients

More information

COMPARING FIXED AND ADAPTIVE COMPUTATION TIME FOR RE-

COMPARING FIXED AND ADAPTIVE COMPUTATION TIME FOR RE- Workshop track - ICLR COMPARING FIXED AND ADAPTIVE COMPUTATION TIME FOR RE- CURRENT NEURAL NETWORKS Daniel Fojo, Víctor Campos, Xavier Giró-i-Nieto Universitat Politècnica de Catalunya, Barcelona Supercomputing

More information

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others) Machine Learning Neural Networks (slides from Domingos, Pardo, others) For this week, Reading Chapter 4: Neural Networks (Mitchell, 1997) See Canvas For subsequent weeks: Scaling Learning Algorithms toward

More information

Identifying QCD transition using Deep Learning

Identifying QCD transition using Deep Learning Identifying QCD transition using Deep Learning Kai Zhou Long-Gang Pang, Nan Su, Hannah Peterson, Horst Stoecker, Xin-Nian Wang Collaborators: arxiv:1612.04262 Outline 2 What is deep learning? Artificial

More information

Efficient Deep Learning Inference based on Model Compression

Efficient Deep Learning Inference based on Model Compression Efficient Deep Learning Inference based on Model Compression Qing Zhang, Mengru Zhang, Mengdi Wang, Wanchen Sui, Chen Meng, Jun Yang Alibaba Group {sensi.zq, mengru.zmr, didou.wmd, wanchen.swc, mc119496,

More information

Reduced-Area Constant-Coefficient and Multiple-Constant Multipliers for Xilinx FPGAs with 6-Input LUTs

Reduced-Area Constant-Coefficient and Multiple-Constant Multipliers for Xilinx FPGAs with 6-Input LUTs Article Reduced-Area Constant-Coefficient and Multiple-Constant Multipliers for Xilinx FPGAs with 6-Input LUTs E. George Walters III Department of Electrical and Computer Engineering, Penn State Erie,

More information

Grundlagen der Künstlichen Intelligenz

Grundlagen der Künstlichen Intelligenz Grundlagen der Künstlichen Intelligenz Neural networks Daniel Hennes 21.01.2018 (WS 2017/18) University Stuttgart - IPVS - Machine Learning & Robotics 1 Today Logistic regression Neural networks Perceptron

More information

Normalization Techniques in Training of Deep Neural Networks

Normalization Techniques in Training of Deep Neural Networks Normalization Techniques in Training of Deep Neural Networks Lei Huang ( 黄雷 ) State Key Laboratory of Software Development Environment, Beihang University Mail:huanglei@nlsde.buaa.edu.cn August 17 th,

More information

Some Applications of Machine Learning to Astronomy. Eduardo Bezerra 20/fev/2018

Some Applications of Machine Learning to Astronomy. Eduardo Bezerra 20/fev/2018 Some Applications of Machine Learning to Astronomy Eduardo Bezerra ebezerra@cefet-rj.br 20/fev/2018 Overview 2 Introduction Definition Neural Nets Applications do Astronomy Ads: Machine Learning Course

More information

2. Accelerated Computations

2. Accelerated Computations 2. Accelerated Computations 2.1. Bent Function Enumeration by a Circular Pipeline Implemented on an FPGA Stuart W. Schneider Jon T. Butler 2.1.1. Background A naive approach to encoding a plaintext message

More information

TETRIS: TilE-matching the TRemendous Irregular Sparsity

TETRIS: TilE-matching the TRemendous Irregular Sparsity TETRIS: TilE-matching the TRemendous Irregular Sparsity Yu Ji 1,2,3 Ling Liang 3 Lei Deng 3 Youyang Zhang 1 Youhui Zhang 1,2 Yuan Xie 3 {jiy15,zhang-yy15}@mails.tsinghua.edu.cn,zyh02@tsinghua.edu.cn 1

More information

FPGA Implementation of a HOG-based Pedestrian Recognition System

FPGA Implementation of a HOG-based Pedestrian Recognition System MPC Workshop Karlsruhe 10/7/2009 FPGA Implementation of a HOG-based Pedestrian Recognition System Sebastian Bauer sebastian.bauer@fh-aschaffenburg.de Laboratory for Pattern Recognition and Computational

More information

EECS150 - Digital Design Lecture 23 - FFs revisited, FIFOs, ECCs, LSFRs. Cross-coupled NOR gates

EECS150 - Digital Design Lecture 23 - FFs revisited, FIFOs, ECCs, LSFRs. Cross-coupled NOR gates EECS150 - Digital Design Lecture 23 - FFs revisited, FIFOs, ECCs, LSFRs April 16, 2009 John Wawrzynek Spring 2009 EECS150 - Lec24-blocks Page 1 Cross-coupled NOR gates remember, If both R=0 & S=0, then

More information

Quantum Artificial Intelligence and Machine Learning: The Path to Enterprise Deployments. Randall Correll. +1 (703) Palo Alto, CA

Quantum Artificial Intelligence and Machine Learning: The Path to Enterprise Deployments. Randall Correll. +1 (703) Palo Alto, CA Quantum Artificial Intelligence and Machine : The Path to Enterprise Deployments Randall Correll randall.correll@qcware.com +1 (703) 867-2395 Palo Alto, CA 1 Bundled software and services Professional

More information

Compressing deep neural networks

Compressing deep neural networks From Data to Decisions - M.Sc. Data Science Compressing deep neural networks Challenges and theoretical foundations Presenter: Simone Scardapane University of Exeter, UK Table of contents Introduction

More information

Performance, Power & Energy. ELEC8106/ELEC6102 Spring 2010 Hayden Kwok-Hay So

Performance, Power & Energy. ELEC8106/ELEC6102 Spring 2010 Hayden Kwok-Hay So Performance, Power & Energy ELEC8106/ELEC6102 Spring 2010 Hayden Kwok-Hay So Recall: Goal of this class Performance Reconfiguration Power/ Energy H. So, Sp10 Lecture 3 - ELEC8106/6102 2 PERFORMANCE EVALUATION

More information

EECS 579: Logic and Fault Simulation. Simulation

EECS 579: Logic and Fault Simulation. Simulation EECS 579: Logic and Fault Simulation Simulation: Use of computer software models to verify correctness Fault Simulation: Use of simulation for fault analysis and ATPG Circuit description Input data for

More information

Google s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation

Google s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation Google s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation Y. Wu, M. Schuster, Z. Chen, Q.V. Le, M. Norouzi, et al. Google arxiv:1609.08144v2 Reviewed by : Bill

More information

CS 700: Quantitative Methods & Experimental Design in Computer Science

CS 700: Quantitative Methods & Experimental Design in Computer Science CS 700: Quantitative Methods & Experimental Design in Computer Science Sanjeev Setia Dept of Computer Science George Mason University Logistics Grade: 35% project, 25% Homework assignments 20% midterm,

More information

An Analytical Method to Determine Minimum Per-Layer Precision of Deep Neural Networks

An Analytical Method to Determine Minimum Per-Layer Precision of Deep Neural Networks An Analytical Method to Determine Minimum Per-Layer Precision of Deep Neural Networks Charbel Sakr, Naresh Shanbhag Department of Electrical and Computer Engineering University of Illinois at Urbana-Champaign

More information

EnergyNet: Energy-Efficient Dynamic Inference

EnergyNet: Energy-Efficient Dynamic Inference EnergyNet: Energy-Efficient Dynamic Inference Yue Wang 1, Tan Nguyen 1, Yang Zhao 2, Zhangyang Wang 3, Yingyan Lin 1, and Richard Baraniuk 1 1 Rice University, Houston, TX 77005, USA 2 UC Santa Barbara,

More information

YodaNN 1 : An Architecture for Ultra-Low Power Binary-Weight CNN Acceleration

YodaNN 1 : An Architecture for Ultra-Low Power Binary-Weight CNN Acceleration 1 YodaNN 1 : An Architecture for Ultra-Low Power Binary-Weight CNN Acceleration Renzo Andri, Lukas Cavigelli, avide Rossi and Luca Benini Integrated Systems Laboratory, ETH Zürich, Zurich, Switzerland

More information

ECEN 248: INTRODUCTION TO DIGITAL SYSTEMS DESIGN. Week 9 Dr. Srinivas Shakkottai Dept. of Electrical and Computer Engineering

ECEN 248: INTRODUCTION TO DIGITAL SYSTEMS DESIGN. Week 9 Dr. Srinivas Shakkottai Dept. of Electrical and Computer Engineering ECEN 248: INTRODUCTION TO DIGITAL SYSTEMS DESIGN Week 9 Dr. Srinivas Shakkottai Dept. of Electrical and Computer Engineering TIMING ANALYSIS Overview Circuits do not respond instantaneously to input changes

More information

Introduction to Convolutional Neural Networks 2018 / 02 / 23

Introduction to Convolutional Neural Networks 2018 / 02 / 23 Introduction to Convolutional Neural Networks 2018 / 02 / 23 Buzzword: CNN Convolutional neural networks (CNN, ConvNet) is a class of deep, feed-forward (not recurrent) artificial neural networks that

More information

CS 229 Project Final Report: Reinforcement Learning for Neural Network Architecture Category : Theory & Reinforcement Learning

CS 229 Project Final Report: Reinforcement Learning for Neural Network Architecture Category : Theory & Reinforcement Learning CS 229 Project Final Report: Reinforcement Learning for Neural Network Architecture Category : Theory & Reinforcement Learning Lei Lei Ruoxuan Xiong December 16, 2017 1 Introduction Deep Neural Network

More information

Today. ESE532: System-on-a-Chip Architecture. Energy. Message. Preclass Challenge: Power. Energy Today s bottleneck What drives Efficiency of

Today. ESE532: System-on-a-Chip Architecture. Energy. Message. Preclass Challenge: Power. Energy Today s bottleneck What drives Efficiency of ESE532: System-on-a-Chip Architecture Day 20: November 8, 2017 Energy Today Energy Today s bottleneck What drives Efficiency of Processors, FPGAs, accelerators How does parallelism impact energy? 1 2 Message

More information

EECS150 - Digital Design Lecture 15 SIFT2 + FSM. Recap and Outline

EECS150 - Digital Design Lecture 15 SIFT2 + FSM. Recap and Outline EECS150 - Digital Design Lecture 15 SIFT2 + FSM Oct. 15, 2013 Prof. Ronald Fearing Electrical Engineering and Computer Sciences University of California, Berkeley (slides courtesy of Prof. John Wawrzynek)

More information

Recurrent Neural Networks with Flexible Gates using Kernel Activation Functions

Recurrent Neural Networks with Flexible Gates using Kernel Activation Functions 2018 IEEE International Workshop on Machine Learning for Signal Processing (MLSP 18) Recurrent Neural Networks with Flexible Gates using Kernel Activation Functions Authors: S. Scardapane, S. Van Vaerenbergh,

More information

Deep Learning Basics Lecture 7: Factor Analysis. Princeton University COS 495 Instructor: Yingyu Liang

Deep Learning Basics Lecture 7: Factor Analysis. Princeton University COS 495 Instructor: Yingyu Liang Deep Learning Basics Lecture 7: Factor Analysis Princeton University COS 495 Instructor: Yingyu Liang Supervised v.s. Unsupervised Math formulation for supervised learning Given training data x i, y i

More information

Spiral 2-1. Datapath Components: Counters Adders Design Example: Crosswalk Controller

Spiral 2-1. Datapath Components: Counters Adders Design Example: Crosswalk Controller 2-. piral 2- Datapath Components: Counters s Design Example: Crosswalk Controller 2-.2 piral Content Mapping piral Theory Combinational Design equential Design ystem Level Design Implementation and Tools

More information

Deep Learning with Low Precision by Half-wave Gaussian Quantization

Deep Learning with Low Precision by Half-wave Gaussian Quantization Deep Learning with Low Precision by Half-wave Gaussian Quantization Zhaowei Cai UC San Diego zwcai@ucsd.edu Xiaodong He Microsoft Research Redmond xiaohe@microsoft.com Jian Sun Megvii Inc. sunjian@megvii.com

More information

Deep Learning with Coherent Nanophotonic Circuits

Deep Learning with Coherent Nanophotonic Circuits Yichen Shen, Nicholas Harris, Dirk Englund, Marin Soljacic Massachusetts Institute of Technology @ Berkeley, Oct. 2017 1 Neuromorphic Computing Biological Neural Networks Artificial Neural Networks 2 Artificial

More information

Deep Learning for Automatic Speech Recognition Part II

Deep Learning for Automatic Speech Recognition Part II Deep Learning for Automatic Speech Recognition Part II Xiaodong Cui IBM T. J. Watson Research Center Yorktown Heights, NY 10598 Fall, 2018 Outline A brief revisit of sampling, pitch/formant and MFCC DNN-HMM

More information

Accelerating Convolutional Neural Networks by Group-wise 2D-filter Pruning

Accelerating Convolutional Neural Networks by Group-wise 2D-filter Pruning Accelerating Convolutional Neural Networks by Group-wise D-filter Pruning Niange Yu Department of Computer Sicence and Technology Tsinghua University Beijing 0008, China yng@mails.tsinghua.edu.cn Shi Qiu

More information

Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis

Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis T. BEN-NUN, T. HOEFLER Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis https://www.arxiv.org/abs/1802.09941 What is Deep Learning good for? Digit Recognition Object

More information

EECS150 - Digital Design Lecture 11 - Shifters & Counters. Register Summary

EECS150 - Digital Design Lecture 11 - Shifters & Counters. Register Summary EECS50 - Digital Design Lecture - Shifters & Counters February 24, 2003 John Wawrzynek Spring 2005 EECS50 - Lec-counters Page Register Summary All registers (this semester) based on Flip-flops: q 3 q 2

More information

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others) Machine Learning Neural Networks (slides from Domingos, Pardo, others) Human Brain Neurons Input-Output Transformation Input Spikes Output Spike Spike (= a brief pulse) (Excitatory Post-Synaptic Potential)

More information

arxiv: v1 [cs.cv] 1 Oct 2015

arxiv: v1 [cs.cv] 1 Oct 2015 A Deep Neural Network ion Pipeline: Pruning, Quantization, Huffman Encoding arxiv:1510.00149v1 [cs.cv] 1 Oct 2015 Song Han 1 Huizi Mao 2 William J. Dally 1 1 Stanford University 2 Tsinghua University {songhan,

More information

Deep Neural Network Compression with Single and Multiple Level Quantization

Deep Neural Network Compression with Single and Multiple Level Quantization The Thirty-Second AAAI Conference on Artificial Intelligence (AAAI-18) Deep Neural Network Compression with Single and Multiple Level Quantization Yuhui Xu, 1 Yongzhuang Wang, 1 Aojun Zhou, 2 Weiyao Lin,

More information

Determination of Linear Force- Free Magnetic Field Constant αα Using Deep Learning

Determination of Linear Force- Free Magnetic Field Constant αα Using Deep Learning Determination of Linear Force- Free Magnetic Field Constant αα Using Deep Learning Bernard Benson, Zhuocheng Jiang, W. David Pan Dept. of Electrical and Computer Engineering (Dept. of ECE) G. Allen Gary

More information