Efficient Methods and Hardware for Deep Learning
|
|
- Dora Parsons
- 6 years ago
- Views:
Transcription
1 1 Efficient Methods and Hardware for Deep Learning Song Han Stanford University
2 2 Deep Learning is Changing Our Lives Self-Driving Machine Translation 2 AlphaGo Smart Robots
3 3 Models are Getting Larger IMAGE RECOGNITION SPEECH RECOGNITION 16X Model 8 layers 1.4 GFLOP ~16% Error 152 layers 22.6 GFLOP ~3.5% error 10X Training Ops 80 GFLOP 7,000 hrs of Data ~8% Error 465 GFLOP 12,000 hrs of Data ~5% Error 2012 AlexNet 2015 ResNet 2014 Deep Speech Deep Speech 2 Dally, NIPS 2016 workshop on Efficient Methods for Deep Neural Networks
4 4 The first Challenge: Model Size Hard to distribute large models through over-the-air update
5 The Second Challenge: Speed ResNet18: ResNet50: ResNet101: ResNet152: Error rate 10.76% 7.02% 6.21% 6.16% Training time 2.5 days 5 days 1 week 1.5 weeks Such long training time limits ML researcher s productivity Training time benchmarked with fb.resnet.torch using four M40 GPUs 5
6 6 The Third Challenge: Energy Efficiency AlphaGo: 1920 CPUs and 280 GPUs, $3000 electric bill per game on mobile: drains battery on data-center: increases TCO
7 8 The Problem of Large DNN Hardware engineer suffers from the large model size larger model => more memory reference => more energy Operation 32 bit int ADD bit float ADD bit Register File 1 32 bit int MULT bit float MULT bit SRAM Cache 5 32 bit DRAM Memory 640 Energy [pj] Relative Energy Cost = 1000
8 9 The Problem of Large DNN Hardware engineer suffers from the large model size larger model => more memory reference => more energy Operation 32 bit int ADD bit float ADD bit Register File 1 32 bit int MULT bit float MULT bit SRAM Cache 5 32 bit DRAM Memory 640 Energy [pj] Relative Energy Cost how to make deep learning more efficient?
9 10 Improve the Efficiency of Deep Learning by Algorithm-Hardware Co-Design
10 11 Application as a Black Box Algorithm Spec 2006 Hardware
11 12 Open the Box before Hardware Design Algorithm? Hardware Breaks the boundary between algorithm and hardware
12 14 What s in the Box: Deep Learning 101 weights/activations training dataset test data Training Inference Model: CNN, RNN, LSTM training hardware inference hardware
13 Proposed Paradigm Conventional Training Inference Slow Power- Hungry Proposed Training Han et al. ICLR 17 Compression Pruning Quantization Han et al. NIPS 15 Han et al. ICLR 16 (Best paper award) Accelerated Inference Han et al. ISCA 16 Han et al. FPGA 17 (Best paper award) Fast Power- Efficient 15
14 Proposed Paradigm Conventional Training Inference Slow Power- Hungry Proposed Training Han et al. ICLR 17 Compression Pruning Quantization Han et al. NIPS 15 Han et al. ICLR 16 (Best paper award) Accelerated Inference Han et al. ISCA 16 Han et al. FPGA 17 (Best paper award) Fast Power- Efficient 16
15 17 The Goal & Trade-off Small Fast Accurate Energy Efficient
16 Agenda Model Compression (Small) Pruning [NIPS 15] Trained Quantization [ICLR 16] Compression Pruning Quantization Pruning Quantization Hardware Acceleration (Fast, Efficient) EIE Accelerator [ISCA 16] ESE Accelerator [FPGA 17] Accelerated Inference Efficient Training (Accurate) Dense-Sparse-Dense Regularization [ICLR 17] Training 18
17 Small Fast Agenda Model Compression (Small) Pruning [NIPS 15] Trained Quantization [ICLR 16] Accurate Energy Efficient Hardware Acceleration (Fast, Efficient) EIE Accelerator [ISCA 16] ESE Accelerator [FPGA 17] Efficient Training (Accurate) Dense-Sparse-Dense Regularization [ICLR 17] 19
18 Learning both Weights and Connections for Efficient Neural Networks Han et al. NIPS
19 Pruning Neural Networks [Lecun et al. NIPS 89] [Han et al. NIPS 15] Pruning Trained Quantization Huffman Coding 22
20 [Han et al. NIPS 15] Pruning Neural Networks -0.01x 2 +x+1 6M 60 Million 10x less connections Pruning Trained Quantization Huffman Coding 23
21 Pruning Neural Networks [Han et al. NIPS 15] Accuracy Loss 0.5% 0.0% -0.5% -1.0% -1.5% -2.0% -2.5% -3.0% -3.5% -4.0% -4.5% 40% 50% 60% 70% 80% 90% 100% Parameters Pruned Away Pruning Trained Quantization Huffman Coding 24
22 Pruning Neural Networks [Han et al. NIPS 15] Accuracy Loss Pruning 0.5% 0.0% -0.5% -1.0% -1.5% -2.0% -2.5% -3.0% -3.5% -4.0% -4.5% 40% 50% 60% 70% 80% 90% 100% Parameters Pruned Away Pruning Trained Quantization Huffman Coding 25
23 Retrain to Recover Accuracy [Han et al. NIPS 15] Accuracy Loss Pruning Pruning+Retraining 0.5% 0.0% -0.5% -1.0% -1.5% -2.0% -2.5% -3.0% -3.5% -4.0% -4.5% 40% 50% 60% 70% 80% 90% 100% Parameters Pruned Away Pruning Trained Quantization Huffman Coding 26
24 [Han et al. NIPS 15] Iteratively Retrain to Recover Accuracy Accuracy Loss Pruning Pruning+Retraining Iterative Pruning and Retraining 0.5% 0.0% -0.5% -1.0% -1.5% -2.0% -2.5% -3.0% -3.5% -4.0% -4.5% 40% 50% 60% 70% 80% 90% 100% Parameters Pruned Away Pruning Trained Quantization Huffman Coding 27
25 Pruning RNN and LSTM [Han et al. NIPS 15] *Karpathy et al "Deep Visual- Semantic Alignments for Generating Image Descriptions" Pruning Trained Quantization Huffman Coding 28
26 Pruning RNN and LSTM [Han et al. NIPS 15] 90% Original: a basketball player in a white uniform is playing with a ball Pruned 90%: a basketball player in a white uniform is playing with a basketball 90% Original : a brown dog is running through a grassy field Pruned 90%: a brown dog is running through a grassy area 90% 95% Original : a man is riding a surfboard on a wave Pruned 90%: a man in a wetsuit is riding a wave on a beach Original : a soccer player in red is running in the field Pruned 95%: a man in a red shirt and black and white black shirt is running through a field Pruning Trained Quantization Huffman Coding 29
27 [Han et al. NIPS 15] Pruning Changes Weight Distribution Before Pruning After Pruning After Retraining Conv5 layer of Alexnet. Representative for other network layers as well. Pruning Trained Quantization Huffman Coding 30
28 Small Fast Agenda Model Compression (Small) Pruning [NIPS 15] Trained Quantization [ICLR 16] Accurate Energy Efficient Pruning Quantization Pruning Quantization Hardware Acceleration (Fast, Efficient) EIE Accelerator [ISCA 16] ESE Accelerator [FPGA 17] Efficient Training (Accurate) Dense-Sparse-Dense Regularization [ICLR 17] 31
29 Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding Han et al. ICLR 2016 Best Paper Pruning Trained Quantization Huffman Coding 32
30 [Han et al. ICLR 16] Trained Quantization 4bit 32 bit 8x less memory footprint 2.09, 2.12, 1.92, Pruning Trained Quantization Huffman Coding 33
31 [Han et al. ICLR 16] Trained Quantization Pruning Trained Quantization Huffman Coding 34
32 [Han et al. ICLR 16] Trained Quantization Pruning Trained Quantization Huffman Coding 35
33 [Han et al. ICLR 16] Trained Quantization Pruning Trained Quantization Huffman Coding 36
34 [Han et al. ICLR 16] Trained Quantization Pruning Trained Quantization Huffman Coding 37
35 [Han et al. ICLR 16] Trained Quantization Pruning Trained Quantization Huffman Coding 38
36 [Han et al. ICLR 16] Trained Quantization Pruning Trained Quantization Huffman Coding 39
37 [Han et al. ICLR 16] Trained Quantization Pruning Trained Quantization Huffman Coding 40
38 Before Trained Quantization: Continuous Weight [Han et al. ICLR 16] Count Weight Value Pruning Trained Quantization Huffman Coding 41
39 After Trained Quantization: Discrete Weight [Han et al. ICLR 16] Count Weight Value Pruning Trained Quantization Huffman Coding 42
40 After Trained Quantization: Discrete Weight after Training [Han et al. ICLR 16] Count Weight Value Pruning Trained Quantization Huffman Coding 43
41 [Han et al. ICLR 16] Bits Per Weight Pruning Trained Quantization Huffman Coding 44
42 [Han et al. ICLR 16] Pruning + Trained Quantization AlexNet on ImageNet Pruning Trained Quantization Huffman Coding 45
43 [Han et al. ICLR 16] Huffman Coding In-frequent weights: use more bits to represent Frequent weights: use less bits to represent Pruning Trained Quantization Huffman Coding 46
44 [Han et al. ICLR 16] Summary of Deep Compression Pruning Trained Quantization Huffman Coding 47
45 [Han et al. ICLR 16] Results: Compression Ratio Network Original Size Compressed Size Compression Ratio Original Accuracy Compressed Accuracy LeNet KB 27KB 40x 98.36% 98.42% LeNet KB 44KB 39x 99.20% 99.26% AlexNet 240MB 6.9MB 35x 80.27% 80.30% VGGNet 550MB 11.3MB 49x 88.68% 89.09% Fit in Cache! GoogleNet 28MB 2.8MB 10x 88.90% 88.92% ResNet MB 4.0MB 11x 89.24% 89.28% Can we make compact models to begin with? 48
46 SqueezeNet squeeze8 1x18convolu.on8filters8 ReLU8 expand8 1x18and83x38convolu.on8filters8 ReLU8 Iandola et al, SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5MB model size, arxiv
47 Compressing SqueezeNet Network Approach Size Ratio Top-1 Accuracy Top-5 Accuracy AlexNet - 240MB 1x 57.2% 80.3% AlexNet SVD 48MB 5x 56.0% 79.4% AlexNet Deep Compression 6.9MB 35x 57.2% 80.3% SqueezeNet - 4.8MB 50x 57.5% 80.3% SqueezeNet Deep Compression 0.47MB 510x 57.5% 80.3% 50
48 Results: Speedup Average 0.6x CPU GPU mgpu 51
49 Results: Energy Efficiency Average CPU GPU mgpu 52
50 Industrial Impact Deep Compression At Baidu, our #1 motivation for compressing networks is to bring down the size of the binary file. As a mobile-first company, we frequently update various apps via different app stores. We've very sensitive to the size of our binary files, and a feature that increases the binary size by 100MB will receive much more scrutiny than one that increases it by 10MB. Andrew Ng 53
51 Challenges Online de-compression while computing Special purpose logic Computation becomes irregular Sparse weight Sparse activation Indirect lookup Parallelization becomes challenging Synchronization overhead. Load imbalance issue. Scalability 54
52 Having Opened the Box, HW Design? Algorithm Hardware? Breaks the boundary between algorithm and hardware 55
53 Small Fast Agenda Model Compression (Small) Pruning [NIPS 15] Trained Quantization [ICLR 16] Accurate Energy Efficient Hardware Acceleration (Fast, Efficient) EIE Accelerator [ISCA 16] ESE Accelerator [FPGA 17] Efficient Training (Accurate) Dense-Sparse-Dense Regularization [ICLR 17] 56
54 EIE: Efficient Inference Engine on Compressed Deep Neural Network Han et al. ISCA
55 57 Operation 32 bit int ADD bit float ADD bit Register File 1 32 bit int MULT bit float MULT bit SRAM Cache 5 32 bit DRAM Memory 640 Energy [pj] = 1000 How to reduce the memory footprint?
56 Related Work SpMat Ptr_Even Act_0 Act_1 Arithm Ptr_Odd SpMat Eyeriss [1] MIT Dataflow TPU [2] Google 8-bit Integer DaDiannao [3] CAS edram EIE [this work] Stanford Compression [1] Yu-Hsin Chen, et al. "Eyeriss: An energy-efficient reconfigurable accelerator for deep convolutional neural networks." ISSCC 2016 [2] Norm Jouppi, Google supercharges machine learning tasks with TPU custom chip, 2016 [3] Yunji Chen, et al. "Dadiannao: A machine-learning supercomputer." Micro 2014 [4] Song Han et al. EIE: Efficient Inference Engine on Compressed Deep Neural Network, ISCA
57 [Han et al. ISCA 16] EIE: Efficient Inference Engine 0 * A = 0 W * 0 = , 1.92=> 2 Sparse Weight 90% static sparsity Sparse Activation 70% dynamic sparsity Weight Sharing 4-bit weights 10x less computation 3x less computation 5x less memory footprint 8x less memory footprint 60
58 EIE: Parallelization on Sparsity [Han et al. ISCA 16] a ( ) 0 a 1 0 a 3 b PE0 w 0,0 w 0,1 0 w 0,3 b 0 PE1 0 0 w 1,2 0 b 1 b 1 PE2 0 w 2,1 0 w 2,3 b 2 0 PE b 3 ReLU b 3 = 0 0 w 4,2 w 4,3 b 4 0 w 5, b w 6,3 b 6 b 6 0 w 7,1 0 0 b 7 0 b 0 b 5 61
59 EIE: Parallelization on Sparsity [Han et al. ISCA 16] PE PE PE PE PE PE PE PE Central Control PE PE PE PE PE PE PE PE a ( ) 0 a 1 0 a 3! b PE0 w 0,0 w 0,1 0 w 0,3 b 0 PE1 0 0 w 1,2 0 b 1 b 1 PE2 0 w 2,1 0 w 2,3 " b 2 0 PE b 3 ReLU b 3 = # 0 0 w 4,2 w 4,3 " b 4 0 w 5, b w 6,3 b 6 b 6 0 w 7,1 0 0 " b 7 0 b 0 b 5 62
60 EIE: Parallelization on Sparsity [Han et al. ISCA 16] logically a ( ) 0 a 1 0 a 3! b PE0 w 0,0 w 0,1 0 w 0,3 b 0 PE1 0 0 w 1,2 0 b 1 b 1 PE2 0 w 2,1 0 w 2,3 " b 2 0 PE b 3 ReLU b 3 = # 0 0 w 4,2 w 4,3 " b 4 0 w 5, b w 6,3 b 6 b 6 0 w 7,1 0 0 " b 7 0 b 0 b 5 physically Virtual Weight W 0,0 W 0,1 W 4,2 W 0,3 W 4,3 Relative Index Column Pointer
61 [Han et al. ISCA 16] Dataflow a ( ) 0 a 1 0 a 3! b PE0 w 0,0 w 0,1 0 w 0,3 b 0 PE1 0 0 w 1,2 0 b 1 b 1 PE2 0 w 2,1 0 w 2,3 " b 2 0 PE b 3 ReLU b 3 = # 0 0 w 4,2 w 4,3 " b 4 0 w 5, b w 6,3 b 6 b 6 0 w 7,1 0 0 " b 7 0 b 0 b 5 64
62 [Han et al. ISCA 16] Dataflow a ( ) 0 a 1 0 a 3! b PE0 w 0,0 w 0,1 0 w 0,3 b 0 PE1 0 0 w 1,2 0 b 1 b 1 PE2 0 w 2,1 0 w 2,3 " b 2 0 PE b 3 ReLU b 3 = # 0 0 w 4,2 w 4,3 " b 4 0 w 5, b w 6,3 b 6 b 6 0 w 7,1 0 0 " b 7 0 b 0 b 5 65
63 [Han et al. ISCA 16] Dataflow a ( ) 0 a 1 0 a 3! b PE0 w 0,0 w 0,1 0 w 0,3 b 0 PE1 0 0 w 1,2 0 b 1 b 1 PE2 0 w 2,1 0 w 2,3 " b 2 0 PE b 3 ReLU b 3 = # 0 0 w 4,2 w 4,3 " b 4 0 w 5, b w 6,3 b 6 b 6 0 w 7,1 0 0 " b 7 0 b 0 b 5 66
64 [Han et al. ISCA 16] Dataflow a ( ) 0 a 1 0 a 3! b PE0 w 0,0 w 0,1 0 w 0,3 b 0 PE1 0 0 w 1,2 0 b 1 b 1 PE2 0 w 2,1 0 w 2,3 " b 2 0 PE b 3 ReLU b 3 = # 0 0 w 4,2 w 4,3 " b 4 0 w 5, b w 6,3 b 6 b 6 0 w 7,1 0 0 " b 7 0 b 0 b 5 67
65 [Han et al. ISCA 16] Dataflow a ( ) 0 a 1 0 a 3! b PE0 w 0,0 w 0,1 0 w 0,3 b 0 PE1 0 0 w 1,2 0 b 1 b 1 PE2 0 w 2,1 0 w 2,3 " b 2 0 PE b 3 ReLU b 3 = # 0 0 w 4,2 w 4,3 " b 4 0 w 5, b w 6,3 b 6 b 6 0 w 7,1 0 0 " b 7 0 b 0 b 5 68
66 [Han et al. ISCA 16] Dataflow a ( ) 0 a 1 0 a 3! b PE0 w 0,0 w 0,1 0 w 0,3 b 0 PE1 0 0 w 1,2 0 b 1 b 1 PE2 0 w 2,1 0 w 2,3 " b 2 0 PE b 3 ReLU b 3 = # 0 0 w 4,2 w 4,3 " b 4 0 w 5, b w 6,3 b 6 b 6 0 w 7,1 0 0 " b 7 0 b 0 b 5 69
67 [Han et al. ISCA 16] Dataflow a ( ) 0 a 1 0 a 3! b PE0 w 0,0 w 0,1 0 w 0,3 b 0 PE1 0 0 w 1,2 0 b 1 b 1 PE2 0 w 2,1 0 w 2,3 " b 2 0 PE b 3 ReLU b 3 = # 0 0 w 4,2 w 4,3 " b 4 0 w 5, b w 6,3 b 6 b 6 0 w 7,1 0 0 " b 7 0 b 0 b 5 70
68 [Han et al. ISCA 16] Dataflow a ( ) 0 a 1 0 a 3! b PE0 w 0,0 w 0,1 0 w 0,3 b 0 PE1 0 0 w 1,2 0 b 1 b 1 PE2 0 w 2,1 0 w 2,3 " b 2 0 PE b 3 ReLU b 3 = # 0 0 w 4,2 w 4,3 " b 4 0 w 5, b w 6,3 b 6 b 6 0 w 7,1 0 0 " b 7 0 b 0 b 5 71
69 [Han et al. ISCA 16] Dataflow a ( ) 0 a 1 0 a 3! b PE0 w 0,0 w 0,1 0 w 0,3 b 0 PE1 0 0 w 1,2 0 b 1 b 1 PE2 0 w 2,1 0 w 2,3 " b 2 0 PE b 3 ReLU b 3 = # 0 0 w 4,2 w 4,3 " b 4 0 w 5, b w 6,3 b 6 b 6 0 w 7,1 0 0 " b 7 0 b 0 b 5 72
70 [Han et al. ISCA 16] Dataflow a ( ) 0 a 1 0 a 3! b PE0 w 0,0 w 0,1 0 w 0,3 b 0 PE1 0 0 w 1,2 0 b 1 b 1 PE2 0 w 2,1 0 w 2,3 " b 2 0 PE b 3 ReLU b 3 = # 0 0 w 4,2 w 4,3 " b 4 0 w 5, b w 6,3 b 6 b 6 0 w 7,1 0 0 " b 7 0 b 0 b 5 73
71 [Han et al. ISCA 16] EIE Architecture Weight decode Compressed DNN Model Input Image Encoded Weight Relative Index Sparse Format 4-bit Virtual weight 4-bit Relative Index Weight Look-up Index Accum 16-bit Real weight ALU 16-bit Mem Absolute Index Prediction Result Address Accumulate 74
72 Micro Architecture for each PE [Han et al. ISCA 16] Act Value Act Index Act Queue Act Index Act Value Encoded Weight Act SRAM Leading NZero Detect Even Ptr SRAM Bank Odd Ptr SRAM Bank Pointer Read Col Start/ End Addr Sparse Matrix SRAM Sparse Matrix Access Regs Relative Index Weight Decoder Address Accum Absolute Address Arithmetic Unit Bypass Dest Act Regs Src Act Regs Act R/W ReLU SRAM Regs Comb 75
73 [Han et al. ISCA 16] Load Balance PE PE PE PE PE PE PE PE Central Control PE PE PE PE Act Value Act Index Act Queue PE PE PE PE SRAM Regs Comb 76
74 [Han et al. ISCA 16] Activation Sparsity PE PE PE PE PE PE PE PE Central Control Act Value Act Index Act Queue Act Index Act Value Leading NZero Detect PE PE PE PE PE PE PE PE Even Ptr SRAM Bank Odd Ptr SRAM Bank Pointer Read SRAM Regs Comb 77
75 [Han et al. ISCA 16] Weight Sparsity PE PE PE PE PE PE PE PE Central Control Act Value Act Index Act Queue Act Index Act Value Leading NZero Detect PE PE PE PE PE PE PE PE Even Ptr SRAM Bank Odd Ptr SRAM Bank Col Start/ End Addr Sparse Matrix SRAM Regs Pointer Read Sparse Matrix Access SRAM Regs Comb 78
76 [Han et al. ISCA 16] Weight Sharing PE PE PE PE PE PE PE PE Central Control Act Value Act Index Act Queue Act Index Act Value Encoded Weight Decoded Weight Leading NZero Detect PE PE PE PE PE PE PE PE Even Ptr SRAM Bank Odd Ptr SRAM Bank Col Start/ End Addr Sparse Matrix SRAM Regs Weight Decoder Pointer Read Sparse Matrix Access SRAM Regs Comb 79
77 [Han et al. ISCA 16] Address Accumulate PE PE PE PE PE PE PE PE Central Control Act Value Act Index Act Queue Act Index Act Value Encoded Weight Decoded Weight Leading NZero Detect PE PE PE PE PE PE PE PE Even Ptr SRAM Bank Odd Ptr SRAM Bank Pointer Read Col Start/ End Addr Sparse Matrix SRAM Sparse Matrix Access Regs Relative Index Weight Decoder Address Accum Absolute Index SRAM Regs Comb 80
78 [Han et al. ISCA 16] Arithmetic PE PE PE PE PE PE PE PE Central Control Act Value Act Index Act Queue Act Index Act Value Encoded Weight Leading NZero Detect PE PE PE PE PE PE PE PE Even Ptr SRAM Bank Odd Ptr SRAM Bank Pointer Read Col Start/ End Addr Sparse Matrix SRAM Sparse Matrix Access Regs Relative Index Weight Decoder Address Accum Absolute Address Arithmetic Unit Bypass SRAM Regs Comb 81
79 [Han et al. ISCA 16] Write Back PE PE PE PE PE PE PE PE Central Control Act Value Act Index Act Queue Act Index Act Value Encoded Weight Act SRAM Leading NZero Detect PE PE PE PE PE PE PE PE Even Ptr SRAM Bank Odd Ptr SRAM Bank Pointer Read Col Start/ End Addr Sparse Matrix SRAM Sparse Matrix Access Regs Relative Index Weight Decoder Address Accum Absolute Address Arithmetic Unit Bypass Dest Act Regs Src Act Regs Act R/W SRAM Regs Comb 82
80 [Han et al. ISCA 16] Relu, Non-zero Detection PE PE PE PE PE PE PE PE Central Control Act Value Act Index Act Queue Act Index Act Value Encoded Weight Act SRAM Leading NZero Detect PE PE PE PE PE PE PE PE Even Ptr SRAM Bank Odd Ptr SRAM Bank Pointer Read Col Start/ End Addr Sparse Matrix SRAM Sparse Matrix Access Regs Relative Index Weight Decoder Address Accum Absolute Address Arithmetic Unit Bypass Dest Act Regs Src Act Regs Act R/W ReLU SRAM Regs Comb 83
81 [Han et al. ISCA 16] What s Special PE PE PE PE PE PE PE PE Central Control Act Value Act Index Act Queue Act Index Act Value Encoded Weight Act SRAM Leading NZero Detect PE PE PE PE PE PE PE PE Even Ptr SRAM Bank Odd Ptr SRAM Bank Pointer Read Col Start/ End Addr Sparse Matrix SRAM Sparse Matrix Access Regs Relative Index Weight Decoder Address Accum Absolute Address Arithmetic Unit Bypass Dest Act Regs Src Act Regs Act R/W ReLU SRAM Regs Comb 84
82 [Han et al. ISCA 16] Post Layout Result of EIE Technology 45 nm # PEs 64 on-chip SRAM Max Model Size Static Sparsity Dynamic Sparsity Quantization ALU Width Area MxV Throughput Power 8 MB 84 Million 10x 3x 4-bit 16-bit 40.8 mm^2 81,967 layers/s 586 mw 1. Post layout result 2. Throughput measured on AlexNet FC-7 85
83 [Han et al. ISCA 16] Benchmark CPU: Intel Core-i7 5930k GPU: NVIDIA TitanX Mobile GPU: NVIDIA Jetson TK1 Layer Size Weight Density Activation Density FLOP Reduction AlexNet % 35% 33x AlexNet % 35% 33x AlexNet % 38% 10x VGG % 18% 100x VGG % 37% 50x VGG % 41% 10x NeuralTalk-We % 100% 10x NeuralTalk-Wd % 100% 10x NeuralTalk-LSTM % 100% 10x Description AlexNet for image classification VGG-16 for image classification RNN and LSTM for image caption 86
84 5. [Han et al. ISCA 16] Speedup on EIE SpMat CPU Dense (Baseline) 1000x CPU Compressed GPU Dense 507x 248x mgpu Compressed Ptr_Even Act_1 Arithm SpMat Ptr_Odd Speedup Act_0 100x 25x 14x 10x 1x 5x 3x 2x 1x 1x 0.6x 24x 9x 21x 14x 5x 1x 0.5x 1x 1x 1.1x 1x 135x 1x 1.0x EIE 92x 22x 10x 8x 9x mgpu Compressed EIE 1x 1x 1.0x 1x 0.5x 48x 25x 9x 3x 3x 2x 60x 33x 15x 9x 1x 1x 0.3x 189x 98x 63x 34x 16x 10x 2x 1x 0.5x 2x 15x 3x 1x 0.5x 1x 3x 0.6x 0.1x Alex-6 Layout of one PE in EIE under TSMC 45nm process. Table II Figure 6. Alex-7 Alex-8 VGG-6 VGG-7 VGG-8 ry network er national cell ueue ad tread munit W cell Power (mw) (59.15%) (20.46%) (11.20%) (9.18%) (1.23%) (19.73%) (54.11%) (12.68%) (12.25%) (%) Area (µm2 ) 638, , ,465 8,946 23, , ,412 3,110 18,934 23,961 Energy Efficiency 98x (%) CPU Dense (Baseline) x (93.22%) 10000x (0.14%) (1.48%) 1000x (1.40%) (3.76%) 100x (0.12%) (19.10%) 10x (73.57%) (0.49%) 1x (2.97%) (3.76%) Figure 7. 2x NT-We NT-Wd NT-LSTM Geo Mean Speedups of GPU, mobile GPU and EIE compared with CPU running uncompressed DNN model. There is no batching in all cases. 189x MPLEMENTATION RESULTS OF ONE PE IN EIE AND THE OWN BY COMPONENT TYPE ( LINE 3-7), BY MODULE ( LINE 8-13). T HE CRITICAL PATH OF EIE IS 1.15 NS x mgpu Dense 618x 210x 115x 94x 56x GPU Compressed 1018x 60x 37x 9x12x 26x 37x 5x 7x 1x 10x Alex-6 GPU Dense 119,797x 61,533x 34,522x 25x 9x CPU Compressed 1x 48x 14,826x 15x Alex-7 15x 78x 101x 59x 7x10x 3x 1x 18x 7x Alex-8 GPU Compressed 17x 10x 1x 13x VGG-6 3x mgpu Dense mgpu Compressed EIE 76,784x 61x 20x 10x 1x 11,828x 24,207x 10,904x 9,485x 8,053x 102x 14x VGG-7 1x 2x 5x 25x 14x 8x 39x 14x 6x 6x 6x 6x 8x 1x 5x VGG-8 NT-We 3x 1x 25x 7x 15x 20x 4x 5x 7x 1x 23x 6x 7x 1x 36x 9x Geo Mean NT-Wd NT-LSTM Geo Mean Energy efficiency of GPU, mobile GPU and EIE compared with CPU running uncompressed DNN model. There is no batching in all cases. 1x 1xcorner. We placed and routed 1x the PE using the Synopsys IC B 0.6x 0.5x compiler (ICC). We used Cacti [25] to get SRAM area and Layer nit: I/O and Computing. In the I/O mode, all of re idle while the activations and weights in every e accessed by a DMA connected with the Central s is one time cost. In the Computing mode, the eatedly collects a non-zero value from the LNZD and broadcasts this value to all PEs. This process until the input length is exceeded. By setting the gth and starting address of pointer array, EIE is to execute different layers. Table III ENCHMARK FROM STATE - OF - THE - ART Size Weight% Act% energy numbers. We annotated the toggle rate from the RTL 9216, Alex-6 9% 4096 simulation to the gate-level netlist, which was dumped to 4096, switching activity interchange format (SAIF), and estimated Alex-7 9% V. E VALUATION M ETHODOLOGY 4096 power using Prime-Time PX. tor, RTL and Layout. We the implemented a custom 4096, Alex-8 25% urate C++ simulator for the accelerator aimed to 1000 Comparison Baseline. We compare EIE with three dife RTL behavior of synchronous circuits. Each EIE CPU GPU mgpu 25088, module is abstracted as an ferent object thatoff-the-shelf implements VGG-6 4% computing units: CPU, GPU and mobile 4096 act methods: propagate and update, corresponding 4096, ation logic and the flip-flop GPU. in RTL. The simulator VGG-7 4% or design space exploration. It also serves as a ) CPU. We use Intel Core i k CPU, a Haswell-E or RTL verification. 4096, asure the area, power and critical delay, we that has been used in NVIDIA Digits Deep VGG-8 23% classpath processor, 1000 ted the RTL of EIE in Verilog. The RTL is verified Dev Box as aacceleration CPU baseline. To run the benchmarkregularization 4096, e cycle-accurate simulator.learning Then we synthesized Compression NT-LSTM g the Synopsys Design Compiler (DC) under the Geo Mean NT-We % DNN MODELS FLOP% 35.1% 3% 35.3% 3% 37.5% 10% 18.3% 1% 37.5% 2% 41.1% 9% 100% 10% Description Compressed AlexNet [1] for large scale image classification Compressed VGG-16 [3] for large scale image classification and object detection Compressed NeuralTalk [7] 87
85 CPU Dense (Baseline) Wd NT-LSTM Speedup 1000x 25x 14x GPU Compressed mgpu Dense Geo Mean 24x 9x 21x 14x mgpu Compressed 135x 92x 22x 10x 34x 16x 10x 60x 33x 15x 9x 2x 1x 0.6x 5x 1x 0.5x 1x 1x 1.1x 1x 1x 1.0x 1x 1x 1.0x 1x 0.3x 1x 25x 9x 3x 3x 2x 1x 2x 1x 0.5x 0.5x 48x 15x 2x 1x 0.5x 3x 1x 3x 0.6x 0.1x Alex-6 Figure 6. Act_0 Ptr_Even Act_1 Arithm Alex-7 Alex-8 CPU Dense (Baseline) Ptr_Odd VGG-6 VGG x CPU Compressed GPU Dense 119,797x 61,533x 34,522x 100x Figure 7. 37x 9x12x 26x 37x 5x 7x 10x 1x 1x 10x Alex-6 1x 78x 101x 59x 15x 7x10x 3x 1x Alex-7 18x 7x Alex-8 (%) mgpu Dense 11,828x 17x 10x 1x 13x VGG-6 61x 20x 10x 102x 14x 1x VGG-7 NT-LSTM Geo Mean mgpu Compressed EIE EIE 1x 2x 5x 8x 25x 14x 5x VGG-8 10,904x 9,485x 39x 14x 6x 6x 6x 6x 1x 8x NT-We 1x 25x 7x NT-Wd 24,207x 8,053x 15x 20x 4x 5x 7x 1x NT-LSTM 23x 6x 7x 1x 36x 9x Geo Mean Energy efficiency of GPU, mobile GPU and EIE compared with CPU running uncompressed DNN model. There is no batching in all cases. 10,904x corner. We placed and routed8,053x the PE using the Synopsys IC Area (µm2 ) 638, , ,465 8,946 23, , ,412 3,110 18,934 23,961 NT-Wd 76,784x 1000x MPLEMENTATION RESULTS OF ONE PE IN EIE AND THE OWN BY COMPONENT TYPE ( LINE 3-7), BY MODULE ( LINE 8-13). T HE CRITICAL PATH OF EIE IS 1.15 NS Power (mw) GPU Compressed 10000x Layout of one PE in EIE under TSMC 45nm process. Table II NT-We 14,826x mgpu Compressed SpMat VGG-8 Speedups of GPU, mobile GPU and EIE compared with CPU running uncompressed DNN model. There is no batching in all cases. SpMat Energy Efficiency 5. 3x 1x 1x 5x 9x [Han et al. ISCA 16] 189x 98x 63x Energy Efficiency on EIE There is no batching in all cases. 10x 8x EIE 618x 210x 115x 94x 56x GPU Dense 1018x 507x 248x 100x CPU Compressed (%) 24,207x Table III B ENCHMARK FROM STATE - OF - THE - ART DNN MODELS compiler(93.22%) (ICC). We used Cacti [25] to get SRAM area and Layer Size Weight% Act% FLOP% Description (0.14%) energy numbers. We annotated the toggle rate from the RTL 9216, (1.48%) Alex-6 9% 35.1% 3% (1.40%) 4096 Compressed simulation to the gate-level netlist, which was dumped to (3.76%) 4096, AlexNet [1] for (1.23%) (0.12%) activity interchange format (SAIF), and estimated Alex-7 9% 35.3% 3% (19.73%)switching (19.10%) 4096 large scale image (54.11%) (73.57%) the power using Prime-Time PX (12.68%) (0.49%) 4096, classification Alex-8 25% 37.5% 10% (12.25%) (2.97%) 1000 Comparison Baseline. We compare EIE with three dif(3.76%) 25088, VGG-6 4% 18.3% 1% Compressed ferent off-the-shelf computing units: CPU, GPU and mobile 4096 nit: I/O and Computing. In the I/O mode, all of VGG-16 [3] for GPU. 36x re idle while the activations and weights in every 4096, VGG-7 4% 37.5% 2% large scale image 25x 23x e accessed by a DMA connected with the Central x 1) CPU. We use Intel Core i k CPU, a Haswell-E classification and 15x s is one time cost. In the Computing mode, the 4096, eatedly collects a non-zero value from the LNZD VGG-8 23% 41.1% 9% object detection class processor, that has been used in NVIDIA Digits 7x Deep x and broadcasts this value to all PEs. This process 5x Dev the Box as a CPU baseline. To run the benchmark 4096, Compressed 4x until the input length is Learning exceeded. By setting NT-We 10% 100% 10% gth and starting address on of pointer array, EIE is 600 NeuralTalk [7] CPU, we used MKL CBLAS GEMV to implement the to execute different layers. 9x 600, with RNN and 7xMKL SPBLAS CSRMV 7xV. EVALUATION METHODOLOGY 1x dense model and 1x for the NT-Wd 11% 100% 11% original 8791 LSTM for compressed sparse model. CPU socket and DRAM power 1201, automatic tor, RTL and Layout. We implemented a custom NTLSTM 10% 100% 11% 2400 image captioning urate C++ simulator for the aimed toby the pcm-power utility provided by Intel. areaccelerator as reported e RTL behavior of synchronous circuits. Each GPU mgpu EIE CPUX GPU, 2) GPU. We use NVIDIA GeForce GTX Titan module is abstracted as an object that implements act methods: propagate and update, corresponding The uncompressed DNN model is obtained from Caffe a state-of-the-art GPU for deep learning as our baseline ation logic and the flip-flop in RTL. The simulator or design space exploration. It alsonvidia-smi serves as a model zoo [28] and NeuralTalk model zoo [7]; The comusing utility to report the power. To run or RTL verification. pressed DNN model is produced as described in [16], [23]. benchmark, asure the area, power andthe critical path delay, we we used cublas GEMV to implement ted the RTL of EIE in Verilog. The RTL is verified The benchmark networks have 9 layers in total obtained thethen original dense layer. For the compressed sparse layer, e cycle-accurate simulator. we synthesized Compression Acceleration Regularization from AlexNet, VGGNet, and NeuralTalk. We use the Imagewe stored thethesparse matrix in in CSR format, and used g the Synopsys Design Compiler (DC) under ry network er national cell ueue ad tread munit W cell Wd (59.15%) (20.46%) (11.20%) (9.18%) Geo Mean NT-LSTM Geo Mean 88
86 [Han et al. ISCA 16] Comparison: Throughput EIE 1E+06 Throughput (Layers/s in log scale) 1E+05 1E+04 1E+03 GPU ASIC ASIC ASIC ASIC 1E+02 1E+01 CPU mgpu FPGA 1E+00 Core-i7 5930k 22nm CPU TitanX 28nm GPU Tegra K1 28nm mgpu A-Eye 28nm FPGA DaDianNao 28nm ASIC TrueNorth 28nm ASIC EIE 45nm ASIC 64PEs EIE 28nm ASIC 256PEs 89
87 Comparison: Energy Efficiency [Han et al. ISCA 16] EIE 1E+06 Energy Efficiency (Layers/J in log scale) 1E+05 ASIC ASIC 1E+04 ASIC ASIC 1E+03 1E+02 1E+01 GPU mgpu 1E+00 CPU Core-i7 5930k 22nm CPU TitanX 28nm GPU Tegra K1 28nm mgpu FPGA A-Eye 28nm FPGA DaDianNao 28nm ASIC TrueNorth 28nm ASIC EIE 45nm ASIC 64PEs EIE 28nm ASIC 256PEs 90
88 [Han et al. ISCA 16] Scalability Speedup PE 2PEs 4PEs 8PEs 16PEs 32PEs 64PEs 128PEs 256PEs Alex-6 Alex-7 Alex-8 VGG-6 VGG-7 VGG-8 NT-We NT-Wd NT-LSTM Figure 11. System scalability. It measures the speedups with different numbers of PEs. The speedup is near-linear. #PEs ~ Speedup 64PEs: 64x 128PEs: 124x 256PEs: 210x 91
89 [Han et al. ISCA 16] Load Balancing Load Balance 100% 80% 60% 40% 20% 0% FIFO=1 FIFO=2 FIFO=4 FIFO=8 FIFO=16 FIFO=32 FIFO=64 FIFO=128 FIFO=256 Alex-6 Alex-7 Alex-8 VGG-6 VGG-7 VGG-8 NT-We NT-Wd NT-LSTM 8. Load efficiency improves as FIFO size increases. When FIFO deepth 8, the marginal gain quickly diminishes. So we choose FIFO de Imbalanced non-zeros among PEs degrades system utilization. This load imbalance could be solved by FIFO. With FIFO depth=8, ALU utilization is > 80%. 92
90 Can we do better with load imbalance? Feedforward => Recurrent neural network? 93
91 Small Fast Agenda Model Compression (Small) Pruning [NIPS 15] Trained Quantization [ICLR 16] Accurate Energy Efficient Compression Hardware Acceleration (Fast, Efficient) EIE Accelerator [ISCA 16] ESE Accelerator [FPGA 17] Efficient Training (Accurate) Dense-Sparse-Dense Regularization [ICLR 17] 94
92 ESE: Efficient Speech Recognition Engine for Sparse LSTM on FPGA Han et al. FPGA 2017 Best Paper 95
93 Accelerating Recurrent Neural Networks speech recognition image caption machine translation visual question answering The recurrent nature of RNN/LSTM produces complicated data dependency, which is more challenging than feedforward neural nets. 96
94 Rethinking Model Compression [Han et al. FPGA 17] Compression Pruning Quantization load balance-aware pruning Accelerated Inference 97
95 Pruning Lead to Load Imbalance [Han et al. FPGA 17] w0,0 w0,1 0 w0,3 0 0 w1,2 0 0 w2,1 0 w2, w4,2 w4,3 w5, w6,0 0 0 w6,3 0 w7,1 0 0 Unbalanced 5 cycles 2 cycles 4 cycles 1 cycle Overall: 5 cycles 100
96 Load Balance Aware Pruning [Han et al. FPGA 17] w0,0 w0,1 0 w0,3 0 0 w1,2 0 0 w2,1 0 w2, w4,2 w4,3 w5, w6,0 0 0 w6,3 w0,0 0 0 w0,3 0 0 w1,2 0 0 w2,1 0 w2,3 0 0 w3, w4,2 0 w5,0 0 0 w5,3 w6, w7, w7,1 0 w7,3 Unbalanced 5 cycles 2 cycles 4 cycles 1 cycle Overall: 5 cycles 101
97 Load Balance Aware Pruning [Han et al. FPGA 17] w0,0 w0,1 0 w0,3 0 0 w1,2 0 0 w2,1 0 w2, w4,2 w4,3 w5, w6,0 0 0 w6,3 w0,0 0 0 w0,3 0 0 w1,2 0 0 w2,1 0 w2,3 0 0 w3, w4,2 0 w5,0 0 0 w5,3 w6, w7, w7,1 0 w7,3 Unbalanced 5 cycles 2 cycles 4 cycles 1 cycle Overall: 5 cycles 102
98 Load Balance Aware Pruning [Han et al. FPGA 17] w0,0 w0,1 0 w0,3 0 0 w1,2 0 0 w2,1 0 w2, w4,2 w4,3 w5, w6,0 0 0 w6,3 w0,0 0 0 w0,3 0 0 w1,2 0 0 w2,1 0 w2,3 0 0 w3, w4,2 0 w5,0 0 0 w5,3 w6, w7, w7,1 0 w7,3 Unbalanced 5 cycles 2 cycles 4 cycles 1 cycle Overall: 5 cycles Balanced 3 cycles 3 cycles 3 cycles 3 cycles Overall: 3 cycles 103
99 Load Balance Aware Pruning: Same Accuracy [Han et al. FPGA 17] 28.0% with load balance without load balance 26.5% Phone Error Rate 25.0% 23.5% 22.0% 20.5% sweet spot 19.0% 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Parameters Pruned Away 106
100 Load Balance Aware Pruning: Better Speedup [Han et al. FPGA 17] 7x 6x 5x with load balance without load balance 6.2x speedup over dense Speedup 4x 3x 5.5x speedup over dense 2x 1x 0x 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% Parameters Pruned Away 109
101 [Han et al. FPGA 17] Hardware Architecture Software Program CPU MEM External Memory x/y t-1 m t ActQueue FIFO FIFO FIFO FIFO Channel with multiple PEs FPGA PCIE Controller MEM Controller Memory DATA BUS Input Buffer Output Buffer PtrRead PtrRead PtrRead Pointer Buffer PtrRead Pointer Buffer Pointer Buf Buffer Buf Pointer Buf Buffer Buf Buf Buf Buf Buf SpMM SpmatRead SpmatRead SpMV Accu SpmatRead Weight Buffer SpMV Accu SpmatRead Weight Buffer SpMV Accu Weight Buf Buffer Buf SpMV Accu Weight Buf Buffer Buf Buf Buf Buf Buf PE N PE k PE 1 PE Act 0 Buffer Act Buffer Act Buffer Buf Act Buffer Buf Buf Buf Buf Buf Assemble y t ESE Controller Channel 0 PE PE PE ESE Accelerator Channel 1 PE PE PE Channel N PE PE PE W c /c t-1 c t ElemMul Wx t /Wy t-1 Adder Tree Elt-wise Sigmoid ElemMul /Tanh H t Buffer 110
102 Speedup and Energy Efficiency Platforms GPU CPU ESE Latency 240us 6017us 82.7us Power 202W 111W 41W Speedup 1x 0.04x 3x Energy Efficiency 1x 0.07x 14x Table 6: ESE Resource Utilization. LUT LUTRAM 1 FF BRAM 1 DSP Avail. 331, , ,360 1,080 2,760 Used 293,920 69, , ,504 Utili. 88.6% 47.6% 68.3% 87.7% 54.5% Resource Utilization on Xilinx KU
103 From Compression to Acceleration Challenge 1: memory access is expensive Deep Compression: 10x-49x smaller, no loss of accuracy Challenge 2: sparsity, indirection, load balance EIE / ESE Accelerator: energy-efficient accelerated inference 112
104 What about Training? Compressed Model Size: Same accuracy => Original Model Size: Higher accuracy 113
105 Small Fast Agenda Model Compression (Small) Pruning [NIPS 15] Trained Quantization [ICLR 16] Accurate Energy Efficient Hardware Acceleration (Fast, Efficient) EIE Accelerator [ISCA 16] ESE Accelerator [FPGA 17] Efficient Training (Accurate) Dense-Sparse-Dense Regularization [ICLR 17] Training 114
106 DSD: Dense-Sparse-Dense Training for Deep Neural Networks Han et al. ICLR 2017
107 DSD: Dense Sparse Dense Training [Han et al. ICLR 2017] Dense Sparse Dense Pruning Sparsity Constraint Re-Dense Increase Model Capacity DSD produces same model architecture but can find better optimization solution, arrives at better local minima, and achieves higher prediction accuracy across a wide range of deep neural networks on CNNs / RNNs / LSTMs. 116
108 [Han et al. ICLR 2017] DSD: Intuition Learn the trunk first Then learn the leaves 117
109 [Han et al. ICLR 2017] DSD is General Purpose: Vision, Speech, Natural Language Network Domain Dataset Type Baseline DSD Abs. Imp. Rel. Imp. GoogleNet Vision ImageNet CNN 31.1% 30.0% 1.1% 3.6% VGG-16 Vision ImageNet CNN 31.5% 27.2% 4.3% 13.7% ResNet-18 Vision ImageNet CNN 30.4% 29.3% 1.1% 3.7% ResNet-50 Vision ImageNet CNN 24.0% 23.2% 0.9% 3.5% NeuralTalk Caption Flickr-8K LSTM % DeepSpeech Speech WSJ 93 RNN 33.6% 31.6% 2.0% 5.8% DeepSpeech-2 Speech WSJ 93 RNN 14.5% 13.4% 1.1% 7.4% Open Sourced DSD Model Zoo: The beseline results of AlexNet, VGG16, GoogleNet, SqueezeNet are from Caffe Model Zoo. ResNet18, ResNet50 are from fb.resnet.torch. 120
110 Related Work Dropout [1] and DropConnect [2] - Dropout use a random sparsity pattern. - DSD training learns with a deterministic data driven sparsity pattern. Distillation [3] - Transfer the knowledge from the large model to a small model. - Both DSD and Distillation don t incur architectural changes. [1] Srivastava, Nitish, et al. "Dropout: a simple way to prevent neural networks from overfitting." Journal of Machine Learning Research 15.1 (2014): [2] Wan, Li, et al. "Regularization of neural networks using dropconnect." Proceedings of the 30th International Conference on Machine Learning (ICML-13) [3] Hinton, Geoffrey, Oriol Vinyals, and Jeff Dean. "Distilling the knowledge in a neural network." arxiv preprint arxiv: (2015). 121
111 Small Fast Summary Model Compression (Small) Pruning [NIPS 15] Trained Quantization [ICLR 16] Accurate Energy Efficient Compression Pruning Quantization Hardware Acceleration (Fast, Efficient) EIE Accelerator [ISCA 16] ESE Accelerator [FPGA 17] Accelerated Inference Efficient Training (Accurate) Dense-Sparse-Dense Regularization [ICLR 17] Training 122
112 Summary Algorithm Inference sparsity Training Hardware 126
113 Summary Algorithm Smaller Size: Deep Compression Higher Accuracy: DSD Regularization Inference sparsity Training Better Speed, Energy Efficiency: EIE / ESE Accelerator Hardware 129
114 Summary Algorithm Smaller Size: Deep PhonesCompression Drones Higher Accuracy: Robots DSD Regularization Inference sparsity Training Better Speed, Energy Efficiency: EIE / ESE Accelerator future work Hardware Self Driving Cars AI in the Cloud 130
115 131 Future Smart Low Latency Privacy Mobility Energy-Efficient
116 132 Outlook: the Path for Computation PC Mobile-First AI-First Computing Mobile Computing Brain-Inspired Cognitive Computing Sundar Pichai, Google IO, 2016
117 Thank you! stanford.edu/~songhan Smart Training Han et al. ICLR 17 Dense Sparse Dense Pruning Re-Dense Sparsity Constraint Increase Model Capacity Compression Pruning Quantization Han et al. NIPS 15 Han et al. ICLR 16 (Best paper award) Accelerated Inference Han et al. ISCA 16 Han et al. FPGA 17 (Best paper award) SpMat Fast Ptr_Even Act_0 Act_1 Arithm Ptr_Odd SpMat Efficient 134
Hardware Acceleration of DNNs
Lecture 12: Hardware Acceleration of DNNs Visual omputing Systems Stanford S348V, Winter 2018 Hardware acceleration for DNNs Huawei Kirin NPU Google TPU: Apple Neural Engine Intel Lake rest Deep Learning
More informationOptimizing Inference via Approximation
Lecture 11: Optimizing Inference via Approximation Visual omputing Systems Take-home midterm To be released Sunday morning 10/23. Hand-in noon Tuesday 10/25. Example forms of questions: - Short answer
More informationLRADNN: High-Throughput and Energy- Efficient Deep Neural Network Accelerator using Low Rank Approximation
LRADNN: High-Throughput and Energy- Efficient Deep Neural Network Accelerator using Low Rank Approximation Jingyang Zhu 1, Zhiliang Qian 2, and Chi-Ying Tsui 1 1 The Hong Kong University of Science and
More informationarxiv: v3 [cs.cv] 20 Nov 2015
DEEP COMPRESSION: COMPRESSING DEEP NEURAL NETWORKS WITH PRUNING, TRAINED QUANTIZATION AND HUFFMAN CODING Song Han Stanford University, Stanford, CA 94305, USA songhan@stanford.edu arxiv:1510.00149v3 [cs.cv]
More informationarxiv: v1 [cs.ar] 11 Dec 2017
Multi-Mode Inference Engine for Convolutional Neural Networks Arash Ardakani, Carlo Condo and Warren J. Gross Electrical and Computer Engineering Department, McGill University, Montreal, Quebec, Canada
More informationOptimizing Loop Operation and Dataflow in FPGA Acceleration of Deep Convolutional Neural Networks
2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays Optimizing Loop Operation and Dataflow in FPGA Acceleration of Deep Convolutional Neural Networks Yufei Ma, Yu Cao, Sarma Vrudhula,
More informationBinary Deep Learning. Presented by Roey Nagar and Kostya Berestizshevsky
Binary Deep Learning Presented by Roey Nagar and Kostya Berestizshevsky Deep Learning Seminar, School of Electrical Engineering, Tel Aviv University January 22 nd 2017 Lecture Outline Motivation and existing
More informationLearning to Skip Ineffectual Recurrent Computations in LSTMs
1 Learning to Skip Ineffectual Recurrent Computations in LSTMs Arash Ardakani, Zhengyun Ji, and Warren J. Gross McGill University, Montreal, Canada arxiv:1811.1396v1 [cs.lg] 9 Nov 218 Abstract Long Short-Term
More informationExploring the Granularity of Sparsity in Convolutional Neural Networks
Exploring the Granularity of Sparsity in Convolutional Neural Networks Anonymous TMCV submission Abstract Sparsity helps reducing the computation complexity of DNNs by skipping the multiplication with
More informationDNPU: An Energy-Efficient Deep Neural Network Processor with On-Chip Stereo Matching
DNPU: An Energy-Efficient Deep Neural Network Processor with On-hip Stereo atching Dongjoo Shin, Jinmook Lee, Jinsu Lee, Juhyoung Lee, and Hoi-Jun Yoo Semiconductor System Laboratory School of EE, KAIST
More informationTowards Accurate Binary Convolutional Neural Network
Paper: #261 Poster: Pacific Ballroom #101 Towards Accurate Binary Convolutional Neural Network Xiaofan Lin, Cong Zhao and Wei Pan* firstname.lastname@dji.com Photos and videos are either original work
More informationCSC321 Lecture 16: ResNets and Attention
CSC321 Lecture 16: ResNets and Attention Roger Grosse Roger Grosse CSC321 Lecture 16: ResNets and Attention 1 / 24 Overview Two topics for today: Topic 1: Deep Residual Networks (ResNets) This is the state-of-the
More informationConvolutional Neural Networks II. Slides from Dr. Vlad Morariu
Convolutional Neural Networks II Slides from Dr. Vlad Morariu 1 Optimization Example of optimization progress while training a neural network. (Loss over mini-batches goes down over time.) 2 Learning rate
More informationNCU EE -- DSP VLSI Design. Tsung-Han Tsai 1
NCU EE -- DSP VLSI Design. Tsung-Han Tsai 1 Multi-processor vs. Multi-computer architecture µp vs. DSP RISC vs. DSP RISC Reduced-instruction-set Register-to-register operation Higher throughput by using
More informationarxiv: v1 [cs.ne] 10 May 2018
Laconic Deep Learning Computing arxiv:85.53v [cs.ne] May 28 Sayeh Sharify, Mostafa Mahmoud, Alberto Delmas Lascorz, Milos Nikolic, Andreas Moshovos Electrical and Computer Engineering, University of Toronto
More informationExploring the Granularity of Sparsity in Convolutional Neural Networks
Exploring the Granularity of Sparsity in Convolutional Neural Networks Huizi Mao 1, Song Han 1, Jeff Pool 2, Wenshuo Li 3, Xingyu Liu 1, Yu Wang 3, William J. Dally 1,2 1 Stanford University 2 NVIDIA 3
More informationNeural Network Approximation. Low rank, Sparsity, and Quantization Oct. 2017
Neural Network Approximation Low rank, Sparsity, and Quantization zsc@megvii.com Oct. 2017 Motivation Faster Inference Faster Training Latency critical scenarios VR/AR, UGV/UAV Saves time and energy Higher
More information<Special Topics in VLSI> Learning for Deep Neural Networks (Back-propagation)
Learning for Deep Neural Networks (Back-propagation) Outline Summary of Previous Standford Lecture Universal Approximation Theorem Inference vs Training Gradient Descent Back-Propagation
More informationDifferentiable Fine-grained Quantization for Deep Neural Network Compression
Differentiable Fine-grained Quantization for Deep Neural Network Compression Hsin-Pai Cheng hc218@duke.edu Yuanjun Huang University of Science and Technology of China Anhui, China yjhuang@mail.ustc.edu.cn
More informationRecurrent Neural Networks (RNN) and Long-Short-Term-Memory (LSTM) Yuan YAO HKUST
1 Recurrent Neural Networks (RNN) and Long-Short-Term-Memory (LSTM) Yuan YAO HKUST Summary We have shown: Now First order optimization methods: GD (BP), SGD, Nesterov, Adagrad, ADAM, RMSPROP, etc. Second
More informationFaster Machine Learning via Low-Precision Communication & Computation. Dan Alistarh (IST Austria & ETH Zurich), Hantian Zhang (ETH Zurich)
Faster Machine Learning via Low-Precision Communication & Computation Dan Alistarh (IST Austria & ETH Zurich), Hantian Zhang (ETH Zurich) 2 How many bits do you need to represent a single number in machine
More informationIntroduction to Deep Neural Networks
Introduction to Deep Neural Networks Presenter: Chunyuan Li Pattern Classification and Recognition (ECE 681.01) Duke University April, 2016 Outline 1 Background and Preliminaries Why DNNs? Model: Logistic
More informationIntroduction to Convolutional Neural Networks (CNNs)
Introduction to Convolutional Neural Networks (CNNs) nojunk@snu.ac.kr http://mipal.snu.ac.kr Department of Transdisciplinary Studies Seoul National University, Korea Jan. 2016 Many slides are from Fei-Fei
More informationEfficient DNN Neuron Pruning by Minimizing Layer-wise Nonlinear Reconstruction Error
Efficient DNN Neuron Pruning by Minimizing Layer-wise Nonlinear Reconstruction Error Chunhui Jiang, Guiying Li, Chao Qian, Ke Tang Anhui Province Key Lab of Big Data Analysis and Application, University
More informationarxiv: v2 [cs.ne] 23 Nov 2017
Minimum Energy Quantized Neural Networks Bert Moons +, Koen Goetschalckx +, Nick Van Berckelaer* and Marian Verhelst + Department of Electrical Engineering* + - ESAT/MICAS +, KU Leuven, Leuven, Belgium
More informationImproved Bayesian Compression
Improved Bayesian Compression Marco Federici University of Amsterdam marco.federici@student.uva.nl Karen Ullrich University of Amsterdam karen.ullrich@uva.nl Max Welling University of Amsterdam Canadian
More informationDesigning Energy-Efficient Convolutional Neural Networks using Energy-Aware Pruning
Designing Energy-Efficient Convolutional Neural Networks using Energy-Aware Pruning Tien-Ju Yang, Yu-Hsin Chen, Vivienne Sze Massachusetts Institute of Technology {tjy, yhchen, sze}@mit.edu Abstract Deep
More informationTYPES OF MODEL COMPRESSION. Soham Saha, MS by Research in CSE, CVIT, IIIT Hyderabad
TYPES OF MODEL COMPRESSION Soham Saha, MS by Research in CSE, CVIT, IIIT Hyderabad 1. Pruning 2. Quantization 3. Architectural Modifications PRUNING WHY PRUNING? Deep Neural Networks have redundant parameters.
More informationMachine Learning for Signal Processing Neural Networks Continue. Instructor: Bhiksha Raj Slides by Najim Dehak 1 Dec 2016
Machine Learning for Signal Processing Neural Networks Continue Instructor: Bhiksha Raj Slides by Najim Dehak 1 Dec 2016 1 So what are neural networks?? Voice signal N.Net Transcription Image N.Net Text
More informationValue-aware Quantization for Training and Inference of Neural Networks
Value-aware Quantization for Training and Inference of Neural Networks Eunhyeok Park 1, Sungjoo Yoo 1, and Peter Vajda 2 1 Department of Computer Science and Engineering Seoul National University {eunhyeok.park,sungjoo.yoo}@gmail.com
More informationAnalog Computation in Flash Memory for Datacenter-scale AI Inference in a Small Chip
1 Analog Computation in Flash Memory for Datacenter-scale AI Inference in a Small Chip Dave Fick, CTO/Founder Mike Henry, CEO/Founder About Mythic 2 Focused on high-performance Edge AI Full stack co-design:
More informationBuilding a Multi-FPGA Virtualized Restricted Boltzmann Machine Architecture Using Embedded MPI
Building a Multi-FPGA Virtualized Restricted Boltzmann Machine Architecture Using Embedded MPI Charles Lo and Paul Chow {locharl1, pc}@eecg.toronto.edu Department of Electrical and Computer Engineering
More informationarxiv: v1 [cs.ne] 20 Apr 2018
arxiv:1804.07802v1 [cs.ne] 20 Apr 2018 Value-aware Quantization for Training and Inference of Neural Networks Eunhyeok Park 1, Sungjoo Yoo 1, Peter Vajda 2 1 Department of Computer Science and Engineering
More informationPRUNING CONVOLUTIONAL NEURAL NETWORKS. Pavlo Molchanov Stephen Tyree Tero Karras Timo Aila Jan Kautz
PRUNING CONVOLUTIONAL NEURAL NETWORKS Pavlo Molchanov Stephen Tyree Tero Karras Timo Aila Jan Kautz 2017 WHY WE CAN PRUNE CNNS? 2 WHY WE CAN PRUNE CNNS? Optimization failures : Some neurons are "dead":
More informationDistributed Machine Learning: A Brief Overview. Dan Alistarh IST Austria
Distributed Machine Learning: A Brief Overview Dan Alistarh IST Austria Background The Machine Learning Cambrian Explosion Key Factors: 1. Large s: Millions of labelled images, thousands of hours of speech
More informationSP-CNN: A Scalable and Programmable CNN-based Accelerator. Dilan Manatunga Dr. Hyesoon Kim Dr. Saibal Mukhopadhyay
SP-CNN: A Scalable and Programmable CNN-based Accelerator Dilan Manatunga Dr. Hyesoon Kim Dr. Saibal Mukhopadhyay Motivation Power is a first-order design constraint, especially for embedded devices. Certain
More informationALU A functional unit
ALU A functional unit that performs arithmetic operations such as ADD, SUB, MPY logical operations such as AND, OR, XOR, NOT on given data types: 8-,16-,32-, or 64-bit values A n-1 A n-2... A 1 A 0 B n-1
More informationBinary Convolutional Neural Network on RRAM
Binary Convolutional Neural Network on RRAM Tianqi Tang, Lixue Xia, Boxun Li, Yu Wang, Huazhong Yang Dept. of E.E, Tsinghua National Laboratory for Information Science and Technology (TNList) Tsinghua
More informationTasks ADAS. Self Driving. Non-machine Learning. Traditional MLP. Machine-Learning based method. Supervised CNN. Methods. Deep-Learning based
UNDERSTANDING CNN ADAS Tasks Self Driving Localizati on Perception Planning/ Control Driver state Vehicle Diagnosis Smart factory Methods Traditional Deep-Learning based Non-machine Learning Machine-Learning
More informationLOGNET: ENERGY-EFFICIENT NEURAL NETWORKS USING LOGARITHMIC COMPUTATION. Stanford University 1, Toshiba 2
LOGNET: ENERGY-EFFICIENT NEURAL NETWORKS USING LOGARITHMIC COMPUTATION Edward H. Lee 1, Daisuke Miyashita 1,2, Elaina Chai 1, Boris Murmann 1, S. Simon Wong 1 Stanford University 1, Toshiba 2 ABSTRACT
More informationSynaptic Devices and Neuron Circuits for Neuron-Inspired NanoElectronics
Synaptic Devices and Neuron Circuits for Neuron-Inspired NanoElectronics Byung-Gook Park Inter-university Semiconductor Research Center & Department of Electrical and Computer Engineering Seoul National
More informationLarge-Scale FPGA implementations of Machine Learning Algorithms
Large-Scale FPGA implementations of Machine Learning Algorithms Philip Leong ( ) Computer Engineering Laboratory School of Electrical and Information Engineering, The University of Sydney Computer Engineering
More informationDeep Learning Architectures and Algorithms
Deep Learning Architectures and Algorithms In-Jung Kim 2016. 12. 2. Agenda Introduction to Deep Learning RBM and Auto-Encoders Convolutional Neural Networks Recurrent Neural Networks Reinforcement Learning
More informationINF2270 Spring Philipp Häfliger. Lecture 8: Superscalar CPUs, Course Summary/Repetition (1/2)
INF2270 Spring 2010 Philipp Häfliger Summary/Repetition (1/2) content From Scalar to Superscalar Lecture Summary and Brief Repetition Binary numbers Boolean Algebra Combinational Logic Circuits Encoder/Decoder
More informationConvolutional Neural Network Architecture
Convolutional Neural Network Architecture Zhisheng Zhong Feburary 2nd, 2018 Zhisheng Zhong Convolutional Neural Network Architecture Feburary 2nd, 2018 1 / 55 Outline 1 Introduction of Convolution Motivation
More informationarxiv: v1 [cs.cl] 1 Dec 2016
ESE: Efficient Speech Recognition Engine with Compressed LSTM on FGA arxiv:1612.00694v1 [cs.cl] 1 Dec 2016 Song Han 1,2, Junlong Kang 2, Huizi Mao 1,2, Yiming Hu 2,3, Xin Li 2, Yubin Li 2, Dongliang Xie
More informationMemory-Augmented Attention Model for Scene Text Recognition
Memory-Augmented Attention Model for Scene Text Recognition Cong Wang 1,2, Fei Yin 1,2, Cheng-Lin Liu 1,2,3 1 National Laboratory of Pattern Recognition Institute of Automation, Chinese Academy of Sciences
More informationLecture 27: Hardware Acceleration. James C. Hoe Department of ECE Carnegie Mellon University
18 447 Lecture 27: Hardware Acceleration James C. Hoe Department of ECE Carnegie Mellon niversity 18 447 S18 L27 S1, James C. Hoe, CM/ECE/CALCM, 2018 18 447 S18 L27 S2, James C. Hoe, CM/ECE/CALCM, 2018
More informationTRAINING LONG SHORT-TERM MEMORY WITH SPAR-
TRAINING LONG SHORT-TERM MEMORY WITH SPAR- SIFIED STOCHASTIC GRADIENT DESCENT Maohua Zhu, Yuan Xie Department of Electrical and Computer Engineering University of California, Santa Barbara Santa Barbara,
More informationVectorized 128-bit Input FP16/FP32/ FP64 Floating-Point Multiplier
Vectorized 128-bit Input FP16/FP32/ FP64 Floating-Point Multiplier Espen Stenersen Master of Science in Electronics Submission date: June 2008 Supervisor: Per Gunnar Kjeldsberg, IET Co-supervisor: Torstein
More informationLecture 7 Convolutional Neural Networks
Lecture 7 Convolutional Neural Networks CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago April 17, 2017 We saw before: ŷ x 1 x 2 x 3 x 4 A series of matrix multiplications:
More informationAuto-balanced Filter Pruning for Efficient Convolutional Neural Networks
Auto-balanced Filter Pruning for Efficient Convolutional Neural Networks Xiaohan Ding, 1 Guiguang Ding, 1 Jungong Han, 2 Sheng Tang 3 1 School of Software, Tsinghua University, Beijing 100084, China 2
More informationTunable Floating-Point for Energy Efficient Accelerators
Tunable Floating-Point for Energy Efficient Accelerators Alberto Nannarelli DTU Compute, Technical University of Denmark 25 th IEEE Symposium on Computer Arithmetic A. Nannarelli (DTU Compute) Tunable
More informationTIME:A Training-in-memory Architecture for RRAM-based Deep Neural Networks
1 TIME:A Training-in-memory Architecture for -based Deep Neural Networks Ming Cheng, Lixue Xia, Student Member, IEEE, Zhenhua Zhu, Yi Cai, Yuan Xie, Fellow, IEEE, Yu Wang, Senior Member, IEEE, Huazhong
More informationArtificial Neural Networks D B M G. Data Base and Data Mining Group of Politecnico di Torino. Elena Baralis. Politecnico di Torino
Artificial Neural Networks Data Base and Data Mining Group of Politecnico di Torino Elena Baralis Politecnico di Torino Artificial Neural Networks Inspired to the structure of the human brain Neurons as
More informationarxiv: v2 [cs.cl] 20 Feb 2017
ESE: Efficient Speech Recognition Engine with Sparse LSTM on FGA Song Han1,2, Junlong Kang2, Huizi Mao1,2, Yiming Hu2,3, Xin Li2, Yubin Li2, Dongliang Xie2 Hong Luo2, Song Yao2, Yu Wang2,3, Huazhong Yang3
More informationSajid Anwar, Kyuyeon Hwang and Wonyong Sung
Sajid Anwar, Kyuyeon Hwang and Wonyong Sung Department of Electrical and Computer Engineering Seoul National University Seoul, 08826 Korea Email: sajid@dsp.snu.ac.kr, khwang@dsp.snu.ac.kr, wysung@snu.ac.kr
More informationECE 407 Computer Aided Design for Electronic Systems. Simulation. Instructor: Maria K. Michael. Overview
407 Computer Aided Design for Electronic Systems Simulation Instructor: Maria K. Michael Overview What is simulation? Design verification Modeling Levels Modeling circuits for simulation True-value simulation
More informationEncoder Based Lifelong Learning - Supplementary materials
Encoder Based Lifelong Learning - Supplementary materials Amal Rannen Rahaf Aljundi Mathew B. Blaschko Tinne Tuytelaars KU Leuven KU Leuven, ESAT-PSI, IMEC, Belgium firstname.lastname@esat.kuleuven.be
More informationBeiHang Short Course, Part 7: HW Acceleration: It s about Performance, Energy and Power
BeiHang Short Course, Part 7: HW Acceleration: It s about Performance, Energy and Power James C. Hoe Department of ECE Carnegie Mellon niversity Eric S. Chung, et al., Single chip Heterogeneous Computing:
More informationSGD and Deep Learning
SGD and Deep Learning Subgradients Lets make the gradient cheating more formal. Recall that the gradient is the slope of the tangent. f(w 1 )+rf(w 1 ) (w w 1 ) Non differentiable case? w 1 Subgradients
More informationCOMPARING FIXED AND ADAPTIVE COMPUTATION TIME FOR RE-
Workshop track - ICLR COMPARING FIXED AND ADAPTIVE COMPUTATION TIME FOR RE- CURRENT NEURAL NETWORKS Daniel Fojo, Víctor Campos, Xavier Giró-i-Nieto Universitat Politècnica de Catalunya, Barcelona Supercomputing
More informationMachine Learning. Neural Networks. (slides from Domingos, Pardo, others)
Machine Learning Neural Networks (slides from Domingos, Pardo, others) For this week, Reading Chapter 4: Neural Networks (Mitchell, 1997) See Canvas For subsequent weeks: Scaling Learning Algorithms toward
More informationIdentifying QCD transition using Deep Learning
Identifying QCD transition using Deep Learning Kai Zhou Long-Gang Pang, Nan Su, Hannah Peterson, Horst Stoecker, Xin-Nian Wang Collaborators: arxiv:1612.04262 Outline 2 What is deep learning? Artificial
More informationEfficient Deep Learning Inference based on Model Compression
Efficient Deep Learning Inference based on Model Compression Qing Zhang, Mengru Zhang, Mengdi Wang, Wanchen Sui, Chen Meng, Jun Yang Alibaba Group {sensi.zq, mengru.zmr, didou.wmd, wanchen.swc, mc119496,
More informationReduced-Area Constant-Coefficient and Multiple-Constant Multipliers for Xilinx FPGAs with 6-Input LUTs
Article Reduced-Area Constant-Coefficient and Multiple-Constant Multipliers for Xilinx FPGAs with 6-Input LUTs E. George Walters III Department of Electrical and Computer Engineering, Penn State Erie,
More informationGrundlagen der Künstlichen Intelligenz
Grundlagen der Künstlichen Intelligenz Neural networks Daniel Hennes 21.01.2018 (WS 2017/18) University Stuttgart - IPVS - Machine Learning & Robotics 1 Today Logistic regression Neural networks Perceptron
More informationNormalization Techniques in Training of Deep Neural Networks
Normalization Techniques in Training of Deep Neural Networks Lei Huang ( 黄雷 ) State Key Laboratory of Software Development Environment, Beihang University Mail:huanglei@nlsde.buaa.edu.cn August 17 th,
More informationSome Applications of Machine Learning to Astronomy. Eduardo Bezerra 20/fev/2018
Some Applications of Machine Learning to Astronomy Eduardo Bezerra ebezerra@cefet-rj.br 20/fev/2018 Overview 2 Introduction Definition Neural Nets Applications do Astronomy Ads: Machine Learning Course
More information2. Accelerated Computations
2. Accelerated Computations 2.1. Bent Function Enumeration by a Circular Pipeline Implemented on an FPGA Stuart W. Schneider Jon T. Butler 2.1.1. Background A naive approach to encoding a plaintext message
More informationTETRIS: TilE-matching the TRemendous Irregular Sparsity
TETRIS: TilE-matching the TRemendous Irregular Sparsity Yu Ji 1,2,3 Ling Liang 3 Lei Deng 3 Youyang Zhang 1 Youhui Zhang 1,2 Yuan Xie 3 {jiy15,zhang-yy15}@mails.tsinghua.edu.cn,zyh02@tsinghua.edu.cn 1
More informationFPGA Implementation of a HOG-based Pedestrian Recognition System
MPC Workshop Karlsruhe 10/7/2009 FPGA Implementation of a HOG-based Pedestrian Recognition System Sebastian Bauer sebastian.bauer@fh-aschaffenburg.de Laboratory for Pattern Recognition and Computational
More informationEECS150 - Digital Design Lecture 23 - FFs revisited, FIFOs, ECCs, LSFRs. Cross-coupled NOR gates
EECS150 - Digital Design Lecture 23 - FFs revisited, FIFOs, ECCs, LSFRs April 16, 2009 John Wawrzynek Spring 2009 EECS150 - Lec24-blocks Page 1 Cross-coupled NOR gates remember, If both R=0 & S=0, then
More informationQuantum Artificial Intelligence and Machine Learning: The Path to Enterprise Deployments. Randall Correll. +1 (703) Palo Alto, CA
Quantum Artificial Intelligence and Machine : The Path to Enterprise Deployments Randall Correll randall.correll@qcware.com +1 (703) 867-2395 Palo Alto, CA 1 Bundled software and services Professional
More informationCompressing deep neural networks
From Data to Decisions - M.Sc. Data Science Compressing deep neural networks Challenges and theoretical foundations Presenter: Simone Scardapane University of Exeter, UK Table of contents Introduction
More informationPerformance, Power & Energy. ELEC8106/ELEC6102 Spring 2010 Hayden Kwok-Hay So
Performance, Power & Energy ELEC8106/ELEC6102 Spring 2010 Hayden Kwok-Hay So Recall: Goal of this class Performance Reconfiguration Power/ Energy H. So, Sp10 Lecture 3 - ELEC8106/6102 2 PERFORMANCE EVALUATION
More informationEECS 579: Logic and Fault Simulation. Simulation
EECS 579: Logic and Fault Simulation Simulation: Use of computer software models to verify correctness Fault Simulation: Use of simulation for fault analysis and ATPG Circuit description Input data for
More informationGoogle s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation
Google s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation Y. Wu, M. Schuster, Z. Chen, Q.V. Le, M. Norouzi, et al. Google arxiv:1609.08144v2 Reviewed by : Bill
More informationCS 700: Quantitative Methods & Experimental Design in Computer Science
CS 700: Quantitative Methods & Experimental Design in Computer Science Sanjeev Setia Dept of Computer Science George Mason University Logistics Grade: 35% project, 25% Homework assignments 20% midterm,
More informationAn Analytical Method to Determine Minimum Per-Layer Precision of Deep Neural Networks
An Analytical Method to Determine Minimum Per-Layer Precision of Deep Neural Networks Charbel Sakr, Naresh Shanbhag Department of Electrical and Computer Engineering University of Illinois at Urbana-Champaign
More informationEnergyNet: Energy-Efficient Dynamic Inference
EnergyNet: Energy-Efficient Dynamic Inference Yue Wang 1, Tan Nguyen 1, Yang Zhao 2, Zhangyang Wang 3, Yingyan Lin 1, and Richard Baraniuk 1 1 Rice University, Houston, TX 77005, USA 2 UC Santa Barbara,
More informationYodaNN 1 : An Architecture for Ultra-Low Power Binary-Weight CNN Acceleration
1 YodaNN 1 : An Architecture for Ultra-Low Power Binary-Weight CNN Acceleration Renzo Andri, Lukas Cavigelli, avide Rossi and Luca Benini Integrated Systems Laboratory, ETH Zürich, Zurich, Switzerland
More informationECEN 248: INTRODUCTION TO DIGITAL SYSTEMS DESIGN. Week 9 Dr. Srinivas Shakkottai Dept. of Electrical and Computer Engineering
ECEN 248: INTRODUCTION TO DIGITAL SYSTEMS DESIGN Week 9 Dr. Srinivas Shakkottai Dept. of Electrical and Computer Engineering TIMING ANALYSIS Overview Circuits do not respond instantaneously to input changes
More informationIntroduction to Convolutional Neural Networks 2018 / 02 / 23
Introduction to Convolutional Neural Networks 2018 / 02 / 23 Buzzword: CNN Convolutional neural networks (CNN, ConvNet) is a class of deep, feed-forward (not recurrent) artificial neural networks that
More informationCS 229 Project Final Report: Reinforcement Learning for Neural Network Architecture Category : Theory & Reinforcement Learning
CS 229 Project Final Report: Reinforcement Learning for Neural Network Architecture Category : Theory & Reinforcement Learning Lei Lei Ruoxuan Xiong December 16, 2017 1 Introduction Deep Neural Network
More informationToday. ESE532: System-on-a-Chip Architecture. Energy. Message. Preclass Challenge: Power. Energy Today s bottleneck What drives Efficiency of
ESE532: System-on-a-Chip Architecture Day 20: November 8, 2017 Energy Today Energy Today s bottleneck What drives Efficiency of Processors, FPGAs, accelerators How does parallelism impact energy? 1 2 Message
More informationEECS150 - Digital Design Lecture 15 SIFT2 + FSM. Recap and Outline
EECS150 - Digital Design Lecture 15 SIFT2 + FSM Oct. 15, 2013 Prof. Ronald Fearing Electrical Engineering and Computer Sciences University of California, Berkeley (slides courtesy of Prof. John Wawrzynek)
More informationRecurrent Neural Networks with Flexible Gates using Kernel Activation Functions
2018 IEEE International Workshop on Machine Learning for Signal Processing (MLSP 18) Recurrent Neural Networks with Flexible Gates using Kernel Activation Functions Authors: S. Scardapane, S. Van Vaerenbergh,
More informationDeep Learning Basics Lecture 7: Factor Analysis. Princeton University COS 495 Instructor: Yingyu Liang
Deep Learning Basics Lecture 7: Factor Analysis Princeton University COS 495 Instructor: Yingyu Liang Supervised v.s. Unsupervised Math formulation for supervised learning Given training data x i, y i
More informationSpiral 2-1. Datapath Components: Counters Adders Design Example: Crosswalk Controller
2-. piral 2- Datapath Components: Counters s Design Example: Crosswalk Controller 2-.2 piral Content Mapping piral Theory Combinational Design equential Design ystem Level Design Implementation and Tools
More informationDeep Learning with Low Precision by Half-wave Gaussian Quantization
Deep Learning with Low Precision by Half-wave Gaussian Quantization Zhaowei Cai UC San Diego zwcai@ucsd.edu Xiaodong He Microsoft Research Redmond xiaohe@microsoft.com Jian Sun Megvii Inc. sunjian@megvii.com
More informationDeep Learning with Coherent Nanophotonic Circuits
Yichen Shen, Nicholas Harris, Dirk Englund, Marin Soljacic Massachusetts Institute of Technology @ Berkeley, Oct. 2017 1 Neuromorphic Computing Biological Neural Networks Artificial Neural Networks 2 Artificial
More informationDeep Learning for Automatic Speech Recognition Part II
Deep Learning for Automatic Speech Recognition Part II Xiaodong Cui IBM T. J. Watson Research Center Yorktown Heights, NY 10598 Fall, 2018 Outline A brief revisit of sampling, pitch/formant and MFCC DNN-HMM
More informationAccelerating Convolutional Neural Networks by Group-wise 2D-filter Pruning
Accelerating Convolutional Neural Networks by Group-wise D-filter Pruning Niange Yu Department of Computer Sicence and Technology Tsinghua University Beijing 0008, China yng@mails.tsinghua.edu.cn Shi Qiu
More informationDemystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis
T. BEN-NUN, T. HOEFLER Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis https://www.arxiv.org/abs/1802.09941 What is Deep Learning good for? Digit Recognition Object
More informationEECS150 - Digital Design Lecture 11 - Shifters & Counters. Register Summary
EECS50 - Digital Design Lecture - Shifters & Counters February 24, 2003 John Wawrzynek Spring 2005 EECS50 - Lec-counters Page Register Summary All registers (this semester) based on Flip-flops: q 3 q 2
More informationMachine Learning. Neural Networks. (slides from Domingos, Pardo, others)
Machine Learning Neural Networks (slides from Domingos, Pardo, others) Human Brain Neurons Input-Output Transformation Input Spikes Output Spike Spike (= a brief pulse) (Excitatory Post-Synaptic Potential)
More informationarxiv: v1 [cs.cv] 1 Oct 2015
A Deep Neural Network ion Pipeline: Pruning, Quantization, Huffman Encoding arxiv:1510.00149v1 [cs.cv] 1 Oct 2015 Song Han 1 Huizi Mao 2 William J. Dally 1 1 Stanford University 2 Tsinghua University {songhan,
More informationDeep Neural Network Compression with Single and Multiple Level Quantization
The Thirty-Second AAAI Conference on Artificial Intelligence (AAAI-18) Deep Neural Network Compression with Single and Multiple Level Quantization Yuhui Xu, 1 Yongzhuang Wang, 1 Aojun Zhou, 2 Weiyao Lin,
More informationDetermination of Linear Force- Free Magnetic Field Constant αα Using Deep Learning
Determination of Linear Force- Free Magnetic Field Constant αα Using Deep Learning Bernard Benson, Zhuocheng Jiang, W. David Pan Dept. of Electrical and Computer Engineering (Dept. of ECE) G. Allen Gary
More information