arxiv: v2 [cs.cv] 17 Jul 2018 ABSTRACT

Size: px
Start display at page:

Download "arxiv: v2 [cs.cv] 17 Jul 2018 ABSTRACT"

Transcription

1 PACT: PARAMETERIZED CLIPPING ACTIVATION FOR QUANTIZED NEURAL NETWORKS Jungwook Choi 1, Zhuo Wang 2, Swagath Venkataramani 2, Pierce I-Jen Chuang 1, Vijayalakshmi Srinivasan 1, Kailash Gopalakrishnan 1 IBM Research AI Yorktown Heights, NY 10598, USA 1 {choij,pchuang,viji,kailash}@us.ibm.com 2 {zhuo.wang,swagath.venkataramani}@ibm.com arxiv: v2 [cs.cv] 17 Jul 2018 ABSTRACT Deep learning algorithms achieve high classification accuracy at the expense of significant computation cost. To address this cost, a number of quantization schemes have been proposed - but most of these techniques focused on quantizing weights, which are relatively smaller in size compared to activations. This paper proposes a novel quantization scheme for activations during training - that enables neural networks to work well with ultra low precision weights and activations without any significant accuracy degradation. This technique, PArameterized Clipping activation (PACT), uses an activation clipping parameter α that is optimized during training to find the right quantization scale. PACT allows quantizing activations to arbitrary bit precisions, while achieving much better accuracy relative to published state-of-the-art quantization schemes. We show, for the first time, that both weights and activations can be quantized to 4-bits of precision while still achieving accuracy comparable to full precision networks across a range of popular models and datasets. We also show that exploiting these reduced-precision computational units in hardware can enable a super-linear improvement in inferencing performance due to a significant reduction in the area of accelerator compute engines coupled with the ability to retain the quantized model and activation data in on-chip memories. 1 INTRODUCTION Deep Convolutional Neural Networks (CNNs) have achieved remarkable accuracy for tasks in a wide range of application domains including image processing (He et al. (2016b)), machine translation (Gehring et al. (2017)), and speech recognition (Zhang et al. (2017)). These state-of-the-art CNNs use very deep models, consuming 100s of ExaOps of computation during training and GBs of storage for model and data. This poses a tremendous challenge to widespread deployment, especially in resource constrained edge environments - leading to a plethora of explorations in compressed models that minimize memory footprint and computation while preserving model accuracy as much as possible. Recently, a whole host of different techniques have been proposed to alleviate these computational costs. Among them, reducing the bit-precision of key CNN data structures, namely weights and activations, has gained attention due to its potential to significantly reduce both storage requirements and computational complexity. In particular, several weight quantization techniques (Li & Liu (2016) and Zhu et al. (2017)) showed significant reduction in the bit-precision of CNN weights with limited accuracy degradation. However, prior work (Hubara et al. (2016b); Zhou et al. (2016)) has shown that a straightforward extension of weight quantization schemes to activations incurs significant accuracy degradation in large-scale image classification tasks such as ImageNet (Russakovsky et al. (2015)). Recently, activation quantization schemes based on greedy layer-wise optimization were proposed (Park et al. (2017); Graham (2017); Cai et al. (2017)), but achieve limited accuracy improvement. In this paper, we propose a novel activation quantization technique, PArameterized Clipping activation function (PACT), that automatically optimizes the quantization scales during model training. Zhuo Wang is now an employee at Google. 1

2 PACT allows significant reductions in the bit-widths needed to represent both weights and activations and opens up new opportunities for trading off hardware complexity with model accuracy. The primary contributions of this work include: 1) PACT: A new activation quantization scheme for finding the optimal quantization scale during training. We introduce a new parameter α that is used to represent the clipping level in the activation function and is learned via back-propagation. α sets the quantization scale smaller than ReLU to reduce the quantization error, but larger than a conventional clipping activation function (used in previous schemes) to allow gradients to flow more effectively. In addition, regularization is applied to α in the loss function to enable faster convergence. We provide reasoning and analysis on the expected effectiveness of PACT in preserving model accuracy. 3) Quantitative results demonstrating the effectiveness of PACT on a spectrum of models and datasets. Empirically, we show that: (a) for extremely low bit-precision ( 2-bits for weights and activations), PACT achieves the highest model accuracy compared to all published schemes and (b) 4-bit quantized CNNs based on PACT achieve accuracies similar to single-precision floating point representations. 4) System performance analysis to demonstrate the trade-offs in hardware complexity for different bit representations vs. model accuracy. We show that a dramatic reduction in the area of the computing engines is possible and use it to estimate the achievable system-level performance gains. The rest of the paper is organized as follows: Section 2 provides a summary of related prior work on quantized CNNs. Challenges in activation quantization are presented in Section 3. We present PACT, our proposed solution for activation quantization in Section 4. In Section 5 we demonstrate the effectiveness of PACT relative to prior schemes using experimental results on popular CNNs. Overall system performance analysis for a representative hardware system is presented in Section 6 demonstrating the observed trade-offs in hardware complexity for different bit representations. 2 RELATED WORK Recently, a whole host of different techniques have been proposed to minimize CNN computation and storage costs. One of the earliest studies in weight quantization schemes (Hwang & Sung (2014) and Courbariaux et al. (2015)) show that it is indeed possible to quantize weights to 1-bit (binary) or 2-bits (ternary), enabling an entire DNN model to fit effectively in resource-constrained platforms (e.g., mobile devices). Effectiveness of weight quantization techniques has been further improved (Li & Liu (2016) and Zhu et al. (2017)), by ternarizing weights using statistical distribution of weight values or by tuning quantization scales during training. However, gain in system performance is limited when only weights are quantized while activations are left in high precision. This is particularly severe in convolutional neural networks (CNNs) since weights are relatively smaller in convolution layers in comparison to fully-connected (FC) layers. To reduce the overhead of activations, prior work (Kim & Smaragdis (2015),Hubara et al. (2016a), and Rastegari et al. (2016)) proposed the use of fully binarized neural networks where activations are quantized using 1-bit as well. More recently, activation quantization schemes using more general selections in bit-precision (Hubara et al. (2016b); Zhou et al. (2016; 2017); Mishra et al. (2017); Mellempudi et al. (2017)) have been studied. However, these techniques show significant degradation in accuracy (> 1%) for ImageNet tasks (Russakovsky et al. (2015)) when bit precision is reduced significantly ( 2 bits). Improvements to previous logarithmic quantization schemes (Miyashita et al. (2016)) using modified base and offset based on weighted entropy of activations have also been studied (Park et al. (2017)). Graham (2017) recommends that normalized activation, in the process of batch normalization (Ioffe & Szegedy (2015), BatchNorm), is a good candidate for quantization. Cai et al. (2017) further exploits the statistics of activations and proposes variants of the ReLU activation function for better quantization. However, such schemes typically rely on local (and greedy) optimizations, and are therefore not adaptable or optimized effectively during training. This is further elaborated in Section 3 where we present a detailed discussion on the challenges in quantizing activations. 2

3 3 CHALLENGES IN ACTIVATION QUANTIZATION Figure 1: (a) Training error, (b) Validation error across epochs for different activation functions (relu and clipping) with and without quantization for the ResNet20 model using the CIFAR10 dataset Quantization of weights is equivalent to discretizing the hypothesis space of the loss function with respect to the weight variables. Therefore, it is indeed possible to compensate weight quantization errors during model training (Hwang & Sung, 2014; Courbariaux et al., 2015). Traditional activation functions, on the other hand, do not have any trainable parameters, and therefore the errors arising from quantizing activations cannot be directly compensated using back-propagation. Activation quantization becomes even more challenging when ReLU (the activation function most commonly used in CNNs) is used as the layer activation function (ActFn). ReLU allows gradient of activations to propagate through deep layers and therefore achieves superior accuracy relative to other activation functions (Nair & Hinton (2010)). However, as the output of the ReLU function is unbounded, the quantization after ReLU requires a high dynamic range (i.e., more bit-precision). In Fig. 1 we present the training and validation errors of ResNet20 with the CIFAR10 dataset using ReLU and show that accuracy is significantly degraded with ReLU quantizations It has been shown that this dynamic range problem can be alleviated by using a clipping activation function, which places an upper-bound on the output (Hubara et al. (2016b); Zhou et al. (2016)). However, because of layer to layer and model to model differences - it is difficult to determine a globally optimal clipping value. In addition, as shown in Fig. 1, even though the training error obtained using clipping with quantization is less than that obtained with quantized ReLU, the validation error is still noticeably higher than the baseline. Recently, this challenge has been partially addressed by applying a half-wave Gaussian quantization scheme to activations (Cai et al. (2017)). Based on the observation that activation after BatchNorm normalization is close to a Gaussian distribution with zero mean and unit variance, they used Lloyd s algorithm to find the optimal quantization scale for this Gaussian distribution and use that scale for every layer. However, this technique also does not fully utilize the strength of back-propagation to optimally learn the clipping level because all the quantization parameters are determined offline and remain fixed throughout the training process. 4 PACT: PARAMETERIZED CLIPPING ACTIVATION FUNCTION Building on these insights, we introduce PACT, a new activation quantization scheme in which the ActFn has a parameterized clipping level, α. α is dynamically adjusted via gradient descent-based training with the objective of minimizing the accuracy degradation arising from quantization. In PACT, the conventional ReLU activation function in CNNs is replaced with the following: 0, x (, 0) y = P ACT (x) = 0.5( x x α + α) = x, x [0, α) α, x [α, + ) (1) 3

4 where α limits the range of activation to [0, α]. The truncated activation output is then linearly quantized to k bits for the dot-product computations, where y q = round(y 2k 1 α ) α 2 k 1 (2) With this new activation function, α is a variable in the loss function, whose value can be optimized during training. For back-propagation, gradient yq α can be computed using the Straight-Through Estimator (STE) (Bengio et al. (2013)) to estimate yq y as 1. Thus, y q α = y { q y 0, x (, α) y α = 1, x [α, + ) (3) The larger the α, the more the parameterized clipping function resembles a ReLU Actfn. To avoid large quantization errors due to a wide dynamic range, we include a L2-regularizer for α in the loss function. Fig. 2 illustrates how the value of α changes during full-precision training of CIFAR10- ResNet20 starting with an initial value of 10 and using the L2-regularizer. It can be observed that α converges to values much smaller than the initial value as the training epochs proceed, thereby limiting the dynamic range of activations and minimizing quantization loss. Figure 2: Evolution of α values during training using a ResNet20 model on the CIFAR10 dataset. 4.1 UNDERSTANDING HOW PARAMETERIZED CLIPPING WORKS When activation is quantized, the overall behavior of network parameters is affected by the quantization error during training. To observe the impact of activation quantization during network training, we sweep the clipping parameter α and record the training loss with and without quantization. Figs. 3 a,b and 3c show cross-entropy and training loss (cross entropy + regularization), respectively, over a range of α for the pre-trained SVHN network. The loaded network is trained with the proposed quantization scheme in which ReLU is replaced with the proposed parameterized clipping ActFn for each of its seven convolution layers. We sweep the value of α one layer at a time, keeping all other parameters (weight (W ), bias (b), BatchNorm parameters (β, γ), and the α of other layers) fixed when computing the cross-entropy and training loss. The cross-entropy computed via full-precision forward-pass of training is shown in Fig. 3a. In this case, the cross-entropy converges to a small value in many layers as α increases, indicating that ReLU is a good activation function when no quantization is applied. But even for the full-precision case, training clipping parameter α may help reduce the cross-entropy for certain layers; for example, ReLU (i.e., α = ) is not optimal for act0 and act6 layers. Next, the cross-entropy computed with quantization in the forward-pass is shown in Fig. 3b. With quantization, the cross-entropy increases in most cases as α increases, implying that ReLU is no longer 4

5 (a) Cross-entropy in full-precision (b) Cross-entropy with quantization (c) Training loss Figure 3: Cross-entropy vs α for SVHN image classification. effective. We also observe that the optimal α has different ranges for different layers, motivating the need to "learn" the quantization scale via training. In addition, we observe plateaus of cross-entropy for the certain ranges of α (e.g., act6), leading to difficulties for gradient descent-based training. Finally, in Fig. 3c, we show the total training loss including both the cross-entropy discussed above and the cost from α regularization. The regularization effectively gets rid of the plateaus in the training loss, thereby favoring convergence for gradient-descent based training. At the same time, α regularization does not perturb the global minimum point. For example, the solid circles in Fig. 3c, which are the optimal α extracted from the pre-trained model, are at the minimum of the training loss curves. The regularization coefficient, λ α, discussed in the next section, is an additional hyper-parameter which controls the impact of regularization on α. 4.2 EXPLORATION OF HYPER-PARAMETERS For this new quantization approach, we studied the scope of α, the choice of initial values of α,and the impact of regularizing α. We briefly summarize our findings below, and present more detailed analysis in Appendix A. From our experiments, the best scope for α was to share α per layer. This choice also reduces hardware complexity because α needs to be multiplied only once after all multiply-accumulate (MAC) operations in reduced-precision in a layer are completed. Among initialization choices for α, we found it to be advantageous to initialize α to a larger value relative to typical values of activation, and then apply regularization to reduce it during training. Finally, we observed that applying L2-regularization for α with the same regularization parameter λ used for weight works reasonably well. We also observed that, as expected, the optimal value for λ α slightly decreases when higher bit-precision is used because more quantization levels result in higher resolution for activation quantization. Additionally, we follow the practice of many other quantized CNN studies (e.g., Hubara et al. (2016b); Zhou et al. (2016)), and do not quantize the first and last layers, as these have been reported to significantly impact accuracy. 5 EXPERIMENTS We implemented PACT in Tensorflow (Abadi et al. (2015)) using Tensorpack (Zhou et al. (2016)). To demonstrate the effectiveness of PACT, we studied several well-known CNNs. The following is a summary of the Dataset-Network for the tested CNNs. More implementation details can be found in Appendix B. Note that the baseline networks use the same hyper-parameters and ReLU activation functions as described in the references. For PACT experiments, we only replace ReLU into PACT but the same hyper-parameters are used. All the time the networks are trained from scratch. CIFAR10-ResNet20 (CIFAR10, Krizhevsky & Hinton (2010)): a convolution (CONV) layer followed by 3 ResNet blocks (16 CONV layers with 3x3 filter) and a final fully-connected (FC) layer. SVHN-SVHN (SVHN, Netzer et al. (2011)): 7 CONV layers followed by 1 FC layer. 5

6 (a) CIFAR10 (b) SVHN (c) AlexNet (d) ResNet18 (e) ResNet50 (f) Degradation (L: ResNet18, R: ResNet50) Figure 4: (a-e) Training and valid error with different bit-precision for various CNNs. (f) Comparison of accuracy degradation for ResNet18 (left) and ResNet50 (right). The lower the better. IMAGENET-AlexNet (AlexNet, Krizhevsky et al. (2012)): 5 parallel-conv layers followed by 3 FC layers. BatchNorm is used before ReLU. IMAGENET-ResNet18 (ResNet18, He et al. (2016b)): a CONV layer followed by 8 ResNet blocks (16 CONV layers with 3x3 filter) and a final FC layer. "full pre-activation" ResNet structure (He et al. (2016a)) is employed. IMAGENET-ResNet50 (ResNet50, He et al. (2016b)): a CONV layer followed by 16 ResNet bottleneck blocks (total 48 CONV layers) and a final FC layer. "full pre-activation" ResNet structure (He et al. (2016a)) is employed. For comparisons, we include accuracy results reported in the following prior work: DoReFa (Zhou et al. (2016)), BalancedQ (Zhou et al. (2017)), WRPN (Mishra et al. (2017)), FGQ (Mellempudi et al. (2017)), WEP (Park et al. (2017)), LPBN (Graham (2017)), and HWGQ (Cai et al. (2017)). Detailed experimental setting for each of these papers, as well as full comparison of accuracy (top-1 and top5) for AlexNet, ResNet18, ResNet50, can be found in Appendix C. In the following section, we present key results demonstrating the effectiveness of PACT relative to prior work. 6

7 Figure 5: Comparison of accuracy degradation (Top-1) for (a) AlexNet, (b) ResNet18, and (c) ResNet ACTIVATION QUANTIZATION PERFORMANCE We first evaluate our activation quantization scheme using various CNNs. Fig. 4 shows training and validation error of PACT for the tested CNNs. Overall, the higher the bit-precision, the closer the training/validation errors are to the full-precision reference. Specifically it can be seen that training using bit-precision higher than 3-bits converges almost identically to the full-precision baseline. The final validation error has less than 1% difference relative to the full-precision validation error for all cases when the activation bit-precision is at least 4-bits. We further compare activation quantization performance with 3 previous schemes, DoReFa, LPBN, and HWGQ. We use accuracy degradation as the quantization performance metric, which is calculated as the difference between full-precision accuracy and the accuracy for each quantization bit-precision. Fig. 4f shows accuracy degradation (top-1) for ResNet18 (left) and ResNet50 (right) for increasing activation bit-precision, when the same weight bit-precision is used for each quantization scheme (indicated within the parenthesis). Overall, we observe that accuracy degradation is reduced as we increase the bit-precision of activations. For both ResNet18 and ResNet50, PACT achieves consistently lower accuracy degradation compared to the other quantization schemes, demonstrating the robustness of PACT relative to prior quantization approaches. 5.2 PACT PERFORMANCE FOR QUANTIZED CNNS In this section, we demonstrate that although PACT targets activation quantization, it does not preclude us from using weight quantization as well. We used PACT to quantize activation of CNNs, and DoReFa scheme to quantize weights. Table 1 summarizes top-1 accuracy of PACT for the tested CNNs (CIFAR10, SVHN, AlexNet, ResNet18, and ResNet50). We also show the accuracy of CNNs when both the weight and activation are quantized by DoReFa s scheme. As can be seen, with 4 bit precision for both weights and activation, PACT achieves full-precision accuracy consistently across the networks tested. To the best of our knowledge, this is the lowest bit precision for both weights and activation ever reported, that can achieve near ( 1%) full-precision accuracy. We further compare the performance of PACT-based quantized CNNs with 7 previous quantization schemes (DoReFa, BalancedQ, WRPN, FGQ, WEP, LPBN, and HWGQ). Fig. 5 shows comparison of accuracy degradation (top-1) for AlexNet, ResNet18, and ResNet50. Overall, the accuracy degradation decreases as bit-precision for activation or weight increases. For example, in Fig. 5a, the accuracy degradation decreases when activation bit-precision increases given the same weight precision or when weight bit-precision increases given the same activation bit-precision. PACT outperforms other schemes for all the cases. In fact, AlexNet even achieves marginally better accuracy (i.e., negative accuracy degradation) using PACT instead of full-precision. 6 SYSTEM-LEVEL PERFORMANCE GAIN In this section, we demonstrate the gain in system performance as a result of the reduction in bit-precision achieved using PACT-CNN. To this end, as shown in Fig. 6(a), we consider a DNN 7

8 External Memory Core DNN Accelerator Chip Core Systolic MAC Array Rows Cols MAC unit Memory Core On-chip Interconnect Network (a) Core System architectural parameters: MAC array: 8 rows X 64 cols, Frequency: 2GHz Num. Cores: 4 (16A-16W), 8 (8A-8W) 16 (4A-4W), 32 (2A-2W), On-chip Memory: 16MB Normalized Area à A- 16W 8A- 8W 4A- 4W 2A- 2W Configuration à (b) Images/Sec à Bandwidth = Observed Linear Bandwidth = 32 GBps Observed Linear Configuration à (c) Bandwidth = 64 GBps 0 Observed Linear Figure 6: (a)system architecture and parameters, (b) Variation in MAC area with bit-precision and (b) Speedup at different quantizations for inference using ResNet50 DNN accelerator system comprising of a DNN accelerator chip, comprising of multiple cores, interfaced with an external memory. Each core consists of a 2D-systolic array of fixed-point multiply-andaccumulate (MAC) processing elements on which DNN layers are executed. Each core also contains an on-chip memory, which stores the operands that are fed into the MAC processing array. To estimate system performance at different bit precisions, we studied different versions of the DNN accelerator each comprising the same amount of on-chip memory, external memory bandwidth, and occupying iso-silicon area. First, using real hardware implementations in a state of the art technology (14 nm CMOS), we accurately estimate the reduction in the MAC area achieved by aggressively scaling bit precision. As shown in Fig. 6(b), we achieve 14 improvement in density when the bit-precisions of both activations and weights are uniformly reduced from 16 bits to 2 bits. Next, to translate the reduction in area to improvement in overall performance, we built a precisionconfigurable MAC unit, whose bit precision can be modulated dynamically. The peak compute capability (FLOPs) of the MAC unit varied such that we achieve iso-area at each precision. Note that the total on-chip memory and external bandwidth remains constant at all precisions. We estimate the overall system performance using DeepMatrix, a detailed performance modelling framework for DNN accelerators (Venkataramani et al.). Fig. 6(c) shows the gain in inference performance for the ResNet50 DNN benchmark. We study the performance improvement using different external memory bandwidths, namely, a bandwidth unconstrained system (infinite memory bandwidth) and two bandwidth constrained systems at 32 and 64 GBps. In the bandwidth unconstrained scenario, the gain in performance is limited by how amenable it is to parallelize the work. In this case, we see a near-linear increase in performance for up-to 4 bits and a small drop at extreme quantization levels (2 bits). Practical systems, whose bandwidths are constrained, (surprisingly) exhibit a super-linear growth in performance with quantization. For example, when external bandwidth is limited to 64 GBps, quantizing from 16 to 4 bits leads to a 4 increase in peak FLOPs but a 4.5 improvement in performance. This is because, the total amount of on-chip memory remains constant, and at very low precision some of the data-structures begin to fit within the memory present in the cores, thereby avoiding data transfers from the external memory. Consequently, in bandwidth limited systems, reducing the amount of data transferred from off-chip can provide an additional boost in system performance beyond the increase in peak FLOPs. Note that for the 4 and 2 bit precision configurations, we still used 8 bit precision to execute the first and last layers of the DNN. If we are able to quantize the first and last layers as well to 4 or 2 bits, we estimate an additional 1.24 improvement in performance, motivating the need to explore ways to quantize the first and last layers. 7 CONCLUSION In this paper, we propose a novel activation quantization scheme based on the PArameterized Clipping activation function (PACT). The proposed scheme replaces ReLU with an activation function with a clipping parameter, α, that is optimized via gradient descent based training. We provide analysis on why PACT outperforms ReLU when quantization is applied during training. Extensive empirical evaluation using several popular convolutional neural networks, such as CIFAR10, SVHN, AlexNet, 8

9 Table 1: Comparison of top-1 accuracy between DoReFa and PACT. Weights are quantized with DoReFa scheme, whereas activations are quantized with our scheme. Note that CNNs with 4b quantization based on our scheme achieves full-precision accuracy for all the CNNs we explored. Network FullPrec DoReFa PACT 2b 3b 4b 5b 2b 3b 4b 5b CIFAR SVHN AlexNet ResNet ResNet ResNet18 and ResNet50, shows that PACT quantizes activations very effectively while simultaneously allowing weights to be heavily quantized. In comparison to all previous quantization schemes, we show that both weights and activations can be quantized much more aggressively (down to 4-bits) - while achieving near ( 1%) full-precision accuracy. In addition, we have shown that the area savings from using reduced-precision MAC units enable a dramatic increase in the number of accelerator cores in the same area, thereby, significantly improving overall system-performance. ACKNOWLEDGMENTS The authors would like to thank Naigang Wang, Daniel Brand, Ankur Agrawal, Wei Zhang and I-Hsin Chung for helpful discussions and supports. This research was supported by IBM Research AI, IBM SoftLayer, and IBM Cognitive Computing Cluster (CCC). REFERENCES Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dan Mané, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viégas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems, URL Software available from tensorflow.org. Yoshua Bengio, Nicholas Léonard, and Aaron C. Courville. Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation. CoRR, abs/ , Zhaowei Cai, Xiaodong He, Jian Sun, and Nuno Vasconcelos. Deep Learning With Low Precision by Half-Wave Gaussian Quantization. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July Matthieu Courbariaux, Yoshua Bengio, and Jean-Pierre David. BinaryConnect: Training Deep Neural Networks with binary weights during propagations. CoRR, abs/ , Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, and Yann N. Dauphin. Convolutional Sequence to Sequence Learning. CoRR, abs/ , Benjamin Graham. Low-Precision Batch-Normalized Activations. CoRR, abs/ , Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity Mappings in Deep Residual Networks. CoRR, abs/ , 2016a. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep Residual Learning for Image Recognition. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp , 2016b. Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. Binarized Neural Networks. NIPs, pp , 2016a. 9

10 Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. Quantized Neural Networks: Training Neural Networks with Low Precision Weights and Activations. CoRR, abs/ , 2016b. Kyuyeon Hwang and Wonyong Sung. Fixed-point Feedforward Deep Neural Network Design Using Weights +1, 0, and -1. In IEEE Workshop on Signal Processing Systems (SiPS), pp. 1 6, Oct Sergey Ioffe and Christian Szegedy. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. In Proceedings of the 32nd International Conference on Machine Learning, pp , Minje Kim and Paris Smaragdis. Bitwise Neural Networks. ICML Workshop on Resource-Efficient Machine Learning, Diederik P. Kingma and Jimmy Ba. Adam: A Method for Stochastic Optimization. International Conference on Learning Representations (ICLR), Alex Krizhevsky and G Hinton. Convolutional deep belief networks on cifar-10. Unpublished manuscript, 40, Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. ImageNet Classification with Deep Convolutional Neural Networks. In Advances in Neural Information Processing Systems 25 (NIPS), pp , Fengfu Li and Bin Liu. Ternary Weight Networks. CoRR, abs/ , Naveen Mellempudi, Abhisek Kundu, Dheevatsa Mudigere, Dipankar Das, Bharat Kaul, and Pradeep Dubey. Ternary Neural Networks with Fine-Grained Quantization. CoRR, abs/ , Asit K. Mishra, Jeffrey J. Cook, Eriko Nurvitadhi, and Debbie Marr. WRPN: Training and Inference using Wide Reduced-Precision Networks. CoRR, abs/ , Daisuke Miyashita, Edward H. Lee, and Boris Murmann. Convolutional Neural Networks using Logarithmic Data Representation. CoRR, abs/ , Vinod Nair and Geoffrey E. Hinton. Rectified Linear Units Improve Restricted Boltzmann Machines. 27th International Conference on Machine Learning (ICML), pp , Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y Ng. Reading digits in natural images with unsupervised feature learning. In NIPS workshop on deep learning and unsupervised feature learning, number 2, pp. 5, Eunhyeok Park, Junwhan Ahn, and Sungjoo Yoo. Weighted-Entropy-Based Quantization for Deep Neural Networks. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, and Ali Farhadi. XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks. CoRR, abs/ , Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 115 (3): , Swagath Venkataramani, Jungwook Choi, Vijayalakshmi Srinivasan, Kailash Gopalakrishnan, and Leland Chang. POSTER: Design Space Exploration for Performance Optimization of Deep Neural Networks on Shared Memory Accelerators. Proc. PACT Ying Zhang, Mohammad Pezeshki, Philemon Brakel, Saizheng Zhang, César Laurent, Yoshua Bengio, and Aaron C. Courville. Towards End-to-End Speech Recognition with Deep Convolutional Neural Networks. CoRR, abs/ , Shuchang Zhou, Zekun Ni, Xinyu Zhou, He Wen, Yuxin Wu, and Yuheng Zou. DoReFa-Net: Training Low Bitwidth Convolutional Neural Networks with Low Bitwidth Gradients. CoRR, abs/ ,

11 Shuchang Zhou, Yuzhi Wang, He Wen, Qinyao He, and Yuheng Zou. Balanced Quantization: An Effective and Efficient Approach to Quantized Neural Networks. CoRR, abs/ , Chenzhuo Zhu, Song Han, Huizi Mao, and William J. Dally. Trained Ternary Quantization. International Conference on Learning Representations (ICLR), APPENDIX A EXPLORATION OF HYPER-PARAMETERS AND DESIGN CHOICES In this section, we present details on the hyper-parameters and design choices studied for PACT. A.1 SCOPE OF α One of key questions is the optimal scope for α. In other words, determining which neuron activations should share the same α. We considered 3 possible choices: (a) Individual α for each neuron activation, (b) Shared α among neurons within the same output channel, and (c) Shared α within a layer. We empirically studied each of these choices of α (without quantization) using CIFAR10- ResNet20 and determined training and validation error for PACT. As shown in Fig. 7, sharing α per layer is the best choice in terms of accuracy. This is in fact a preferred option from the perspective of hardware complexity as well, since α needs to be multiplied only once after all multiply-accumulate(mac) operations in a layer are completed. Figure 7: Training and validation error of CIFAR10-ResNet20 for PACT with different scope of α. A.2 INITIAL VALUE AND REGULARIZATION OF α The optimization behavior of α can be explained from the formulation of the parameterized clipping function. From Eq. 3 it is clear that, if α is initialized to a very small value, more activations fall into the range for the nonzero gradient, leading to unstable α in the early epochs, potentially causing accuracy degradation. On the other hand, if α is initialized to a very large value, the gradient becomes too small and α may be stuck at a large value, potentially suffering more on quantization error. Therefore, it is intuitive to start with a reasonably large value to cover a wide dynamic range and avoid unstable adaptation of α, but apply regularizer to reduce the value of α so as to alleviate quantization error. In practice, we found that applying L2-regularization for α while setting its coefficient λ α the same as the L2-regularization coefficient for weight, λ, works well. Fig. 8 shows that validation error for PACT-quantized CIFAR10-ResNet20 does not significantly vary for a wide range of λ α. We also observed that, as expected, the optimal value for λ α slightly decreases when higher bit-precision is used because more quantization levels result in higher resolution for activation quantization. A.3 QUANTIZATION OF FIRST AND LAST LAYERS Many previous work (e.g., Hubara et al. (2016b); Zhou et al. (2016)) follow the convention to keep the first and last layer in full precision during training, since quantizing those layers lead to substantial accuracy degradation. We empirically studied this for the proposed quantization approach 11

12 Figure 8: Training and validation error of quantized CIFAR10-ResNet20 for PACT with different regularization parameter λ α. Table 2: Validation error (in %) of CIFAR10-ResNet20 when first and last layers are quantized with different bit-precision. FL/M/FL means the first and last layers are quantized with Bit-FL bits, while the other layers are quantized with Bit-M bits. NQ represents no quantization is applied. BIT-M (bits) BIT-FL (bits) FL/M/FL FL/M/NQ NQ/M/FL for CIFAR10-ResNet20. In Fig. 9, the only difference among the curves is whether input activation and weight of the first convolution layer or the last fully-connected layer are quantized. As can be seen from the plots, there can be noticeable accuracy degradation if the first or last layers are aggressively quantized. But computation in floating point is very expensive in hardware. Therefore, we further studied the option of quantizing the first and last layers with higher quantization bit-precision than the bit-precision of the other layers. Table 2 shows that independent of the quantization level for the other layers, there is little accuracy degradation if the first and last layer are quantized with 8-bits. This motivates us to employ reduced precision computation even for the first/last layers. Figure 9: Comparison of accuracy of CIFAR10-ResNet20 with and without quantization of the first and last layers. APPENDIX B CNN IMPLEMENTATION DETAILS In this section, we summarize details of our CNN implementation as well as our training settings, which is based on the default networks provided by Tensorpack (Zhou et al. (2016)). Unless 12

13 mentioned otherwise, ReLU following BatchNorm is used for ActFn of the convolution (CONV) layers, and Softmax is used for the fully-connected (FC) layer. Note that the baseline networks use the same hyper-parameters and ReLU activation functions as described in the references. For PACT experiments, we only replace ReLU into PACT but the same hyper-parameters are used. All the time the networks are trained from scratch. The CIFAR10 dataset (Krizhevsky & Hinton (2010)) is an image classification benchmark containing pixel RGB images. It consists of 50K training and 10K test image sets. We used the standard ResNet structure (He et al. (2016a)) which consists of a CONV layer followed by 3 ResNet blocks (16 CONV layers with 3x3 filter) and a final FC layer. We used stochastic gradient descent (SGD) with momentum of 0.9 and learning rate starting from 0.1 and scaled by 0.1 at epoch 60, 120. L2-regularizer with decay of is applied to weight. The mini-batch size of 128 is used, and the maximum number of epochs is 200. The SVHN dataset (Netzer et al. (2011)) is a real-world digit recognition dataset containing photos of house numbers in Google Street View images, where the cropped colored images (re-sized to as input to the network) centered around a single character are used. It consists of digits for training and digits for testing. We used a CNN model which contains 7 CONV layers followed by 1 FC layer. We used ADAM(Kingma & Ba (2015)) with epsilon 10 5 and learning rate starting from 10 3 and scaled by 0.5 every 50 epoch. L2-regularizer with decay of 10 7 is applied to weight. The mini-batch size of 128 is used, and the maximum number of epochs is 200. The IMAGENET dataset (Russakovsky et al. (2015)) consists of 1000-categories of objects with over 1.2M training and 50K validation images. Images are first re-sized to and randomly cropped to prior to being used as input to the network. We used a modified AlexNet, ResNet18 and ResNet50. We used AlexNet network (Krizhevsky et al. (2012)) in which local contrast renormalization (R- Norm) layer is replaced with BatchNorm layer. We used ADAM with epsilon 10 5 and learning rate starting from 10 4 and scaled by 0.2 at epoch 56 and 64. L2-regularizer with decay factor of is applied to weight. The mini-batch size of 128 is used, and the maximum number of epochs is 100. ResNet18 consists of a CONV layer followed by 8 ResNet blocks (16 CONV layers with 3x3 filter) and a final FC layer. "full pre-activation" ResNet structure (He et al. (2016a)) is employed. ResNet50 consists of a CONV layer followed by 16 ResNet bottleneck blocks (total 48 CONV layers) and a final FC layer. "full pre-activation" ResNet structure (He et al. (2016a)) is employed. For both ResNet18 and ResNet50, we used stochastic gradient descent (SGD) with momentum of 0.9 and learning rate starting from 0.1 and scaled by 0.1 at epoch 30, 60, 85, 95. L2-regularizer with decay of 10 4 is applied to weight. The mini-batch size of 256 is used, and the maximum number of epochs is 110. APPENDIX C COMPARISON WITH RELATED WORK C.1 QUANTIZATION EXPERIMENT SETTING DoReFa-Net (DoReFa, Zhou et al. (2016)): A general bit-precision uniform quantization schemes for weight, activation, and gradient of DNN training. We compared the experimental results of DoReFa for CIFAR10, SVHN, AlexNet and ResNet18 under the same experimental setting as PACT. Note that a clipped absolute activation function is used for SVHN in DoReFa. Balanced Quantization (BalancedQ, Zhou et al. (2017)): A quantization scheme based on recursive partitioning of data into balanced bins. We compared the reported top-1/top-5 validation accuracy of their quantization scheme for AlexNet and ResNet18. Quantization using Wide Reduced-Precision Networks (WRPN, Mishra et al. (2017)): A scheme to increase the number of filter maps to increase robustness for activation quantization. We compared the reported top-1 accuracy of their quantization with various weight/activation bit-precision for AlexNet. 13

14 Fine-grained Quantization (FGQ, Mellempudi et al. (2017)): A direct quantization scheme (i.e., little re-training needed) based on fine-grained grouping (i.e., within a small subset of filter maps). We compared the reported top-1 validation accuracy of their quantization with 2-bit weight and 4-bit activation for AlexNet and ResNet50. Weighted-entropy-based quantization (WEP, Park et al. (2017)): A quantization scheme that considers statistics of weight/activation. We compared the top-1/top-5 reported accuracy of their quantization with various bit-precision for AlexNet, where the first and last layers are not quantized. Low-precision batch normalization (LPBN, Graham (2017)): A scheme for activation quantization in the process of batch normalization. We compared the top-1/top-5 reported accuracy of their quantization with 3-5 bit precision for activation. The first layer activation is not quantized. Half-wave Gaussian quantization (HWGQ, Cai et al. (2017)): A quantization scheme that finds the scale via Lloyd search on Normal distribution. We compared the top-1/top-5 reported accuracy for their quantization with 1-bit weight and varying activation bit-precision for AlexNet, and 2-bit weight for ResNet18 and ResNet50. The first and last layers are not quantized. C.2 COMPARISON OF ACCURACY In this section, we present full comparison of accuracy (top-1 and top-5) of the tested CNNs (AlexNet, ResNet18, ResNet50) for image classification on IMAGENET dataset. All the data points for PACT and DoReFa are obtained by running experiments on Tensorpack. All the other data points are accuracy reported in the corresponding papers. As can be seen, PACT achieves the best accuracy across the board for various flavors of quantization. We also observe that using PACT for activation quantization enables more aggressive weight quantization without loss in accuracy. Table 3: Comparison of Top-1 accuracy (in %) for AlexNet. Bold entries indicate the lowest accuracy degradation compared to single-precision reference from each work. Baseline (full-precision) accuracy for PACT is 55.1%. BitW BitA WRPN BalancedQ FGQ WEQ DoReFa PACT Table 4: Comparison of Top-5 accuracy (in %) for AlexNet. Bold entries indicate the lowest accuracy degradation compared to single-precision reference from each work. Baseline (full-precision) accuracy for PACT is 77.0%. BitW BitA LogQuant WEQ BalancedQ DoReFa PACT

15 Table 5: Comparison of Top-1 accuracy (in %) for ResNet18. Bold entries indicate the lowest accuracy degradation compared to single-precision reference from each work. Baseline (full-precision) accuracy for PACT is 70.4%. BitW BitA BalancedQ LPBN HWGQ DoReFa PACT Table 6: Comparison of Top-5 accuracy (in %) for ResNet18. Bold entries indicate the lowest accuracy degradation compared to single-precision reference from each work. Baseline (full-precision) accuracy for PACT is 89.6%. BitW BitA BalancedQ LPBN HWGQ DoReFa PACT Table 7: Comparison of Top-1 accuracy (in %) for ResNet50. Bold entries indicate the lowest accuracy degradation compared to single-precision reference from each work. Baseline (full-precision) accuracy for PACT is 76.9%. BitW BitA FGQ LPBN HWGQ DoReFa PACT Table 8: Comparison of Top-5 accuracy (in %) for ResNet50. Bold entries indicate the lowest accuracy degradation compared to single-precision reference from each work. Baseline (full-precision) accuracy for PACT is 93.1%. BitW BitA LPBN HWGQ DoReFa PACT

ATTENTION-BASED GUIDED STRUCTURED SPARSITY

ATTENTION-BASED GUIDED STRUCTURED SPARSITY ATTENTION-BASED GUIDED STRUCTURED SPARSITY OF DEEP NEURAL NETWORKS Amirsina Torfi Virginia Tech atorfi@vt.edu Rouzbeh A. Shirvani Howard University rouzbeh.asgharishir@bison.howard.edu ABSTRACT arxiv:1802.09902v3

More information

σ(x) = clip( x m + α 2, 0, α) = min(max( x m + α 2, 0), α) (1)

σ(x) = clip( x m + α 2, 0, α) = min(max( x m + α 2, 0), α) (1) TRUE GRADIENT-BASED TRAINING OF DEEP BINARY ACTIVATED NEURAL NETWORKS VIA CONTINUOUS BINARIZATION Charbel Sakr, Jungwook Choi, Zhuo Wang, Kailash Gopalakrishnan, Naresh Shanbhag Dept. of Electrical and

More information

TYPES OF MODEL COMPRESSION. Soham Saha, MS by Research in CSE, CVIT, IIIT Hyderabad

TYPES OF MODEL COMPRESSION. Soham Saha, MS by Research in CSE, CVIT, IIIT Hyderabad TYPES OF MODEL COMPRESSION Soham Saha, MS by Research in CSE, CVIT, IIIT Hyderabad 1. Pruning 2. Quantization 3. Architectural Modifications PRUNING WHY PRUNING? Deep Neural Networks have redundant parameters.

More information

arxiv: v2 [cs.cv] 20 Aug 2018

arxiv: v2 [cs.cv] 20 Aug 2018 Joint Training of Low-Precision Neural Network with Quantization Interval Parameters arxiv:1808.05779v2 [cs.cv] 20 Aug 2018 Sangil Jung sang-il.jung Youngjun Kwak yjk.kwak Changyong Son cyson Jae-Joon

More information

Towards Accurate Binary Convolutional Neural Network

Towards Accurate Binary Convolutional Neural Network Paper: #261 Poster: Pacific Ballroom #101 Towards Accurate Binary Convolutional Neural Network Xiaofan Lin, Cong Zhao and Wei Pan* firstname.lastname@dji.com Photos and videos are either original work

More information

Layer 1. Layer L. difference between the two precisions (B A and B W ) is chosen to balance the sum in (1), as follows: B A B W = round log 2 (2)

Layer 1. Layer L. difference between the two precisions (B A and B W ) is chosen to balance the sum in (1), as follows: B A B W = round log 2 (2) AN ANALYTICAL METHOD TO DETERMINE MINIMUM PER-LAYER PRECISION OF DEEP NEURAL NETWORKS Charbel Sakr Naresh Shanbhag Dept. of Electrical Computer Engineering, University of Illinois at Urbana Champaign ABSTRACT

More information

Weighted-Entropy-based Quantization for Deep Neural Networks

Weighted-Entropy-based Quantization for Deep Neural Networks Weighted-Entropy-based Quantization for Deep Neural Networks Eunhyeok Park, Junwhan Ahn, and Sungjoo Yoo canusglow@gmail.com, junwhan@snu.ac.kr, sungjoo.yoo@gmail.com Seoul National University Computing

More information

Binary Deep Learning. Presented by Roey Nagar and Kostya Berestizshevsky

Binary Deep Learning. Presented by Roey Nagar and Kostya Berestizshevsky Binary Deep Learning Presented by Roey Nagar and Kostya Berestizshevsky Deep Learning Seminar, School of Electrical Engineering, Tel Aviv University January 22 nd 2017 Lecture Outline Motivation and existing

More information

arxiv: v2 [cs.ne] 23 Nov 2017

arxiv: v2 [cs.ne] 23 Nov 2017 Minimum Energy Quantized Neural Networks Bert Moons +, Koen Goetschalckx +, Nick Van Berckelaer* and Marian Verhelst + Department of Electrical Engineering* + - ESAT/MICAS +, KU Leuven, Leuven, Belgium

More information

arxiv: v1 [cs.cv] 3 Aug 2017

arxiv: v1 [cs.cv] 3 Aug 2017 DONG ET AL.: STOCHASTIC QUANTIZATION 1 arxiv:1708.01001v1 [cs.cv] 3 Aug 2017 Learning Accurate Low-Bit Deep Neural Networks with Stochastic Quantization Yinpeng Dong 1 dyp17@mails.tsinghua.edu.cn Renkun

More information

Neural Network Approximation. Low rank, Sparsity, and Quantization Oct. 2017

Neural Network Approximation. Low rank, Sparsity, and Quantization Oct. 2017 Neural Network Approximation Low rank, Sparsity, and Quantization zsc@megvii.com Oct. 2017 Motivation Faster Inference Faster Training Latency critical scenarios VR/AR, UGV/UAV Saves time and energy Higher

More information

Deep Learning with Low Precision by Half-wave Gaussian Quantization

Deep Learning with Low Precision by Half-wave Gaussian Quantization Deep Learning with Low Precision by Half-wave Gaussian Quantization Zhaowei Cai UC San Diego zwcai@ucsd.edu Xiaodong He Microsoft Research Redmond xiaohe@microsoft.com Jian Sun Megvii Inc. sunjian@megvii.com

More information

Efficient DNN Neuron Pruning by Minimizing Layer-wise Nonlinear Reconstruction Error

Efficient DNN Neuron Pruning by Minimizing Layer-wise Nonlinear Reconstruction Error Efficient DNN Neuron Pruning by Minimizing Layer-wise Nonlinear Reconstruction Error Chunhui Jiang, Guiying Li, Chao Qian, Ke Tang Anhui Province Key Lab of Big Data Analysis and Application, University

More information

RAGAV VENKATESAN VIJETHA GATUPALLI BAOXIN LI NEURAL DATASET GENERALITY

RAGAV VENKATESAN VIJETHA GATUPALLI BAOXIN LI NEURAL DATASET GENERALITY RAGAV VENKATESAN VIJETHA GATUPALLI BAOXIN LI NEURAL DATASET GENERALITY SIFT HOG ALL ABOUT THE FEATURES DAISY GABOR AlexNet GoogleNet CONVOLUTIONAL NEURAL NETWORKS VGG-19 ResNet FEATURES COMES FROM DATA

More information

Scalable Methods for 8-bit Training of Neural Networks

Scalable Methods for 8-bit Training of Neural Networks Scalable Methods for 8-bit Training of Neural Networks Ron Banner 1, Itay Hubara 2, Elad Hoffer 2, Daniel Soudry 2 {itayhubara, elad.hoffer, daniel.soudry}@gmail.com {ron.banner}@intel.com (1) Intel -

More information

Differentiable Fine-grained Quantization for Deep Neural Network Compression

Differentiable Fine-grained Quantization for Deep Neural Network Compression Differentiable Fine-grained Quantization for Deep Neural Network Compression Hsin-Pai Cheng hc218@duke.edu Yuanjun Huang University of Science and Technology of China Anhui, China yjhuang@mail.ustc.edu.cn

More information

Convolutional Neural Networks II. Slides from Dr. Vlad Morariu

Convolutional Neural Networks II. Slides from Dr. Vlad Morariu Convolutional Neural Networks II Slides from Dr. Vlad Morariu 1 Optimization Example of optimization progress while training a neural network. (Loss over mini-batches goes down over time.) 2 Learning rate

More information

arxiv: v3 [cs.ne] 2 Feb 2018

arxiv: v3 [cs.ne] 2 Feb 2018 DOREFA-NET: TRAINING LOW BITWIDTH CONVOLU- TIONAL NEURAL NETWORKS WITH LOW BITWIDTH GRADIENTS Shuchang Zhou, Yuxin Wu, Zekun Ni, Xinyu Zhou, He Wen, Yuheng Zou Megvii Inc. {zsc, wyx, nzk, zxy, wenhe, zouyuheng}@megvii.com

More information

arxiv: v1 [cs.ne] 20 Jun 2016

arxiv: v1 [cs.ne] 20 Jun 2016 DoReFa-Net: Training Low Bitwidth Convolutional Neural Networks with Low Bitwidth Gradients Shuchang Zhou, Zekun Ni, Xinyu Zhou, He Wen, Yuxin Wu, Yuheng Zou Megvii Inc. {shuchang.zhou,mike.zekun}@gmail.com

More information

arxiv: v1 [cs.ne] 20 Apr 2018

arxiv: v1 [cs.ne] 20 Apr 2018 arxiv:1804.07802v1 [cs.ne] 20 Apr 2018 Value-aware Quantization for Training and Inference of Neural Networks Eunhyeok Park 1, Sungjoo Yoo 1, Peter Vajda 2 1 Department of Computer Science and Engineering

More information

Introduction to Deep Neural Networks

Introduction to Deep Neural Networks Introduction to Deep Neural Networks Presenter: Chunyuan Li Pattern Classification and Recognition (ECE 681.01) Duke University April, 2016 Outline 1 Background and Preliminaries Why DNNs? Model: Logistic

More information

Value-aware Quantization for Training and Inference of Neural Networks

Value-aware Quantization for Training and Inference of Neural Networks Value-aware Quantization for Training and Inference of Neural Networks Eunhyeok Park 1, Sungjoo Yoo 1, and Peter Vajda 2 1 Department of Computer Science and Engineering Seoul National University {eunhyeok.park,sungjoo.yoo}@gmail.com

More information

SHAKE-SHAKE REGULARIZATION OF 3-BRANCH

SHAKE-SHAKE REGULARIZATION OF 3-BRANCH SHAKE-SHAKE REGULARIZATION OF 3-BRANCH RESIDUAL NETWORKS Xavier Gastaldi xgastaldi.mba2011@london.edu ABSTRACT The method introduced in this paper aims at helping computer vision practitioners faced with

More information

Quantisation. Efficient implementation of convolutional neural networks. Philip Leong. Computer Engineering Lab The University of Sydney

Quantisation. Efficient implementation of convolutional neural networks. Philip Leong. Computer Engineering Lab The University of Sydney 1/51 Quantisation Efficient implementation of convolutional neural networks Philip Leong Computer Engineering Lab The University of Sydney July 2018 / PAPAA Workshop Australia 2/51 3/51 Outline 1 Introduction

More information

arxiv: v1 [cs.cv] 25 May 2018

arxiv: v1 [cs.cv] 25 May 2018 Heterogeneous Bitwidth Binarization in Convolutional Neural Networks arxiv:1805.10368v1 [cs.cv] 25 May 2018 Josh Fromm Department of Electrical Engineering University of Washington Seattle, WA 98195 jwfromm@uw.edu

More information

Deep Learning Year in Review 2016: Computer Vision Perspective

Deep Learning Year in Review 2016: Computer Vision Perspective Deep Learning Year in Review 2016: Computer Vision Perspective Alex Kalinin, PhD Candidate Bioinformatics @ UMich alxndrkalinin@gmail.com @alxndrkalinin Architectures Summary of CNN architecture development

More information

DNN FEATURE MAP COMPRESSION USING LEARNED REPRESENTATION OVER GF(2)

DNN FEATURE MAP COMPRESSION USING LEARNED REPRESENTATION OVER GF(2) DNN FEATURE MAP COMPRESSION USING LEARNED REPRESENTATION OVER GF(2) Anonymous authors Paper under double-blind review ABSTRACT In this paper, we introduce a method to compress intermediate feature maps

More information

Accelerating Convolutional Neural Networks by Group-wise 2D-filter Pruning

Accelerating Convolutional Neural Networks by Group-wise 2D-filter Pruning Accelerating Convolutional Neural Networks by Group-wise D-filter Pruning Niange Yu Department of Computer Sicence and Technology Tsinghua University Beijing 0008, China yng@mails.tsinghua.edu.cn Shi Qiu

More information

arxiv: v1 [cs.cv] 3 Feb 2017

arxiv: v1 [cs.cv] 3 Feb 2017 Deep Learning with Low Precision by Half-wave Gaussian Quantization Zhaowei Cai UC San Diego zwcai@ucsd.edu Xiaodong He Microsoft Research Redmond xiaohe@microsoft.com Jian Sun Megvii Inc. sunjian@megvii.com

More information

LARGE BATCH TRAINING OF CONVOLUTIONAL NET-

LARGE BATCH TRAINING OF CONVOLUTIONAL NET- LARGE BATCH TRAINING OF CONVOLUTIONAL NET- WORKS WITH LAYER-WISE ADAPTIVE RATE SCALING Anonymous authors Paper under double-blind review ABSTRACT A common way to speed up training of deep convolutional

More information

LQ-Nets: Learned Quantization for Highly Accurate and Compact Deep Neural Networks. Microsoft Research

LQ-Nets: Learned Quantization for Highly Accurate and Compact Deep Neural Networks. Microsoft Research LQ-Nets: Learned Quantization for Highly Accurate and Compact Deep Neural Networks Dongqing Zhang, Jiaolong Yang, Dongqiangzi Ye, and Gang Hua Microsoft Research zdqzeros@gmail.com jiaoyan@microsoft.com

More information

arxiv: v1 [cs.cv] 11 Sep 2018

arxiv: v1 [cs.cv] 11 Sep 2018 Discovering Low-Precision Networks Close to Full-Precision Networks for Efficient Embedded Inference Jeffrey L. McKinstry jlmckins@us.ibm.com Steven K. Esser sesser@us.ibm.com Rathinakumar Appuswamy rappusw@us.ibm.com

More information

Encoder Based Lifelong Learning - Supplementary materials

Encoder Based Lifelong Learning - Supplementary materials Encoder Based Lifelong Learning - Supplementary materials Amal Rannen Rahaf Aljundi Mathew B. Blaschko Tinne Tuytelaars KU Leuven KU Leuven, ESAT-PSI, IMEC, Belgium firstname.lastname@esat.kuleuven.be

More information

FreezeOut: Accelerate Training by Progressively Freezing Layers

FreezeOut: Accelerate Training by Progressively Freezing Layers FreezeOut: Accelerate Training by Progressively Freezing Layers Andrew Brock, Theodore Lim, & J.M. Ritchie School of Engineering and Physical Sciences Heriot-Watt University Edinburgh, UK {ajb5, t.lim,

More information

Normalization Techniques in Training of Deep Neural Networks

Normalization Techniques in Training of Deep Neural Networks Normalization Techniques in Training of Deep Neural Networks Lei Huang ( 黄雷 ) State Key Laboratory of Software Development Environment, Beihang University Mail:huanglei@nlsde.buaa.edu.cn August 17 th,

More information

TRAINING AND INFERENCE WITH INTEGERS IN DEEP NEURAL NETWORKS

TRAINING AND INFERENCE WITH INTEGERS IN DEEP NEURAL NETWORKS Published as a conference paper at ICLR 218 TRAINING AND INFERENCE WITH INTEGERS IN DEEP NEURAL NETWORKS Shuang Wu 1, Guoqi Li 1, Feng Chen 2, Luping Shi 1 1 Department of Precision Instrument 2 Department

More information

COMPARING FIXED AND ADAPTIVE COMPUTATION TIME FOR RE-

COMPARING FIXED AND ADAPTIVE COMPUTATION TIME FOR RE- Workshop track - ICLR COMPARING FIXED AND ADAPTIVE COMPUTATION TIME FOR RE- CURRENT NEURAL NETWORKS Daniel Fojo, Víctor Campos, Xavier Giró-i-Nieto Universitat Politècnica de Catalunya, Barcelona Supercomputing

More information

Layer-wise Relevance Propagation for Neural Networks with Local Renormalization Layers

Layer-wise Relevance Propagation for Neural Networks with Local Renormalization Layers Layer-wise Relevance Propagation for Neural Networks with Local Renormalization Layers Alexander Binder 1, Grégoire Montavon 2, Sebastian Lapuschkin 3, Klaus-Robert Müller 2,4, and Wojciech Samek 3 1 ISTD

More information

Byte-based Language Identification with Deep Convolutional Networks

Byte-based Language Identification with Deep Convolutional Networks Byte-based Language Identification with Deep Convolutional Networks Johannes Bjerva University of Groningen The Netherlands j.bjerva@rug.nl Abstract We report on our system for the shared task on discrimination

More information

LOGNET: ENERGY-EFFICIENT NEURAL NETWORKS USING LOGARITHMIC COMPUTATION. Stanford University 1, Toshiba 2

LOGNET: ENERGY-EFFICIENT NEURAL NETWORKS USING LOGARITHMIC COMPUTATION. Stanford University 1, Toshiba 2 LOGNET: ENERGY-EFFICIENT NEURAL NETWORKS USING LOGARITHMIC COMPUTATION Edward H. Lee 1, Daisuke Miyashita 1,2, Elaina Chai 1, Boris Murmann 1, S. Simon Wong 1 Stanford University 1, Toshiba 2 ABSTRACT

More information

HIGHLY EFFICIENT 8-BIT LOW PRECISION INFERENCE

HIGHLY EFFICIENT 8-BIT LOW PRECISION INFERENCE HIGHLY EFFICIENT 8-BIT LOW PRECISION INFERENCE OF CONVOLUTIONAL NEURAL NETWORKS Anonymous authors Paper under double-blind review ABSTRACT High throughput and low latency inference of deep neural networks

More information

arxiv: v1 [cs.lg] 21 Jun 2018

arxiv: v1 [cs.lg] 21 Jun 2018 Quantizing deep convolutional networks for efficient inference: A whitepaper arxiv:1806.08342v1 [cs.lg] 21 Jun 2018 Contents Raghuraman Krishnamoorthi raghuramank@google.com June 2018 1 Introduction 3

More information

Sajid Anwar, Kyuyeon Hwang and Wonyong Sung

Sajid Anwar, Kyuyeon Hwang and Wonyong Sung Sajid Anwar, Kyuyeon Hwang and Wonyong Sung Department of Electrical and Computer Engineering Seoul National University Seoul, 08826 Korea Email: sajid@dsp.snu.ac.kr, khwang@dsp.snu.ac.kr, wysung@snu.ac.kr

More information

Multi-Precision Quantized Neural Networks via Encoding Decomposition of { 1, +1}

Multi-Precision Quantized Neural Networks via Encoding Decomposition of { 1, +1} Multi-Precision Quantized Neural Networks via Encoding Decomposition of {, +} Qigong Sun, Fanhua Shang, Kang Yang, Xiufang Li, Yan Ren, Licheng Jiao Key Laboratory of Intelligent Perception and Image Understanding

More information

Large-Scale Feature Learning with Spike-and-Slab Sparse Coding

Large-Scale Feature Learning with Spike-and-Slab Sparse Coding Large-Scale Feature Learning with Spike-and-Slab Sparse Coding Ian J. Goodfellow, Aaron Courville, Yoshua Bengio ICML 2012 Presented by Xin Yuan January 17, 2013 1 Outline Contributions Spike-and-Slab

More information

arxiv: v2 [cs.lg] 23 May 2017

arxiv: v2 [cs.lg] 23 May 2017 Shake-Shake regularization Xavier Gastaldi xgastaldi.mba2011@london.edu arxiv:1705.07485v2 [cs.lg] 23 May 2017 Abstract The method introduced in this paper aims at helping deep learning practitioners faced

More information

arxiv: v1 [cs.lg] 10 Jan 2019

arxiv: v1 [cs.lg] 10 Jan 2019 Quantized Epoch-SGD for Communication-Efficient Distributed Learning arxiv:1901.0300v1 [cs.lg] 10 Jan 2019 Shen-Yi Zhao Hao Gao Wu-Jun Li Department of Computer Science and Technology Nanjing University,

More information

Improved Bayesian Compression

Improved Bayesian Compression Improved Bayesian Compression Marco Federici University of Amsterdam marco.federici@student.uva.nl Karen Ullrich University of Amsterdam karen.ullrich@uva.nl Max Welling University of Amsterdam Canadian

More information

ANALYSIS ON GRADIENT PROPAGATION IN BATCH NORMALIZED RESIDUAL NETWORKS

ANALYSIS ON GRADIENT PROPAGATION IN BATCH NORMALIZED RESIDUAL NETWORKS ANALYSIS ON GRADIENT PROPAGATION IN BATCH NORMALIZED RESIDUAL NETWORKS Anonymous authors Paper under double-blind review ABSTRACT We conduct mathematical analysis on the effect of batch normalization (BN)

More information

arxiv: v3 [cs.lg] 2 Jun 2016

arxiv: v3 [cs.lg] 2 Jun 2016 Darryl D. Lin Qualcomm Research, San Diego, CA 922, USA Sachin S. Talathi Qualcomm Research, San Diego, CA 922, USA DARRYL.DLIN@GMAIL.COM TALATHI@GMAIL.COM arxiv:5.06393v3 [cs.lg] 2 Jun 206 V. Sreekanth

More information

Lecture 7 Convolutional Neural Networks

Lecture 7 Convolutional Neural Networks Lecture 7 Convolutional Neural Networks CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago April 17, 2017 We saw before: ŷ x 1 x 2 x 3 x 4 A series of matrix multiplications:

More information

arxiv: v2 [stat.ml] 18 Jun 2017

arxiv: v2 [stat.ml] 18 Jun 2017 FREEZEOUT: ACCELERATE TRAINING BY PROGRES- SIVELY FREEZING LAYERS Andrew Brock, Theodore Lim, & J.M. Ritchie School of Engineering and Physical Sciences Heriot-Watt University Edinburgh, UK {ajb5, t.lim,

More information

arxiv: v1 [cs.cv] 11 May 2015 Abstract

arxiv: v1 [cs.cv] 11 May 2015 Abstract Training Deeper Convolutional Networks with Deep Supervision Liwei Wang Computer Science Dept UIUC lwang97@illinois.edu Chen-Yu Lee ECE Dept UCSD chl260@ucsd.edu Zhuowen Tu CogSci Dept UCSD ztu0@ucsd.edu

More information

Stochastic Layer-Wise Precision in Deep Neural Networks

Stochastic Layer-Wise Precision in Deep Neural Networks Stochastic Layer-Wise Precision in Deep Neural Networks Griffin Lacey NVIDIA Graham W. Taylor University of Guelph Vector Institute for Artificial Intelligence Canadian Institute for Advanced Research

More information

Towards a Deeper Understanding of Training Quantized Neural Networks

Towards a Deeper Understanding of Training Quantized Neural Networks Hao Li * 1 Soham De * 1 Zheng Xu 1 Christoph Studer Hanan Samet 1 Tom Goldstein 1 Abstract Training neural networks with coarsely quantized weights is a key step towards learning on embedded platforms

More information

Local Affine Approximators for Improving Knowledge Transfer

Local Affine Approximators for Improving Knowledge Transfer Local Affine Approximators for Improving Knowledge Transfer Suraj Srinivas & François Fleuret Idiap Research Institute and EPFL {suraj.srinivas, francois.fleuret}@idiap.ch Abstract The Jacobian of a neural

More information

Multiscale methods for neural image processing. Sohil Shah, Pallabi Ghosh, Larry S. Davis and Tom Goldstein Hao Li, Soham De, Zheng Xu, Hanan Samet

Multiscale methods for neural image processing. Sohil Shah, Pallabi Ghosh, Larry S. Davis and Tom Goldstein Hao Li, Soham De, Zheng Xu, Hanan Samet Multiscale methods for neural image processing Sohil Shah, Pallabi Ghosh, Larry S. Davis and Tom Goldstein Hao Li, Soham De, Zheng Xu, Hanan Samet A TALK IN TWO ACTS Part I: Stacked U-nets The globalization

More information

Compressing deep neural networks

Compressing deep neural networks From Data to Decisions - M.Sc. Data Science Compressing deep neural networks Challenges and theoretical foundations Presenter: Simone Scardapane University of Exeter, UK Table of contents Introduction

More information

Auto-balanced Filter Pruning for Efficient Convolutional Neural Networks

Auto-balanced Filter Pruning for Efficient Convolutional Neural Networks Auto-balanced Filter Pruning for Efficient Convolutional Neural Networks Xiaohan Ding, 1 Guiguang Ding, 1 Jungong Han, 2 Sheng Tang 3 1 School of Software, Tsinghua University, Beijing 100084, China 2

More information

On the Complexity of Neural-Network-Learned Functions

On the Complexity of Neural-Network-Learned Functions On the Complexity of Neural-Network-Learned Functions Caren Marzban 1, Raju Viswanathan 2 1 Applied Physics Lab, and Dept of Statistics, University of Washington, Seattle, WA 98195 2 Cyberon LLC, 3073

More information

ON THE INEFFECTIVENESS OF VARIANCE REDUCED OPTIMIZATION FOR DEEP LEARNING

ON THE INEFFECTIVENESS OF VARIANCE REDUCED OPTIMIZATION FOR DEEP LEARNING ON THE INEFFECTIVENESS OF VARIANCE REDUCED OPTIMIZATION FOR DEEP LEARNING Anonymous authors Paper under double-blind review ABSTRACT The application of stochastic variance reduction to optimization has

More information

<Special Topics in VLSI> Learning for Deep Neural Networks (Back-propagation)

<Special Topics in VLSI> Learning for Deep Neural Networks (Back-propagation) Learning for Deep Neural Networks (Back-propagation) Outline Summary of Previous Standford Lecture Universal Approximation Theorem Inference vs Training Gradient Descent Back-Propagation

More information

Frequency-Domain Dynamic Pruning for Convolutional Neural Networks

Frequency-Domain Dynamic Pruning for Convolutional Neural Networks Frequency-Domain Dynamic Pruning for Convolutional Neural Networks Zhenhua Liu 1, Jizheng Xu 2, Xiulian Peng 2, Ruiqin Xiong 1 1 Institute of Digital Media, School of Electronic Engineering and Computer

More information

Classification of One-Dimensional Non-Stationary Signals Using the Wigner-Ville Distribution in Convolutional Neural Networks

Classification of One-Dimensional Non-Stationary Signals Using the Wigner-Ville Distribution in Convolutional Neural Networks Classification of One-Dimensional Non-Stationary Signals Using the Wigner-Ville Distribution in Convolutional Neural Networks Johan Brynolfsson Mathematical Statistics, Centre for Mathematical Sciences,

More information

CS 229 Project Final Report: Reinforcement Learning for Neural Network Architecture Category : Theory & Reinforcement Learning

CS 229 Project Final Report: Reinforcement Learning for Neural Network Architecture Category : Theory & Reinforcement Learning CS 229 Project Final Report: Reinforcement Learning for Neural Network Architecture Category : Theory & Reinforcement Learning Lei Lei Ruoxuan Xiong December 16, 2017 1 Introduction Deep Neural Network

More information

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 16

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 16 Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 16 Slides adapted from Jordan Boyd-Graber, Justin Johnson, Andrej Karpathy, Chris Ketelsen, Fei-Fei Li, Mike Mozer, Michael Nielson

More information

Deep Neural Network Compression with Single and Multiple Level Quantization

Deep Neural Network Compression with Single and Multiple Level Quantization The Thirty-Second AAAI Conference on Artificial Intelligence (AAAI-18) Deep Neural Network Compression with Single and Multiple Level Quantization Yuhui Xu, 1 Yongzhuang Wang, 1 Aojun Zhou, 2 Weiyao Lin,

More information

arxiv: v2 [cs.cv] 17 Nov 2017

arxiv: v2 [cs.cv] 17 Nov 2017 Towards Effective Low-bitwidth Convolutional Neural Networks Bohan Zhuang, Chunhua Shen, Mingkui Tan, Lingqiao Liu, Ian Reid arxiv:1711.00205v2 [cs.cv] 17 Nov 2017 Abstract This paper tackles the problem

More information

A Deep Learning Analytic Suite for Maximizing Twitter Impact

A Deep Learning Analytic Suite for Maximizing Twitter Impact A Deep Learning Analytic Suite for Maximizing Twitter Impact Zhao Chen Department of Physics Stanford University Stanford CA, 94305 zchen89[at]stanford.edu Alexander Hristov Department of Physics Stanford

More information

Analytical Guarantees on Numerical Precision of Deep Neural Networks

Analytical Guarantees on Numerical Precision of Deep Neural Networks Charbel Sakr Yongjune Kim Naresh Shanbhag Abstract The acclaimed successes of neural networks often overshadow their tremendous complexity. We focus on numerical precision - a key parameter defining the

More information

Tutorial on: Optimization I. (from a deep learning perspective) Jimmy Ba

Tutorial on: Optimization I. (from a deep learning perspective) Jimmy Ba Tutorial on: Optimization I (from a deep learning perspective) Jimmy Ba Outline Random search v.s. gradient descent Finding better search directions Design white-box optimization methods to improve computation

More information

Rethinking Binary Neural Network for Accurate Image Classification and Semantic Segmentation

Rethinking Binary Neural Network for Accurate Image Classification and Semantic Segmentation Rethinking nary Neural Network for Accurate Image Classification and Semantic Segmentation Bohan Zhuang 1, Chunhua Shen 1, Mingkui Tan 2, Lingqiao Liu 1, and Ian Reid 1 1 The University of Adelaide, Australia;

More information

Stochastic Gradient Estimate Variance in Contrastive Divergence and Persistent Contrastive Divergence

Stochastic Gradient Estimate Variance in Contrastive Divergence and Persistent Contrastive Divergence ESANN 0 proceedings, European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning. Bruges (Belgium), 7-9 April 0, idoc.com publ., ISBN 97-7707-. Stochastic Gradient

More information

Importance Reweighting Using Adversarial-Collaborative Training

Importance Reweighting Using Adversarial-Collaborative Training Importance Reweighting Using Adversarial-Collaborative Training Yifan Wu yw4@andrew.cmu.edu Tianshu Ren tren@andrew.cmu.edu Lidan Mu lmu@andrew.cmu.edu Abstract We consider the problem of reweighting a

More information

SYQ: Learning Symmetric Quantization For Efficient Deep Neural Networks

SYQ: Learning Symmetric Quantization For Efficient Deep Neural Networks SYQ: Learning Symmetric Quantization For Efficient Deep Neural Networks Julian Faraone* Nicholas Fraser # Michaela Blott # Philip H.W. Leong* The University of Sydney* Xilinx Research Labs # (julian.faraone,

More information

arxiv: v1 [cs.lg] 11 May 2015

arxiv: v1 [cs.lg] 11 May 2015 Improving neural networks with bunches of neurons modeled by Kumaraswamy units: Preliminary study Jakub M. Tomczak JAKUB.TOMCZAK@PWR.EDU.PL Wrocław University of Technology, wybrzeże Wyspiańskiego 7, 5-37,

More information

Swapout: Learning an ensemble of deep architectures

Swapout: Learning an ensemble of deep architectures Swapout: Learning an ensemble of deep architectures Saurabh Singh, Derek Hoiem, David Forsyth Department of Computer Science University of Illinois, Urbana-Champaign {ss1, dhoiem, daf}@illinois.edu Abstract

More information

Day 3 Lecture 3. Optimizing deep networks

Day 3 Lecture 3. Optimizing deep networks Day 3 Lecture 3 Optimizing deep networks Convex optimization A function is convex if for all α [0,1]: f(x) Tangent line Examples Quadratics 2-norms Properties Local minimum is global minimum x Gradient

More information

arxiv: v1 [cs.cv] 21 Jul 2017

arxiv: v1 [cs.cv] 21 Jul 2017 CONFIDENCE ESTIMATION IN DEEP NEURAL NETWORKS VIA DENSITY MODELLING Akshayvarun Subramanya Suraj Srinivas R.Venkatesh Babu Video Analytics Lab, Department of Computational and Data Sciences Indian Institute

More information

Quantized Neural Networks: Training Neural Networks with Low Precision Weights and Activations

Quantized Neural Networks: Training Neural Networks with Low Precision Weights and Activations Journal of Machine Learning Research 18 (2018) 1-30 Submitted 9/16; Revised 4/17; Published 4/18 Quantized Neural Networks: Training Neural Networks with Low Precision Weights and Activations Itay Hubara*

More information

Two-Step Quantization for Low-bit Neural Networks

Two-Step Quantization for Low-bit Neural Networks Two-Step Quantization for Low-bit Neural Networks Peisong Wang 1,2, Qinghao Hu 1,2, Yifan Zhang 1,2, Chunjie Zhang 1,2, Yang Liu 4, and Jian Cheng 1,2,3 1 Institute of Automation, Chinese Academy of Sciences,

More information

arxiv: v3 [cs.cl] 24 Feb 2018

arxiv: v3 [cs.cl] 24 Feb 2018 FACTORIZATION TRICKS FOR LSTM NETWORKS Oleksii Kuchaiev NVIDIA okuchaiev@nvidia.com Boris Ginsburg NVIDIA bginsburg@nvidia.com ABSTRACT arxiv:1703.10722v3 [cs.cl] 24 Feb 2018 We present two simple ways

More information

Introduction to Convolutional Neural Networks 2018 / 02 / 23

Introduction to Convolutional Neural Networks 2018 / 02 / 23 Introduction to Convolutional Neural Networks 2018 / 02 / 23 Buzzword: CNN Convolutional neural networks (CNN, ConvNet) is a class of deep, feed-forward (not recurrent) artificial neural networks that

More information

Eve: A Gradient Based Optimization Method with Locally and Globally Adaptive Learning Rates

Eve: A Gradient Based Optimization Method with Locally and Globally Adaptive Learning Rates Eve: A Gradient Based Optimization Method with Locally and Globally Adaptive Learning Rates Hiroaki Hayashi 1,* Jayanth Koushik 1,* Graham Neubig 1 arxiv:1611.01505v3 [cs.lg] 11 Jun 2018 Abstract Adaptive

More information

Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, Spis treści

Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, Spis treści Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, 2017 Spis treści Website Acknowledgments Notation xiii xv xix 1 Introduction 1 1.1 Who Should Read This Book?

More information

Normalization Techniques

Normalization Techniques Normalization Techniques Devansh Arpit Normalization Techniques 1 / 39 Table of Contents 1 Introduction 2 Motivation 3 Batch Normalization 4 Normalization Propagation 5 Weight Normalization 6 Layer Normalization

More information

Based on the original slides of Hung-yi Lee

Based on the original slides of Hung-yi Lee Based on the original slides of Hung-yi Lee Google Trends Deep learning obtains many exciting results. Can contribute to new Smart Services in the Context of the Internet of Things (IoT). IoT Services

More information

Introduction to Convolutional Neural Networks (CNNs)

Introduction to Convolutional Neural Networks (CNNs) Introduction to Convolutional Neural Networks (CNNs) nojunk@snu.ac.kr http://mipal.snu.ac.kr Department of Transdisciplinary Studies Seoul National University, Korea Jan. 2016 Many slides are from Fei-Fei

More information

Convolutional Neural Network Architecture

Convolutional Neural Network Architecture Convolutional Neural Network Architecture Zhisheng Zhong Feburary 2nd, 2018 Zhisheng Zhong Convolutional Neural Network Architecture Feburary 2nd, 2018 1 / 55 Outline 1 Introduction of Convolution Motivation

More information

Very Deep Residual Networks with Maxout for Plant Identification in the Wild Milan Šulc, Dmytro Mishkin, Jiří Matas

Very Deep Residual Networks with Maxout for Plant Identification in the Wild Milan Šulc, Dmytro Mishkin, Jiří Matas Very Deep Residual Networks with Maxout for Plant Identification in the Wild Milan Šulc, Dmytro Mishkin, Jiří Matas Center for Machine Perception Department of Cybernetics Faculty of Electrical Engineering

More information

Classification goals: Make 1 guess about the label (Top-1 error) Make 5 guesses about the label (Top-5 error) No Bounding Box

Classification goals: Make 1 guess about the label (Top-1 error) Make 5 guesses about the label (Top-5 error) No Bounding Box ImageNet Classification with Deep Convolutional Neural Networks Alex Krizhevsky, Ilya Sutskever, Geoffrey E. Hinton Motivation Classification goals: Make 1 guess about the label (Top-1 error) Make 5 guesses

More information

arxiv: v1 [cs.lg] 25 Sep 2018

arxiv: v1 [cs.lg] 25 Sep 2018 Utilizing Class Information for DNN Representation Shaping Daeyoung Choi and Wonjong Rhee Department of Transdisciplinary Studies Seoul National University Seoul, 08826, South Korea {choid, wrhee}@snu.ac.kr

More information

ACIQ: ANALYTICAL CLIPPING FOR INTEGER QUAN-

ACIQ: ANALYTICAL CLIPPING FOR INTEGER QUAN- ACIQ: ANALYTICAL CLIPPING FOR INTEGER QUAN- TIZATION OF NEURAL NETWORKS Anonymous authors Paper under double-blind review ABSTRACT Unlike traditional approaches that focus on the quantization at the network

More information

A Neural-Symbolic Approach to Design of CAPTCHA

A Neural-Symbolic Approach to Design of CAPTCHA A Neural-Symbolic Approach to Design of CAPTCHA Qiuyuan Huang, Paul Smolensky, Xiaodong He, Li Deng, Dapeng Wu {qihua,psmo,xiaohe}@microsoft.com, l.deng@ieee.org, dpwu@ufl.edu Microsoft Research AI, Redmond,

More information

arxiv: v1 [cs.cv] 2 Dec 2015

arxiv: v1 [cs.cv] 2 Dec 2015 Rethinking the Inception Architecture for Computer Vision arxiv:1512.00567v1 [cs.cv] 2 Dec 2015 Christian Szegedy Google Inc. szegedy@google.com Abstract Vincent Vanhoucke vanhoucke@google.com Convolutional

More information

arxiv: v2 [cs.cv] 12 Apr 2016

arxiv: v2 [cs.cv] 12 Apr 2016 arxiv:1603.05027v2 [cs.cv] 12 Apr 2016 Identity Mappings in Deep Residual Networks Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun Microsoft Research Abstract Deep residual networks [1] have emerged

More information

ADAPTIVE QUANTIZATION OF NEURAL NETWORKS

ADAPTIVE QUANTIZATION OF NEURAL NETWORKS ADAPTIVE QUANTIZATION OF NEURAL NETWORKS Soroosh Khoram Department of Electrical and Computer Engineering University of Wisconsin - Madison khoram@wisc.edu Jing Li Department of Electrical and Computer

More information

Classification of Higgs Boson Tau-Tau decays using GPU accelerated Neural Networks

Classification of Higgs Boson Tau-Tau decays using GPU accelerated Neural Networks Classification of Higgs Boson Tau-Tau decays using GPU accelerated Neural Networks Mohit Shridhar Stanford University mohits@stanford.edu, mohit@u.nus.edu Abstract In particle physics, Higgs Boson to tau-tau

More information

Data Analysis Project

Data Analysis Project Design of Weight-Learning Efficient Convolutional Modules in Deep Convolutional Neural Networks and its Application to Large-Scale Visual Recognition Tasks Felix Juefei-Xu May 3, 2017 1 Abstract Background.

More information

A MAIN/SUBSIDIARY NETWORK FRAMEWORK FOR SIMPLIFYING BINARY NEURAL NETWORKS

A MAIN/SUBSIDIARY NETWORK FRAMEWORK FOR SIMPLIFYING BINARY NEURAL NETWORKS Under review as a conference paper at ICLR 209 A MAIN/SUBSIDIARY NETWORK FRAMEWORK FOR SIMPLIFYING BINARY NEURAL NETWORKS Anonymous authors Paper under double-blind review ABSTRACT To reduce memory footprint

More information