Compression of Deep Neural Networks on the Fly

Size: px

Start display at page:

Download "Compression of Deep Neural Networks on the Fly"

Lee Morton
6 years ago
Views:

1 Compression of Deep Neural Networks on the Fly Guillaume Soulié, Vincent Gripon, Maëlys Robert Télécom Bretagne arxiv: v1 [cs.lg] 9 Sep 01 Abstract Because of their performance, deep neural networks are increasingly used for object recognition. They are particularly attractive because of their ability to absorb great quantities of labeled data through millions of parameters. However, as the accuracy and the model sizes increase, so does the memory requirements of the classifiers. This prohibits their usage on resource limited hardware, including cell phones or other embedded devices. We introduce a novel compression method for deep neural networks that performs during the learning phase. It consists in adding an extra regularization term to the cost function of fully-connected layers. We combine this method with Product Quantization (PQ) of the trained weights for higher savings in memory and storage consumption. We evaluate our method on two data sets (MNIST and CIFAR10), on which we achieve significantly larger compression than state-of-the-art methods. I. MOTIVATION Deep Convolutional Neural Networks (CNNs) [1,, 3, 4] have become the gold standard for object recognition, image classification and retrieval. Almost all of the recent successful recognition systems [4,, 6, 7, 8, 9] are built on top of this architecture. Importing CNNs onto embedded platforms [10], the recent trend toward mobile computing, has a wide range of application impacts. It is especially useful when the bandwidth is limited and when files are sensitive and should not be sent to distant servers. However, the size of a CNN is typically large (e.g. more than 00M bytes), which limits the applicability of such models on embedded platforms. For example, it is hard to force users to download an iphone application that big. The purpose of this paper is to propose new techniques for compressing networks without sacrificing performance. In this article we focus on compressing CNNs for computer vision tasks, although our methodology is not taking any advantage of this particular application field and we expect it to perform similarly on other types of learning tasks. A typical CNN [, 7, 8] with good performance in visual object recognition contains eight layers (five convolutional layers and three dense connected layers) and a huge number of parameters in order to achieve state-of-the-art results. These parameters are overparameterized [11] and we aim at compressing them. Note that our motivation is mainly reducing storage rather than speeding up the testing time [1]. A few early works on compressing CNNs have been published. In [1] and [13] the authors use compressing methods This work was founded in part by the European Research Council under the European Union s Seventh Framework Program ( FP7 / ) / ERC grant agreement number for speeding up CNN testing time. More recently, some works focus on compressing neural network to reduce storage of the network. These works can generally be put in two different categories: some of them focus on compressing the fully connected layers and others on compressing the convolutionnal layers. In [14] the authors focus on compressing dense connected layers. In their work, instead of the traditional matrix factorization methods considered in [1, 13], they consider series of information theoretical vector quantization methods [1, 16] for compressing dense connected layers. They obtain good results by applying Product Quantization. In [17] the authors focus on compressing the fully connected layers of a Multi- Layer Perceptron (MLP) using Hashing Trick, a low cost hash function to randomly group connection weights into hash buckets, and set the same value to all the parameters in the same bucket. In [18] the authors propose compressing convolutionnal layers using a Discrete Cosinus Transform applied on the convolutionnal filters, followed by Hashing Trick, as for the fully connected layers. For a typical CNN as the one introduced in [8], about 90% of the storage is taken up by the dense connected layers and more than 90% of the running time is taken by the convolutional layers. This is why our work focus upon how to compress the dense connected layers to reduce storage of neural networks. Instead of using a post-learning method to compress the network, our approach consists in modifying the regularization function used during the learning phase in order to favor quantized weights in some layers especially the output ones. To achieve this, we use an idea that was originally proposed in [19]. In order to compress furthermore our obtained networks, we also use PQ as described in [14] afterwards. We perform some experiments both on Multi-Layer Perceptrons (MLP) and Convolutionnal Neural Networks (CNN). In this paper, we introduce a novel way to quantize weights in deep learning systems. More precisely : We introduce a regularization term that forces weights to converge to either 0 or 1, before using the product quantization on the learned weights. We show how this extra term impacts performance depending on the depth of the layer it is used onto. We experiment our proposed method on celebrated benchmarks and compare with state-of-the-art techniques. The outline of the paper is as follows. In Section II we discuss related work. Section III introduced our methodology

2 for compressing layers in deep neural networks. In Section IV we run experiments on celebrated databases. In section V we discuss the biological motivation for our work. Section VI is a conclusion. II. RELATED WORK As already mentioned in the introduction, a state-of-theart CNN usually involves hundreds of millions of parameters, thus requiring an important storage capability that may be hard to obtain in practice. The bottleneck comes from model storage and testing speed. Several works have been published on speeding up CNN prediction speed. In [0] the authors use tricks of CPUs to speed up the execution of CNN. In particular they focus on aligning memory and SIMD operations to boost matrix operations. In [1] the authors show the convolutional operations can be efficiently carried out in the Fourier domain, leading to a speed-up of 00%. Two very recent works [1, 13] use linear matrix factorization methods for speeding up convolutional layers and obtain a 00% speed-up gain with almost no lost in classification performance. Almost all of the above mentioned works focus on making the prediction speed of CNN faster; little work has been specifically devoted to making CNN models smaller. In [11], the authors demonstrate the redundancies in neural network parameters. Indeed, they show that the weights within one layer can be accurately predicted by a small (%) subset of the parameters. These results motivate [9] to apply vector quantization methods to compress the redundancy in parameter space. Their work can be viewed as a compression of the parameter prediction results reported in [11]. They obtain similar results to those of [11], in that they are able to compress the parameters about 0 times with little decrease of performance. This result further confirms the interesting empirical findings in [11]. In their paper, they tackle this model storage issue by investigating information theoretical vector quantization methods for compressing the parameters of CNNs. In particular, they show that in terms of compressing the most storage demanding dense connected layers, vector quantization method performs much better than existing matrix factorization methods. Simply applying k-means clustering to the weights or conducting product quantization can lead to a very good balance between model size and recognition accuracy. For the 1000-category classification task in the ImageNet challenge, they are able to achieve 16-4 times compression of the network with only 1% loss of classification accuracy using the state-of-the-art CNN. In [17], for the first time a learn-based method is proposed to compress neural networks. This method, based on Hashing Trick, allows efficient compression rates. In particular, they show that compressing a big neural network may be more efficient than using a smaller one: in their example they are able to divide the loss by two using a eight times larger neural network compressed eight times. The same authors also propose in [18] to compress filters in convolutional layers, arguing that the size of the convolutional layers in state-ofthe-art s CNN is increasing year after year. Using the fact that learned CNN filters are often smooth, their Discrete Cosinus Transform followed by Hashing Tricks allows them to compress a whole neural network without loosing too much accuracy. III. METHODOLOGY In this section, we consider two methods for compressing the parameters in CNN layers. We first introduce the product quantization method proposed in [14], and then introduce our proposed learn-based method. A. Product Quantization (PQ) For more details on this quantization method, refer to [14]. The key idea is to consider structured vector quantization methods for compressing parameters. More precisely, the idea consists in applying Product Quantization (PQ) [1] to exploit the redundancy of structures in vector space. PQ partitions the vector space into disjoint sub-spaces, and perform quantization in each of them. It performs increasingly better as the redundancy in each subspace grows. Specifically, given the matrix W, we partition it column-wise into several sub-matrices: W = [W 1, W,..., W s ], (1) where W i R m (n/s) assuming n is divisible by s ( (n/s) is named the segment size). We can then perform a k-means clustering for each sub-matrix W i : min m z=1 j=1 k wz i c i j, () where wi z denotes the z-th row of sub-matrix W i, and c i j denotes the j-th row of sub-codebook C i R k (n/s). For each sub-vector wi z, we need only store its corresponding cluster index and the associated codebooks. Thus, the reconstructed matrix is: Ŵ = [Ŵ 1, Ŵ,..., Ŵ s ], (3) where ŵ i j = c i j, j being a minimizer of min j w i z c i j. Note that PQ can be applied to either the x-axis or the y- axis of the matrix. In our experiments, we evaluated both cases and obtained similar performance. We need to store the cluster indexes and codebooks for each sub-vector. In particular, the codebook is not negligible. The compression rate for this method is (3mn)/(3kn+log (k)ms). With a fixed segment size, increasing k will lead to decreasing the compression rate. B. Proposed method Our proposed method is twofold: first, we use a specific added regularization term in order to attract network weights to binary values, then we coarsely quantize the output layers. Let us recall that training a neural network is generally performed thanks to the minimization of a cost function using a derivative of a gradient descent algorithm. In order to attract network weights to binary values, we add a binarization

This idea is not new, although we did not find any work applying it to deep learning in the literature. Our choice for the regularization term has been greatly inspired by [19].

3 cost (regularizer) during the learning phase. This added cost pushes weights to binary values. As a result, solutions of the minimization problem are expected to be binary or almost binary, depending on the scaling parameter of the added cost with respect to the initial one. This idea is not new, although we did not find any work applying it to deep learning in the literature. Our choice for the regularization term has been greatly inspired by [19]. More precisely, let us denote by W the weights of the neural network, f(w ) the cost associated with W, h W (X) and y(x) respectively the output and the label for a given input X, we obtain: f(w ) = X h W (X) y(x) + α w W w w 1, (4) where α is a scaling parameter representing the importance of the binarization cost with respect to the initial cost. Finding a good value for α may be tricky, as a too small value results in a failure of the binarization process and a too large value results in the creation of local minima that will prevent the network from successfully training. To facilitate this selection of α, we use a barrier method [19] that consists in starting with small values of α and incrementing it regularly to help the quantization process. We observed that typically some layers are very well quantified at the end of this learning phase, whereas others are still far from binary. For that reason we then binarize some of the layers but not all. In order to improve further our compression rate, we then use the PQ method presented in the previous subsection. The compression rate for this method is (3mn)/(kn + log (k)ms) (instead of (3mn)/(3kn + log (k)ms) for single Product Quantization). With a fixed segment size, increasing k will lead to decreasing the compression rate. IV. EXPERIMENTS We evaluate these different methods on two image classification datasets : MNIST and CIFAR10. A. Experimental settings 1) MNIST: The MNIST database of handwritten digits has a training set of 60,000 examples, and a test set of 10,000 examples. It is a subset of a larger set available from NIST. The digits have been size-normalized and centered in a fixedsize image. The neural network we use with MNIST is LeNet. LeNet is a convolutional neural network introduced by Yann Lecun []. LeNet is one of the LeNet family networks, described in Figure 1. We use exactly the same parameters for the network (number of layer, sizes of layers, etc...). ) CIFAR10: The CIFAR10 database has a training set of 0,000 examples, and a test set of 10,000 examples. It is a subset of a larger set available from the 80 million tiny images dataset. It consists in 3x3 colour images partitioned into ten classes. With the CIFAR10 database, we use a convolutional neural network made of four convolutional layers followed by two RGB input layer 0 features maps 0 features maps (8 8) (4 4) (1 1) convolution layer max pooling layer Convolution... 0 features maps 0 features maps (8 8) (4 4) layer 1 1 max pooling layer hidden layer (00) fully connected MLP Figure 1. A graphical depiction of a LeNet model (figure inspired by [3]) fully connected layers. This network has been introduced in [4]. B. Layers to quantify Our first experiments (with the MNIST database) depicted in Figure show the influence of quantified layers on performance. We observe that performance strongly depends on which layers are quantized. More precisely, this experiment shows that one should quantize layers from the output to the input rather than the contrary. This result is not surprising to us as input layers have often been described as similar to wavelet transforms, which are intrinsically analog operators, whereas output layers are often compared to high level patterns which detection in an image is often enough for good classification results. C. Performance comparison Our second experiment shows a comparison with previous work. The results are depicted in Figures 3 and 4. Note that in both cases compared networks have the exact same architecture. As far as our proposed method is concerned, we chose to compress only the two outputs layers, which are fully connected. Since their sizes are distinct, we were not able to use the same PQ coefficients k and m twice. Note that layer contains almost all weights and is therefore the one we chose to investigate the role of each parameters. We found that our added regularization cost allows to significantly improve performance. For example for the MNIST database, if we want to respect a loss of %, we have a compression rate of 33 with single PQ, whereas our learnbased method leads to a compression rate of 107.

4 For decades the question of whether the encoding of information in the brain is analog or digital has been debated. A recent paper [7] performed some experiments emphasizing the brain might have different encodings depending on the region it is applied to. In particular they show that as far as the visual cortex is concerned, the further the region is from the sensors the most likely its encodings are digital. It is tempting to draw a parallel between these findings and the ability of the brain to associate analog-like data, such as sounds and images, with digital-data, such as language and labels. The results depicted in Figure suggest a similar behavior in deep artificial neural systems. Figure. Performance of the classification task depending on which layers of the network are quantized, on the MNIST database VI. CONCLUSION AND PERSPECTIVES In this paper we introduced a new method to compress convolutionnal neural networks. This method consists in adding an extra term to the cost function that forces weights to become almost binary. In order to compress even more the network, we then apply Product Quantization and the combination of both allows us to reach performance above state-of-the-art methods. We also demonstrate the influence of the depth of the binarized layer on performance. These findings are of particular interest to us, and a motivation to further explore the connections between actual biological data and deep neural systems. In future work, we consider exploring more complex regularization functions. Also, we plan on performing PQ on the fly as well during the learning phase. Finally, other strategies could try to take benefit from sparseness of the trained weights. Such sparsity could be enforced thanks to similar mechanisms as the ones we introduce in this paper. Figure 3. Comparison of our proposed method with previous work on the MNIST dataset. Figure 4. Comparison of our proposed method with previous work on the CIFAR10 dataset. V. BIOLOGICAL DISCUSSION Deep neural networks have often been compared to sensor parts of the human cortex, both in terms of architecture and performance []. As a matter of fact, some authors [6] propose models very similar to CNNs and biologically inspired.

5 REFERENCES [1] A. Krizhevsky, I. Sutskever, and G. E. Hinton, Imagenet classification with deep convolutional neural networks, in Advances in neural information processing systems, 01, pp [] B. B. Le Cun, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel, Handwritten digit recognition with a back-propagation network, in Advances in neural information processing systems. Citeseer, [3] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, Going deeper with convolutions, arxiv preprint arxiv: , 014. [4] K. Simonyan and A. Zisserman, Very deep convolutional networks for large-scale image recognition, arxiv preprint arxiv: , 014. [] Y. Jia, Caffe: An open source convolutional architecture for fast feature embedding, 013. [6] J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng, and T. Darrell, Decaf: A deep convolutional activation feature for generic visual recognition, arxiv preprint arxiv: , 013. [7] P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and Y. LeCun, Overfeat: Integrated recognition, localization and detection using convolutional networks, arxiv preprint arxiv:131.69, 013. [8] M. D. Zeiler and R. Fergus, Stochastic pooling for regularization of deep convolutional neural networks, arxiv preprint arxiv: , 013. [9] Y. Gong, L. Wang, R. Guo, and S. Lazebnik, Multiscale orderless pooling of deep convolutional activation features, in Computer Vision ECCV 014. Springer, 014, pp [10] V. Gokhale, J. Jin, A. Dundar, B. Martini, and E. Culurciello, A 40 g-ops/s mobile coprocessor for deep neural networks, in Computer Vision and Pattern Recognition Workshops (CVPRW), 014 IEEE Conference on. IEEE, 014, pp [11] M. Denil, B. Shakibi, L. Dinh, N. de Freitas et al., Predicting parameters in deep learning, in Advances in Neural Information Processing Systems, 013, pp [1] E. L. Denton, W. Zaremba, J. Bruna, Y. LeCun, and R. Fergus, Exploiting linear structure within convolutional networks for efficient evaluation, in Advances in Neural Information Processing Systems, 014, pp [13] M. Jaderberg, A. Vedaldi, and A. Zisserman, Speeding up convolutional neural networks with low rank expansions, arxiv preprint arxiv: , 014. [14] Y. Gong, L. Liu, M. Yang, and L. Bourdev, Compressing deep convolutional networks using vector quantization, arxiv preprint arxiv: , 014. [1] H. Jegou, M. Douze, and C. Schmid, Product quantization for nearest neighbor search, Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 33, no. 1, pp , 011. [16] Y. Chen, T. Guan, and C. Wang, Approximate nearest neighbor search by residual vector quantization, Sensors, vol. 10, no. 1, pp , 010. [17] W. Chen, J. T. Wilson, S. Tyree, K. Q. Weinberger, and Y. Chen, Compressing neural networks with the hashing trick, arxiv preprint arxiv: , 01. [18], Compressing convolutional neural networks, arxiv preprint arxiv: , 01. [19] W. Murray and K.-M. Ng, An algorithm for nonlinear optimization problems with binary variables, Computational Optimization and Applications, vol. 47, no., pp. 7 88, 010. [0] V. Vanhoucke, A. Senior, and M. Z. Mao, Improving the speed of neural networks on cpus, in Proc. Deep Learning and Unsupervised Feature Learning NIPS Workshop, vol. 1, 011. [1] M. Mathieu, M. Henaff, and Y. LeCun, Fast training of convolutional networks through ffts, arxiv preprint arxiv:131.81, 013. [] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, Gradient-based learning applied to document recognition, Proceedings of the IEEE, vol. 86, no. 11, pp , [3] F. Bastien, P. Lamblin, R. Pascanu, J. Bergstra, I. J. Goodfellow, A. Bergeron, N. Bouchard, and Y. Bengio, Theano: new features and speed improvements, Deep Learning and Unsupervised Feature Learning NIPS 01 Workshop, 01. [4] F. Chollet, Keras: Theano-based deep learning library. [Online]. Available: [] C. F. Cadieu, H. Hong, D. L. Yamins, N. Pinto, D. Ardila, E. A. Solomon, N. J. Majaj, and J. J. DiCarlo, Deep neural networks rival the representation of primate it cortex for core visual object recognition, PLoS computational biology, vol. 10, no. 1, p. e , 014. [6] T. Serre, L. Wolf, S. Bileschi, M. Riesenhuber, and T. Poggio, Robust object recognition with cortex-like mechanisms, Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 9, no. 3, pp , 007. [7] Y. Mochizuki and S. Shinomoto, Analog and digital codes in the brain, Physical Review E, vol. 89, no., p. 070, 014.

Sajid Anwar, Kyuyeon Hwang and Wonyong Sung

Sajid Anwar, Kyuyeon Hwang and Wonyong Sung Department of Electrical and Computer Engineering Seoul National University Seoul, 08826 Korea Email: sajid@dsp.snu.ac.kr, khwang@dsp.snu.ac.kr, wysung@snu.ac.kr