Channel Pruning and Other Methods for Compressing CNN

Size: px

Start display at page:

Download "Channel Pruning and Other Methods for Compressing CNN"

Phillip Patrick Griffith
6 years ago
Views:

1 Channel Pruning and Other Methods for Compressing CNN Yihui He Xi an Jiaotong Univ. October 12, 2017 Yihui He (Xi an Jiaotong Univ.) Channel Pruning and Other Methods for Compressing CNN October 12, / 16

2 Background Recent CNN acceleration works fall into three categories: (1) optimized implementation (e.g., FFT, dictionary) (2) quantization (e.g., BinaryNet) (3) structured simplification (convert CNN structure into compact one) We focuses on the last one. Yihui He (Xi an Jiaotong Univ.) Channel Pruning and Other Methods for Compressing CNN October 12, / 16

3 Background W1 number of channels nonlinear W2 nonlinear W3 (a) (b) (c) (d) Structured simplification methods that accelerate CNNs: (a) a network with 3 conv layers. (b) sparse connection deacti-vates some connections between channels. (c) tensor factorization factorizes a convolutional layer into several pieces. (d) channel pruning reduces number of channels in each layer (focus of this paper). Yihui He (Xi an Jiaotong Univ.) Channel Pruning and Other Methods for Compressing CNN October 12, / 16

4 Sparse Connection S. Han, J. Pool, J. Tran, and W. Dally. Learning both weights and connections for efficient neural network. In Advances in Neural Information Processing Systems, pages , 2015 Yihui He (Xi an Jiaotong Univ.) Channel Pruning and Other Methods for Compressing CNN October 12, / 16

5 Tensor factorization Accelerating very deep convolutional networks for classification and detection. Speeding up convolutional neural networks with low rank expansions. Yihui He (Xi an Jiaotong Univ.) Channel Pruning and Other Methods for Compressing CNN October 12, / 16

6 Our approach A B C W c k h k w nonlinear c n nonlinear We aim to reduce the width of feature map B, while minimizing the reconstruction error on feature map C. Our optimization algorithm performs within the dotted box, which does not involve nonlinearity. This figure illustrates the situation that two channels are pruned for feature map B. Thus corresponding channels of filters W can be removed. Furthermore, even though not directly optimized by our algorithm, the corresponding filters in the previous layer can also be removed (marked by dotted filters). c, n: number of channels for feature maps B and C, k h k w : kernel size. Yihui He (Xi an Jiaotong Univ.) Channel Pruning and Other Methods for Compressing CNN October 12, / 16

7 Optimization Formally, to prune a feature map with c channels, we consider applying n c k h k w convolutional filters W on N c k h k w input volumes X sampled from this feature map, which produces N n output matrix Y. Here, N is the number of samples, n is the number of output channels, and k h, k w are the kernel size. For simple representation, bias term is not included in our formulation. To prune the input channels from c to desired c (0 c c), while minimizing reconstruction error, we formulate our problem as follow: 1 c 2 arg min β,w 2N Y β i X i W i (1) i=1 F subject to β 0 c F is Frobenius norm. X i is N k h k w matrix sliced from ith channel of input volumes X, i = 1,..., c. W i is n k h k w filter weights sliced from ith channel of W. β is coefficient vector of length c for channel selection, and β i is ith entry of β. Notice that, if β i = 0, X i will be no longer useful, which could be safely pruned from feature map. W i could also be removed. Yihui He (Xi an Jiaotong Univ.) Channel Pruning and Other Methods for Compressing CNN October 12, / 16

8 Optimization Solving this l 0 minimization problem in Eqn. 1 is NP-hard. we relax the l 0 to l 1 regularization: arg min β,w 1 2N Y c β i X i W i i=1 subject to β 0 c, i W i F = 1 2 F + λ β 1 λ is a penalty coefficient. By increasing λ, there will be more zero terms in β and one can get higher speed-up ratio. We also add a constrain i W i F = 1 to this formulation, which avoids trivial solution. Now we solve this problem in two folds. First, we fix W, solve β for channel selection. Second, we fix β, solve W to reconstruct error. (2) Yihui He (Xi an Jiaotong Univ.) Channel Pruning and Other Methods for Compressing CNN October 12, / 16

9 Optimization (i) The subproblem of β: In this case, W is fixed. We solve β for channel selection. ˆβ LASSO 1 c 2 (λ) = arg min β 2N Y β i Z i + λ β 1 (3) subject to β 0 c Here Z i = X i W i (size N n). We will ignore ith channels if β i = 0. (ii) The subproblem of W: In this case, β is fixed. We utilize the selected channels to minimize reconstruction error. We can find optimized solution by least squares: arg min W i=1 Y X (W ) 2 Here X = [β 1 X 1 β 2 X 2... β i X i... β c X c ] (size N ck h k w ). W is n ck h k w reshaped W, W = [W 1 W 2... W i... W c ]. After obtained result W, it is reshaped back to W. Then we assign β i β i W i F, W i W i / W i F. Constrain i W i F = 1 satisfies. Yihui He (Xi an Jiaotong Univ.) Channel Pruning and Other Methods for Compressing CNN October 12, / 16 F F (4)

10 Analysis Yihui He (Xi an Jiaotong Univ.) Channel Pruning and Other Methods for Compressing CNN October 12, / 16

11 Analysis 5 4 conv1_1 first k max response ours conv2_1 first k max response ours conv3_1 first k max response ours increase of error (%) conv3_2 first k max response ours conv4_1 first k max response ours conv4_2 first k max response ours increase of error (%) speed-up ratio speed-up ratio speed-up ratio Yihui He (Xi an Jiaotong Univ.) Channel Pruning and Other Methods for Compressing CNN October 12, / 16

12 Analysis Yihui He (Xi an Jiaotong Univ.) Channel Pruning and Other Methods for Compressing CNN October 12, / 16

13 Results Increase of top-5 error (1-view, baseline 89.9%) Solution Jaderberg (Zhang 2015 s impl.) Asym Filter pruning (fine-tuned, our impl.) Ours (without fine-tune) Ours (fine-tuned) Table 1: Accelerating the VGG-16 model using a speedup ratio of 2, 4, or 5 (smaller is better). Yihui He (Xi an Jiaotong Univ.) Channel Pruning and Other Methods for Compressing CNN October 12, / 16

14 Results Increase of top-5 error (1-view, 89.9%) Solution 4 5 Asym. 3D Asym. 3D (fine-tuned) Our 3C Our 3C (fine-tuned) Table 2: Performance of combined methods on the VGG-16 model using a speed-up ratio of 4 or 5. Our 3C solution outperforms previous approaches (smaller is better). Yihui He (Xi an Jiaotong Univ.) Channel Pruning and Other Methods for Compressing CNN October 12, / 16

15 Absolute performance Table 3. GPU acceleration comparison. We measure forward-pass time per image. Our approach generalizes well on GPU (smaller is better). Yihui He (Xi an Jiaotong Univ.) Channel Pruning and Other Methods for Compressing CNN October 12, / 16

16 Acc Detection 2, 4 acceleration for Faster R-CNN detection. mmap is AP at IoU=.50:.05:.95 (primary challenge metric of COCO [32]). Yihui He (Xi an Jiaotong Univ.) Channel Pruning and Other Methods for Compressing CNN October 12, / 16

arxiv: v2 [cs.cv] 21 Aug 2017 Abstract

arxiv: v2 [cs.cv] 21 Aug 2017 Abstract Channel Pruning for Accelerating Very Deep Neural Networks Yihui He * Xi an Jiaotong University Xi an, 70049, China heyihui@stu.xjtu.edu.cn Xiangyu Zhang Megvii Inc. Beijing, 0090, China zhangxiangyu@megvii.com