Rethinking Binary Neural Network for Accurate Image Classification and Semantic Segmentation

Size: px

Start display at page:

Download "Rethinking Binary Neural Network for Accurate Image Classification and Semantic Segmentation"

Charla Walker
5 years ago
Views:

1 Rethinking nary Neural Network for Accurate Image Classification and Semantic Segmentation Bohan Zhuang 1, Chunhua Shen 1, Mingkui Tan 2, Lingqiao Liu 1, and Ian Reid 1 1 The University of Adelaide, Australia; and Australian Centre for Robotic Vision 2 South China University of Technology, China arxiv: v1 [cs.cv] 22 Nov 2018 Abstract In this paper, we propose to train a network with both binary weights and binary activations, designed specifically for mobile devices with limited computation capacity and power consumption. Previous works on quantizing CNNs uncritically assume the same architecture with fullprecision networks, which we term value approximation. Their objective is to preserve the floating-point information using a set of discrete values. However, we take a novel view for best performance it is very likely that a different architecture may be better suited to deal with binary weights as well as binary activations. Thus we directly design such a highly accurate binary network structure, which is termed structure approximation. In particular, we propose a network decomposition strategy in which we divide the networks into groups and aggregate a set of homogeneous binary branches to implicitly reconstruct the full-precision intermediate feature maps. In addition, we also learn the connections between each group. We further provide a comprehensive comparison among all quantization categories. Experiments on ImageNet classification tasks demonstrate the superior performance of the proposed model, named Group-Net, over various popular architectures. In particular, we outperform the previous best binary neural network in terms of accuracy as well as saving huge computational complexity. Furthermore, the proposed Group-Net can effectively utilize task specific properties for strong generalization. In particular, we propose to extend Group-Net for lossless semantic segmentation. This is the first work proposed on solving dense pixels prediction based on BNNs in the literature. Actually, we claim that considering both value and structure approximation should be the future development direction of BNNs. CONTENTS I Introduction 2 II Related Work 3 III Method 3 III-A narization function III-B Network nary Decomposition... 3 III-C III-B1 Layer-wise feature reconstruction III-B2 Group-wise feature approximation III-B3 Learning to decompose.. 5 From image classification to semantic segmentation IV Discussions 6 V Experiment 7 V-A Implementation details V-B Evaluation on ImageNet V-B1 Comparison with binary neural networks V-B2 Comparison with fix-point approaches V-B3 Hardware implementation on ImageNet V-C Evaluation on PASCAL VOC VI Conclusion 8 VII Ablation study 8 VII-1 Group space exploration. 8 VII-2 Layer-wise vs. group-wise 9 VII-3 Effect of the number of bases VIII More discussions 9 IX Extending Group-Net to binary weights and low-precision activations 9 IX-1 Fixed-point Activation quantization IX-2 Implementation details.. 10 IX-3 Evaluation on ImageNet.. 10 References 10

2 I. INTRODUCTION Designing deeper and wider convolutional neural networks has led to significant breakthroughs in many machine learning tasks, such as image classification [1], [2], object detection [3], [4] and object segmentation [5]. However, accurate deep models always require billions of FLOPs, which makes it infeasible to run real-time applications on resource constrained mobile platforms. To solve this problem, many existing works focus on network pruning [6], [7], low-bit quantization [8], [9] and efficient architecture design [10], [11]. Quantization approaches represent weights and activations with low bitwidth fixed-point integers, where dot product can be computed by several bitwise operations. The xnor-popcount operation just requires a single logic gate instead of using hundreds units for floating point multiplication [12], [13]. BNNs [14], [15] can be treated as an extreme quantization approach where both weights and activations are restricted to single bit, either +1 or -1. In this paper, we aim to design a highly accurate binary neural network from both the quantization and efficient architecture design perspective. Existing quantization methods are mainly divided into two categories. The first category methods propose to design more effective optimization algorithms to find better local minima for quantized weights. They propose to either introduce knowledge distillation [8], [16], [17] or use lossaware objectives [18], [19]. The second category approaches focus on improving the quantization function [20] [22]. It is essential to learn suitable mappings between discrete values and floating-point counterparts for maintaining the performance. However, quantization function is highly non-trivial especially for BNNs. This is because quantization functions are non-differentiable and gradients can only be roughly approximated. Both of these two categories belong to value approximation. Their common objective is to discretize continuous weights and/or activations to preserve most of the representability of the original network. However, there is a natural disadvantage of value approximation approaches since it is merely a local approximation. More specifically, it can only be an approximation of a full-precision model without adapting to a specific task. Given a pretrained model on a given task, quantization error will inevitably affect the final performance. On the contrary, we start to explore a third category, which we call structure approximation. The objective is to redesign a binary architecture that can fit the capability of the floating-point model. In particular, we propose to partition the full-precision model into groups and use a set of parallel binary bases to approximate its floatingpoint structure counterpart. Compared to value approximation, it s a higher-level category with structural information preserved. What s more, we can design flexible binary structures according to a given task and utilize taskspecific structure to compensate the quantization loss and facilitate training. For example, when transferring Group- Net from image classification to semantic segmentation, we are motivated by the structure of Atrous Spatial Paramid Pooling (ASPP) [23]. In DeepLab v3 [24] and v3+ [25], ASPP is merely applied on the top of extracted features while each block in the backbone network can only employ one atrous rate. In contrast, we propose to directly apply different atrous rates on parallel binary bases in the backbone network, which is equivalent to absorbing ASPP into the feature extraction stage. In this way, we do not theoretically increase the computational complexity of binary convolution while significantly boost the performance on semantic segmentation. Interestingly, all the previous quantization approaches cannot generalize their value approximation strategies to semantic segmentation or other computer vision tasks accordingly. This is the most significant advantage of Group-Net. Actually, value and structure approximation are complementary rather than contradictory. In other words, both should be carefully designed for highly accurate BNNs. Interestingly, we are also motivated by the energyefficient architecture design approaches [11], [26], [27]. The objective of all these approaches is to replace the traditional expensive convolution with computational efficient convolutional operations ( i.e., depthwise separable convolution, 1 1 convolution). Nevertheless, we propose to design binary network architectures for dedicated hardware from the quantization view. Most previous quantization works focus on directly quantizing the full-precision architecture. At this point in time we do begin to explore alternative architectures that we show are better suited to dealing with binary weights and activations. Specifically, apart from decomposing each group into several binary bases, we also propose to learn the connections between each group by introducing a fusion gate. Moreover, Group-Net also has strong scalability that possible further improvements such as Neural Architecture Search [28] [30] can be incorporated. Our contributions are summarized as follows: We propose to design accurate BNNs structures from the structure approximation perspective. Specifically, we divide the networks into groups and approximate each group using a set of binary bases. We also propose to automatically learn the decomposition by introducing soft connections. The proposed Group-Net has strong scalability which can be easily transferred to other tasks. In this paper, we propose nary Parallel Atrous Convolution (BPAC), which embeds rich multi-scale context into BNNs for accurate semantic segmentation. Group-Net with BPAC significantly improves the performance while maintaining the complexity compared to employ Group-Net only. We evaluate our models on ImageNet and PASCAL VOC datasets based on ResNet. Extensive experiments show the proposed Group-Net achieves the state-ofthe-art trade-off between accuracy and computational complexity. From the results, we also claim increasing gradient paths ( i.e., more filters or more shortcuts) are

3 essential for BNNs training. II. RELATED WORK Network quantization: The recent increasing demand for implementing fixed point deep neural networks on embedded devices motivates the study of network quantization. Fixed-point approaches that use low-bitwidth discrete values to approximate real ones have been extensively explored in literature [8], [14], [15], [20], [21], [31]. BNNs [14], [15] propose to constrain both weights and activations to binary values ( i.e., +1 and -1), where the multiply-accumulations can be replaced by purely xnor( ) and popcount( ) operations. To make a trade-off between accuracy and complexity, [32] [35] propose to recursively perform binarization on residuals from lower bits while still possesses the advantage of binary operation. However, multiple binarizations are sequential process which cannot be paralleled. And in [31], it proposes to use a linear combination of binary weights/activations to approximate the floating counterparts during forward propagation. Different from all local tensor approximation approaches, we propose to directly design BNNs from structure approximation perspective. Efficient architecture design: There has been a rising interest in designing efficient architecture in the recent literature. Efficient model designs like GoogLeNet [36] and SqueezeNet [26] propose to replace 3x3 convolutional kernels with 1x1 size to reduce the complexity while increasing the depth and accuracy. Additionally, separable convolutions are also proved to be effective in Inception approaches [37], [38]. This idea is further generalized as depthwise separable convolutions by Xception [10], MobileNet [11], [39] and ShuffleNet [27] to generate energy-efficient network structure. To avoid handcrafted heuristics, neural architecture search [28] [30], [40], [41] has been explored for automatic model design. III. METHOD The objective of this paper is to binarize the weights and activations. To aid with descriptions, we adopt a terminology for architecture as comprising layers, blocks and groups. A layer is a standard single parameterized layer in a network such as a dense or convolutional layer. A block is a collection of layers in which the output of end layer is connected to the input of the next block ( e.g., a residual block). A group is a collection of blocks. We first elaborate on the binarization functions in Sec. III-A. Then in Sec. III-B, we explain our binary architecture design strategy. Finally, in Sec. III-C, we describe how to utilize task-specific attributes to generalize our approach to semantic segmentation. A. narization function For a convolutional layer, we define the input x R cin win hin, weight filter w R c w h and the output y R cout wout hout, respectively. narization of weights: Following [15], we estimate the floating-point weight w by a binary weight filter b w and a scaling factor α R + such that w αb w, where b w is the sign of w and α calculates the mean of absolute values of w. In general, the sign( ) is non-differentiable and we adopt the straight-through estimator [42] (STE) to approximate the gradient calculation. Formally, the forward and backward process can be given as follows: Forward : b w = α sign(w), l Backward : w = l b w bw w l b w, (1) where l is the loss. In practice, we find such a binarization scheme is quite stable. narization of activations: For activation binarization, we utilize the piecewise polynomial function to approximate the sign function as in [43]. The forward and backward can be written as: Forward : b a = sign(x), Backward : l ba where x = l b a b a x = x, 2 + 2x : 1 x < 0 2 2x : 0 x < 1 0 : otherwise Remark: Most previous literatures focus on designing accurate binarization functions for weights and activations ( e.g., multiple binarizations [31], [33] [35].) Actually, this belongs to tensor (value) approximation. However, we propose a new perspective in Sec. III-B for training BNNs. B. Network nary Decomposition Suppose a floating-point residual network Φ has N blocks. In this paper, our objective is to decompose Φ into binary structures to preserve its representability as shown in Figure 1. We assume Φ is decomposed into P binary fragments [F 1,..., F P ], where F i ( ) can be any binary structure and each F i ( ) may be different. We explore two different architecture changes for F( ) which we call layer-wise and group-wise in Sec. III-B1 and Sec. III-B2, respectively. We then explain the strategy for automatic decomposition in Sec. III-B3. 1) Layer-wise feature reconstruction: For F( ), the simplest structure is a single convolutional layer. At each layer, we aim to reconstruct the full-precision output feature map given the binary input tensor x using a set of binarized homogeneous branches: K K F(x) = λ i f i (x) = λ i (b w i sign(x)), (3) i=1 i=1 where f( ) is the binary convolutional operation, K is the number of branches and λ i is a scale factor. f( ) is the bitwise operations: xnor( ) and popcount( ). Fig. 2 (b) illustrates the layer-wise feature reconstruction for a single block. The scale scalar can be absorbed into batch normalization during inference. All f i s in Eq. 3 have the same convolution hyperparameters as the original floating-point counterpart. In this way, each binary branch gives a rough transformation and all the transformations are aggregated to approximate the original full precision output feature map. It can be expected that with the increase of K, we can get stronger fitting capability. (2)

4 Floating-point layer (layers) Decompose narylayer (layers) F ( ) nary layer (layers) Fig. 1: Illustration of the main objective. We directly design BNNs structure via decomposing the full-precision network into binary bases. (a) (b) f i λ 1 λ k F (c) λ 1 λ k µ i Fig. 2: Overview of the baseline binarization method and two proposed approaches. We take one residual block for illustration. is a binary layer. (a): Directly binarizing the full-precision block. (b): Reconstruct full-precision activations layer-wise using a set of binary convolutional layers. (c): Approximate full-precision activations group-wise with a linear combination of binary bases µ i. Block Block Block Block (a) + + (b) (c) (d) (e) Fig. 3: Illustration of several possible group-wise architectures. We assume the original full-precision network comprises four blocks as shown in (a). is the abbreviation of quantized block. We omit the skip connection for convenience. (b): Each group comprises one block in (a) and we approximate each floating-point block s output with a set of binarized blocks. (c): We partition two blocks as a group. (d): Approximate features using flexible block combination. (e): An extreme case. We ensemble a set of binarized networks to approximate the output before the classification layer. with more complex transformations. A special case is when M = 1, it corresponds to directly quantize the full-precision network (Fig. 2 (a)). The structure is fixed during training and each binary convolutional kernel b w i is directly endto-end optimized. During inference, the homogeneous K bases is parallelizable and hardware friendly, which can bring great speed-up during test-time inference. However, this strategy has an apparent limitation. The quantization function in each branch introduces a certain amount of error. Furthermore, as the previous estimation error propagates into the current multi-branch convolutional layer, it will be enlarged by the quantization function and the final aggregation process. As a result, it may cause large quantization errors especially for deeper layers and bring large deviation for gradients during backpropagation. To solve this problem, we further propose a flexible groupwise approximation approach in Sec. III-B2. Complexity: The bitwise XNOR operation and bitcounting can be performed in a parallel of 64 by the current generation of CPUs [15], [43]. We need to calculate K binary convolutions and K full-precision additions, thus the speed up ratio σ can be calculated as: c σ = inc outwhw inh in 1 64 (Kcincoutwhwinhin)+Kcoutwouthout = 64 K c outwhw inh in c inc outwhw inh in+64c outw outh out. 2) Group-wise feature approximation: In layer-wise approach, we approximate each layer separately. However, in this section we explore the idea of structural ensemble: approximating an entire group where F( ) is a hierarchical cascade. The motivation is as follows. As explained in Sec. III-B1, the frequency of reconstructing the intermediate activations and suppressing the quantization error is a tradeoff. Analogy to the extreme low-level layer-wise case, we also show the extreme high-level case in Fig. 3 (e) where we directly ensemble a set of binary networks. In the ensemble model, the output quantization error for each branch can be very large even though we only aggregate the outputs once. Clearly, these two extreme cases may (4)

5 not be the optimal and it motivates us to explore the mid-level cases. More specifically, we propose a groupwise approximation strategy which intends to approximate residual blocks or several consequent layers as a whole. We first consider the simplest case where each group consists of only one block ( i.e., the group comprises one block of Fig. 3 (a)). Then the above layer-wise approximation method can be easily extended to the group-wise structure. The most typical block structure is the bottleneck architecture and we can extend Eq. 3 as: K F(x) = λ i µ i (x), (5) i=1 where µ i ( ) is a binary residual bottleneck [2] which consists of several binary convolutions and floating-point element-wise operations ( i.e., ReLU, BN), and λ i is the scale factor. In Eq. 5, we use a linear combination of homogeneous binary bases to approximate one group, where each base µ i is a binarized block ( in Fig. 3). We illustrate such a group in Fig. 2 (c) and the framework which consists of these groups in Fig. 3 (b). In this way, we effectively keep the original residual structure in each base to preserve the network capacity. In addition, we keep a balance between suppressing quantization error accumulation and feature reconstruction. Moreover, compared to layerwise strategy, the number of parameters and complexity both decrease since there is no need to apply floating-point tensor aggregation within the group. Furthermore, the group-wise approximation is flexible. We now analyze the case where each group may contain different number of blocks. Suppose we partition the network into P groups and it follows a simple rule that the each group must include one or multiple complete residual building blocks. Let {G 1, G 2,..., G P } be the group indexes at which we approximate the output feature maps. For the p-th group, we consider the blocks B {B p 1 +1,..., B p }, where the index B p 1 = 0 if p = 1. And we can extend Eq. 5 into multiple blocks format: F(x Bp 1+1) = K i=1 θ i u Bp i (u Bp 1 i (...(u Bp 1+1 i (6) Based on F( ), we can efficiently construct a network by stacking these groups and each group may consist of one or multiple blocks. Different from Eq. 5, we further expose a new dimension on each base, which is the number of blocks. This greatly increases the flexibility of framework and we illustrate several possible connections in Fig. 3. We further describe how to learn the connections in Sec. III-B3. 3) Learning to decompose: There is still a problem to be solved in Eq. 6. Since the network has N blocks, the possible number of connections is 2 N. It is not practical to enumerate all possible structures for training. To solve this problem, we explain how to automatically learn the optimal structural decomposition. We introduce in a fusion gate as the soft connection between blocks µ( ). We define the input of the i-th branch for the n-th block as: (a): The conventional floating-point dilated convolution. Feature map a 11 a 12 a 13 a 14 a 15 a 16 a 17 a 21 a 22 a 23 a 24 a 25 a 26 a 27 a 31 a 32 a 33 a 34 a 35 a 36 a 37 a 41 a 42 a 43 a 44 a 45 a 46 a 47 a 51 a 52 a 53 a 54 a 55 a 56 a 57 a61 a62 a63 a64 a65 a66 a67 a71 a72 a73 a74 a75 a76 a77 Sign nary feature map (b): The proposed nary Parallel Atrous Convolution (BPAC). Fig. 5: Illustration of the soft connection between two neighbouring blocks. We directly optimize the fusion gate (x Bp 1+1))...)), end-to-end. 3x3 Conv, dilation rate=2 w 11 w 12 w 13 w21 w22 w23 w 31 w 32 w Output Multi-dilations decompose 3x3 Conv rate=1 3x3 Conv rate=2 Sum Output Fig. 4: The comparison between conventional dilated convolution and BPAC. For expression convenience, the group only has one convolutional layer. is the convolution operation and indicates the XNOR operation. (a): The original floating-point dilated convolution. (b): We decompose the floating-point atrous convolution into a combination of binary bases, where each base uses a different dilated rate. We sum the output features of each binary branches as the final representation. λ 1 λ K G n i = sigmoid(θn i ), x n i = Gn i µn 1 i 1- (x n 1 i Fusion gate sigmoid ) + (1 G n i ) K j=1 λ 1 λ K λ j µ n 1 j (x n 1 j ), (7) where θ R K is a learnable parameter. The branch input x n i is a weighted combination of two paths. The first path is the output of the corresponding i-th branch in the (n 1)-th block, which is a straight connection. The second path is the aggregation output of the (n 1)-th block. In this way, we make more information flow into the branch and increase the gradient paths for improving the convergence of BNNs. The extreme case is when N G i = 0, Eq. 7 becomes Eq. 5 i=1 where we approximate each block independently. When

6 N G i = N, it corresponds to directly ensemble K binary i=1 networks as shown in Fig. 3 (e). We also discuss further extensions in Sec. VIII. C. From image classification to semantic segmentation The proposed Group-Net structure has great generalization ability. We first analysis the relationship between semantic segmentation and general image classification. Semantic segmentation is a dense pixel-wise classification task. Image classification serves as the encoder of segmentation and the feature quality greatly influences the pixelwise prediction. In other words, binarization on encoder can definitely introduce large bias on final prediction. And how to solve it? For segmentation, it is essential to capture multiscale information for accurately classifying regions of an arbitrary scale. Inspired by this, we propose to embed rich multi-scale context into GroupNet. To achieve this, we first revisit Atrous Spatial Pyramid Pooling [24], where several parallel atrous convolutions with different atrous rates are applied on top of the feature map. However, traditional ASPP is only applied on top of the features extracted from the network backbone while each block in the network can only employ one atrous rate as shown in Fig. 4 (a). Interestingly, the proposed Group-Net can naturally solve the limitation of ASPP and seamlessly embed it into the backbone. Specifically, each binary base corresponds to a different atrous ratio, as shown in Fig. 4 (b). We call the new structure as nary Parallel Astrous Convolution (BPAC). In this way, each binary base captures a typical level feature and we fuse the multi-level features from all the binary branches by summation at the end of the group. Moreover, the modification of atrous rates does not theoretically increase the complexity of binary atrous convolution. We effectively compensate the binarization error by incorporating rich context into approximation for semantic segmentation, which brings great improvement for performance as shown in Sec. V-C. In this paper, we use the same ResNet backbone in [24], [25] with output stride=8, where the last two blocks employ atrous convolution. In BPAC, we keep rates = 2 (1,..., K) and rates = 4 (1,..., K) for K bases in the last two blocks, respectively. Interestingly, BPAC is a general framework and can be applied on other tasks such as depth estimation. IV. DISCUSSIONS Theoretically Computational Complexity analysis: We provide a clear and comprehensive comparison of various quantization approaches in Table I. With respect to the trade-off between computational complexity and prediction accuracy, our approach apparently achieves the best performance. For example, in the previous state-of-the-art binary model [31], each convolutional layer is approximated using K weight bases and K activation bases, which needs to calculate K 2 times binary convolution. In contrast, we just need to approximate several intermediate feature maps with K structural bases. As reported in Sec. V-B, we save approximate K times computational complexity while still achieving comparable Top-1 accuracy. Since we use K structural bases, the number of parameters increases by K times in comparison to the full-precision counterpart. But we still save memory bandwidth by 32/K times since all the weights are binary in our paper. For our approach, there exists element-wise operations between each group, so the computational complexity saving is slightly less than 64 K. Differences between Group-net and fixed-point approaches [8], [20], [22]: Intuitively, one might think the proposed group-net with K bases is equivalent to K-bit fixed-point approaches. However, this is not the case. We now show the bitwise operations for the inner product between fixed-point weights and activations. Let a weight vector w R N be encoded by a vector b w i { 1, 1} N, j = 1,..., K. Assume we also quantize activations to K-bit. Similarly, activations x can be encoded by b a i { 1, 1}N, j = 1,..., K. The convolution can be written as 1 K 1 K 1 Q K (w T )Q K (x) = 2 i+j (b w i b a j ). (8) i=0 j=0 During inference, it first needs to get the encoding b a i for each bit via linear regression. Then it calculates and sum over K 2 times xnor( ) and popcount( ). The complexity is slightly larger than O(K 2 ). And the output range for each element in the activations should be [ (2 K 1) 2 N, (2 K 1) 2 N]. In contract, we directly obtain b a i by applying sign( ). Moreover, we just need to calculates K times xnor( ) and popcount( ) (see Eq. 3) then sum over the outputs, where the computational complexity is slightly larger than O(K). For binary convolution, it always outputs within the range {-1, 1}. And the value range for each element after summation is [ KN, KN], in which the number of distinct values representable is much less than that in fixed-point methods. In summary, we just need 1/K complexity of K-bit quantization and also save (2 K 1) 2 /K storage during computation. Even K-bit quantization with similar computational complexity requires more memory bandwidth to feed signal in SRAM or in registers. Since the representability and computational complexity have big difference, it is not appropriate to directly compare Group-net with fixed-point methods. Even though, we give such a comparison in Sec. V-B2. Differences between Group-net and multiple binarizations approaches: In [31], a linear combination of binary weight/activations bases are obtained from the full-precision weights/activations without being directly learned. And it has to solve the linear regression problem for each layer during forward propagation. In contrast, we directly design the binary network structure, where binary weights are end-to-end optimized. [33] [35] propose to use multiple binarizations, where each bit is used to binarize the residual error from previous bits. However, it is a sequential process which cannot be parallelized. And all multiple binarizations methods belong to local tensor approximation. In contrast 1 We omit learned scales here and only consider uniform quantization.

7 Model Weights Activations Operations Memory saving Computation Saving Full-precision DNN F F +, -, 1 1 [14], [15] B B XNOR, bitcount [19], [44] B F +, [45], [46] Q K F +, -, 32 K < 2 [8], [17], [20], [22] Q K Q K +, -, 32 < 64 K K 2 [31] K B K B +, -, XNOR, bitcount 32 < 64 K K 2 Ours K (B, B) +, -, XNOR, bitcount 32 < 64 K K TABLE I: Theoretically computational complexity analysis as well as the memory usage for all categories of quantization approaches. F : full-precision, B: binary, Q K : K-bit quantization. to value approximation, we propose a structural ensemble approach to mimic the full-precision network. Moreover, tensor-based methods are tightly designed to local value approximation and are hardly generalized to other tasks accordingly. In addition, our structure decomposition strategy achieves much better performance than tensor-level approximation as shown in Sec. S1.0.2 in the supplementary material. More discussions are provided in Sec. S2 in the supplementary file. V. EXPERIMENT We define several methods for comparison as follows: Layerwise: It implements the layer-wise feature approximation strategy described in Sec. III-B1. Group-Net v1: We implement with the group-wise feature approximation strategy, where each base consists of one block. It corresponds to the approach described in Eq. 5 and is illustrated in Fig. 3 (b). Group-Net v2: Similar to Group-Net v1, the only difference is that each group base has two blocks. It is illustrated in Fig. 3 (c) and is explained in Eq. 6. Group- Net v3: It is an extreme case where each base is a whole network, which can be treated as an ensemble of a set of binary networks. This case is shown in Fig. 3 (e). Ours: It implements the full model with learned soft connections described in Eq. 7. Ours**: Following -Real Net [43], we apply shortcut bypassing every convolution to improve the convergence. And this strategy can only be applied on basic blocks in ResNet-18 and ResNet-34. Due to the limited space, we put the ablation study in Sec. S1 in the supplementary file. A. Implementation details As in [8], [15], [20], [21], we quantize the weights and activations of all convolutional layers except that the first and last layers have full precisions. In all ImageNet experiments, training images are resized to , and a crop is randomly sampled from an image or its horizontal flip, with the per-pixel mean subtracted. We do not use any further data augmentation in our implementation. We use a simple single-crop testing for standard evaluation. No bias term is utilized. We first pretrain the full-precision model as initialization and fine-tune the binary counterpart. 2 We use Adam [47] for optimization. For training all binary networks, the mini-batch size and weight decay are set to 2 For pretraining Ours, we use ReLU( ) as nonlinearity. For pretraining Ours**, we use HardT anh( ) as nonlinearity. 256 and , respectively. The learning rate starts at 5e- 4 and is decayed twice by multiplying 0.1 at the 30th and 40th epoch. We train 50 epochs in total. Following [8], [21], no dropout is used due to binarization itself can be treated as a regularization. We apply layer-reordering to the networks as: Sign Conv ReLU BN. Inserting ReLU( ) after convolution is important for convergence. Our implementation is based on Pytorch [48]. B. Evaluation on ImageNet The proposed method is evaluated on ImageNet (ILSVRC2012) [49] dataset. ImageNet is a large-scale dataset which has 1.2M training images from 1K categories and 50K validation images. Several representative networks are tested: ResNet-18 [2], ResNet-34 and ResNet- 50. As discussed in Sec. VIII, binary approaches and fixedpoint approaches differ a lot in computational complexity as well as storage consumption. So we compare the proposed approach with binary neural networks in Table II and fixedpoint approaches in Table III, respectively. 1) Comparison with binary neural networks: Since we employ binary weights and binary activations, we directly compare to the previous state-of-the-art binary approaches, including BNN [14], XNOR-Net [15],-Real Net [43] and ABC-Net [31]. In all cases, our model uses binary weights and activations. We report the results in Table II and summarize the following points. 1): The most comparable baseline for Group-Net is ABC-Net. As discussed in Sec. VIII, we save considerable computational complexity while still achieving better performance compared to ABC-Net. In comparison to directly binarizing networks, Group-Net achieves much better performance but needs K times more storage and complexity. However, the K homogeneous bases can be easily parallelized on the real chip. 2): By comparing Ours** (5 bases) and Ours (8 bases), we can observe comparable performance. It justifies adding more shortcuts can facilitate BNNs training. 3): For Bottleneck structure in ResNet-50, we observe larger quantization error than the counterparts using basic blocks with 3 3 convolutions in ResNet-18 and ResNet-34. We assume that this is mainly attributable to the 1 1 convolutions in Bottleneck. The reason is 1 1 filters are limited to two states only (either 1 or -1) and they have very limited learning power. What s more, the bottleneck structure reduces the number of filters significantly, which means the gradient paths are greatly reduced. In other

8 words, it blocks the gradient flow through BNNs. Even though the bottleneck structure can benefit full-precision training, it is really needed to be redesigned in BNNs. To increase gradient paths, the 1 1 convolutions should be removed. 2) Comparison with fix-point approaches: Since we use K binary group bases, we compare our approach with at least K-bit fix-point approaches. In Table III, we compare our approach with the state-of-the-art fixed-point approaches DoReFa-Net [20], SYQ [50] and LQ-Nets [22]. As described in Sec. VIII, K binarizations are more superior than the K-bit width quantization with respect to the resource consumption. DOREFA-Net and LQ-Nets use 2-bit weights and 2-bit activations. SYQ employs binary weights and 8-bit activations. All the comparison results are directly cited from the corresponding papers. Even though we consume much less resources, we can achieve comparable or even improved performance with compared approaches. LQ-Nets is the current best-performing fixed-point approach and its activations have a long-tail distribution. As discussed in Sec. VIII, we require less memory bandwidth while still achieving comparable accuracy. The performance gap is mainly due to the limited representability of activations. 3) Hardware implementation on ImageNet: We currently implement ResNet-34 using Ours** and test the inference time on XEON E v3 CPU with 8 cores. We use the open-source BMXNet [51] and OpenMP for acceleration. The speed up of xnor 64 omp is greatly influenced by the input channel size, filter number, filter size and batch size. Moreover, BN + sign( ) is accelerated by XNOR operations following the implementation in FINN [52]. We cannot accelerate floating-point element-wise operations including tensor additions and nonlinearity. We finally achieve 7.5x speedup in comparison to the original ResNet-34 model, where process communication across cores occupies a lot of time in CPU. We also achieve actual 5.8x memory saving. However, we expect to achieve better acceleration on parallelization friendly FPGA platforms but currently we do not have enough resources. C. Evaluation on PASCAL VOC We evaluate the proposed methods on the PASCAL VOC 2012 semantic segmentation benchmark [53] which contains 20 foreground object classes and one background class. The original dataset contains 1,464 (train), 1,449 (val) and 1456 (test) images. The dataset is augmented by the extra annotations from [54], resulting in 10,582 training images. The performance is measured in terms of averaged pixel intersection-over-union (miou) over 21 classes. Our experiments are based on the original FCN [5] with the backbone ResNet-18, 34 and 50, in which we use atrous convolution in the last two blocks with output stride=8. We first pretrain the full-precision classification network on ImageNet dataset and fine-tune it on PASCAL VOC. During fine-tuning, we use Adam with initial learning rate=1e-4, decay=1e-5 and batch size=16. We set the number of bases K = 5 in experiments. We train 40 epochs in total and decay the learning rate by a factor of 10 at 20 and 30 epochs. We do not add any auxilary loss and ASPP. We empirically observe full-precision FCN under dilation rates (4, 8) in last two blocks achieves the best performance. We further report the main results in Table IV. From the results, we can observe that when all bases using the same dilation rates, there is a large performance gap with the full-precision counterpart. This performance drop is consistent with the classification results on ImageNet dataset in Table II. It proves that the quality of extracted features have a great impact on the segmentation performance. What s more, by utilizing task-specific BPAC, we find significant performance increase with no theoretical computational complexity added. It strongly justifies the scalability and flexibility of the proposed Group-Net. We expect BPAC can be further extended to other dense prediction tasks. Moreover, we can observe Ours + BPAC based on ResNet-34 even outperform the counterpart on ResNet-50. This shows the widely used bottleneck structure is not suited to BNNs. VI. CONCLUSION In this paper, we have begun to explore highly efficient and accurate CNN architectures with binary weights and activations. Specifically, we have proposed to directly decompose the full-precision network into multiple groups and each group is approximated using a set of binary bases which can be optimized in an end-to-end manner. We also propose to learn the decomposition automatically. Experimental results have proved the effectiveness of the proposed approach on the ImageNet classification task. Moreover, we have generalized Group-Net from image classification task to semantic segmentation and achieved promising performance on PASCAL VOC. We have implemented the homogeneous multi-branch structure on CPU and achieved promising acceleration on test-time inference. APPENDIX VII. ABLATION STUDY The core idea of Group-Net is decomposing floatingpoint network into binary structures and we evaluate it comprehensively in this section on ImageNet dataset with ResNet-18 and ResNet-50. 1) Group space exploration: We are interested in exploring the influence of different decomposition strategies. We present the results in Table V. We observe that by learning the soft connections between each block results in the best performance on ResNet-18. And methods based on hard connections perform relatively worse. From the results, we can conclude that designing compact binary structure is essential for highly accurate classification. What s more, we expect to further boost the performance by integrating with the NAS approaches as discussed in Sec. VIII.

9 model Full BNN XNOR -Real Net ABC-Net (25 bases) Ours (5 bases) Ours** (5 bases) Ours (8 bases) ResNet-18 Top % 42.2% 51.2% 56.4% 65.0% 64.8% 67.0% 67.5% Top % 67.1% 73.2% 79.5% 85.9% 85.7% 87.5% 88.0% ResNet-34 Top % % 68.4% 68.5% 70.5% 71.8% Top % % 88.2% 88.0% 89.3% 90.4% ResNet-50 Top % % 69.5% % Top % % 89.2% % TABLE II: Comparison with the state-of-the-art binary models using ResNet-18, ResNet-34 and ResNet-50 on ImageNet. All the comparing results are directly cited from the original papers. Model W A Top-1 Top-5 ResNet-18 Full-precision % 89.4% ResNet-18 Ours (4 bases) % 85.6% LQ-Net [22] % 85.9% DOREFA-Net [20] % 84.4% SYQ [50] % 84.6% TABLE III: Comparison with the state-of-the-art fixedpoint models with ResNet-18 on ImageNet. model miou Full-precision 64.9 ResNet-18, FCN-32s Ours 60.5 Ours + BPAC 63.8 Ours** + BPAC 65.1 Full-precision 67.3 ResNet-18, FCN-16s Ours 62.7 Ours + BPAC 66.5 Ours** + BPAC 67.4 Full-precision 72.7 ResNet-34, FCN-32s Ours 68.2 Ours + BPAC 71.2 Ours** + BPAC 72.8 Full-precision 73.1 ResNet-50, FCN-32s Ours 67.2 Ours + BPAC 70.4 TABLE IV: Performance on PASCAL VOC 2012 validation set. 2) Layer-wise vs. group-wise: We explore the difference between layer-wise and group-wise design strategies in Table V. By comparing the results, we find Ours outperforms Layerwise by 8.2% while saving the complexity with respect to floating-point tensor summation. Note that Layerwise approach can be treated as a kind of tensor approximation which has similarities with multiple binarizations methods in [31], [33] [35] and the differences are described in Sec. 4 in the main paper. It strongly shows the necessity for employing the group-wise design strategy to get promising results. We speculate that this significant gain is partly due to the preserved block structure in binary bases. It also proves that apart from designing accurate binarization function, it is also essential to design appropriate structure for BNNs. 3) Effect of the number of bases: We further explore the influence of number of bases to the final performance in Table VI. When the number is set to 1, it corresponds to directly binarize the original full-precision network and we observe apparent accuracy drop compared to its fullprecision counterpart. With more bases employed, we can find the performance steadily increases. The reason can be attributed to the better fitting of the intermediate feature maps, which is a trade-off between accuracy and complexity. It can be expected that with enough bases, the network should has the capacity to approximate the fullprecision network precisely. With the multi-branch groupwise design, we can achieve high accuracy while still significantly reducing inference time and power consumption. Interestingly, each base can be implemented using small resource and the parallel structure is quite friendly to FPGA/ASIC. VIII. MORE DISCUSSIONS Relation to ResNeXt [55]: The homogeneous multi-branch architecture design shares some spirit of ResNeXt and enjoys the advantage of introducing a cardinality dimension. However, our objectives are totally different. ResNeXt aims to increase the capacity while maintaining the complexity. To achieve this, it first divides the input channels into groups and perform efficient group convolutions implementation. Then all the group outputs are aggregated to approximate the original feature map. In contrast, we first divide the network into groups and directly replicate the floating-point structure for each branch while both weights and activations are quantized. In this way, we can reconstruct the fullprecision outputs via aggregating a set of low-precision transformations for complexity reduction in energy-efficient hardwares. Furthermore, our transformations are not restricted to only one block as in ResNeXt. Group-Net has strong scalability and flexibility: The group-wise approximation approach can be efficiently integrated with Neural Architecture Search (NAS) frameworks [28] [30], [40], [56] to explore the optimal architecture. Based on Group-Net, we can further add number of bases, filter numbers, connections among bases and even the bitwidth for each structural base into the search space. The proposed approach can also be combined with knowledge distillation strategy as in [8], [57]. The basic idea is to train a target network alongside another pretrained guidance network. An additional regularizer is added to minimize the difference between student s and teacher s intermediate feature representations for higher accuracy. In this way, we expect to further decrease the number of bases while maintaining the performance. IX. EXTENDING GROUP-NET TO BINARY WEIGHTS AND LOW-PRECISION ACTIVATIONS In the main paper, all the experiments are based on binary weights and binary activations. To make a tradeoff between accuracy and computational complexity, we can add more bases as discussed in Sec. VII-3. However, we can also increase the bit-width of activations for better accuracy. We

10 Model Bases Top-1 Top-5 Top-1 gap Top-5 gap ResNet-18 Full-precision % 89.4% - - ResNet-18 Ours % 85.7% 4.9% 3.7% ResNet-18 Group-Net v % 84.8% 6.7% 4.6% ResNet-18 Group-Net v % 84.1% 7.5% 5.3% ResNet-18 Group-Net v % 81.3% 10.8% 8.1% ResNet-18 Layerwise % 78.7 % 14.1% 10.7% TABLE V: Comparison with the state-of-the-art fixed-point models with ResNet-18 on ImageNet. Top-1 gap to the corresponding full-precision networks is also reported. Model Bases Top-1 Top-5 Top-1 gap Top-5 gap Full-precision % 89.4% - - Ours % 77.6% 15.5% 11.8% Ours % 84.2% 7.2% 5.2% Ours % 85.7% 4.9% 3.7% TABLE VI: Validation accuracy of Group-Net on ImageNet with different number of bases. All cases are based on the ResNet-18 network with binary weight and activations. conduct experiments on ImageNet dataset and report the accuracy in Table VII, Table VIII and Table IX. 1) Fixed-point Activation quantization: We just apply the simple uniform activation quantization in the paper. As the output of the RELU function is unbounded, the quantization after RELU requires a high dynamic range. It will cause large quantization errors especially when the bitprecision is low. To alleviate this problem, similar to [20], [58], we use a clip function h(y) = clip(y, 0, β) to limit the range of activation to [0, β], where β (not learned) is fixed during training. Then the truncated activation output y is uniformly quantized to k-bits (k > 1) and we still use STE to estimate the gradient: Forward : ỹ = round(y 2k 1 ) β Backward : l y = l ỹ. β 2 k 1, Since the weights are binary, the multiplication in convolution is replaced by fixed-point addition. One can simply replace the uniform quantizer with other non-uniform quantizers for more accurate quantization similar to [21], [22]. 2) Implementation details: For data preprocessing, it follows the same pipeline as BNNs. We also quantize the weights and activations of all convolutional layers except that the first and last layers have full precisions. For training ResNet with non-binary activations, the learning rate starts at 0.05 and is divided by 10 when it gets saturated. We use Nesterov momentum SGD for optimization. The minibatch size and weight decay are set to 128 and , respectively. The momentum ratio is 0.9. We directly learn from scratch since we empirically observe that fine-tuning does not bring further benefits to the performance. The convolution and element-wise operations are in the order: Conv BN ReLU Quantize. 3) Evaluation on ImageNet: From Table VII, we can observe that with binary weights and fixed-point activations, we can achieve highly accurate results. For example, by also referring to Table 2 in the main paper, we can find (9) the Top-1 accuracy drop for ResNet-50 Ours with tenary and binary activations are 1.5% and 6.5%, respectively. Furthermore, our approach still works well on plain network structures such as AlexNet in Table VIII. We also provide the comparison with different number of bases in Table IX. Model W A Top-1 Top-5 Top-1 gap Top-5 gap ResNet-18 Full-precision % 89.4% - - ResNet-18 Ours % 89.0% 0.1% 0.4% ResNet-18 Ours % 89.8% -0.7% -0.4% ResNet-18 Group-Net v % 88.5% 0.5% 0.9% ResNet-18 Group-Net v % 87.9% 1.4% 1.5% ResNet-18 Group-Net v % 85.0% 5.2% 4.4% ResNet-18 Layerwise % 82.2% 9.6% 7.2% ResNet-50 Full-precision % 92.9% - - ResNet-50 Ours % 91.5% 1.5% 1.4% ResNet-50 Ours % 92.7% 0.0% 0.2% TABLE VII: Validation accuracy of Group-Net on ImageNet with different choices of W and A for Group- Net v1. W and A refer to the weight and activation bitwidth, respectively. Effect of different group divisions is also reported as Layerwise, Group-Net v1, Group-Net v2 and Group-Net v3. Model Full-precision Layerwise Group-Net v1 Ours Top % 54.2% 57.3% 57.8 % Top % 77.6% 80.1% 80.9% TABLE VIII: Performance of AlexNet on ImageNet. All cases use binary weights and 2-bit activations. Model Bases bitw bita Top-1 Top-5 Top-1 gap Top-5 gap Full-precision % 89.4% - - Ours % 81.8% 9.6% 7.6% Ours % 88.7% 1.2% 0.7% Ours % 89.5% -0.4% -0.1% TABLE IX: Validation accuracy of Group-Net on ImageNet with number of bases. All cases are based on the ResNet-18 network with binary weights and 4-bit activations. REFERENCES [1] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Proc. Adv. Neural Inf. Process. Syst., pages , [2] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pages , , 5, 7 [3] Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. You only look once: Unified, real-time object detection. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pages , [4] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In Proc. Adv. Neural Inf. Process. Syst., pages 91 99,

Convolutional Neural Network Architecture

Convolutional Neural Network Architecture Zhisheng Zhong Feburary 2nd, 2018 Zhisheng Zhong Convolutional Neural Network Architecture Feburary 2nd, 2018 1 / 55 Outline 1 Introduction of Convolution Motivation