Rethinking Binary Neural Network for Accurate Image Classification and Semantic Segmentation

Size: px
Start display at page:

Download "Rethinking Binary Neural Network for Accurate Image Classification and Semantic Segmentation"

Transcription

1 Rethinking nary Neural Network for Accurate Image Classification and Semantic Segmentation Bohan Zhuang 1, Chunhua Shen 1, Mingkui Tan 2, Lingqiao Liu 1, and Ian Reid 1 1 The University of Adelaide, Australia; and Australian Centre for Robotic Vision 2 South China University of Technology, China arxiv: v1 [cs.cv] 22 Nov 2018 Abstract In this paper, we propose to train a network with both binary weights and binary activations, designed specifically for mobile devices with limited computation capacity and power consumption. Previous works on quantizing CNNs uncritically assume the same architecture with fullprecision networks, which we term value approximation. Their objective is to preserve the floating-point information using a set of discrete values. However, we take a novel view for best performance it is very likely that a different architecture may be better suited to deal with binary weights as well as binary activations. Thus we directly design such a highly accurate binary network structure, which is termed structure approximation. In particular, we propose a network decomposition strategy in which we divide the networks into groups and aggregate a set of homogeneous binary branches to implicitly reconstruct the full-precision intermediate feature maps. In addition, we also learn the connections between each group. We further provide a comprehensive comparison among all quantization categories. Experiments on ImageNet classification tasks demonstrate the superior performance of the proposed model, named Group-Net, over various popular architectures. In particular, we outperform the previous best binary neural network in terms of accuracy as well as saving huge computational complexity. Furthermore, the proposed Group-Net can effectively utilize task specific properties for strong generalization. In particular, we propose to extend Group-Net for lossless semantic segmentation. This is the first work proposed on solving dense pixels prediction based on BNNs in the literature. Actually, we claim that considering both value and structure approximation should be the future development direction of BNNs. CONTENTS I Introduction 2 II Related Work 3 III Method 3 III-A narization function III-B Network nary Decomposition... 3 III-C III-B1 Layer-wise feature reconstruction III-B2 Group-wise feature approximation III-B3 Learning to decompose.. 5 From image classification to semantic segmentation IV Discussions 6 V Experiment 7 V-A Implementation details V-B Evaluation on ImageNet V-B1 Comparison with binary neural networks V-B2 Comparison with fix-point approaches V-B3 Hardware implementation on ImageNet V-C Evaluation on PASCAL VOC VI Conclusion 8 VII Ablation study 8 VII-1 Group space exploration. 8 VII-2 Layer-wise vs. group-wise 9 VII-3 Effect of the number of bases VIII More discussions 9 IX Extending Group-Net to binary weights and low-precision activations 9 IX-1 Fixed-point Activation quantization IX-2 Implementation details.. 10 IX-3 Evaluation on ImageNet.. 10 References 10

2 I. INTRODUCTION Designing deeper and wider convolutional neural networks has led to significant breakthroughs in many machine learning tasks, such as image classification [1], [2], object detection [3], [4] and object segmentation [5]. However, accurate deep models always require billions of FLOPs, which makes it infeasible to run real-time applications on resource constrained mobile platforms. To solve this problem, many existing works focus on network pruning [6], [7], low-bit quantization [8], [9] and efficient architecture design [10], [11]. Quantization approaches represent weights and activations with low bitwidth fixed-point integers, where dot product can be computed by several bitwise operations. The xnor-popcount operation just requires a single logic gate instead of using hundreds units for floating point multiplication [12], [13]. BNNs [14], [15] can be treated as an extreme quantization approach where both weights and activations are restricted to single bit, either +1 or -1. In this paper, we aim to design a highly accurate binary neural network from both the quantization and efficient architecture design perspective. Existing quantization methods are mainly divided into two categories. The first category methods propose to design more effective optimization algorithms to find better local minima for quantized weights. They propose to either introduce knowledge distillation [8], [16], [17] or use lossaware objectives [18], [19]. The second category approaches focus on improving the quantization function [20] [22]. It is essential to learn suitable mappings between discrete values and floating-point counterparts for maintaining the performance. However, quantization function is highly non-trivial especially for BNNs. This is because quantization functions are non-differentiable and gradients can only be roughly approximated. Both of these two categories belong to value approximation. Their common objective is to discretize continuous weights and/or activations to preserve most of the representability of the original network. However, there is a natural disadvantage of value approximation approaches since it is merely a local approximation. More specifically, it can only be an approximation of a full-precision model without adapting to a specific task. Given a pretrained model on a given task, quantization error will inevitably affect the final performance. On the contrary, we start to explore a third category, which we call structure approximation. The objective is to redesign a binary architecture that can fit the capability of the floating-point model. In particular, we propose to partition the full-precision model into groups and use a set of parallel binary bases to approximate its floatingpoint structure counterpart. Compared to value approximation, it s a higher-level category with structural information preserved. What s more, we can design flexible binary structures according to a given task and utilize taskspecific structure to compensate the quantization loss and facilitate training. For example, when transferring Group- Net from image classification to semantic segmentation, we are motivated by the structure of Atrous Spatial Paramid Pooling (ASPP) [23]. In DeepLab v3 [24] and v3+ [25], ASPP is merely applied on the top of extracted features while each block in the backbone network can only employ one atrous rate. In contrast, we propose to directly apply different atrous rates on parallel binary bases in the backbone network, which is equivalent to absorbing ASPP into the feature extraction stage. In this way, we do not theoretically increase the computational complexity of binary convolution while significantly boost the performance on semantic segmentation. Interestingly, all the previous quantization approaches cannot generalize their value approximation strategies to semantic segmentation or other computer vision tasks accordingly. This is the most significant advantage of Group-Net. Actually, value and structure approximation are complementary rather than contradictory. In other words, both should be carefully designed for highly accurate BNNs. Interestingly, we are also motivated by the energyefficient architecture design approaches [11], [26], [27]. The objective of all these approaches is to replace the traditional expensive convolution with computational efficient convolutional operations ( i.e., depthwise separable convolution, 1 1 convolution). Nevertheless, we propose to design binary network architectures for dedicated hardware from the quantization view. Most previous quantization works focus on directly quantizing the full-precision architecture. At this point in time we do begin to explore alternative architectures that we show are better suited to dealing with binary weights and activations. Specifically, apart from decomposing each group into several binary bases, we also propose to learn the connections between each group by introducing a fusion gate. Moreover, Group-Net also has strong scalability that possible further improvements such as Neural Architecture Search [28] [30] can be incorporated. Our contributions are summarized as follows: We propose to design accurate BNNs structures from the structure approximation perspective. Specifically, we divide the networks into groups and approximate each group using a set of binary bases. We also propose to automatically learn the decomposition by introducing soft connections. The proposed Group-Net has strong scalability which can be easily transferred to other tasks. In this paper, we propose nary Parallel Atrous Convolution (BPAC), which embeds rich multi-scale context into BNNs for accurate semantic segmentation. Group-Net with BPAC significantly improves the performance while maintaining the complexity compared to employ Group-Net only. We evaluate our models on ImageNet and PASCAL VOC datasets based on ResNet. Extensive experiments show the proposed Group-Net achieves the state-ofthe-art trade-off between accuracy and computational complexity. From the results, we also claim increasing gradient paths ( i.e., more filters or more shortcuts) are

3 essential for BNNs training. II. RELATED WORK Network quantization: The recent increasing demand for implementing fixed point deep neural networks on embedded devices motivates the study of network quantization. Fixed-point approaches that use low-bitwidth discrete values to approximate real ones have been extensively explored in literature [8], [14], [15], [20], [21], [31]. BNNs [14], [15] propose to constrain both weights and activations to binary values ( i.e., +1 and -1), where the multiply-accumulations can be replaced by purely xnor( ) and popcount( ) operations. To make a trade-off between accuracy and complexity, [32] [35] propose to recursively perform binarization on residuals from lower bits while still possesses the advantage of binary operation. However, multiple binarizations are sequential process which cannot be paralleled. And in [31], it proposes to use a linear combination of binary weights/activations to approximate the floating counterparts during forward propagation. Different from all local tensor approximation approaches, we propose to directly design BNNs from structure approximation perspective. Efficient architecture design: There has been a rising interest in designing efficient architecture in the recent literature. Efficient model designs like GoogLeNet [36] and SqueezeNet [26] propose to replace 3x3 convolutional kernels with 1x1 size to reduce the complexity while increasing the depth and accuracy. Additionally, separable convolutions are also proved to be effective in Inception approaches [37], [38]. This idea is further generalized as depthwise separable convolutions by Xception [10], MobileNet [11], [39] and ShuffleNet [27] to generate energy-efficient network structure. To avoid handcrafted heuristics, neural architecture search [28] [30], [40], [41] has been explored for automatic model design. III. METHOD The objective of this paper is to binarize the weights and activations. To aid with descriptions, we adopt a terminology for architecture as comprising layers, blocks and groups. A layer is a standard single parameterized layer in a network such as a dense or convolutional layer. A block is a collection of layers in which the output of end layer is connected to the input of the next block ( e.g., a residual block). A group is a collection of blocks. We first elaborate on the binarization functions in Sec. III-A. Then in Sec. III-B, we explain our binary architecture design strategy. Finally, in Sec. III-C, we describe how to utilize task-specific attributes to generalize our approach to semantic segmentation. A. narization function For a convolutional layer, we define the input x R cin win hin, weight filter w R c w h and the output y R cout wout hout, respectively. narization of weights: Following [15], we estimate the floating-point weight w by a binary weight filter b w and a scaling factor α R + such that w αb w, where b w is the sign of w and α calculates the mean of absolute values of w. In general, the sign( ) is non-differentiable and we adopt the straight-through estimator [42] (STE) to approximate the gradient calculation. Formally, the forward and backward process can be given as follows: Forward : b w = α sign(w), l Backward : w = l b w bw w l b w, (1) where l is the loss. In practice, we find such a binarization scheme is quite stable. narization of activations: For activation binarization, we utilize the piecewise polynomial function to approximate the sign function as in [43]. The forward and backward can be written as: Forward : b a = sign(x), Backward : l ba where x = l b a b a x = x, 2 + 2x : 1 x < 0 2 2x : 0 x < 1 0 : otherwise Remark: Most previous literatures focus on designing accurate binarization functions for weights and activations ( e.g., multiple binarizations [31], [33] [35].) Actually, this belongs to tensor (value) approximation. However, we propose a new perspective in Sec. III-B for training BNNs. B. Network nary Decomposition Suppose a floating-point residual network Φ has N blocks. In this paper, our objective is to decompose Φ into binary structures to preserve its representability as shown in Figure 1. We assume Φ is decomposed into P binary fragments [F 1,..., F P ], where F i ( ) can be any binary structure and each F i ( ) may be different. We explore two different architecture changes for F( ) which we call layer-wise and group-wise in Sec. III-B1 and Sec. III-B2, respectively. We then explain the strategy for automatic decomposition in Sec. III-B3. 1) Layer-wise feature reconstruction: For F( ), the simplest structure is a single convolutional layer. At each layer, we aim to reconstruct the full-precision output feature map given the binary input tensor x using a set of binarized homogeneous branches: K K F(x) = λ i f i (x) = λ i (b w i sign(x)), (3) i=1 i=1 where f( ) is the binary convolutional operation, K is the number of branches and λ i is a scale factor. f( ) is the bitwise operations: xnor( ) and popcount( ). Fig. 2 (b) illustrates the layer-wise feature reconstruction for a single block. The scale scalar can be absorbed into batch normalization during inference. All f i s in Eq. 3 have the same convolution hyperparameters as the original floating-point counterpart. In this way, each binary branch gives a rough transformation and all the transformations are aggregated to approximate the original full precision output feature map. It can be expected that with the increase of K, we can get stronger fitting capability. (2)

4 Floating-point layer (layers) Decompose narylayer (layers) F ( ) nary layer (layers) Fig. 1: Illustration of the main objective. We directly design BNNs structure via decomposing the full-precision network into binary bases. (a) (b) f i λ 1 λ k F (c) λ 1 λ k µ i Fig. 2: Overview of the baseline binarization method and two proposed approaches. We take one residual block for illustration. is a binary layer. (a): Directly binarizing the full-precision block. (b): Reconstruct full-precision activations layer-wise using a set of binary convolutional layers. (c): Approximate full-precision activations group-wise with a linear combination of binary bases µ i. Block Block Block Block (a) + + (b) (c) (d) (e) Fig. 3: Illustration of several possible group-wise architectures. We assume the original full-precision network comprises four blocks as shown in (a). is the abbreviation of quantized block. We omit the skip connection for convenience. (b): Each group comprises one block in (a) and we approximate each floating-point block s output with a set of binarized blocks. (c): We partition two blocks as a group. (d): Approximate features using flexible block combination. (e): An extreme case. We ensemble a set of binarized networks to approximate the output before the classification layer. with more complex transformations. A special case is when M = 1, it corresponds to directly quantize the full-precision network (Fig. 2 (a)). The structure is fixed during training and each binary convolutional kernel b w i is directly endto-end optimized. During inference, the homogeneous K bases is parallelizable and hardware friendly, which can bring great speed-up during test-time inference. However, this strategy has an apparent limitation. The quantization function in each branch introduces a certain amount of error. Furthermore, as the previous estimation error propagates into the current multi-branch convolutional layer, it will be enlarged by the quantization function and the final aggregation process. As a result, it may cause large quantization errors especially for deeper layers and bring large deviation for gradients during backpropagation. To solve this problem, we further propose a flexible groupwise approximation approach in Sec. III-B2. Complexity: The bitwise XNOR operation and bitcounting can be performed in a parallel of 64 by the current generation of CPUs [15], [43]. We need to calculate K binary convolutions and K full-precision additions, thus the speed up ratio σ can be calculated as: c σ = inc outwhw inh in 1 64 (Kcincoutwhwinhin)+Kcoutwouthout = 64 K c outwhw inh in c inc outwhw inh in+64c outw outh out. 2) Group-wise feature approximation: In layer-wise approach, we approximate each layer separately. However, in this section we explore the idea of structural ensemble: approximating an entire group where F( ) is a hierarchical cascade. The motivation is as follows. As explained in Sec. III-B1, the frequency of reconstructing the intermediate activations and suppressing the quantization error is a tradeoff. Analogy to the extreme low-level layer-wise case, we also show the extreme high-level case in Fig. 3 (e) where we directly ensemble a set of binary networks. In the ensemble model, the output quantization error for each branch can be very large even though we only aggregate the outputs once. Clearly, these two extreme cases may (4)

5 not be the optimal and it motivates us to explore the mid-level cases. More specifically, we propose a groupwise approximation strategy which intends to approximate residual blocks or several consequent layers as a whole. We first consider the simplest case where each group consists of only one block ( i.e., the group comprises one block of Fig. 3 (a)). Then the above layer-wise approximation method can be easily extended to the group-wise structure. The most typical block structure is the bottleneck architecture and we can extend Eq. 3 as: K F(x) = λ i µ i (x), (5) i=1 where µ i ( ) is a binary residual bottleneck [2] which consists of several binary convolutions and floating-point element-wise operations ( i.e., ReLU, BN), and λ i is the scale factor. In Eq. 5, we use a linear combination of homogeneous binary bases to approximate one group, where each base µ i is a binarized block ( in Fig. 3). We illustrate such a group in Fig. 2 (c) and the framework which consists of these groups in Fig. 3 (b). In this way, we effectively keep the original residual structure in each base to preserve the network capacity. In addition, we keep a balance between suppressing quantization error accumulation and feature reconstruction. Moreover, compared to layerwise strategy, the number of parameters and complexity both decrease since there is no need to apply floating-point tensor aggregation within the group. Furthermore, the group-wise approximation is flexible. We now analyze the case where each group may contain different number of blocks. Suppose we partition the network into P groups and it follows a simple rule that the each group must include one or multiple complete residual building blocks. Let {G 1, G 2,..., G P } be the group indexes at which we approximate the output feature maps. For the p-th group, we consider the blocks B {B p 1 +1,..., B p }, where the index B p 1 = 0 if p = 1. And we can extend Eq. 5 into multiple blocks format: F(x Bp 1+1) = K i=1 θ i u Bp i (u Bp 1 i (...(u Bp 1+1 i (6) Based on F( ), we can efficiently construct a network by stacking these groups and each group may consist of one or multiple blocks. Different from Eq. 5, we further expose a new dimension on each base, which is the number of blocks. This greatly increases the flexibility of framework and we illustrate several possible connections in Fig. 3. We further describe how to learn the connections in Sec. III-B3. 3) Learning to decompose: There is still a problem to be solved in Eq. 6. Since the network has N blocks, the possible number of connections is 2 N. It is not practical to enumerate all possible structures for training. To solve this problem, we explain how to automatically learn the optimal structural decomposition. We introduce in a fusion gate as the soft connection between blocks µ( ). We define the input of the i-th branch for the n-th block as: (a): The conventional floating-point dilated convolution. Feature map a 11 a 12 a 13 a 14 a 15 a 16 a 17 a 21 a 22 a 23 a 24 a 25 a 26 a 27 a 31 a 32 a 33 a 34 a 35 a 36 a 37 a 41 a 42 a 43 a 44 a 45 a 46 a 47 a 51 a 52 a 53 a 54 a 55 a 56 a 57 a61 a62 a63 a64 a65 a66 a67 a71 a72 a73 a74 a75 a76 a77 Sign nary feature map (b): The proposed nary Parallel Atrous Convolution (BPAC). Fig. 5: Illustration of the soft connection between two neighbouring blocks. We directly optimize the fusion gate (x Bp 1+1))...)), end-to-end. 3x3 Conv, dilation rate=2 w 11 w 12 w 13 w21 w22 w23 w 31 w 32 w Output Multi-dilations decompose 3x3 Conv rate=1 3x3 Conv rate=2 Sum Output Fig. 4: The comparison between conventional dilated convolution and BPAC. For expression convenience, the group only has one convolutional layer. is the convolution operation and indicates the XNOR operation. (a): The original floating-point dilated convolution. (b): We decompose the floating-point atrous convolution into a combination of binary bases, where each base uses a different dilated rate. We sum the output features of each binary branches as the final representation. λ 1 λ K G n i = sigmoid(θn i ), x n i = Gn i µn 1 i 1- (x n 1 i Fusion gate sigmoid ) + (1 G n i ) K j=1 λ 1 λ K λ j µ n 1 j (x n 1 j ), (7) where θ R K is a learnable parameter. The branch input x n i is a weighted combination of two paths. The first path is the output of the corresponding i-th branch in the (n 1)-th block, which is a straight connection. The second path is the aggregation output of the (n 1)-th block. In this way, we make more information flow into the branch and increase the gradient paths for improving the convergence of BNNs. The extreme case is when N G i = 0, Eq. 7 becomes Eq. 5 i=1 where we approximate each block independently. When

6 N G i = N, it corresponds to directly ensemble K binary i=1 networks as shown in Fig. 3 (e). We also discuss further extensions in Sec. VIII. C. From image classification to semantic segmentation The proposed Group-Net structure has great generalization ability. We first analysis the relationship between semantic segmentation and general image classification. Semantic segmentation is a dense pixel-wise classification task. Image classification serves as the encoder of segmentation and the feature quality greatly influences the pixelwise prediction. In other words, binarization on encoder can definitely introduce large bias on final prediction. And how to solve it? For segmentation, it is essential to capture multiscale information for accurately classifying regions of an arbitrary scale. Inspired by this, we propose to embed rich multi-scale context into GroupNet. To achieve this, we first revisit Atrous Spatial Pyramid Pooling [24], where several parallel atrous convolutions with different atrous rates are applied on top of the feature map. However, traditional ASPP is only applied on top of the features extracted from the network backbone while each block in the network can only employ one atrous rate as shown in Fig. 4 (a). Interestingly, the proposed Group-Net can naturally solve the limitation of ASPP and seamlessly embed it into the backbone. Specifically, each binary base corresponds to a different atrous ratio, as shown in Fig. 4 (b). We call the new structure as nary Parallel Astrous Convolution (BPAC). In this way, each binary base captures a typical level feature and we fuse the multi-level features from all the binary branches by summation at the end of the group. Moreover, the modification of atrous rates does not theoretically increase the complexity of binary atrous convolution. We effectively compensate the binarization error by incorporating rich context into approximation for semantic segmentation, which brings great improvement for performance as shown in Sec. V-C. In this paper, we use the same ResNet backbone in [24], [25] with output stride=8, where the last two blocks employ atrous convolution. In BPAC, we keep rates = 2 (1,..., K) and rates = 4 (1,..., K) for K bases in the last two blocks, respectively. Interestingly, BPAC is a general framework and can be applied on other tasks such as depth estimation. IV. DISCUSSIONS Theoretically Computational Complexity analysis: We provide a clear and comprehensive comparison of various quantization approaches in Table I. With respect to the trade-off between computational complexity and prediction accuracy, our approach apparently achieves the best performance. For example, in the previous state-of-the-art binary model [31], each convolutional layer is approximated using K weight bases and K activation bases, which needs to calculate K 2 times binary convolution. In contrast, we just need to approximate several intermediate feature maps with K structural bases. As reported in Sec. V-B, we save approximate K times computational complexity while still achieving comparable Top-1 accuracy. Since we use K structural bases, the number of parameters increases by K times in comparison to the full-precision counterpart. But we still save memory bandwidth by 32/K times since all the weights are binary in our paper. For our approach, there exists element-wise operations between each group, so the computational complexity saving is slightly less than 64 K. Differences between Group-net and fixed-point approaches [8], [20], [22]: Intuitively, one might think the proposed group-net with K bases is equivalent to K-bit fixed-point approaches. However, this is not the case. We now show the bitwise operations for the inner product between fixed-point weights and activations. Let a weight vector w R N be encoded by a vector b w i { 1, 1} N, j = 1,..., K. Assume we also quantize activations to K-bit. Similarly, activations x can be encoded by b a i { 1, 1}N, j = 1,..., K. The convolution can be written as 1 K 1 K 1 Q K (w T )Q K (x) = 2 i+j (b w i b a j ). (8) i=0 j=0 During inference, it first needs to get the encoding b a i for each bit via linear regression. Then it calculates and sum over K 2 times xnor( ) and popcount( ). The complexity is slightly larger than O(K 2 ). And the output range for each element in the activations should be [ (2 K 1) 2 N, (2 K 1) 2 N]. In contract, we directly obtain b a i by applying sign( ). Moreover, we just need to calculates K times xnor( ) and popcount( ) (see Eq. 3) then sum over the outputs, where the computational complexity is slightly larger than O(K). For binary convolution, it always outputs within the range {-1, 1}. And the value range for each element after summation is [ KN, KN], in which the number of distinct values representable is much less than that in fixed-point methods. In summary, we just need 1/K complexity of K-bit quantization and also save (2 K 1) 2 /K storage during computation. Even K-bit quantization with similar computational complexity requires more memory bandwidth to feed signal in SRAM or in registers. Since the representability and computational complexity have big difference, it is not appropriate to directly compare Group-net with fixed-point methods. Even though, we give such a comparison in Sec. V-B2. Differences between Group-net and multiple binarizations approaches: In [31], a linear combination of binary weight/activations bases are obtained from the full-precision weights/activations without being directly learned. And it has to solve the linear regression problem for each layer during forward propagation. In contrast, we directly design the binary network structure, where binary weights are end-to-end optimized. [33] [35] propose to use multiple binarizations, where each bit is used to binarize the residual error from previous bits. However, it is a sequential process which cannot be parallelized. And all multiple binarizations methods belong to local tensor approximation. In contrast 1 We omit learned scales here and only consider uniform quantization.

7 Model Weights Activations Operations Memory saving Computation Saving Full-precision DNN F F +, -, 1 1 [14], [15] B B XNOR, bitcount [19], [44] B F +, [45], [46] Q K F +, -, 32 K < 2 [8], [17], [20], [22] Q K Q K +, -, 32 < 64 K K 2 [31] K B K B +, -, XNOR, bitcount 32 < 64 K K 2 Ours K (B, B) +, -, XNOR, bitcount 32 < 64 K K TABLE I: Theoretically computational complexity analysis as well as the memory usage for all categories of quantization approaches. F : full-precision, B: binary, Q K : K-bit quantization. to value approximation, we propose a structural ensemble approach to mimic the full-precision network. Moreover, tensor-based methods are tightly designed to local value approximation and are hardly generalized to other tasks accordingly. In addition, our structure decomposition strategy achieves much better performance than tensor-level approximation as shown in Sec. S1.0.2 in the supplementary material. More discussions are provided in Sec. S2 in the supplementary file. V. EXPERIMENT We define several methods for comparison as follows: Layerwise: It implements the layer-wise feature approximation strategy described in Sec. III-B1. Group-Net v1: We implement with the group-wise feature approximation strategy, where each base consists of one block. It corresponds to the approach described in Eq. 5 and is illustrated in Fig. 3 (b). Group-Net v2: Similar to Group-Net v1, the only difference is that each group base has two blocks. It is illustrated in Fig. 3 (c) and is explained in Eq. 6. Group- Net v3: It is an extreme case where each base is a whole network, which can be treated as an ensemble of a set of binary networks. This case is shown in Fig. 3 (e). Ours: It implements the full model with learned soft connections described in Eq. 7. Ours**: Following -Real Net [43], we apply shortcut bypassing every convolution to improve the convergence. And this strategy can only be applied on basic blocks in ResNet-18 and ResNet-34. Due to the limited space, we put the ablation study in Sec. S1 in the supplementary file. A. Implementation details As in [8], [15], [20], [21], we quantize the weights and activations of all convolutional layers except that the first and last layers have full precisions. In all ImageNet experiments, training images are resized to , and a crop is randomly sampled from an image or its horizontal flip, with the per-pixel mean subtracted. We do not use any further data augmentation in our implementation. We use a simple single-crop testing for standard evaluation. No bias term is utilized. We first pretrain the full-precision model as initialization and fine-tune the binary counterpart. 2 We use Adam [47] for optimization. For training all binary networks, the mini-batch size and weight decay are set to 2 For pretraining Ours, we use ReLU( ) as nonlinearity. For pretraining Ours**, we use HardT anh( ) as nonlinearity. 256 and , respectively. The learning rate starts at 5e- 4 and is decayed twice by multiplying 0.1 at the 30th and 40th epoch. We train 50 epochs in total. Following [8], [21], no dropout is used due to binarization itself can be treated as a regularization. We apply layer-reordering to the networks as: Sign Conv ReLU BN. Inserting ReLU( ) after convolution is important for convergence. Our implementation is based on Pytorch [48]. B. Evaluation on ImageNet The proposed method is evaluated on ImageNet (ILSVRC2012) [49] dataset. ImageNet is a large-scale dataset which has 1.2M training images from 1K categories and 50K validation images. Several representative networks are tested: ResNet-18 [2], ResNet-34 and ResNet- 50. As discussed in Sec. VIII, binary approaches and fixedpoint approaches differ a lot in computational complexity as well as storage consumption. So we compare the proposed approach with binary neural networks in Table II and fixedpoint approaches in Table III, respectively. 1) Comparison with binary neural networks: Since we employ binary weights and binary activations, we directly compare to the previous state-of-the-art binary approaches, including BNN [14], XNOR-Net [15],-Real Net [43] and ABC-Net [31]. In all cases, our model uses binary weights and activations. We report the results in Table II and summarize the following points. 1): The most comparable baseline for Group-Net is ABC-Net. As discussed in Sec. VIII, we save considerable computational complexity while still achieving better performance compared to ABC-Net. In comparison to directly binarizing networks, Group-Net achieves much better performance but needs K times more storage and complexity. However, the K homogeneous bases can be easily parallelized on the real chip. 2): By comparing Ours** (5 bases) and Ours (8 bases), we can observe comparable performance. It justifies adding more shortcuts can facilitate BNNs training. 3): For Bottleneck structure in ResNet-50, we observe larger quantization error than the counterparts using basic blocks with 3 3 convolutions in ResNet-18 and ResNet-34. We assume that this is mainly attributable to the 1 1 convolutions in Bottleneck. The reason is 1 1 filters are limited to two states only (either 1 or -1) and they have very limited learning power. What s more, the bottleneck structure reduces the number of filters significantly, which means the gradient paths are greatly reduced. In other

8 words, it blocks the gradient flow through BNNs. Even though the bottleneck structure can benefit full-precision training, it is really needed to be redesigned in BNNs. To increase gradient paths, the 1 1 convolutions should be removed. 2) Comparison with fix-point approaches: Since we use K binary group bases, we compare our approach with at least K-bit fix-point approaches. In Table III, we compare our approach with the state-of-the-art fixed-point approaches DoReFa-Net [20], SYQ [50] and LQ-Nets [22]. As described in Sec. VIII, K binarizations are more superior than the K-bit width quantization with respect to the resource consumption. DOREFA-Net and LQ-Nets use 2-bit weights and 2-bit activations. SYQ employs binary weights and 8-bit activations. All the comparison results are directly cited from the corresponding papers. Even though we consume much less resources, we can achieve comparable or even improved performance with compared approaches. LQ-Nets is the current best-performing fixed-point approach and its activations have a long-tail distribution. As discussed in Sec. VIII, we require less memory bandwidth while still achieving comparable accuracy. The performance gap is mainly due to the limited representability of activations. 3) Hardware implementation on ImageNet: We currently implement ResNet-34 using Ours** and test the inference time on XEON E v3 CPU with 8 cores. We use the open-source BMXNet [51] and OpenMP for acceleration. The speed up of xnor 64 omp is greatly influenced by the input channel size, filter number, filter size and batch size. Moreover, BN + sign( ) is accelerated by XNOR operations following the implementation in FINN [52]. We cannot accelerate floating-point element-wise operations including tensor additions and nonlinearity. We finally achieve 7.5x speedup in comparison to the original ResNet-34 model, where process communication across cores occupies a lot of time in CPU. We also achieve actual 5.8x memory saving. However, we expect to achieve better acceleration on parallelization friendly FPGA platforms but currently we do not have enough resources. C. Evaluation on PASCAL VOC We evaluate the proposed methods on the PASCAL VOC 2012 semantic segmentation benchmark [53] which contains 20 foreground object classes and one background class. The original dataset contains 1,464 (train), 1,449 (val) and 1456 (test) images. The dataset is augmented by the extra annotations from [54], resulting in 10,582 training images. The performance is measured in terms of averaged pixel intersection-over-union (miou) over 21 classes. Our experiments are based on the original FCN [5] with the backbone ResNet-18, 34 and 50, in which we use atrous convolution in the last two blocks with output stride=8. We first pretrain the full-precision classification network on ImageNet dataset and fine-tune it on PASCAL VOC. During fine-tuning, we use Adam with initial learning rate=1e-4, decay=1e-5 and batch size=16. We set the number of bases K = 5 in experiments. We train 40 epochs in total and decay the learning rate by a factor of 10 at 20 and 30 epochs. We do not add any auxilary loss and ASPP. We empirically observe full-precision FCN under dilation rates (4, 8) in last two blocks achieves the best performance. We further report the main results in Table IV. From the results, we can observe that when all bases using the same dilation rates, there is a large performance gap with the full-precision counterpart. This performance drop is consistent with the classification results on ImageNet dataset in Table II. It proves that the quality of extracted features have a great impact on the segmentation performance. What s more, by utilizing task-specific BPAC, we find significant performance increase with no theoretical computational complexity added. It strongly justifies the scalability and flexibility of the proposed Group-Net. We expect BPAC can be further extended to other dense prediction tasks. Moreover, we can observe Ours + BPAC based on ResNet-34 even outperform the counterpart on ResNet-50. This shows the widely used bottleneck structure is not suited to BNNs. VI. CONCLUSION In this paper, we have begun to explore highly efficient and accurate CNN architectures with binary weights and activations. Specifically, we have proposed to directly decompose the full-precision network into multiple groups and each group is approximated using a set of binary bases which can be optimized in an end-to-end manner. We also propose to learn the decomposition automatically. Experimental results have proved the effectiveness of the proposed approach on the ImageNet classification task. Moreover, we have generalized Group-Net from image classification task to semantic segmentation and achieved promising performance on PASCAL VOC. We have implemented the homogeneous multi-branch structure on CPU and achieved promising acceleration on test-time inference. APPENDIX VII. ABLATION STUDY The core idea of Group-Net is decomposing floatingpoint network into binary structures and we evaluate it comprehensively in this section on ImageNet dataset with ResNet-18 and ResNet-50. 1) Group space exploration: We are interested in exploring the influence of different decomposition strategies. We present the results in Table V. We observe that by learning the soft connections between each block results in the best performance on ResNet-18. And methods based on hard connections perform relatively worse. From the results, we can conclude that designing compact binary structure is essential for highly accurate classification. What s more, we expect to further boost the performance by integrating with the NAS approaches as discussed in Sec. VIII.

9 model Full BNN XNOR -Real Net ABC-Net (25 bases) Ours (5 bases) Ours** (5 bases) Ours (8 bases) ResNet-18 Top % 42.2% 51.2% 56.4% 65.0% 64.8% 67.0% 67.5% Top % 67.1% 73.2% 79.5% 85.9% 85.7% 87.5% 88.0% ResNet-34 Top % % 68.4% 68.5% 70.5% 71.8% Top % % 88.2% 88.0% 89.3% 90.4% ResNet-50 Top % % 69.5% % Top % % 89.2% % TABLE II: Comparison with the state-of-the-art binary models using ResNet-18, ResNet-34 and ResNet-50 on ImageNet. All the comparing results are directly cited from the original papers. Model W A Top-1 Top-5 ResNet-18 Full-precision % 89.4% ResNet-18 Ours (4 bases) % 85.6% LQ-Net [22] % 85.9% DOREFA-Net [20] % 84.4% SYQ [50] % 84.6% TABLE III: Comparison with the state-of-the-art fixedpoint models with ResNet-18 on ImageNet. model miou Full-precision 64.9 ResNet-18, FCN-32s Ours 60.5 Ours + BPAC 63.8 Ours** + BPAC 65.1 Full-precision 67.3 ResNet-18, FCN-16s Ours 62.7 Ours + BPAC 66.5 Ours** + BPAC 67.4 Full-precision 72.7 ResNet-34, FCN-32s Ours 68.2 Ours + BPAC 71.2 Ours** + BPAC 72.8 Full-precision 73.1 ResNet-50, FCN-32s Ours 67.2 Ours + BPAC 70.4 TABLE IV: Performance on PASCAL VOC 2012 validation set. 2) Layer-wise vs. group-wise: We explore the difference between layer-wise and group-wise design strategies in Table V. By comparing the results, we find Ours outperforms Layerwise by 8.2% while saving the complexity with respect to floating-point tensor summation. Note that Layerwise approach can be treated as a kind of tensor approximation which has similarities with multiple binarizations methods in [31], [33] [35] and the differences are described in Sec. 4 in the main paper. It strongly shows the necessity for employing the group-wise design strategy to get promising results. We speculate that this significant gain is partly due to the preserved block structure in binary bases. It also proves that apart from designing accurate binarization function, it is also essential to design appropriate structure for BNNs. 3) Effect of the number of bases: We further explore the influence of number of bases to the final performance in Table VI. When the number is set to 1, it corresponds to directly binarize the original full-precision network and we observe apparent accuracy drop compared to its fullprecision counterpart. With more bases employed, we can find the performance steadily increases. The reason can be attributed to the better fitting of the intermediate feature maps, which is a trade-off between accuracy and complexity. It can be expected that with enough bases, the network should has the capacity to approximate the fullprecision network precisely. With the multi-branch groupwise design, we can achieve high accuracy while still significantly reducing inference time and power consumption. Interestingly, each base can be implemented using small resource and the parallel structure is quite friendly to FPGA/ASIC. VIII. MORE DISCUSSIONS Relation to ResNeXt [55]: The homogeneous multi-branch architecture design shares some spirit of ResNeXt and enjoys the advantage of introducing a cardinality dimension. However, our objectives are totally different. ResNeXt aims to increase the capacity while maintaining the complexity. To achieve this, it first divides the input channels into groups and perform efficient group convolutions implementation. Then all the group outputs are aggregated to approximate the original feature map. In contrast, we first divide the network into groups and directly replicate the floating-point structure for each branch while both weights and activations are quantized. In this way, we can reconstruct the fullprecision outputs via aggregating a set of low-precision transformations for complexity reduction in energy-efficient hardwares. Furthermore, our transformations are not restricted to only one block as in ResNeXt. Group-Net has strong scalability and flexibility: The group-wise approximation approach can be efficiently integrated with Neural Architecture Search (NAS) frameworks [28] [30], [40], [56] to explore the optimal architecture. Based on Group-Net, we can further add number of bases, filter numbers, connections among bases and even the bitwidth for each structural base into the search space. The proposed approach can also be combined with knowledge distillation strategy as in [8], [57]. The basic idea is to train a target network alongside another pretrained guidance network. An additional regularizer is added to minimize the difference between student s and teacher s intermediate feature representations for higher accuracy. In this way, we expect to further decrease the number of bases while maintaining the performance. IX. EXTENDING GROUP-NET TO BINARY WEIGHTS AND LOW-PRECISION ACTIVATIONS In the main paper, all the experiments are based on binary weights and binary activations. To make a tradeoff between accuracy and computational complexity, we can add more bases as discussed in Sec. VII-3. However, we can also increase the bit-width of activations for better accuracy. We

10 Model Bases Top-1 Top-5 Top-1 gap Top-5 gap ResNet-18 Full-precision % 89.4% - - ResNet-18 Ours % 85.7% 4.9% 3.7% ResNet-18 Group-Net v % 84.8% 6.7% 4.6% ResNet-18 Group-Net v % 84.1% 7.5% 5.3% ResNet-18 Group-Net v % 81.3% 10.8% 8.1% ResNet-18 Layerwise % 78.7 % 14.1% 10.7% TABLE V: Comparison with the state-of-the-art fixed-point models with ResNet-18 on ImageNet. Top-1 gap to the corresponding full-precision networks is also reported. Model Bases Top-1 Top-5 Top-1 gap Top-5 gap Full-precision % 89.4% - - Ours % 77.6% 15.5% 11.8% Ours % 84.2% 7.2% 5.2% Ours % 85.7% 4.9% 3.7% TABLE VI: Validation accuracy of Group-Net on ImageNet with different number of bases. All cases are based on the ResNet-18 network with binary weight and activations. conduct experiments on ImageNet dataset and report the accuracy in Table VII, Table VIII and Table IX. 1) Fixed-point Activation quantization: We just apply the simple uniform activation quantization in the paper. As the output of the RELU function is unbounded, the quantization after RELU requires a high dynamic range. It will cause large quantization errors especially when the bitprecision is low. To alleviate this problem, similar to [20], [58], we use a clip function h(y) = clip(y, 0, β) to limit the range of activation to [0, β], where β (not learned) is fixed during training. Then the truncated activation output y is uniformly quantized to k-bits (k > 1) and we still use STE to estimate the gradient: Forward : ỹ = round(y 2k 1 ) β Backward : l y = l ỹ. β 2 k 1, Since the weights are binary, the multiplication in convolution is replaced by fixed-point addition. One can simply replace the uniform quantizer with other non-uniform quantizers for more accurate quantization similar to [21], [22]. 2) Implementation details: For data preprocessing, it follows the same pipeline as BNNs. We also quantize the weights and activations of all convolutional layers except that the first and last layers have full precisions. For training ResNet with non-binary activations, the learning rate starts at 0.05 and is divided by 10 when it gets saturated. We use Nesterov momentum SGD for optimization. The minibatch size and weight decay are set to 128 and , respectively. The momentum ratio is 0.9. We directly learn from scratch since we empirically observe that fine-tuning does not bring further benefits to the performance. The convolution and element-wise operations are in the order: Conv BN ReLU Quantize. 3) Evaluation on ImageNet: From Table VII, we can observe that with binary weights and fixed-point activations, we can achieve highly accurate results. For example, by also referring to Table 2 in the main paper, we can find (9) the Top-1 accuracy drop for ResNet-50 Ours with tenary and binary activations are 1.5% and 6.5%, respectively. Furthermore, our approach still works well on plain network structures such as AlexNet in Table VIII. We also provide the comparison with different number of bases in Table IX. Model W A Top-1 Top-5 Top-1 gap Top-5 gap ResNet-18 Full-precision % 89.4% - - ResNet-18 Ours % 89.0% 0.1% 0.4% ResNet-18 Ours % 89.8% -0.7% -0.4% ResNet-18 Group-Net v % 88.5% 0.5% 0.9% ResNet-18 Group-Net v % 87.9% 1.4% 1.5% ResNet-18 Group-Net v % 85.0% 5.2% 4.4% ResNet-18 Layerwise % 82.2% 9.6% 7.2% ResNet-50 Full-precision % 92.9% - - ResNet-50 Ours % 91.5% 1.5% 1.4% ResNet-50 Ours % 92.7% 0.0% 0.2% TABLE VII: Validation accuracy of Group-Net on ImageNet with different choices of W and A for Group- Net v1. W and A refer to the weight and activation bitwidth, respectively. Effect of different group divisions is also reported as Layerwise, Group-Net v1, Group-Net v2 and Group-Net v3. Model Full-precision Layerwise Group-Net v1 Ours Top % 54.2% 57.3% 57.8 % Top % 77.6% 80.1% 80.9% TABLE VIII: Performance of AlexNet on ImageNet. All cases use binary weights and 2-bit activations. Model Bases bitw bita Top-1 Top-5 Top-1 gap Top-5 gap Full-precision % 89.4% - - Ours % 81.8% 9.6% 7.6% Ours % 88.7% 1.2% 0.7% Ours % 89.5% -0.4% -0.1% TABLE IX: Validation accuracy of Group-Net on ImageNet with number of bases. All cases are based on the ResNet-18 network with binary weights and 4-bit activations. REFERENCES [1] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Proc. Adv. Neural Inf. Process. Syst., pages , [2] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pages , , 5, 7 [3] Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. You only look once: Unified, real-time object detection. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pages , [4] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In Proc. Adv. Neural Inf. Process. Syst., pages 91 99,

Convolutional Neural Network Architecture

Convolutional Neural Network Architecture Convolutional Neural Network Architecture Zhisheng Zhong Feburary 2nd, 2018 Zhisheng Zhong Convolutional Neural Network Architecture Feburary 2nd, 2018 1 / 55 Outline 1 Introduction of Convolution Motivation

More information

Neural Network Approximation. Low rank, Sparsity, and Quantization Oct. 2017

Neural Network Approximation. Low rank, Sparsity, and Quantization Oct. 2017 Neural Network Approximation Low rank, Sparsity, and Quantization zsc@megvii.com Oct. 2017 Motivation Faster Inference Faster Training Latency critical scenarios VR/AR, UGV/UAV Saves time and energy Higher

More information

Towards Accurate Binary Convolutional Neural Network

Towards Accurate Binary Convolutional Neural Network Paper: #261 Poster: Pacific Ballroom #101 Towards Accurate Binary Convolutional Neural Network Xiaofan Lin, Cong Zhao and Wei Pan* firstname.lastname@dji.com Photos and videos are either original work

More information

Differentiable Fine-grained Quantization for Deep Neural Network Compression

Differentiable Fine-grained Quantization for Deep Neural Network Compression Differentiable Fine-grained Quantization for Deep Neural Network Compression Hsin-Pai Cheng hc218@duke.edu Yuanjun Huang University of Science and Technology of China Anhui, China yjhuang@mail.ustc.edu.cn

More information

arxiv: v2 [cs.cv] 17 Nov 2017

arxiv: v2 [cs.cv] 17 Nov 2017 Towards Effective Low-bitwidth Convolutional Neural Networks Bohan Zhuang, Chunhua Shen, Mingkui Tan, Lingqiao Liu, Ian Reid arxiv:1711.00205v2 [cs.cv] 17 Nov 2017 Abstract This paper tackles the problem

More information

TYPES OF MODEL COMPRESSION. Soham Saha, MS by Research in CSE, CVIT, IIIT Hyderabad

TYPES OF MODEL COMPRESSION. Soham Saha, MS by Research in CSE, CVIT, IIIT Hyderabad TYPES OF MODEL COMPRESSION Soham Saha, MS by Research in CSE, CVIT, IIIT Hyderabad 1. Pruning 2. Quantization 3. Architectural Modifications PRUNING WHY PRUNING? Deep Neural Networks have redundant parameters.

More information

<Special Topics in VLSI> Learning for Deep Neural Networks (Back-propagation)

<Special Topics in VLSI> Learning for Deep Neural Networks (Back-propagation) Learning for Deep Neural Networks (Back-propagation) Outline Summary of Previous Standford Lecture Universal Approximation Theorem Inference vs Training Gradient Descent Back-Propagation

More information

CS 229 Project Final Report: Reinforcement Learning for Neural Network Architecture Category : Theory & Reinforcement Learning

CS 229 Project Final Report: Reinforcement Learning for Neural Network Architecture Category : Theory & Reinforcement Learning CS 229 Project Final Report: Reinforcement Learning for Neural Network Architecture Category : Theory & Reinforcement Learning Lei Lei Ruoxuan Xiong December 16, 2017 1 Introduction Deep Neural Network

More information

Classification goals: Make 1 guess about the label (Top-1 error) Make 5 guesses about the label (Top-5 error) No Bounding Box

Classification goals: Make 1 guess about the label (Top-1 error) Make 5 guesses about the label (Top-5 error) No Bounding Box ImageNet Classification with Deep Convolutional Neural Networks Alex Krizhevsky, Ilya Sutskever, Geoffrey E. Hinton Motivation Classification goals: Make 1 guess about the label (Top-1 error) Make 5 guesses

More information

Introduction to Convolutional Neural Networks (CNNs)

Introduction to Convolutional Neural Networks (CNNs) Introduction to Convolutional Neural Networks (CNNs) nojunk@snu.ac.kr http://mipal.snu.ac.kr Department of Transdisciplinary Studies Seoul National University, Korea Jan. 2016 Many slides are from Fei-Fei

More information

Local Affine Approximators for Improving Knowledge Transfer

Local Affine Approximators for Improving Knowledge Transfer Local Affine Approximators for Improving Knowledge Transfer Suraj Srinivas & François Fleuret Idiap Research Institute and EPFL {suraj.srinivas, francois.fleuret}@idiap.ch Abstract The Jacobian of a neural

More information

Lecture 7 Convolutional Neural Networks

Lecture 7 Convolutional Neural Networks Lecture 7 Convolutional Neural Networks CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago April 17, 2017 We saw before: ŷ x 1 x 2 x 3 x 4 A series of matrix multiplications:

More information

Introduction to Deep Neural Networks

Introduction to Deep Neural Networks Introduction to Deep Neural Networks Presenter: Chunyuan Li Pattern Classification and Recognition (ECE 681.01) Duke University April, 2016 Outline 1 Background and Preliminaries Why DNNs? Model: Logistic

More information

CS 179: LECTURE 16 MODEL COMPLEXITY, REGULARIZATION, AND CONVOLUTIONAL NETS

CS 179: LECTURE 16 MODEL COMPLEXITY, REGULARIZATION, AND CONVOLUTIONAL NETS CS 179: LECTURE 16 MODEL COMPLEXITY, REGULARIZATION, AND CONVOLUTIONAL NETS LAST TIME Intro to cudnn Deep neural nets using cublas and cudnn TODAY Building a better model for image classification Overfitting

More information

Towards Effective Low-bitwidth Convolutional Neural Networks

Towards Effective Low-bitwidth Convolutional Neural Networks Towards Effective Low-bitwidth Convolutional Neural Networks Bohan Zhuang 1,2, Chunhua Shen 1,2, Mingkui Tan 3, Lingqiao Liu 1, Ian Reid 1,2 1 The University of Adelaide, Australia, 2 Australian Centre

More information

Artificial Neural Networks D B M G. Data Base and Data Mining Group of Politecnico di Torino. Elena Baralis. Politecnico di Torino

Artificial Neural Networks D B M G. Data Base and Data Mining Group of Politecnico di Torino. Elena Baralis. Politecnico di Torino Artificial Neural Networks Data Base and Data Mining Group of Politecnico di Torino Elena Baralis Politecnico di Torino Artificial Neural Networks Inspired to the structure of the human brain Neurons as

More information

arxiv: v1 [cs.ar] 11 Dec 2017

arxiv: v1 [cs.ar] 11 Dec 2017 Multi-Mode Inference Engine for Convolutional Neural Networks Arash Ardakani, Carlo Condo and Warren J. Gross Electrical and Computer Engineering Department, McGill University, Montreal, Quebec, Canada

More information

Convolutional Neural Networks II. Slides from Dr. Vlad Morariu

Convolutional Neural Networks II. Slides from Dr. Vlad Morariu Convolutional Neural Networks II Slides from Dr. Vlad Morariu 1 Optimization Example of optimization progress while training a neural network. (Loss over mini-batches goes down over time.) 2 Learning rate

More information

Introduction to Convolutional Neural Networks 2018 / 02 / 23

Introduction to Convolutional Neural Networks 2018 / 02 / 23 Introduction to Convolutional Neural Networks 2018 / 02 / 23 Buzzword: CNN Convolutional neural networks (CNN, ConvNet) is a class of deep, feed-forward (not recurrent) artificial neural networks that

More information

Lecture 17: Neural Networks and Deep Learning

Lecture 17: Neural Networks and Deep Learning UVA CS 6316 / CS 4501-004 Machine Learning Fall 2016 Lecture 17: Neural Networks and Deep Learning Jack Lanchantin Dr. Yanjun Qi 1 Neurons 1-Layer Neural Network Multi-layer Neural Network Loss Functions

More information

ANALYSIS ON GRADIENT PROPAGATION IN BATCH NORMALIZED RESIDUAL NETWORKS

ANALYSIS ON GRADIENT PROPAGATION IN BATCH NORMALIZED RESIDUAL NETWORKS ANALYSIS ON GRADIENT PROPAGATION IN BATCH NORMALIZED RESIDUAL NETWORKS Anonymous authors Paper under double-blind review ABSTRACT We conduct mathematical analysis on the effect of batch normalization (BN)

More information

Is Robustness the Cost of Accuracy? A Comprehensive Study on the Robustness of 18 Deep Image Classification Models

Is Robustness the Cost of Accuracy? A Comprehensive Study on the Robustness of 18 Deep Image Classification Models Is Robustness the Cost of Accuracy? A Comprehensive Study on the Robustness of 18 Deep Image Classification Models Dong Su 1*, Huan Zhang 2*, Hongge Chen 3, Jinfeng Yi 4, Pin-Yu Chen 1, and Yupeng Gao

More information

Deep Residual. Variations

Deep Residual. Variations Deep Residual Network and Its Variations Diyu Yang (Originally prepared by Kaiming He from Microsoft Research) Advantages of Depth Degradation Problem Possible Causes? Vanishing/Exploding Gradients. Overfitting

More information

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels Need for Deep Networks Perceptron Can only model linear functions Kernel Machines Non-linearity provided by kernels Need to design appropriate kernels (possibly selecting from a set, i.e. kernel learning)

More information

DEEP LEARNING AND NEURAL NETWORKS: BACKGROUND AND HISTORY

DEEP LEARNING AND NEURAL NETWORKS: BACKGROUND AND HISTORY DEEP LEARNING AND NEURAL NETWORKS: BACKGROUND AND HISTORY 1 On-line Resources http://neuralnetworksanddeeplearning.com/index.html Online book by Michael Nielsen http://matlabtricks.com/post-5/3x3-convolution-kernelswith-online-demo

More information

Encoder Based Lifelong Learning - Supplementary materials

Encoder Based Lifelong Learning - Supplementary materials Encoder Based Lifelong Learning - Supplementary materials Amal Rannen Rahaf Aljundi Mathew B. Blaschko Tinne Tuytelaars KU Leuven KU Leuven, ESAT-PSI, IMEC, Belgium firstname.lastname@esat.kuleuven.be

More information

Spatial Transformer Networks

Spatial Transformer Networks BIL722 - Deep Learning for Computer Vision Spatial Transformer Networks Max Jaderberg Andrew Zisserman Karen Simonyan Koray Kavukcuoglu Contents Introduction to Spatial Transformers Related Works Spatial

More information

arxiv: v1 [cs.cv] 11 May 2015 Abstract

arxiv: v1 [cs.cv] 11 May 2015 Abstract Training Deeper Convolutional Networks with Deep Supervision Liwei Wang Computer Science Dept UIUC lwang97@illinois.edu Chen-Yu Lee ECE Dept UCSD chl260@ucsd.edu Zhuowen Tu CogSci Dept UCSD ztu0@ucsd.edu

More information

Asaf Bar Zvi Adi Hayat. Semantic Segmentation

Asaf Bar Zvi Adi Hayat. Semantic Segmentation Asaf Bar Zvi Adi Hayat Semantic Segmentation Today s Topics Fully Convolutional Networks (FCN) (CVPR 2015) Conditional Random Fields as Recurrent Neural Networks (ICCV 2015) Gaussian Conditional random

More information

CSC321 Lecture 9: Generalization

CSC321 Lecture 9: Generalization CSC321 Lecture 9: Generalization Roger Grosse Roger Grosse CSC321 Lecture 9: Generalization 1 / 27 Overview We ve focused so far on how to optimize neural nets how to get them to make good predictions

More information

1 What a Neural Network Computes

1 What a Neural Network Computes Neural Networks 1 What a Neural Network Computes To begin with, we will discuss fully connected feed-forward neural networks, also known as multilayer perceptrons. A feedforward neural network consists

More information

Efficient DNN Neuron Pruning by Minimizing Layer-wise Nonlinear Reconstruction Error

Efficient DNN Neuron Pruning by Minimizing Layer-wise Nonlinear Reconstruction Error Efficient DNN Neuron Pruning by Minimizing Layer-wise Nonlinear Reconstruction Error Chunhui Jiang, Guiying Li, Chao Qian, Ke Tang Anhui Province Key Lab of Big Data Analysis and Application, University

More information

Deep Learning with Low Precision by Half-wave Gaussian Quantization

Deep Learning with Low Precision by Half-wave Gaussian Quantization Deep Learning with Low Precision by Half-wave Gaussian Quantization Zhaowei Cai UC San Diego zwcai@ucsd.edu Xiaodong He Microsoft Research Redmond xiaohe@microsoft.com Jian Sun Megvii Inc. sunjian@megvii.com

More information

Comments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms

Comments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms Neural networks Comments Assignment 3 code released implement classification algorithms use kernels for census dataset Thought questions 3 due this week Mini-project: hopefully you have started 2 Example:

More information

Accelerating Convolutional Neural Networks by Group-wise 2D-filter Pruning

Accelerating Convolutional Neural Networks by Group-wise 2D-filter Pruning Accelerating Convolutional Neural Networks by Group-wise D-filter Pruning Niange Yu Department of Computer Sicence and Technology Tsinghua University Beijing 0008, China yng@mails.tsinghua.edu.cn Shi Qiu

More information

COMPARING FIXED AND ADAPTIVE COMPUTATION TIME FOR RE-

COMPARING FIXED AND ADAPTIVE COMPUTATION TIME FOR RE- Workshop track - ICLR COMPARING FIXED AND ADAPTIVE COMPUTATION TIME FOR RE- CURRENT NEURAL NETWORKS Daniel Fojo, Víctor Campos, Xavier Giró-i-Nieto Universitat Politècnica de Catalunya, Barcelona Supercomputing

More information

Deep Learning (CNNs)

Deep Learning (CNNs) 10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Deep Learning (CNNs) Deep Learning Readings: Murphy 28 Bishop - - HTF - - Mitchell

More information

A practical theory for designing very deep convolutional neural networks. Xudong Cao.

A practical theory for designing very deep convolutional neural networks. Xudong Cao. A practical theory for designing very deep convolutional neural networks Xudong Cao notcxd@gmail.com Abstract Going deep is essential for deep learning. However it is not easy, there are many ways of going

More information

Multiscale methods for neural image processing. Sohil Shah, Pallabi Ghosh, Larry S. Davis and Tom Goldstein Hao Li, Soham De, Zheng Xu, Hanan Samet

Multiscale methods for neural image processing. Sohil Shah, Pallabi Ghosh, Larry S. Davis and Tom Goldstein Hao Li, Soham De, Zheng Xu, Hanan Samet Multiscale methods for neural image processing Sohil Shah, Pallabi Ghosh, Larry S. Davis and Tom Goldstein Hao Li, Soham De, Zheng Xu, Hanan Samet A TALK IN TWO ACTS Part I: Stacked U-nets The globalization

More information

Two at Once: Enhancing Learning and Generalization Capacities via IBN-Net

Two at Once: Enhancing Learning and Generalization Capacities via IBN-Net Two at Once: Enhancing Learning and Generalization Capacities via IBN-Net Supplementary Material Xingang Pan 1, Ping Luo 1, Jianping Shi 2, and Xiaoou Tang 1 1 CUHK-SenseTime Joint Lab, The Chinese University

More information

arxiv: v1 [cs.ne] 20 Apr 2018

arxiv: v1 [cs.ne] 20 Apr 2018 arxiv:1804.07802v1 [cs.ne] 20 Apr 2018 Value-aware Quantization for Training and Inference of Neural Networks Eunhyeok Park 1, Sungjoo Yoo 1, Peter Vajda 2 1 Department of Computer Science and Engineering

More information

Sajid Anwar, Kyuyeon Hwang and Wonyong Sung

Sajid Anwar, Kyuyeon Hwang and Wonyong Sung Sajid Anwar, Kyuyeon Hwang and Wonyong Sung Department of Electrical and Computer Engineering Seoul National University Seoul, 08826 Korea Email: sajid@dsp.snu.ac.kr, khwang@dsp.snu.ac.kr, wysung@snu.ac.kr

More information

Multiple Wavelet Coefficients Fusion in Deep Residual Networks for Fault Diagnosis

Multiple Wavelet Coefficients Fusion in Deep Residual Networks for Fault Diagnosis Multiple Wavelet Coefficients Fusion in Deep Residual Networks for Fault Diagnosis Minghang Zhao, Myeongsu Kang, Baoping Tang, Michael Pecht 1 Backgrounds Accurate fault diagnosis is important to ensure

More information

CSC321 Lecture 9: Generalization

CSC321 Lecture 9: Generalization CSC321 Lecture 9: Generalization Roger Grosse Roger Grosse CSC321 Lecture 9: Generalization 1 / 26 Overview We ve focused so far on how to optimize neural nets how to get them to make good predictions

More information

Value-aware Quantization for Training and Inference of Neural Networks

Value-aware Quantization for Training and Inference of Neural Networks Value-aware Quantization for Training and Inference of Neural Networks Eunhyeok Park 1, Sungjoo Yoo 1, and Peter Vajda 2 1 Department of Computer Science and Engineering Seoul National University {eunhyeok.park,sungjoo.yoo}@gmail.com

More information

CS230: Lecture 5 Advanced topics in Object Detection

CS230: Lecture 5 Advanced topics in Object Detection CS230: Lecture 5 Advanced topics in Object Detection Stanford University Today s outline We will learn: - the Intuition behind object detection methods - a series of computer vision papers I. Object detection

More information

Bayesian Networks (Part I)

Bayesian Networks (Part I) 10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Bayesian Networks (Part I) Graphical Model Readings: Murphy 10 10.2.1 Bishop 8.1,

More information

Google s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation

Google s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation Google s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation Y. Wu, M. Schuster, Z. Chen, Q.V. Le, M. Norouzi, et al. Google arxiv:1609.08144v2 Reviewed by : Bill

More information

Machine Learning for Computer Vision 8. Neural Networks and Deep Learning. Vladimir Golkov Technical University of Munich Computer Vision Group

Machine Learning for Computer Vision 8. Neural Networks and Deep Learning. Vladimir Golkov Technical University of Munich Computer Vision Group Machine Learning for Computer Vision 8. Neural Networks and Deep Learning Vladimir Golkov Technical University of Munich Computer Vision Group INTRODUCTION Nonlinear Coordinate Transformation http://cs.stanford.edu/people/karpathy/convnetjs/

More information

Neural networks and optimization

Neural networks and optimization Neural networks and optimization Nicolas Le Roux Criteo 18/05/15 Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 1 / 85 1 Introduction 2 Deep networks 3 Optimization 4 Convolutional

More information

Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning

Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning Nicolas Thome Prenom.Nom@cnam.fr http://cedric.cnam.fr/vertigo/cours/ml2/ Département Informatique Conservatoire

More information

Neural Networks. David Rosenberg. July 26, New York University. David Rosenberg (New York University) DS-GA 1003 July 26, / 35

Neural Networks. David Rosenberg. July 26, New York University. David Rosenberg (New York University) DS-GA 1003 July 26, / 35 Neural Networks David Rosenberg New York University July 26, 2017 David Rosenberg (New York University) DS-GA 1003 July 26, 2017 1 / 35 Neural Networks Overview Objectives What are neural networks? How

More information

Normalization Techniques in Training of Deep Neural Networks

Normalization Techniques in Training of Deep Neural Networks Normalization Techniques in Training of Deep Neural Networks Lei Huang ( 黄雷 ) State Key Laboratory of Software Development Environment, Beihang University Mail:huanglei@nlsde.buaa.edu.cn August 17 th,

More information

Convolutional neural networks

Convolutional neural networks 11-1: Convolutional neural networks Prof. J.C. Kao, UCLA Convolutional neural networks Motivation Biological inspiration Convolution operation Convolutional layer Padding and stride CNN architecture 11-2:

More information

Jakub Hajic Artificial Intelligence Seminar I

Jakub Hajic Artificial Intelligence Seminar I Jakub Hajic Artificial Intelligence Seminar I. 11. 11. 2014 Outline Key concepts Deep Belief Networks Convolutional Neural Networks A couple of questions Convolution Perceptron Feedforward Neural Network

More information

SHAKE-SHAKE REGULARIZATION OF 3-BRANCH

SHAKE-SHAKE REGULARIZATION OF 3-BRANCH SHAKE-SHAKE REGULARIZATION OF 3-BRANCH RESIDUAL NETWORKS Xavier Gastaldi xgastaldi.mba2011@london.edu ABSTRACT The method introduced in this paper aims at helping computer vision practitioners faced with

More information

TTIC 31230, Fundamentals of Deep Learning David McAllester, April Vanishing and Exploding Gradients. ReLUs. Xavier Initialization

TTIC 31230, Fundamentals of Deep Learning David McAllester, April Vanishing and Exploding Gradients. ReLUs. Xavier Initialization TTIC 31230, Fundamentals of Deep Learning David McAllester, April 2017 Vanishing and Exploding Gradients ReLUs Xavier Initialization Batch Normalization Highway Architectures: Resnets, LSTMs and GRUs Causes

More information

SGD and Deep Learning

SGD and Deep Learning SGD and Deep Learning Subgradients Lets make the gradient cheating more formal. Recall that the gradient is the slope of the tangent. f(w 1 )+rf(w 1 ) (w w 1 ) Non differentiable case? w 1 Subgradients

More information

LQ-Nets: Learned Quantization for Highly Accurate and Compact Deep Neural Networks. Microsoft Research

LQ-Nets: Learned Quantization for Highly Accurate and Compact Deep Neural Networks. Microsoft Research LQ-Nets: Learned Quantization for Highly Accurate and Compact Deep Neural Networks Dongqing Zhang, Jiaolong Yang, Dongqiangzi Ye, and Gang Hua Microsoft Research zdqzeros@gmail.com jiaoyan@microsoft.com

More information

Binary Deep Learning. Presented by Roey Nagar and Kostya Berestizshevsky

Binary Deep Learning. Presented by Roey Nagar and Kostya Berestizshevsky Binary Deep Learning Presented by Roey Nagar and Kostya Berestizshevsky Deep Learning Seminar, School of Electrical Engineering, Tel Aviv University January 22 nd 2017 Lecture Outline Motivation and existing

More information

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels Need for Deep Networks Perceptron Can only model linear functions Kernel Machines Non-linearity provided by kernels Need to design appropriate kernels (possibly selecting from a set, i.e. kernel learning)

More information

Weighted-Entropy-based Quantization for Deep Neural Networks

Weighted-Entropy-based Quantization for Deep Neural Networks Weighted-Entropy-based Quantization for Deep Neural Networks Eunhyeok Park, Junwhan Ahn, and Sungjoo Yoo canusglow@gmail.com, junwhan@snu.ac.kr, sungjoo.yoo@gmail.com Seoul National University Computing

More information

Faster Machine Learning via Low-Precision Communication & Computation. Dan Alistarh (IST Austria & ETH Zurich), Hantian Zhang (ETH Zurich)

Faster Machine Learning via Low-Precision Communication & Computation. Dan Alistarh (IST Austria & ETH Zurich), Hantian Zhang (ETH Zurich) Faster Machine Learning via Low-Precision Communication & Computation Dan Alistarh (IST Austria & ETH Zurich), Hantian Zhang (ETH Zurich) 2 How many bits do you need to represent a single number in machine

More information

Channel Pruning and Other Methods for Compressing CNN

Channel Pruning and Other Methods for Compressing CNN Channel Pruning and Other Methods for Compressing CNN Yihui He Xi an Jiaotong Univ. October 12, 2017 Yihui He (Xi an Jiaotong Univ.) Channel Pruning and Other Methods for Compressing CNN October 12, 2017

More information

PRUNING CONVOLUTIONAL NEURAL NETWORKS. Pavlo Molchanov Stephen Tyree Tero Karras Timo Aila Jan Kautz

PRUNING CONVOLUTIONAL NEURAL NETWORKS. Pavlo Molchanov Stephen Tyree Tero Karras Timo Aila Jan Kautz PRUNING CONVOLUTIONAL NEURAL NETWORKS Pavlo Molchanov Stephen Tyree Tero Karras Timo Aila Jan Kautz 2017 WHY WE CAN PRUNE CNNS? 2 WHY WE CAN PRUNE CNNS? Optimization failures : Some neurons are "dead":

More information

Recurrent Neural Networks with Flexible Gates using Kernel Activation Functions

Recurrent Neural Networks with Flexible Gates using Kernel Activation Functions 2018 IEEE International Workshop on Machine Learning for Signal Processing (MLSP 18) Recurrent Neural Networks with Flexible Gates using Kernel Activation Functions Authors: S. Scardapane, S. Van Vaerenbergh,

More information

Handwritten Indic Character Recognition using Capsule Networks

Handwritten Indic Character Recognition using Capsule Networks Handwritten Indic Character Recognition using Capsule Networks Bodhisatwa Mandal,Suvam Dubey, Swarnendu Ghosh, RiteshSarkhel, Nibaran Das Dept. of CSE, Jadavpur University, Kolkata, 700032, WB, India.

More information

Layer 1. Layer L. difference between the two precisions (B A and B W ) is chosen to balance the sum in (1), as follows: B A B W = round log 2 (2)

Layer 1. Layer L. difference between the two precisions (B A and B W ) is chosen to balance the sum in (1), as follows: B A B W = round log 2 (2) AN ANALYTICAL METHOD TO DETERMINE MINIMUM PER-LAYER PRECISION OF DEEP NEURAL NETWORKS Charbel Sakr Naresh Shanbhag Dept. of Electrical Computer Engineering, University of Illinois at Urbana Champaign ABSTRACT

More information

Introduction to (Convolutional) Neural Networks

Introduction to (Convolutional) Neural Networks Introduction to (Convolutional) Neural Networks Philipp Grohs Summer School DL and Vis, Sept 2018 Syllabus 1 Motivation and Definition 2 Universal Approximation 3 Backpropagation 4 Stochastic Gradient

More information

Statistical Machine Learning

Statistical Machine Learning Statistical Machine Learning Lecture 9 Numerical optimization and deep learning Niklas Wahlström Division of Systems and Control Department of Information Technology Uppsala University niklas.wahlstrom@it.uu.se

More information

A Novel Activity Detection Method

A Novel Activity Detection Method A Novel Activity Detection Method Gismy George P.G. Student, Department of ECE, Ilahia College of,muvattupuzha, Kerala, India ABSTRACT: This paper presents an approach for activity state recognition of

More information

Nonlinear Optimization Methods for Machine Learning

Nonlinear Optimization Methods for Machine Learning Nonlinear Optimization Methods for Machine Learning Jorge Nocedal Northwestern University University of California, Davis, Sept 2018 1 Introduction We don t really know, do we? a) Deep neural networks

More information

Deep Feedforward Networks

Deep Feedforward Networks Deep Feedforward Networks Liu Yang March 30, 2017 Liu Yang Short title March 30, 2017 1 / 24 Overview 1 Background A general introduction Example 2 Gradient based learning Cost functions Output Units 3

More information

arxiv: v1 [cs.cv] 3 Feb 2017

arxiv: v1 [cs.cv] 3 Feb 2017 Deep Learning with Low Precision by Half-wave Gaussian Quantization Zhaowei Cai UC San Diego zwcai@ucsd.edu Xiaodong He Microsoft Research Redmond xiaohe@microsoft.com Jian Sun Megvii Inc. sunjian@megvii.com

More information

arxiv: v2 [cs.cv] 12 Apr 2016

arxiv: v2 [cs.cv] 12 Apr 2016 arxiv:1603.05027v2 [cs.cv] 12 Apr 2016 Identity Mappings in Deep Residual Networks Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun Microsoft Research Abstract Deep residual networks [1] have emerged

More information

Quantisation. Efficient implementation of convolutional neural networks. Philip Leong. Computer Engineering Lab The University of Sydney

Quantisation. Efficient implementation of convolutional neural networks. Philip Leong. Computer Engineering Lab The University of Sydney 1/51 Quantisation Efficient implementation of convolutional neural networks Philip Leong Computer Engineering Lab The University of Sydney July 2018 / PAPAA Workshop Australia 2/51 3/51 Outline 1 Introduction

More information

Neural Networks and Deep Learning

Neural Networks and Deep Learning Neural Networks and Deep Learning Professor Ameet Talwalkar November 12, 2015 Professor Ameet Talwalkar Neural Networks and Deep Learning November 12, 2015 1 / 16 Outline 1 Review of last lecture AdaBoost

More information

An overview of deep learning methods for genomics

An overview of deep learning methods for genomics An overview of deep learning methods for genomics Matthew Ploenzke STAT115/215/BIO/BIST282 Harvard University April 19, 218 1 Snapshot 1. Brief introduction to convolutional neural networks What is deep

More information

arxiv: v4 [cs.cv] 2 Aug 2016

arxiv: v4 [cs.cv] 2 Aug 2016 arxiv:1603.05279v4 [cs.cv] 2 Aug 2016 XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, Ali Farhadi Allen Institute for AI,

More information

arxiv: v2 [cs.cv] 20 Aug 2018

arxiv: v2 [cs.cv] 20 Aug 2018 Joint Training of Low-Precision Neural Network with Quantization Interval Parameters arxiv:1808.05779v2 [cs.cv] 20 Aug 2018 Sangil Jung sang-il.jung Youngjun Kwak yjk.kwak Changyong Son cyson Jae-Joon

More information

Machine Learning for Signal Processing Neural Networks Continue. Instructor: Bhiksha Raj Slides by Najim Dehak 1 Dec 2016

Machine Learning for Signal Processing Neural Networks Continue. Instructor: Bhiksha Raj Slides by Najim Dehak 1 Dec 2016 Machine Learning for Signal Processing Neural Networks Continue Instructor: Bhiksha Raj Slides by Najim Dehak 1 Dec 2016 1 So what are neural networks?? Voice signal N.Net Transcription Image N.Net Text

More information

Introduction to Machine Learning (67577)

Introduction to Machine Learning (67577) Introduction to Machine Learning (67577) Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem Deep Learning Shai Shalev-Shwartz (Hebrew U) IML Deep Learning Neural Networks

More information

Introduction to Neural Networks

Introduction to Neural Networks CUONG TUAN NGUYEN SEIJI HOTTA MASAKI NAKAGAWA Tokyo University of Agriculture and Technology Copyright by Nguyen, Hotta and Nakagawa 1 Pattern classification Which category of an input? Example: Character

More information

Deep Learning Lab Course 2017 (Deep Learning Practical)

Deep Learning Lab Course 2017 (Deep Learning Practical) Deep Learning Lab Course 207 (Deep Learning Practical) Labs: (Computer Vision) Thomas Brox, (Robotics) Wolfram Burgard, (Machine Learning) Frank Hutter, (Neurorobotics) Joschka Boedecker University of

More information

Speaker Representation and Verification Part II. by Vasileios Vasilakakis

Speaker Representation and Verification Part II. by Vasileios Vasilakakis Speaker Representation and Verification Part II by Vasileios Vasilakakis Outline -Approaches of Neural Networks in Speaker/Speech Recognition -Feed-Forward Neural Networks -Training with Back-propagation

More information

Deep Learning Year in Review 2016: Computer Vision Perspective

Deep Learning Year in Review 2016: Computer Vision Perspective Deep Learning Year in Review 2016: Computer Vision Perspective Alex Kalinin, PhD Candidate Bioinformatics @ UMich alxndrkalinin@gmail.com @alxndrkalinin Architectures Summary of CNN architecture development

More information

CSC321 Lecture 16: ResNets and Attention

CSC321 Lecture 16: ResNets and Attention CSC321 Lecture 16: ResNets and Attention Roger Grosse Roger Grosse CSC321 Lecture 16: ResNets and Attention 1 / 24 Overview Two topics for today: Topic 1: Deep Residual Networks (ResNets) This is the state-of-the

More information

IN recent years, convolutional neural networks (CNNs) have

IN recent years, convolutional neural networks (CNNs) have SUBMISSION TO IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE Holistic CNN Compression via Low-rank Decomposition with Knowledge Transfer Shaohui Lin, Rongrong Ji, Senior Member, IEEE, Chao

More information

Neural networks COMS 4771

Neural networks COMS 4771 Neural networks COMS 4771 1. Logistic regression Logistic regression Suppose X = R d and Y = {0, 1}. A logistic regression model is a statistical model where the conditional probability function has a

More information

Correlation Autoencoder Hashing for Supervised Cross-Modal Search

Correlation Autoencoder Hashing for Supervised Cross-Modal Search Correlation Autoencoder Hashing for Supervised Cross-Modal Search Yue Cao, Mingsheng Long, Jianmin Wang, and Han Zhu School of Software Tsinghua University The Annual ACM International Conference on Multimedia

More information

Topics in AI (CPSC 532L): Multimodal Learning with Vision, Language and Sound. Lecture 3: Introduction to Deep Learning (continued)

Topics in AI (CPSC 532L): Multimodal Learning with Vision, Language and Sound. Lecture 3: Introduction to Deep Learning (continued) Topics in AI (CPSC 532L): Multimodal Learning with Vision, Language and Sound Lecture 3: Introduction to Deep Learning (continued) Course Logistics - Update on course registrations - 6 seats left now -

More information

arxiv: v1 [cs.cv] 16 Mar 2016

arxiv: v1 [cs.cv] 16 Mar 2016 arxiv:1603.05279v1 [cs.cv] 16 Mar 2016 XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, Ali Farhadi Allen Institute for AI,

More information

Grundlagen der Künstlichen Intelligenz

Grundlagen der Künstlichen Intelligenz Grundlagen der Künstlichen Intelligenz Neural networks Daniel Hennes 21.01.2018 (WS 2017/18) University Stuttgart - IPVS - Machine Learning & Robotics 1 Today Logistic regression Neural networks Perceptron

More information

Binary Convolutional Neural Network on RRAM

Binary Convolutional Neural Network on RRAM Binary Convolutional Neural Network on RRAM Tianqi Tang, Lixue Xia, Boxun Li, Yu Wang, Huazhong Yang Dept. of E.E, Tsinghua National Laboratory for Information Science and Technology (TNList) Tsinghua

More information

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6 Machine Learning for Large-Scale Data Analysis and Decision Making 80-629-17A Neural Networks Week #6 Today Neural Networks A. Modeling B. Fitting C. Deep neural networks Today s material is (adapted)

More information

UNSUPERVISED LEARNING

UNSUPERVISED LEARNING UNSUPERVISED LEARNING Topics Layer-wise (unsupervised) pre-training Restricted Boltzmann Machines Auto-encoders LAYER-WISE (UNSUPERVISED) PRE-TRAINING Breakthrough in 2006 Layer-wise (unsupervised) pre-training

More information

Eve: A Gradient Based Optimization Method with Locally and Globally Adaptive Learning Rates

Eve: A Gradient Based Optimization Method with Locally and Globally Adaptive Learning Rates Eve: A Gradient Based Optimization Method with Locally and Globally Adaptive Learning Rates Hiroaki Hayashi 1,* Jayanth Koushik 1,* Graham Neubig 1 arxiv:1611.01505v3 [cs.lg] 11 Jun 2018 Abstract Adaptive

More information

Very Deep Residual Networks with Maxout for Plant Identification in the Wild Milan Šulc, Dmytro Mishkin, Jiří Matas

Very Deep Residual Networks with Maxout for Plant Identification in the Wild Milan Šulc, Dmytro Mishkin, Jiří Matas Very Deep Residual Networks with Maxout for Plant Identification in the Wild Milan Šulc, Dmytro Mishkin, Jiří Matas Center for Machine Perception Department of Cybernetics Faculty of Electrical Engineering

More information

FreezeOut: Accelerate Training by Progressively Freezing Layers

FreezeOut: Accelerate Training by Progressively Freezing Layers FreezeOut: Accelerate Training by Progressively Freezing Layers Andrew Brock, Theodore Lim, & J.M. Ritchie School of Engineering and Physical Sciences Heriot-Watt University Edinburgh, UK {ajb5, t.lim,

More information

Data Mining Part 5. Prediction

Data Mining Part 5. Prediction Data Mining Part 5. Prediction 5.5. Spring 2010 Instructor: Dr. Masoud Yaghini Outline How the Brain Works Artificial Neural Networks Simple Computing Elements Feed-Forward Networks Perceptrons (Single-layer,

More information