A practical theory for designing very deep convolutional neural networks. Xudong Cao.

A practical theory for designing very deep convolutional neural networks Xudong Cao notcxd@gmail.com Abstract Going deep is essential for deep learning. However it is not easy, there are many ways of going deep but most of them are ineffective. In this work, we propose two novel constrains in the design of deep structure to guarantee the performance gain when going deep. Firstly, for each convolutional layer, its capacity of learning more complex patterns should be guaranteed; Secondly, the receptive field of the topmost layer should be no larger than the image region. Given these two constrains, we cast the task of designing deep convolutional neural network into a constrained optimization problem. We present an analytic optimal solution under certain conditions. 1. Introduction Going deep has demonstrated a great success in the ImageNet competition 2014. Both Google and VGG won the competition using very deep convolutional neural networks. Going deep greatly improves the learning/fitting capacity of the entire network while only increase model size and computational cost linearly. Although it has become a common sense to going deep, it is still unclear how to design a very deep convolution neural network effectively. Most times arbitrarily adding more layers does not help, or even worsen the performance. In this work, we propose a practical theory for designing very deep convolutional neural network effectively. As shown in Figure 1, we divide a convolutional neural network into two levels i.e. classifier and feature levels, we focus on designing a very deep architecture in the feature level while use a fixed simple design in the classifier level (will describe in the end of this section). We cast the design of deep convolutional neural network into a constrained optimization problem. The objective is maximizing the depth of the target convolutional neural network, subjecting to two constraints: (1) the c-value of each layer should not be too small, c-value is a metric for measuring the capacity of learning more complex patterns; (2) the receptive field of the topmost convolutional layer in the feature-level should no larger than image size. We proof Classifier Level Feature Level Small Feature Map Large Conv Kernel Identical for All Nets Our theory will provide a CNN design Figure 1. We divide a convolutional neural network into two levels, i.e. classifier level and feature level. In this work, we focus on designing very deep convolutional neural network in the feature level, while using a fixed simple classifier-level design for all networks. that there is a close form solution for this constrained optimization problem under certain conditions. The convolutional neural networks we designed achieved state-of-theart performance in many tasks, including cifar10/cifar100 classification, kaggle plankton classification and face recognition/verification 1. It is worth noting that our theory is task independent. The architecture of the designed convolutional neural network only depends on the raw image size and the fixed filter size 2. This fact could become less surprising if we consider that some recent works achieve good results on various tasks using VGGNet and GooleNet. It implies that we could design good architecture regardless the specific tasks. Of course, we believe better architectures can be designed by incorporating task specific knowledge, such as the cyclic pooling and rolling network designed by the first prize winner of the national data science bowl. Before directly dive into detailed mathematical formulation, we firstly describe the two novel constraints and the underlined intuitions in the subsequent sections, then we present the formal mathematical formulations as well as the 1 I will detail those experiments in a formal academic paper. For this tech report, I mainly describe the methods and the underlined intuitions. 2 In this work we use one kind of filter (size) in the designing of convolutional neural network for clarity and simplicity. It is possible to remove this restriction and extend the ideas to more general cases. 1

Pattern A Pattern B Pattern Response down sample stride 2 down sample stride 2 (a) Figure 2. An illustration of learning capacity. The exemplar (a) has better learning capacity comparing to the exemplar (b). The blue rectangle represent the convolutional filter/kernel. (b) 8 conv layers 3-by-3 kernel 2: 2: 16 4 conv layers 3-by-3 kernel 20: 4: 32 4 conv layers 3-by-3 kernel 40: 8: 64 receptive field size Figure 3. An illustration of receptive field size. The receptive field sizes of all convolutional layers are represented in matlab-style arrays for clarity. Fror example, 40: 8: 64 represents [40, 48, 56, 64]. The size of input image is 64-by-64. We subtract an annoying small constant 1 when calculating the receptive field sizes in order to make the description and subsequent derivation more concise. close form solution under certain conditions. The design of classifier level. The common design in classifier level is two fully connected layers with dropout on each of them. However we found this design is prone to over-fitting if the training set is not sufficiently large. We empirically found that it is better to down sample the input feature map to a small size (6x6, 7x7 or 8x8), and then apply two 5x5 convolutional layers, finally max pool the feature map into a vector and dropout this vector. As kernel size is very large relative to the feature map, herein the convolutional layers are more like fully connected layers. One perspective of understanding this design is it conduct dense sliding window test (used in VGG s work) and aggregate the final results by max pooling. This design is inspired by Network in Network and GoogleNet. The difference is that we found it is better to use large convolutional kernels relative to the feature map and replace the average pooling with max pooling. 2. Capacity of learning the first constraint The functionality of a convolutional layer is composing more complex patterns from input patterns. As shown in Figure 2(a), given the responses of pattern A and B and their spatial relationship, a convolutional layer can form/detect a more complex pattern AB. However, a convolutional layer does not always has the capacity of learning more complex patterns. In Figure 2(b), we show a simple case when a convolutional layer fail to learn a more complex pattern. In this case, the filter size is smaller than the distance between the responses of pattern A and B, in other words, it can not detect both pattern A and B as well as their spatial relationship, therefore fail to learn the more complex pattern AB. To regain the capacity of learning, we can either use larger convolutional kernel, or cut the responses distance by half via down sampling with stride 2. In this work, we only s- tudy regaining the learning capacity via down sampling. The odds of the learning failure grows, as a convolutional neural network goes deep without down sampling, because the sizes of the detected patterns and their meaningful s- patial relationships grows layer by layer. Once the spatial relationships of the majority input patterns exceed the filter size of a convolutional layer, this convolutional layer will lose its capacity of learning more complex patterns. To quantitatively measure the learning capacity of a convolutional layer we define the c-value of a convolutional layer as follows. c-value = Real Filter Size Receptive Field Size where the real filter size of a k-by-k convolutional layer is k if there is no down sampling, it doubles after each down sampling i.e. 2k after one down sampling and 4k after two down sampling etc. The receptive field size is defined as the maximum size of a neuron can see on the raw image. It grows proportionally as the convolutional neural network goes deep. Figure 3 shows how the receptive fields grows in an exemplar convolutional neural network. It is worth noting that we use the receptive field size instead of the typical size of the spatial relationships of the input patterns. Because the later is not measurable, even though it is conceptually more accurate. In the definition, we implicitly assume that the receptive field size of a convolutional layer is proportional to the typical size of the spatial relationships of the input patterns. Provided the discussions above, we are ready to present the the first constraint for designing very deep convolutional neural network: the c-value of each convolutional layer should be larger than a minimum value t. We empirically found t = 1/6 is a good lower bound of c-value for all convolutional layers in various tasks. (1)

3. Necessity of learning the second constraint As the receptive field grows, new and more complex patterns are constantly emerging. In the process, we need additional layers to learn the newly emerged patterns as the receptive field grows. However when the receptive field reach the image size i.e. neurons have seen the entire image region, it stops emerging new and more complex patterns, the driven force for adding new layers no longer exists. Empirically, we found adding many layers after the saturation of receptive field does not help the performance in general, what is worse, it increase the risk of over-fitting and hurt the performance in most cases. Base the analysis above and our empirical study, we propose the second constraint for designing very deep convolutional neural network. The receptive field size of the topmost convolutional layer should be no larger than the image size. This constraint implies that the receptive field of the topmost convolutional layer must be around the entire image region. If the receptive field of the topmost layer is much smaller the image region, we can add one more layer to improve our objective (i.e. increase the depth) without violating the constraint. From another perspective, if the receptive field is much smaller the image size, the network will lose the opportunity to see and learn the high-level patterns/features, which is suboptimal to the performance. For some tasks, there is no meaningful or discriminative patterns/features in a certain range of receptive field, thus no necessity of using many layers for learning i.e. maximizing the depth of entire network is no longer valid. We will have more discussions about this topic in the later part of this paper. 4. The mathematical formulation In this section, we formulate the design of deep architecture into a constrained optimization problem, and then we present and prove an optimal solution under certain conditions. As aforementioned, we assume the filter sizes are the same for all layers, we do not take the filter numbers into consideration, because it is task and data set dependent. 4.1. The formulation Herein we define the notations of the input parameters. They are the image size z, the filter size k and the minimum c-value t. The architecture of a deep model can determined by the total number of stages n and the number of layers in various stages {a i }. Various stages are divided by a down sampling (stride 2). For example, n = 3 and a 1, a 2, a 3 = 4, 3, 2 represent a model with 3 stages, the number of layers in the first, second and the third stage are 4, 3, 2 respectively. In between are down sampling with stride 2. The goal of going deep essentially is to maximize the total number of layers i.e. i a i, given the two constrains proposed in the previous two sections. The first constrain requires the c-values of all layers are no smaller than the minimum c-value t. As the receptive field keep growing, and the real filter size in one stage stay the same, the c-value of the last layer in one stage is the smallest. Therefore the first constrain is equivalent to ensuring the c-value of the last layer in each stage no smaller than the minimum c-value t, which can be translated into a set of inequations, 2 l k l 2i 1 (k 1)a i t, where l = 1, 2,..., n (2) where 2 l k is the real filter size at l-th stage, l 2i 1 (k 1)a i is the receptive field 3 of the last layer at l-th stage, t is the minimum c-value, we set t = 1/6 for all tasks. The second constrain requires the receptive field of the topmost of convolutional layer is no larger than the entire image region. It can be formally represented as follows. 2 i 1 (k 1)a i z (3) i where the left term is the receptive field of the topmost convolutional layer, 2 i 1 (k 1) is the receptive field increment of a layer at i-th stage, 2 i 1 (k 1)a i is the total receptive field increment at i-th stage. The objective function of our formulation can be formally represented by maximizing the total number of layers, subjecting to the two constraints in Equations (2) and (3). With slightly transforming the two constraints, we achieve our final formulation as follows. i max n,{a i} n a i (4) i a i 2 i 1 2l k where l = 1, 2,..., n (5) t(k 1) a i 2 i 1 z (6) k 1 where both n and {a i } are integers. We simplify the forms of the first and second constrains by equivalent transforms to facilitate the later discussions. 3 We omit the small constant 1 in calculating the receptive field sizes to simplify the description and subsequent derivation.

net without augment net with augment 32x32 input image 24x24 input image conv11, 5x5, 32x32, 128 conv11, 5x5, 24x24, 192 conv12, 5x5, 32x32, 128 conv12, 5x5, 24x24, 192 conv13, 5x5, 32x32, 128 conv13, 5x5, 24x24, 192 conv14, 5x5, 32x32, 128 conv14, 5x5, 24x24, 192 conv15, 5x5, 32x32, 128 conv15, 5x5, 24x24, 192 conv16, 5x5, 32x32, 128 conv16, 5x5, 24x24, 192 max pool stride 2 conv21, 5x5, 16x16, 128 max pool stride 2 max pool stride 4 conv31, 5x5, 8x8, 128 conv21, 5x5, 6x6, 192 conv32, 5x5, 8x8, 64 conv22, 5x5, 6x6, 64 max pool stride 8 max pool stride 6 Table 1. Networks designed using 5x5 filter size for cifar10 and cifar100 data. The description conv15, 5x5, 32x32, 128, from left to right, represents layer name, filter size, feature map size and filter number. 4.2. The optimal solution under certain conditions In this section, we show that the optimal solution of our formulation can be found under certain conditions. Assume that the image size z = 2 m 1 k/t, and the layer numbers {a i } are relaxed from integers to positive real numbers, we can show that the optimal solution to our objective function is that a 1 = n = m (7) k (k 1)t, a 2 =... = a n = 1/2a 1 (8) Due to the limit of time, we just roughly sketch the proof in Section 6. Discussion about the optimal solution. Although this optimal solution is obtained under certain conditions, it provides great insights about how to design effective deep architecture under general conditions. First, it guides how many times of down samplings should we choose given the input parameters. Second, it shows the number of layers should be as evenly distributed as possible in all stages, except the first stage. Third, it shows the maximum depth could be achieved by various filter size, based on which we can make better tradeoff between various filter sizes. net without augment net with augment 32x32 input image 24x24 input image conv11, 3x3, 32x32, 192 conv11, 3x3, 24x24, 256 conv12, 3x3, 32x32, 192 conv12, 3x3, 24x24, 256 conv13, 3x3, 32x32, 192 conv13, 3x3, 24x24, 256 conv14, 3x3, 32x32, 192 conv14, 3x3, 24x24, 256 conv15, 3x3, 32x32, 192 conv15, 3x3, 24x24, 256 conv16, 3x3, 32x32, 192 conv16, 3x3, 24x24, 256 conv17, 3x3, 32x32, 192 conv17, 3x3, 24x24, 256 conv18, 3x3, 32x32, 192 conv18, 3x3, 24x24, 256 max pool stride 2 max pool stride 2 conv21, 3x3, 16x16, 192 conv21, 3x3, 16x16, 256 conv22, 3x3, 16x16, 192 conv22, 3x3, 16x16, 256 conv23, 3x3, 16x16, 192 conv24, 3x3, 16x16, 192 max pool stride 2 max pool stride 4 conv31, 5x5, 8x8, 128 conv31, 5x5, 6x6, 192 conv32, 5x5, 8x8, 64 conv32, 5x5, 6x6, 64 max pool stride 8 max pool stride 6 Table 2. Networks designed using 3x3 filter size for cifar10 and cifar100 data. 5. Some exemplar networks Herein we show some networks designed according to our theory. 5.1. Networks for cifar10/cifar100 CIFAR10 and CIFAR100 are proposed by Alex. It contains 60,000 tiny color images with 32x32 size. Without data augmentation, the image size feed into the convolutional neural network is 32x32. With Alex s data augmentation, the image size feed into convolutional neural network is 24x24. As our design depends on the input image size and the filter size, we present four networks in Table (2) and (1). We achieve the best performance on cifar10 and cifar100 using the designed networks. On cifar10, our best result is 7.44%/9.02% (7.44% with data augmentation and 9.02% without data augmentation), better than Deeply Supervised Network (7.97%/9.69%), Network in Network (8.81%/10.41%), Maxout Network (9.32%/11.68%). On cifar100, our best result is 33.03% without data augmentation, better than Deeply Supervised Network (34.57%), Network in Network (35.68%) and Maxout Network (38.57%). It is worth noting that fractional max pooling proposed by Dr. Ben achieves much better results using more sophisticated data augmentation. Due to different data augmentation and testing settings, herein we did not directly compare with his work.

image size 48x48 image size 112x112 image size 128x128 image size 144x144 filter size 5x5 filter size 2x2 filter size 3x3 filter size 4x4 conv01, 112x112, 5x5, 16 conv01, 128x128, 5x5, 16 conv01, 144x144, 5x5, 16 max pool stride 2 max pool stride 2 max pool stride 2 conv11, 48x48, 5x5, 16 conv11, 56x56, 2x2, 32 conv11, 64x64, 3x3, 32 conv11, 72x72, 4x4, 32 conv12, 48x48, 5x5, 16 conv12, 56x56, 2x2, 32 conv12, 64x64, 3x3, 32 conv12, 72x72, 4x4, 32 conv13, 48x48, 5x5, 32 conv13, 56x56, 2x2, 32 conv13, 64x64, 3x3, 64 conv13, 72x72, 4x4, 64 conv14, 48x48, 5x5, 32 conv14, 56x56, 2x2, 32 conv14, 64x64, 3x3, 64 conv14, 72x72, 4x4, 64 conv15, 48x48, 5x5, 128 conv15, 56x56, 2x2, 64 conv15, 64x64, 3x3, 128 conv15, 72x72, 4x4, 128 conv16, 48x48, 5x5, 128 conv16, 56x56, 2x2, 64 conv16, 64x64, 3x3, 128 conv16, 72x72, 4x4, 128 conv17, 56x56, 2x2, 64 conv17, 64x64, 3x3, 192 conv17, 72x72, 4x4, 128 conv18, 56x56, 2x2, 64 conv18, 64x64, 3x3, 192 conv18, 72x72, 4x4, 128 conv19, 56x56, 2x2, 192 conv110, 56x56, 2x2, 192 conv111, 56x56, 2x2, 192 conv112, 56x56, 2x2, 192 max pool stride 2 max pool stride 2 max pool stride 2 max pool stride 2 conv21, 24x24, 5x5, 256 conv21, 28x28, 2x2, 384 conv21, 32x32, 3x3, 256 conv21, 36x36, 4x4, 256 conv22, 24x24, 5x5, 256 conv22, 28x28, 2x2, 384 conv22, 32x32, 3x3, 256 conv22, 36x36, 4x4, 256 conv23, 24x24, 5x5, 256 conv23, 28x28, 2x2, 384 conv23, 32x32, 3x3, 256 conv23, 36x36, 4x4, 256 conv24, 28x28, 2x2, 384 conv24, 32x32, 3x3, 256 conv24, 36x36, 4x4, 256 conv25, 28x28, 2x2, 384 conv26, 28x28, 2x2, 384 max pool stride 2 max pool stride 2 max pool stride 2 conv31, 14x14, 2x2, 768 conv31, 16x16, 3x3, 512 conv31, 18x18, 4x4, 512 conv32, 14x14, 2x2, 768 conv32, 16x16, 3x3, 512 conv32, 18x18, 4x4, 512 conv33, 14x14, 2x2, 768 conv33, 16x16, 3x3, 512 conv34, 14x14, 2x2, 768 conv34, 16x16, 3x3, 512 conv35, 14x14, 2x2, 768 conv36, 14x14, 2x2, 768 max pool stride 2 conv41, 7x7, 2x2, 1024 conv42, 7x7, 2x2, 1024 max pool stride 4 max pool stride 2 max pool stride 2 max pool stride 3 conv31, 6x6, 5x5, 256 conv51, 3x3, 3x3, 768 conv41, 8x8, 5x5, 512 conv41, 6x6, 5x5, 512 conv32, 6x6, 5x5, 256 conv52, 3x3, 3x3, 768 conv42, 8x8, 5x5, 512 conv42, 6x6, 5x5, 512 Table 3. Networks for national data science bowl. 5.2. Networks for national data science bowl National data science bowl is a kaggle competition. The task is classifying plankton images into 121 pre-defined classes. There are around 30k training data and 130k testing data. To make the designed networks complementary to each other, we design our networks with various input image sizes and filter sizes. In Table (3), we show 4 representative networks designed following our theory. Among them, the third net (image size 128, filter size 3) and the forth net (image size 144, filter size 4) achieved 0.606 and 0.609 log loss score on the public leader board. Our final ensemble consists of the four models in Table (3), a VGG-like model and 2 variants of our designs. The 7-model ensemble achieved 0.582 public leader board score. Combining with Bings score, we final achieved 0.574 score on the public leader board.

6. The sketch of the proof First, the solution satisfies the constrains in the inequations 5 and 6. We can easily verify this by plugging the solution into the inequations. We found that the solution pushes the left term of those inequations exactly equal to the right term. In other words, this solution is at the boundary of the set of feasible solutions. Second, assume there is a solution l and {b 1,..., b l }, and l m. By linearly composing the inequations in 5 and 6, we can always show that m b i (l + 1)C/2 = k where C = (k 1)t. This inequations implies there is no better solution when l m. Third, assume there is a solution l and {b 1,..., b l }, and l m. By linearly composing the inequations in 5 and 6, we can have that m b i + 2 i=m+1 a i b i (l + 1)C/2 By reorganize the the left term, we have b i + i=m+1 b i (l + 1)C/2 = m From the inequations above we tell that there is no better solution when l > m. References a i