Convolutional Neural Network Architecture

Convolutional Neural Network Architecture Zhisheng Zhong Feburary 2nd, 2018 Zhisheng Zhong Convolutional Neural Network Architecture Feburary 2nd, 2018 1 / 55

Outline 1 Introduction of Convolution Motivation Operation Properties From FC Layer to CONV Layer 2 CNN Architecture The Benchmark Dataset Go Deeper Go Wider Information Flow 3 Summary Zhisheng Zhong Convolutional Neural Network Architecture Feburary 2nd, 2018 2 / 55

Outline ResNeXt, MultiResNet, FractalNet, Go wider CONV layer NN CNN Go deeper AlexNet, VGGNet, GoogleNet, ResNet Information Flow DenseNet, CliqueNet (Our work) Zhisheng Zhong Convolutional Neural Network Architecture Feburary 2nd, 2018 3 / 55

Introduction of Convolution Motivation In the 1960s, scientists studied the cat s visual cortex cells and found that each visual neuron processes only a small area of the visual image, the receptive field. Zhisheng Zhong Convolutional Neural Network Architecture Feburary 2nd, 2018 5 / 55

Introduction of Convolution Operation We first take a input with single channel as an example. P W P P 0 0 0 0 0 w W 0 0 H 0 0 h H 0 0 P 0 0 0 0 0 Input Convolution Kernel Output (S = 1) Here are some parameters:h, W, P, S, h, w, H, W. The relations between these parameters: H = (H + 2P h)/s + 1 W = (W + 2P w)/s + 1 Zhisheng Zhong Convolutional Neural Network Architecture Feburary 2nd, 2018 7 / 55

Introduction of Convolution Operation P W P P 0 0 0 0 0 w W 0 0 H 0 0 h H 0 0 P 0 0 0 0 0 Input Convolution Kernel Output (S = 1) Suppose we use X to represent the input matrix, W for the convolution kernel, X for the output matrix. The convolution operation is computed by the following formula ( represents convolution operation): X (m, n) = X W = h w X (m + i, n + j)w (i, j). (1) i=1 j=1 Zhisheng Zhong Convolutional Neural Network Architecture Feburary 2nd, 2018 8 / 55

Introduction of Convolution Operation In general, the input is not a matrix but a 3-D tensor, X R H W C i, the kernel becomes to a 4-D tensor W R h w C i C o, and the output is also a 3-D tensor X R H W C o. Zhisheng Zhong Convolutional Neural Network Architecture Feburary 2nd, 2018 9 / 55

Introduction of Convolution Properties In image processing, the convolution kernel is also is called a filter. It is important to note that filters acts as feature detectors from the original input image. Zhisheng Zhong Convolutional Neural Network Architecture Feburary 2nd, 2018 11 / 55

Introduction of Convolution Properties In signal processing, convolution has a strong connection with Fourier transform. Theorem Convolution Theorem: X Y = F 1 (F(X ) F(Y )), (2) where F represents Fourier transform and F 1 represents inverse Fourier transform, denotes element-wise product. Convolution in the spatial domain are equivalent to element-wise products in the Fourier domain. Zhisheng Zhong Convolutional Neural Network Architecture Feburary 2nd, 2018 12 / 55

Introduction of Convolution Properties Because the computation complexities of FFT and IFFT are both O(HW log(hw )), directly computation of convolution requires O(HWhw), we can use FFT and IFFT to reduce the number of parameters and accelerate the forward propagation. Zhisheng Zhong Convolutional Neural Network Architecture Feburary 2nd, 2018 13 / 55

Introduction of Convolution From FC Layer to CONV Layer Items FC layer CONV layer Input vector 3-D tensor Weight matrix 4-D tensor Output vector 3-D tensor Zhisheng Zhong Convolutional Neural Network Architecture Feburary 2nd, 2018 15 / 55

Introduction of Convolution From FC Layer to CONV Layer 1 2 3 4 5 6 7 8 9 a b c d = I J K L 1 2 3 4 5 6 7 8 9 Figure: Convent CONV Layer to FC Layer Comparing with FC layer, CONV Layer has 2 properties: Local Connectivity. Weight sharing. Both FC layer and CONV layer are linear transformations. Due to the convolution operator, CONV layer can preserve the spatial information. I J K L Zhisheng Zhong Convolutional Neural Network Architecture Feburary 2nd, 2018 16 / 55

Introduction of Convolution From FC Layer to CONV Layer Strong Representation and Predictive Power The toy models are run on MNIST dataset, which contains 60,000 training examples and each example is a 28 28 pixels image of digit from 0 to 9. Items Setting 1 Setting 2 Hidden layer 1 CONV (5,5,1,10) FC (784,1440) Number of parameter of H 1 250 1,128,960 Hidden layer 2 FC (1440,10) FC (1440,10) Number of parameter of H 2 14,400 14,400 The minimum of training loss 0.0038 0.0107 The maximum of test acc. 98.32% 97.25% Zhisheng Zhong Convolutional Neural Network Architecture Feburary 2nd, 2018 17 / 55

Introduction of Convolution From FC Layer to CONV Layer Strong Representation and Predictive Power Zhisheng Zhong Convolutional Neural Network Architecture Feburary 2nd, 2018 18 / 55

The Benchmark Dataset ImageNet Zhisheng Zhong Convolutional Neural Network Architecture Feburary 2nd, 2018 20 / 55

The Benchmark Dataset ImageNet difficulty Zhisheng Zhong Convolutional Neural Network Architecture Feburary 2nd, 2018 21 / 55

Go Deeper: Overview Revolution of Depth 152 layers 25.8 28.2 16.4 22 layers 19 layers 6.7 7.3 11.7 3.57 8 layers 8 layers shallow ILSVRC'15 ResNet ILSVRC'14 GoogleNet ILSVRC'14 VGG ILSVRC'13 ZFNet ILSVRC'12 AlexNet ImageNet Classification top-5 error (%) ILSVRC'11 ILSVRC'10 Zhisheng Zhong Convolutional Neural Network Architecture Feburary 2nd, 2018 23 / 55

Go Deeper Vanishing gradient problem the vanishing gradient problem is a difficulty found in training neural networks with gradient-based learning methods and backpropagation. We take backpropagation as an example: Denote an L-layer feedforward neural network by W X Σ 1 W 0 2 Σ1 Σ2 WL Σ L Y, where Σ i = σ(σ i 1 W i ), i = 1,..., L 1, Σ L = g(σ L 1 W L ). σ( ): activation, sigmoid, Relu... g( ): transform function of the last layer, e.g., identity for regression, softmax for classification. Zhisheng Zhong Convolutional Neural Network Architecture Feburary 2nd, 2018 24 / 55

Go Deeper Vanishing gradient problem We define f (W) is the loss on W. After choosing a proper loss function l, Using backpropagation we can get the gradients for all weights: WL f = Σ T L 1 Φ 1, Φ 1 = g (Σ L 1 W L 1 ) l (Σ L ) WL 1 f = Σ T L 2 Φ 2, Φ 2 = σ (Σ L 2 W L 2 ) (Φ 1 W T L ) WL 2 f = Σ T L 3 Φ 3, Φ 3 = σ (Σ L 3 W L 3 ) (Φ 2 W T L 1 ) where denotes element-wise product. Zhisheng Zhong Convolutional Neural Network Architecture Feburary 2nd, 2018 25 / 55

Go Deeper Vanishing gradient problem When we use sigmoid function as an activation function, σ(x) = 1, σ(x) (0, 1) 1 + e x. Consider the first order derivative of the sigmoid function: σ (x) = σ(x)(1 σ(x)), σ (x) 1 4. When we use Relu function as an activation function, { σ 1, if x 0, (x) = 0, otherwise. Zhisheng Zhong Convolutional Neural Network Architecture Feburary 2nd, 2018 26 / 55

Go Deeper AlexNet Feature extraction Classification Zhisheng Zhong Convolutional Neural Network Architecture Feburary 2nd, 2018 27 / 55

Go Deeper VGGNet Zhisheng Zhong Convolutional Neural Network Architecture Feburary 2nd, 2018 28 / 55

Go Deeper VGGNet Main contribution: The use of only 3 3 sized filters is quite different from AlexNets 11 11 filters in the first layer and ZF Nets 7 7 filters. This idea is widely applied by later CNN architectures such as ResNet, DenseNet and so on. Zhisheng Zhong Convolutional Neural Network Architecture Feburary 2nd, 2018 29 / 55

Go Deeper GoogleNet Zhisheng Zhong Convolutional Neural Network Architecture Feburary 2nd, 2018 30 / 55

Go Deeper GoogleNet:Inception Main contribution: 1. GoogLeNet was one of the first models that introduced the idea that CNN layers didnt always have to be stacked up sequentially. Inception block contains parallel paths. 2. Assistant loss functions in middle layers helps to learn more efficiently. Zhisheng Zhong Convolutional Neural Network Architecture Feburary 2nd, 2018 31 / 55

Go Deeper ResNet: Overview Zhisheng Zhong Convolutional Neural Network Architecture Feburary 2nd, 2018 32 / 55

Go Deeper ResNet: Motivation 20 20 training error (%) 10 56-layer 20-layer test error (%) 10 56-layer 20-layer 0 0 1 2 3 4 5 6 iter. (1e4) 0 0 1 2 3 4 5 6 iter. (1e4) Figure: Training error (left) and test error (right) on CIFAR-10 with 20-layer and 56-layer plain networks. The degradation (of training accuracy) indicates that not all systems are similarly easy to optimize. Zhisheng Zhong Convolutional Neural Network Architecture Feburary 2nd, 2018 33 / 55

Go Deeper ResNet: Conclusion & Solution Conclusion: Identity map may be difficult to learn when networks become deeper. Solution: Design a residual block with skip identity map. x F(x) weight layer relu weight layer F(x) + x relu x identity X i+1 = σ(f(x i, W i ) + X i ). (3) Zhisheng Zhong Convolutional Neural Network Architecture Feburary 2nd, 2018 34 / 55

Go Deeper ResNet: Improvement 60 60 50 50 error (%) 40 34-layer 30 plain-18 18-layer plain-34 20 0 10 20 30 40 50 iter. (1e4) error (%) 40 18-layer 30 ResNet-18 ResNet-34 34-layer 20 0 10 20 30 40 50 iter. (1e4) Figure: Left: plain networks of 18 and 34 layers. Right: ResNets of 18 and 34 layers. In this plot, the residual networks have no extra parameter compared to their plain counterparts. Zhisheng Zhong Convolutional Neural Network Architecture Feburary 2nd, 2018 35 / 55

Go Deeper ResNet: Main contribution Importance of identity skip connection For simplicity, we do not consider the effect of activation function. X i+1 = X i + F(X i, W i ), (4) X i+2 = X i+1 +F(X i+1, W i+1 ) = X i +F(X i, W i )+F(X i+1, W i+1 ). (5) From the two equations, we can get the recursive form: L 1 X L = X k + F(X i, W i ). (6) i=k Zhisheng Zhong Convolutional Neural Network Architecture Feburary 2nd, 2018 36 / 55

Go Deeper ResNet: Main contribution Importance of identity skip connection L 1 X L = X k + F(X i, W i ). i=k We compute the gradient of kth layer: l X k = l l L 1 F(X i, W i ) (1 + )). (7) X L X k i=k The additive term of X L ensures that information is directly propagated back to any shallower layer k. Zhisheng Zhong Convolutional Neural Network Architecture Feburary 2nd, 2018 37 / 55

Go Deeper ResNet: Main contribution Importance of identity skip connection Lets consider a simple modification to break the identity shortcut: X i+1 = λ i X i + F(X i, W i ), (8) L 1 X L = X k i=k the gradient of kth layer becomes to: l X k = l L 1 ( X L i=k L 1 λ i + F(X i, W i ). (9) i=k L 1 λ i + i=k F(X i, W i ) X k )). (10) For an extremely deep network (L is large), if λ i > 1 for all i, this factor can be exponentially large; if λ i < 1 for all i, this factor can be small and vanish. (information flow was broken) Zhisheng Zhong Convolutional Neural Network Architecture Feburary 2nd, 2018 38 / 55

Go Deeper ResNet: Another inference Ensemble X 3 = X 2 + f 3 (X 2 ) (11) X 3 = X 1 + f 2 (X 1 ) + f 3 (X 1 + f 2 (X 1 )) (12) X 3 = X 0 + f 1 (X 0 ) + f 2 (X 0 + f 1 (X 0 )) + f 3 (X 0 + f 1 (X 0 ) + f 2 (X 0 + f 1 (X 0 ))) If f 1, f 2, f 3 are linear operators: X 3 = X 0 +f 1 (X 0 )+f 2 (X 0 )+f 2 f 1 (X 0 )+f 3 (X 0 )+f 3 f 1 (X 0 )+f 3 f 2 (X 0 )+f 3 f 2 f 1 (X 0 ) Zhisheng Zhong Convolutional Neural Network Architecture Feburary 2nd, 2018 39 / 55

Go Deeper ResNet: Another inference Ensemble Distribution of path lengths follows a Binomial distribution. 54 blocks in total, more than 95% of paths go through 19 to 35 modules. (ResNet-110) Zhisheng Zhong Convolutional Neural Network Architecture Feburary 2nd, 2018 40 / 55

Go Deeper ResNet: Another inference Ensemble The effective paths in residual networks are relatively shallow The gradient of the input layer mostly pass by the shallow path. Zhisheng Zhong Convolutional Neural Network Architecture Feburary 2nd, 2018 41 / 55

Go Wider Motivated by ResNet, the effective paths in residual networks are relatively shallow. Researchers tried to shallow and widen the networks and achieved better results. ResNeXt Multi-ResNet FractalNet Zhisheng Zhong Convolutional Neural Network Architecture Feburary 2nd, 2018 43 / 55

Information Flow DenseNet Motivated by ResNet, identity skip connections in ResNet are very important. Because of identity skip connections, the signal can be directly propagated from any unit to another, both forward and backward. X l = H l ([X 0, X 1,..., X l 1 ]) (13) Zhisheng Zhong Convolutional Neural Network Architecture Feburary 2nd, 2018 45 / 55

Information Flow DenseNet This introduces (L 1)L 2 connections in an L-layer network. Problems: 1. the number of connections grows quadratically with depth. 2. the number of input feature maps grows linearly with depth. Solutions: Add blocks and transition layers. Transition layer Transition layer Zhisheng Zhong Convolutional Neural Network Architecture Feburary 2nd, 2018 46 / 55

Information Flow CliqueNet (our work) Motivated by the success of ResNet and DenseNet, we add more identity skip connections to maximize the information flow of the networks. Zhisheng Zhong Convolutional Neural Network Architecture Feburary 2nd, 2018 47 / 55

Information Flow CliqueNet (our work) Stage-I: X (1) i = σ( W li X (1) l ) (14) 0 l<i Stage-II (k 2): = σ( X (k) i 0<l<i W li X (k) l + i<m L W mi X (k) m ) (15) Loop Zhisheng Zhong Convolutional Neural Network Architecture Feburary 2nd, 2018 48 / 55

Information Flow CliqueNet (our work) Transition Feature Block Feature Zhisheng Zhong Convolutional Neural Network Architecture Feburary 2nd, 2018 49 / 55

Information Flow CliqueNet (our work) Comparison of the number of parameters. (DenseNet vs. CliqueNet) We set the number of block equal to 5. Each block contains 5 layers. each layer produces 64 feature maps. 700 The number of input channel per layer of DenseNet 700 The number of input channel per layer of CliqueNet 600 600 500 500 400 400 300 300 200 200 100 100 0 0 5 10 15 20 25 30 0 0 5 10 15 20 25 30 Under the same conditions, the number of parameters of DenseNet / the number of parameters of CliqueNet is about 0.95. When the number of Block becomes larger, the number of CliqueNet will be less than the number of DenseNet. Zhisheng Zhong Convolutional Neural Network Architecture Feburary 2nd, 2018 50 / 55

Information Flow CliqueNet (our work) Figure: Visualization of the weights in the first block in pretrained DenseNet (left) and CliqueNet (right) by calculating the average absolute value of W ij. From the visualization, CliqueNet s parameter efficiency is better than DenseNet s. Zhisheng Zhong Convolutional Neural Network Architecture Feburary 2nd, 2018 51 / 55

Information Flow CliqueNet (our work) Results on CIFAR-10, CIFAR-100 and SVHN Zhisheng Zhong Convolutional Neural Network Architecture Feburary 2nd, 2018 52 / 55

Information Flow CliqueNet (our work) Results on ImageNet Zhisheng Zhong Convolutional Neural Network Architecture Feburary 2nd, 2018 53 / 55

Information Flow CliqueNet (our work) Our contribution: 1. To maximize the information flow, we use fully connected graph. 2. We are the first to propose a unfolded loop block structure. 3. We use multi-scale features as an input of loss function and 2 kind of features are propagated in our network. Zhisheng Zhong Convolutional Neural Network Architecture Feburary 2nd, 2018 54 / 55

Summary Local connected Weight sharing More powerful in Image Feature extraction ResNeXt, MultiResNet, FractalNet, Go wider CONV layer NN CNN Go deeper AlexNet, VGGNet, GoogleNet, ResNet The most important part: Skip identity connection Design criterion: Directed computation graph without loop Information Flow DenseNet, CliqueNet (Our work) Zhisheng Zhong Convolutional Neural Network Architecture Feburary 2nd, 2018 55 / 55