CSE 591: Introduction to Deep Learning in Visual Computing - Parag S. Chandakkar - Instructors: Dr. Baoxin Li and Ragav Venkatesan
Overview Background Why another network structure? Vanishing and exploding gradients Deep Residual Network Identity Mappings a way to make them deeper Extensions Future Scope
Background Three famous network architectures: Courtesy: Deep learning gets way deeper, ICML 2016 tutorial, Kaiming He
Background Three famous network architectures: Do we yet have a way to make networks arbitrarily deeper? Courtesy: Deep learning gets way deeper, ICML 2016 tutorial, Kaiming He
Background Three famous network architectures: For example, a module that can be repeated to increase the network depth. Courtesy: Deep learning gets way deeper, ICML 2016 tutorial, Kaiming He
Background Just appending layers to increase depth leads to an increase in the training error! Courtesy: He et al., Deep Residual Learning for Image Recognition, CVPR 2016
Background Just appending layers to increase depth leads to an increase in the training error! Theoretically, training error for a 56-layer network should be lesser than or equal to its 20-layer counterpart. This is not overfitting, nor can it be fully attributed to the vanishing/exploding gradient problem.
Background There exists a naïve solution (to construct arbitrarily deep networks) by construction and our solver/optimizer should be able to find it.
Background There exists a naïve solution (to construct arbitrarily deep networks) by construction and our solver/optimizer should be able to find it. Copy all the layers from the learned shallower network to the deeper network. Rest of the layers in the deeper network do nothing but an identity mapping. Do our best solvers ever find this solution in a reasonably deep network?
Background Solvers are not able to find a solution as simple as identity mappings in deeper networks. This is the degradation problem in deep neural networks. Not overfitting Only partially caused by vanishing/exploding gradients
Background Assume linear activation, a single layer neural network, Cost function C Independently initialized weights, same input feature variances Y = W 1 X 1 + W 2 X 2 + + W n X n Var Y = n in Var W i Var X i Var C = n out Var(W X i ) Var C i Y i Courtesy: Kaiming He, Deep learning gets way deeper, ICML 2016 Reading: Xavier Glorot, Yoshua Bengio, Understanding the difficulty of training deep feedforward neural networks, AISTATS 2010.
Deep Residual Networks The key is to define a residual block that can be stacked to create a network with an arbitrary depth. X W 1 BN, ReLU W 2 BN ReLU F X + X = H(X) With a plain net, we hope to discover the underlying mapping - H(X) With a residual net, we hope to discover only the residual mapping - F(X) Reading: He et al., Deep Residual Learning for Image Recognition, CVPR 2016.
Deep Residual Networks x l F(x l, W l ) W 1 BN, ReLU W 2 BN ReLU y l = x l + F(x l, W l ) Almost uninterrupted flow of gradients from any layer to the input layer. Only has to find the residual mapping, which complements the identity mapping. x l+1 = f y l = f x l + F x l, W l Reading: He et al., Deep Residual Learning for Image Recognition, CVPR 2016.
Deep Residual Networks x l W 1 BN, ReLU Recursively, x L = f f x 0 + F x 0, W 0 L 1 times F(x l, W l ) W 2 BN ReLU y l = x l + F(x l, W l ) C x l = C x L x L x l = C x L? (What is the issue?) x l+1 = f y l = f x l + F x l, W l Reading: He et al., Deep Residual Learning for Image Recognition, CVPR 2016.
Identity Mappings in Residual Networks F(x l, W l ) x l W 1 BN, ReLU W 2 BN ReLU y l = x l + F(x l, W l ) x l+1 = f y l = f x l + F x l, W l Adverse effects of this phenomenon can be seen in ultra-deep networks of 1000+ layers. Solution: make f as an identity mapping too. Options: Just remove the last non-linearity OR Change the order of BN, ReLU and W Reading: He et al., Identity Mappings in Deep Residual Networks, ECCV 2016.
Identity Mappings in Residual Networks Solution: make f as an identity mapping too. F(x l, W l ) x l BN, ReLU W 1 BN, ReLU W 2 x l+1 = x l + F(x l, W l ) Options: Just remove the last non-linearity OR Change the order of W, BN and ReLU C x l = C x L x L x l = C x L 1 + x l i=l L 1 F x i, W i Now, gradient flows more smoothly from a deeper layer L to a shallower layer l. This allows us to construct ultra-deep networks of 1000+ layers. Reading: He et al., Identity Mappings in Deep Residual Networks, ECCV 2016.
Extensions Deep networks with stochastic depth [1]: Only a randomly chosen subset of the layers will be executed during the training of a mini-batch. This allows us to train networks with higher depth. Residual networks behave like ensembles of relatively shallow networks [2]: This paper claims that a residual network is nothing but an ensemble of multiple shallow networks. Provides a completely new perspective on the success of residual networks and how it avoids the vanishing gradient problem. An intriguing paper! Reading: [1] Huang, Gao, et al. "Deep networks with stochastic depth." ECCV, 2016. [2] Veit, Andreas et al., "Residual networks behave like ensembles of relatively shallow networks." NIPS, 2016.
Future Work Analysis of depth versus error rate. Does the error rate continuously decrease as we increase depth? Yes, we can build a 1200-layer neural network now, but is it going to be feasible to do inference from such a huge network? As always, anything we can do to increase the accuracy further?