arxiv: v2 [stat.ml] 18 Jun 2017

Similar documents
FreezeOut: Accelerate Training by Progressively Freezing Layers

SHAKE-SHAKE REGULARIZATION OF 3-BRANCH

arxiv: v2 [cs.lg] 23 May 2017

Two at Once: Enhancing Learning and Generalization Capacities via IBN-Net

COMPARING FIXED AND ADAPTIVE COMPUTATION TIME FOR RE-

Eve: A Gradient Based Optimization Method with Locally and Globally Adaptive Learning Rates

arxiv: v4 [cs.cv] 6 Sep 2017

Swapout: Learning an ensemble of deep architectures

WHY ARE DEEP NETS REVERSIBLE: A SIMPLE THEORY,

Beyond finite layer neural network

CondenseNet: An Efficient DenseNet using Learned Group Convolutions

arxiv: v2 [cs.cv] 8 Mar 2018

arxiv: v1 [cs.lg] 10 Aug 2018

Deep Residual. Variations

CondenseNet: An Efficient DenseNet using Learned Group Convolutions

CONVOLUTIONAL neural networks [18] have contributed

Convolutional Neural Network Architecture

Improving training of deep neural networks via Singular Value Bounding

Efficient DNN Neuron Pruning by Minimizing Layer-wise Nonlinear Reconstruction Error

Introduction to Deep Neural Networks

Regularizing Deep Networks Using Efficient Layerwise Adversarial Training

Normalization Techniques in Training of Deep Neural Networks

Theories of Deep Learning

P-TELU : Parametric Tan Hyperbolic Linear Unit Activation for Deep Neural Networks

Very Deep Residual Networks with Maxout for Plant Identification in the Wild Milan Šulc, Dmytro Mishkin, Jiří Matas

Fast Learning with Noise in Deep Neural Nets

arxiv: v1 [cs.cv] 18 May 2018

Convolutional Neural Networks II. Slides from Dr. Vlad Morariu

Deep Neural Networks (3) Computational Graphs, Learning Algorithms, Initialisation

ANALYSIS ON GRADIENT PROPAGATION IN BATCH NORMALIZED RESIDUAL NETWORKS

Learning Deep Architectures

Day 3 Lecture 3. Optimizing deep networks

arxiv: v3 [cs.lg] 4 Jun 2016 Abstract

Differentiable Fine-grained Quantization for Deep Neural Network Compression

Denoising Autoencoders

arxiv: v2 [cs.cv] 18 Jul 2017

arxiv: v1 [stat.ml] 2 Sep 2014

Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, Spis treści

Topics in AI (CPSC 532L): Multimodal Learning with Vision, Language and Sound. Lecture 3: Introduction to Deep Learning (continued)

Machine Learning Lecture 14

arxiv: v1 [cs.lg] 16 Jun 2017

FIXING WEIGHT DECAY REGULARIZATION IN ADAM

Sajid Anwar, Kyuyeon Hwang and Wonyong Sung

Deep Learning Year in Review 2016: Computer Vision Perspective

Deep Learning & Neural Networks Lecture 4

Deep Learning & Artificial Intelligence WS 2018/2019

Negative Momentum for Improved Game Dynamics

Classification goals: Make 1 guess about the label (Top-1 error) Make 5 guesses about the label (Top-5 error) No Bounding Box

arxiv: v1 [cs.lg] 25 Sep 2018

arxiv: v2 [cs.cv] 12 Apr 2016

The Deep Ritz method: A deep learning-based numerical algorithm for solving variational problems

arxiv: v1 [cs.cv] 11 May 2015 Abstract

Optimization Algorithm Inspired Deep Neural Network Structure Design

Hypernetwork-based Implicit Posterior Estimation and Model Averaging of Convolutional Neural Networks

Deep Feedforward Networks

Speaker Representation and Verification Part II. by Vasileios Vasilakakis

Advanced Training Techniques. Prajit Ramachandran

Optimizing CNNs. Timothy Dozat Stanford. Abstract. 1. Introduction. 2. Background Momentum

Deeply-Supervised Nets

Deep Learning Lecture 2

Local Affine Approximators for Improving Knowledge Transfer

Maxout Networks. Hien Quoc Dang

DEEP neural networks with a huge number of learnable

Encoder Based Lifelong Learning - Supplementary materials

Normalization Techniques

CS 229 Project Final Report: Reinforcement Learning for Neural Network Architecture Category : Theory & Reinforcement Learning

Appendix A.1 Derivation of Nesterov s Accelerated Gradient as a Momentum Method

CSE 591: Introduction to Deep Learning in Visual Computing. - Parag S. Chandakkar - Instructors: Dr. Baoxin Li and Ragav Venkatesan

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

arxiv: v2 [cs.lg] 9 Jun 2017

The K-FAC method for neural network optimization

Tutorial on: Optimization I. (from a deep learning perspective) Jimmy Ba

Fenchel Lifted Networks: A Lagrange Relaxation of Neural Network Training

Training Neural Networks with Implicit Variance

Lecture 7 Convolutional Neural Networks

Trusting SVM for Piecewise Linear CNNs

Deep Learning with Low Precision by Half-wave Gaussian Quantization

arxiv: v1 [cs.lg] 29 Dec 2018

Stochastic Variance Reduction for Nonconvex Optimization. Barnabás Póczos

arxiv: v1 [stat.ml] 27 Feb 2018

TUTORIAL PART 1 Unsupervised Learning

arxiv: v2 [cs.lg] 13 Sep 2017

Greedy Layer-Wise Training of Deep Networks

Abstention Protocol for Accuracy and Speed

arxiv: v2 [cs.lg] 14 Feb 2018

arxiv: v1 [cs.cv] 11 Sep 2018

Reading Group on Deep Learning Session 1

IMPROVING STOCHASTIC GRADIENT DESCENT

Gaussian Cardinality Restricted Boltzmann Machines

From CDF to PDF A Density Estimation Method for High Dimensional Data

<Special Topics in VLSI> Learning for Deep Neural Networks (Back-propagation)

Machine Learning Techniques

Introduction to Convolutional Neural Networks (CNNs)

BRIDGING NONLINEARITIES

Adversarial Image Perturbation for Privacy Protection A Game Theory Perspective Supplementary Materials

arxiv: v3 [cs.lg] 11 Nov 2018

Deep Belief Networks are compact universal approximators

Lecture 14: Deep Generative Learning

arxiv: v1 [cs.lg] 18 Sep 2018

Lecture 16 Deep Neural Generative Models

Transcription:

FREEZEOUT: ACCELERATE TRAINING BY PROGRES- SIVELY FREEZING LAYERS Andrew Brock, Theodore Lim, & J.M. Ritchie School of Engineering and Physical Sciences Heriot-Watt University Edinburgh, UK {ajb5, t.lim, j.m.ritchie}@hw.ac.uk Nick Weston Renishaw plc Research Ave, North Edinburgh, UK Nick.Weston@renishaw.com arxiv:176.4983v2 [stat.ml] 18 Jun 217 ABSTRACT The early layers of a deep neural net have the fewest parameters, but take up the most computation. In this extended abstract, we propose to only train the hidden layers for a set portion of the training run, freezing them out one-by-one and excluding them from the backward pass. Through experiments on CIFAR, we empirically demonstrate that FreezeOut yields savings of up to 2% wall-clock time during training with 3% loss in accuracy for DenseNets, a 2% speedup without loss of accuracy for ResNets, and no improvement for VGG networks. Our code is publicly available at https://github.com/ajbrock/freezeout 1 INTRODUCTION Layer-wise pre-training of neural nets (Bengio et al., 27) (Vincent et al.) has largely been replaced by careful initialization strategies and fully end-to-end training. Successful techniques such as DropOut (Srivastava et al., 212) and Stochastic Depth (Huang et al.), however, suggest that it is not necessary to have every unit in a network participate in the training process at every training step. Stochastic Depth leverages this to both regularize and reduce computational costs by dropping whole layers at a time (though it requires the use of residual connections (He et al.)), while DropOut s use of masks does not by default result in a computational speedup. In this work, we are concerned with reducing the time required for training a network by only training each layer for a set portion of the training schedule, progressively "freezing out" layers and excluding them from the backward pass. This technique is motivated by the observation that in many deep architectures, the early layers take up most of the budget, but have the fewest parameters and tend to converge to fairly simple configurations (e.g. edge detectors), suggesting that they do not require as much fine tuning as the later layers, where most of the parameters reside. This work is also motivated by a (wholly unverified) hypothesis that the success of SGDR (Loshchilov & Hutter, 217) is partially due to certain layers achieving near-optimality after every annealing-step before restart. We suspect that this results in faster convergence due to a reduction in internal covariate shift (Ioffe & Szegedy): if a unit s gradient is near zero, the unit will remain near its local optimum even when the learning rate is kicked back up on warm restart. 2 FREEZEOUT We propose a simple modification to the standard backprop+sgd pipeline to reduce training time. FreezeOut employs cosine annealing (as proposed by (Loshchilov & Hutter, 217)) without restarts (as used by (Gastaldi, 217)) with a layer-wise schedule, where the first layer s learning rate is reduced to zero partway through training (at t ), and each subsequent layer s learning rate is annealed to zero some set time thereafter. Once a layer s learning rate reaches zero, we put it in inference mode and exclude it from all future backward passes, resulting in an immediate per-iteration speedup proportional to the computational cost of the layer. In the simplest version of FreezeOut, each layer L i starts with a single fixed learning rate α, which anneals to zero at t i, where t i is linearly spaced between a user-selected t and the total number of 1

1 2 9 8 7 6 5 4 3 18 16 14 12 1 8 6 2 4 1 2 (a) Unscaled Linear Schedule (b) Scaled Linear Schedule 1 8 9 8 7 6 5 4 3 2 7 6 5 4 3 2 1 1 (c) Unscaled Cubic Schedule (d) Scaled Cubic Schedule Figure 1: Per-Layer Learning Rate Schedules for a 5-hidden-layer network with t =.5. iterations, as shown in Figure 1(a). Each layer s learning rate at iteration t is thus given as α i (t) =.5 α i ()(1 + cos(πt/t i )) (1) We experiment with varying two aspects of this strategy. First, we consider scaling the initial layerwise learning rate to be α i () = α/t i, where α is the base learning rate (and the learning rate of the final layer). This scaling, shown in Figure 1 (b), causes each layer s learning curve to integrate to the same value, meaning that each layer travels the same distance in the weight space (modulo its observed gradients and weight dimension) despite the reduced number of steps taken. Second, we vary the strategy for the how t relates to the t i s of the remaining layers. The simple version we explore takes the t i values determined by the linear scheduling rule, and cubes them with t i(cubed) = t 3 i(linear), as shown in Figure 1 (c) and (d). This gives more priority (in terms of training time) to later layers than does the linear schedule. As with the linear schedule, we consider an "unscaled" variant where the α i values are identical, and a "scaled" variant where the α i values are scaled based on the cubed t i values. Throughout this paper, we refer to t i s with respect to their uncubed values, such that such that a user selected t =.5 results in a cubed t =.125. FreezeOut adds two user decisions the choice of t and the choice of FreezeOut strategy to the standard choices of initial learning rate and number of training iterations. In the following section, we empirically investigate the relative merit of each of the four strategies, and provide recommendations for a default configuration, reducing the need for user-tuning of the hyperparameters. FreezeOut is easy to implement in a dynamic graph framework, requiring approximately 15 unique lines of code in PyTorch (Paszke et al., 217). 2

Figure 2: FreezeOut results for k=12, L=76 DenseNets on CIFAR-1 for 1 epochs. Shaded areas represent one standard deviation from the mean across 2-5 training runs. 3 EXPERIMENTS We modify publicly available implementations of DenseNets (Huang et al., 217) 1, Wide ResNets (Zagoruyko & Komodakis, 216) 2, and VGG (Simonyan & Zisserman) and test each of the four scheduling strategies across a wide range of t values on CIFAR-1 and CIFAR-1. Our PyTorch (Paszke et al., 217) code is publicly available. 3 3.1 DENSENET EXPERIMENTS Unless otherwise noted, all DenseNet experiments were performed using DenseNet-BC models with a growth rate of 12 and a depth of 76, and we swap the order of BatchNorm (Ioffe & Szegedy) and ReLU from the original implementation (such that we perform ReLU-BatchNorm-Convolution). We make use of standard data augmentation, train using SGD with Nesterov Momentum (Sutskever et al.) and report results on the test set after training is completed. We compare achieved speedups (in terms of observed wall clock time relative to a non-freezeout baseline) to performance on the test set, and are primarily interested in the accuracy reduction incurred for a given reduction in training time. For the speedups presented here, we use the values attained by running each test sweep in a controlled environment, on a GTX18Ti with no other programs running. The actual speedups we observe in practice are slightly different, as we run the experiments on shared servers, with various types of GPUs, and with many other programs also running. We found that a quick back-of the envelope calculation based on the reduction in operations accurately estimates the obtained speedups. Given c i, the computational cost for the forward pass through each convolutional layer, and noting that the cost of a full forward-backward pass through that layer is 2c i, we calculate the baseline computational cost as the sum of each layer s cost: 1 https://github.com/bamos/densenet.pytorch 2 https://github.com/xternalz/wideresnet-pytorch 3 https://github.com/ajbrock/freezeout C = Σ(2c i n itr ) (2) 3

Figure 3: FreezeOut results for k=12, L=76 DenseNets on CIFAR-1 for 1 epochs. with n itr the total number of training iterations. The compute cost for training with FreezeOut is: C f = Σ((1 + t i ) c i n itr ) (3) and the approximate relative speedup is simply the ratio 1 C f /C. We found that this accurately estimates speedup for cubic scheduling, but underestimates the speedups (e.g. the actual speedup is greater than the calculation gives) for linear scheduling; multiplying the predicted speedup by a correction factor of 1.3 resolves this almost entirely. This estimate neglects computational overhead and the cost of other operations (nonlinearities and BatchNorm), but this seems to be offset by the fact that a frozen layer s batchnorm costs are reduced (no longer having to calculate means and variances), and we find that this is generally a reliable estimate of the achieved speedups for a given setting. Our most-investigated setup is shown in Figure 2, where we train on CIFAR-1 for 1 epochs and repeat each experimental setting 2 to 5 times. We also train for a single pass on CIFAR-1 for 1 epochs (Figure 3). We observe a clear speedup versus accuracy tradeoff. For every strategy, we observe a speedup of up to 2%, with a maximum relative 3% increase in test error. Lower speedup levels perform better and occasionally outperform the baseline, though given the inherent level of non-determinism in training a network, we consider this margin insignificant. Whether this tradeoff is acceptable is up to the user. If one is prototyping many different designs and simply wants to observe how they rank relative to one another, then employing higher levels of FreezeOut may be tenable. If, however, one has set one s network design and hyperparameters and simply wants to maximize performance on a test set, then a reduction in training time is likely of no value, and FreezeOut is not a desirable technique to use. Based on these experiments, we recommend a default strategy of cubic scheduling with learning rate scaling, using a t value of.8 before cubing (so t =.512) for maximizing speed while remaining within an envelope of 3% relative error. As a close alternative, we suggest linear scheduling without learning rate scaling, using t =.5. The user is, of course, free to select parameters to fit a particular point on the design curve. 4

Figure 4: FreezeOut results for WRN4-4 on CIFAR-1. Shaded areas represent one standard deviation from the mean across 3 training runs for Cubic Scaled and 4 training runs for Linear Unscaled. 3.2 WIDE RESNET EXPERIMENTS We next investigate the feasibility of FreezeOut for use in Residual architectures (He et al.). We train Wide ResNets (Zagoruyko & Komodakis, 216) with depth of 4 and widening factor of 4, and compare our two recommended strategies against a no-freezeout baseline. We vary the number of epochs for which we train, and investigate how FreezeOut performs against a baseline of simply training for fewer epochs (e.g. a network trained with FreezeOut for 1 epochs has approximately the same training time as a network trained without FreezeOut for 8 epochs). The results of this investigation are shown in Figure 4. FreezeOut appears to be better suited to Wide ResNets than DenseNets, achieving higher accuracy even when trained for the same number of epochs (the final point on the Cubic Scaled curve and the final point on the Baseline curve were both trained for the same number of iterations). 3.3 VGG EXPERIMENTS Finally, we investigate the use of FreezeOut on an architecture without skip connections. We employ a variant of the VGG-16 (Simonyan & Zisserman) architecture with Batch Normalization (Ioffe & Szegedy), no Dropout (Srivastava et al., 212), and 512 units in each of the fully connected layers rather than 496. We perform a more limited set of experiments using the Linear Unscaled strategy against a baseline without FreezeOut, as shown in Figure 5. FreezeOut appears to be less well-suited for use in VGG, suggesting that skip connections (either dense or residual) may be an important element enabling FreezeOut to work for the other architectures we investigate. 4 CONCLUSION In this extended abstract, we presented FreezeOut, a simple technique to accelerate neural network training by progressively freezing hidden layers. ACKNOWLEDGMENTS This research was made possible by grants and support from Renishaw plc and the Edinburgh Centre For Robotics. The work presented herein is also partially funded under the European H22 Programme BEACONING project, Grant Agreement nr. 687676. 5

Figure 5: FreezeOut results for VGG-16 on CIFAR-1. Error Bars represent a single standard deviation from the mean across three training runs. Error bars instead of shaded error lines used here for improved clarity. REFERENCES Y. Bengio, P. Lamblin, D. Popovici, and H. Larochelle. Greedy layer-wise training of deep networks. In NIPS, 27. X. Gastaldi. Shake-shake regularization of 3-branch residual networks. ICLR 217 Workshop, 217. K. He, X. Zhang, S. Ren, and J. Sun. Identity mappings in deep residual networks. In ECCV 216. G. Huang, Y. Sun, Z. Liu, D. Sedra, and K. Q. Weinberger. Deep networks with stochastic depth. In ECCV 216. G. Huang, Z. Liu, K.Q. Weinberger, and L. van der Maaten. Densely connected convolutional networks. In CVPR 217, 217. S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML 215. I. Loshchilov and F. Hutter. Sgdr: Stochastic gradient descent with warm restarts. In ICLR 217, 217. A. Paszke, S. Gross, and S. Chintala. Pytorch. github.com/pytorch/pytorch, 217. K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In ICLR 215. N. Srivastava, G.E. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Improving neural networks by preventing co-adaptation of feature detectors. arxiv Preprint arxiv: 127.58, 212. I. Sutskever, J. Martens, G. Dahl, and G. Hinton. On the importance of initialization and momentum in deep learning. In ICML 213. 6

P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, and P.-A. Manzagol. Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion. In JMLR 21. S. Zagoruyko and N. Komodakis. Wide residual networks. arxiv Preprint arxiv: 165.7146, 216. 7