arxiv: v2 [cs.lg] 30 Mar 2017

Size: px

Start display at page:

Download "arxiv: v2 [cs.lg] 30 Mar 2017"

John Boyd
6 years ago
Views:

1 Batch Renoralization: Towards Reducing Minibatch Dependence in Batch-Noralized Models Sergey Ioffe Google Inc., arxiv: v2 [cs.lg] 30 Mar 2017 Abstract Batch Noralization is quite effective at accelerating and iproving the training of deep odels. However, its effectiveness diinishes when the training inibatches are sall, or do not consist of independent saples. We hypothesize that this is due to the dependence of odel layer inputs on all the exaples in the inibatch, and different activations being produced between training and inference. We propose Batch Renoralization, a siple and effective extension to ensure that the training and inference odels generate the sae outputs that depend on individual exaples rather than the entire inibatch. Models trained with Batch Renoralization perfor substantially better than batchnor when training with sall or non-i.i.d. inibatches. At the sae tie, Batch Renoralization retains the benefits of batchnor such as insensitivity to initialization and training efficiency. 1 Introduction Batch Noralization ( batchnor [6]) has recently becoe a part of the standard toolkit for training deep networks. By noralizing activations, batch noralization helps stabilize the distributions of internal activations as the odel trains. Batch noralization also akes it possible to use significantly higher learning rates, and reduces the sensitivity to initialization. These effects help accelerate the training, soeties draatically so. Batchnor has been successfully used to enable state-of-the-art architectures such as residual networks [5]. Batchnor works on inibatches in stochastic gradient training, and uses the ean and variance of the inibatch to noralize the activations. Specifically, consider a particular node in the deep network, producing a scalar value for each input exaple. Given a inibatch B of exaples, consider the values of this node,x 1...x. Then batchnor takes the for: x i x i µ B B where µ B is the saple ean of x 1...x, and B 2 is the saple variance (in practice, a sall ǫ is added to it for nuerical stability). It is clear that the noralized activations corresponding to an input exaple will depend on the other exaples in the inibatch. This is undesirable during inference, and therefore the ean and variance coputed over all training data can be used instead. In practice, the odel usually aintains oving averages of inibatch eans and variances, and during inference uses those in place of the inibatch statistics. While it appears to ake sense to replace the inibatch statistics with whole-data ones during inference, this changes the activations in the network. In particular, this eans that the upper layers (whose inputs are noralized using the inibatch) are trained on representations different fro those coputed in inference (when the inputs are noralized using the population statistics). When the inibatch size is large and its eleents are i.i.d. saples fro the training distribution, this difference is sall, and can in fact aid generalization. However, inibatchwise noralization ay have significant drawbacks: For sall inibatches, the estiates of the ean and variance becoe less accurate. These inaccuracies are copounded with depth, and reduce the quality of resulting odels. Moreover, as each exaple is used to copute the variance used in its own noralization, the noralization operation is less well approxiated by an affine transfor, which is what is used in inference. Non-i.i.d. inibatches can have a detriental effect on odels with batchnor. For exaple, in a etric learning scenario (e.g. [4]), it is coon to bias the inibatch sapling to include sets of exaples that are known to be related. For instance, for a inibatch of size 32, we ay randoly select 16 labels, then choose 2 exaples for each of those labels. Without batchnor, the loss coputed for the inibatch decouples over the exaples, and the intra-batch dependence introduced by our sapling echanis ay, at worst, increase the variance of the inibatch gradient. With batchnor, however, the exaples interact at every layer, which ay cause the odel to overfit to the specific distribution of inibatches and suffer when used on individual exaples. The dependence of the batch-noralized activations on the entire inibatch akes batchnor powerful, but it is also the source of its drawbacks. Several approaches have been proposed to alleviate this. However, unlike batchnor which can be easily applied to an existing odel, these ethods ay require careful analysis of nonlinearities [1] and ay change the class of functions representable by the odel [2]. Weight noralization [10] 1

2 presents an alternative, but does not offer guarantees about the activations and gradients when the odel contains arbitrary nonlinearities, or contains layers without such noralization. Furtherore, weight noralization has been shown to benefit fro ean-only batch noralization, which, like batchnor, results in different outputs during training and inference. Another alternative [9] is to use a separate and fixed inibatch to copute the noralization paraeters, but this akes the training ore expensive, and does not guarantee that the activations outside the fixed inibatch are noralized. In this paper we propose Batch Renoralization, a new extension to batchnor. Our ethod ensures that the activations coputed in the forward pass of the training step depend only on a single exaple and are identical to the activations coputed in inference. This significantly iproves the training on non-i.i.d. or sall inibatches, copared to batchnor, without incurring extra cost. 2 Prior Work: Batch Noralization We are interested in stochastic gradient optiization of deep networks. The task is to iniize the loss, which decoposes over training exaples: Θ = argin Θ 1 N N l i (Θ) where l i is the loss incurred on the ith training exaple, and Θ is the vector of odel weights. At each training step, a inibatch of exaples is used to copute the gradient 1 i (Θ) Θ which the optiizer uses to adjustθ. Consider a particular node x in a deep network. We observe that x depends on all the odel paraeters that are used for its coputation, and when those change, the distribution of x also changes. Since x itself affects the loss through all the layers above it, this change in distribution coplicates the training of the layers above. This has been referred to as internal covariate shift. Batch Noralization [6] addresses it by considering the values of x in a inibatch B = {x 1... }. It then noralizes the as follows: µ B 1 x i B 1 (x i µ B ) 2 +ǫ x i x i µ B B y i γ x i +β BN(x i ) Here γ and β are trainable paraeters (learned using the sae procedure, such as stochastic gradient descent, as all the other odel weights), and ǫ is a sall constant. Crucially, the coputation of the saple eanµ B and saple standard deviation B are part of the odel architecture, are theselves functions of the odel paraeters, and as such participate in backpropagation. The backpropagation forulas for batchnor are easy to derive by chain rule and are given in [6]. When applying batchnor to a layer of activations x, the noralization takes place independently for each diension (or, in the convolutional case, for each channel or feature ap). Whenxis itself a result of applying a linear transfor W to the previous layer, batchnor akes the odel invariant to the scale of W (ignoring the sall ǫ). This invariance akes it possible to not be picky about weight initialization, and to use larger learning rates. Besides the reduction of internal covariate shift, an intuition for another effect of batchnor can be obtained by considering the gradients with respect to different layers. Consider the noralized layer x, whose eleents all have zero ean and unit variance. For a thought experient, let us assue that the diensions of x are independent. Further, let us approxiate the loss l( x) as its first-order Taylor expansion: l l 0 + g T x, where g = x. It then follows that Var[l] g 2 in which the left-hand side does not depend on the layer we picked. This eans that the nor of the gradient w.r.t. a noralized layer x is approxiately the sae for different noralized layers. Therefore the gradients, as they flow through the network, do not explode nor vanish, thus facilitating the training. While the assuptions of independence and linearity do not hold in practice, the gradient flow is in fact significantly iproved in batch-noralized odels. During inference, the standard practice is to noralize the activations using the oving averagesµ, 2 instead of inibatch eanµ B and variance 2 B : y inference = x µ γ +β which depends only on a single input exaple rather than requiring a whole inibatch. It is natural to ask whether we could siply use the oving averages µ, to perfor the noralization during training, since this would reove the dependence of the noralized activations on the other exaple in the inibatch. This, however, has been observed to lead to the odel blowing up. As argued in [6], such use of oving averages would cause the gradient optiization and the noralization to counteract each other. For exaple, the gradient step ay increase a bias or scale the convolutional weights, in spite of the fact that the noralization would cancel the effect of these changes on the loss. This would result in unbounded growth of odel paraeters without actually iproving the loss. It is thus crucial to use the inibatch oents, and to backpropagate through the. 2

3 3 Batch Renoralization With batchnor, the activities in the network differ between training and inference, since the noralization is done differently between the two odels. Here, we ai to rectify this, while retaining the benefits of batchnor. Let us observe that if we have a inibatch and noralize a particular node x using either the inibatch statistics or their oving averages, then the results of these two noralizations are related by an affine transfor. Specifically, let µ be an estiate of the ean of x, and be an estiate of its standard deviation, coputed perhaps as a oving average over the last several inibatches. Then, we have: x i µ = x i µ B r+d, wherer = B B, d = µ B µ If = E[ B ] andµ = E[µ B ], thene[r] = 1 ande[d] = 0 (the expectations are w.r.t. a inibatch B). Batch Noralization, in fact, siply sets r = 1,d = 0. We propose to retain r and d, but treat the as constants for the purposes of gradient coputation. In other words, we augent a network, which contains batch noralization layers, with a per-diension affine transforation applied to the noralized activations. We treat the paraeters r and d of this affine transfor as fixed, even though they were coputed fro the inibatch itself. It is iportant to note that this transfor is identity in expectation, as long as = E[ B ] andµ = E[µ B ]. We refer to batch noralization augented with this affine transfor as Batch Renoralization: the fixed (for the given inibatch) r and d correct for the fact that the inibatch statistics differ fro the population ones. This allows the above layers to observe the correct activations naely, the ones that would be generated by the inference odel. In practice, it is beneficial to train the odel for a certain nuber of iterations with batchnor alone, without the correction, then rap up the aount of allowed correction. We do this by iposing bounds onr andd, which initially constrain the to 1 and 0, respectively, and then are gradually relaxed. Algorith 1 presents Batch Renoralization. Unlike batchnor, where the oving averages are coputed during training but used only for inference, Batch Renor does use µ and during training to perfor the correction. We use a fairly high rate of update α for these averages, to ensure that they benefit fro averaging ultiple batches but do not becoe stale relative to the odel paraeters. We explicitly update the exponentially-decayed oving averages µ and, and optiize the rest of the odel using gradient optiization, with the gradients cal- Input: Values of x over a training ini-batch B = {x 1... }; paraetersγ, β; current oving ean µ and standard deviation ; oving average update rate α; axiu allowed correctionr ax,d ax. Output: {y i = BatchRenor(x i )}; updatedµ,. µ B 1 x i B 1 ǫ+ (x i µ B ) 2 ( ( B )) r stop gradient clip [1/rax,r ax] ( ( )) µb µ d stop gradient clip [ dax,d ax] x i x i µ B B r +d y i γ x i +β µ := µ+α(µ B µ) // Update oving averages := +α( B ) Inference: y γ x µ +β Algorith 1: Training (top) and inference (botto) with Batch Renoralization, applied to activation x over a ini-batch. During backpropagation, standard chain rule is used. The values arked with stop gradient are treated as constant for a given training step, and the gradient is not propagated through the. culated via backpropagation: = γ x i y i B = µ B = x i (x i µ B ) r 2 B x i r B = r + xi µ B + 1 x i B B B µ B γ = x i y i β = y i These gradient equations reveal another interpretation of Batch Renoralization. Because the loss l is unaffected when all x i are shifted or scaled by the sae aount, 3

4 the functions l({x i + t}) and l({x i (1 + t)}) are constant in t, and coputing their derivatives at t = 0 gives = 0 and x i = 0. Therefore, if we consider the -diensional vector { } (with one eleent per exaple in the inibatch), and further consider { two vectorsp 0 = (1,...,1) andp 1 = (x 1,...,x ), then } lies in the null-space ofp0 andp 1. In fact, it is easy to see fro the Batch Renor backprop forulas that to copute the gradient { } { fro } x i, we need to first scale the latter byr/ B, then project it onto the null-space of p 0 and p 1. For r = B, this is equivalent to the backprop for the transforation x µ, but cobined with the null-space projection. In other words, Batch Renoralization allows us to noralize using oving averages µ, in training, and akes it work using the extra projection step in backprop. Batch Renoralization shares any of the beneficial properties of batchnor, such as insensitivity to initialization and ability to train efficiently with large learning rates. Unlike batchnor, our ethod ensures that that all layers are trained on internal representations that will be actually used during inference. 4 Results To evaluate Batch Renoralization, we applied it to the proble of iage classification. Our baseline odel is Inception v3 [12], trained on 1000 classes fro IageNet training set [8], and evaluated on the IageNet validation data. In the baseline odel, batchnor was used after convolution and before the ReLU [7]. To apply Batch Renor, we siply swapped it into the odel in place of batchnor. Both ethods noralize each feature ap over exaples as well as over spatial locations. We fix the scale γ = 1, since it could be propagated through the ReLU and absorbed into the next layer. The training used 50 synchronized workers [3]. Each worker processed a inibatch of 32 exaples per training step. The gradients coputed for all 50 inibatches were aggregated and then used by the RMSProp optiizer [13]. As is coon practice, the inference odel used exponentially-decayed oving averages of all odel paraeters, including the µ and coputed by both batchnor and Batch Renor. For Batch Renor, we used r ax = 1, d ax = 0 (i.e. siply batchnor) for the first 5000 training steps, after which these were gradually relaxed to reach r ax = 3 at 40k steps, and d ax = 5 at 25k steps. These final values resulted in clipping a sall fraction of rs, and none of ds. However, at the beginning of training, when the learning rate was larger, it proved iportant to increase r ax slowly: otherwise, occasional large gradients were observed to suddenly and severely increase the loss. To account for the fact that the eans and variances change as the odel trains, we used relatively fast updates to the oving statistics µ and, with α = Because of Accuracy (%) Baseline (78.3%) Batch Renoralization (78.5%) 30 0k 20k 40k 60k 80k 100k 120k 140k Training steps Figure 1: Validation top-1 accuracy of Inception-v3 odel with batchnor and its Batch Renor version, trained on 50 synchronized workers, each processing inibatches of size 32. The Batch Renor odel achieves a arginally higher validation accuracy. this and keepingr ax = 1 for a relatively large nuber of steps, we did not need to apply initialization bias correction [? ]. All the hyperparaeters other than those related to noralization were fixed between the odels and across experients. 4.1 Baseline As a baseline, we trained the batchnor odel using the inibatch size of 32. More specifically, batchnor was applied to each of the 50 inibatches; each exaple was noralized using 32 exaples, but the resulting gradients were aggregated over 50 inibatches. This odel achieved the top-1 validation accuracy of 78.3% after 130k training steps. To verify that Batch Renor does not diinish perforance on such inibatches, we also trained the odel with Batch Renor, see Figure 1. The test accuracy of this odel closely tracked the baseline, achieving a slightly higher test accuracy (78.5%) after the sae nuber of steps. 4.2 Sall inibatches To investigate the effectiveness of Batch Renor when training on sall inibatches, we reduced the nuber of exaples used for noralization to 4. Each inibatch of size 32 was thus broken into icrobatches each having 4 exaples; each icrobatch was noralized independently, but the loss for each inibatch was coputed as before. In other words, the gradient was still aggregated over 1600 exaples per step, but the noralization 4

5 Accuracy (%) Accuracy (%) Batchnor (67.0%) Batchnor, train accuracy (72.8%) Non-i.i.d. in test, using µ B, B (76.5%) Batchnor (74.2%) Batchnor on half-inibatches (77.4%) Batch Renoralization (76.5%) 30 0k 20k 40k 60k 80k 100k 120k 140k 160k Training steps Batch Renoralization (78.6%) 30 0k 20k 40k 60k 80k 100k 120k 140k Training steps Figure 2: Validation accuracy for odels trained with either batchnor or Batch Renor, where noralization is perfored for sets of 4 exaples (but with the gradients aggregated over all50 32 exaples processed by the 50 workers). Batch Renor allows the odel to train faster and achieve a higher accuracy, although noralizing sets of 32 exaples perfors better. involved groups of 4 exaples rather than 32 as in the baseline. Figure 2 shows the results. The validation accuracy of the batchnor odel is significantly lower than the baseline that noralized over inibatches of size 32, and training is slow, achieving 74.2% at 210k steps. We obtain a substantial iproveent uch faster (76.5% at 130k steps) by replacing batchnor with Batch Renor, However, the resulting test accuracy is still below what we get when applying either batchnor or Batch Renor to size 32 inibatches. Although Batch Renor iproves the training with sall inibatches, it does not eliinate the benefit of having larger ones. 4.3 Non-i.i.d. inibatches When exaples in a inibatch are not sapled independently, batchnor can perfor rather poorly. However, sapling with dependencies ay be necessary for tasks such as for etric learning [4, 11]. We ay want to ensure that iages with the sae label have ore siilar representations than otherwise, and to learn this we require that a reasonable nuber of sae-label iage pairs can be found within the sae inibatch. In this experient (Figure 3), we selected each inibatch of size 32 by randoly sapling 16 labels (out of the total 1000) with replaceent, then randoly selecting 2 iages for each of those labels. When training with batchnor, the test accuracy is uch lower than for i.i.d. inibatches, achieving only 67%. Surprisingly, even the training accuracy is uch lower (72.8%) than the test accuracy in the i.i.d. case, and in fact exhibits a drop that is Figure 3: Validation accuracy when training on non-i.i.d. inibatches, obtained by sapling 2 iages for each of 16 (out of total 1000) rando labels. This distribution bias results not only in a low test accuracy, but also low accuracy on the training set, with an eventual drop. This indicates overfitting to the particular inibatch distribution, which is confired by the iproveent when the test inibatches also contain 2 iages per label, and batchnor uses inibatch statisticsµ B, B during inference. It iproves further if batchnor is applied separately to 2 halves of a training inibatch, aking each of the ore i.i.d. Finally, by using Batch Renor, we are able to just train and evaluate norally, and achieve the sae validation accuracy as we get for i.i.d. inibatches in Fig. 1. consistent with overfitting. We suspect that this is in fact what happens: the odel learns to predict labels for iages that coe in a set, where each iage has a counterpart with the sae label. This does not directly translate to classifying iages individually, thus producing a drop in the accuracy coputed on the training data. To verify this, we also evaluated the odel in the training ode, i.e. using inibatch statistics µ B, B instead of oving averagesµ,, where each test inibatch had size 50 and was obtained using the sae procedure as the training inibatches 25 labels, with 2 iages per label. As expected, this does uch better, achieving 76.5%, though still below the baseline accuracy. Of course, this evaluation scenario is usually infeasible, as we want the iage representation to be a deterinistic function of that iage alone. We can iprove the accuracy for this proble by splitting each inibatch into two halves of size 16 each, so that for every pair of iages belonging to the sae class, one iage is assigned to the first half-inibatch, and the other to the second. Each half is then ore i.i.d., and this achieves a uch better test accuracy (77.4% at 140k steps), but still below the baseline. This ethod is only applicable when the nuber of exaples per label is sall (since this deterines the nuber of icrobatches that a 5

6 inibatch needs to be split into). With Batch Renor, we siply trained the odel with inibatch size of 32. The odel achieved the sae test accuracy (78.5% at 120k steps) as the equivalent odel on i.i.d. inibatches, vs. 67% obtained with batchnor. By replacing batchnor with Batch Renor, we ensured that the inference odel can effectively classify individual iages. This has copletely eliinated the effect of overfitting the odel to iage sets with a biased label distribution. 5 Conclusions We have deonstrated that Batch Noralization, while effective, is not well suited to sall or non-i.i.d. training inibatches. We hypothesized that these drawbacks are due to the fact that the activations in the odel, which are in turn used by other layers as inputs, are coputed differently during training than during inference. We address this with Batch Renoralization, which replaces batchnor and ensures that the outputs coputed by the odel are dependent only on the individual exaples and not the entire inibatch, during both training and inference. Batch Renoralization extends batchnor with a perdiension correction to ensure that the activations atch between the training and inference networks. This correction is identity in expectation; its paraeters are coputed fro the inibatch but are treated as constant by the optiizer. Unlike batchnor, where the eans and variances used during inference do not need to be coputed until the training has copleted, Batch Renoralization benefits fro having these statistics directly participate in the training. Batch Renoralization is as easy to ipleent as batchnor itself, runs at the sae speed during both training and inference, and significantly iproves training on sall or non-i.i.d. inibatches. Our ethod does have extra hyperparaeters: the update rate α for the oving averages, and the schedules for correction liits d ax, r ax. A ore extensive investigation of the effect of these is a part of future work. Batch Renoralization offers a proise of iproving the perforance of any odel that would norally use batchnor. This includes Residual Networks [5]. Another application is Generative Adversarial Networks [9], where the non-deterinis introduced by batchnor has been found to be an issue, and Batch Renor ay provide a solution. Finally, Batch Renoralization ay benefit applications where applying batch noralization has been difficult such as recurrent networks. There, batchnor would require each tiestep to be noralized independently, but Batch Renoralization ay ake it possible to use the sae running averages to noralize all tiesteps, and then update those averages using all tiesteps. This reains one of the areas that warrants further exploration. References [1] D. Arpit, Y. Zhou, B. U. Kota, and V. Govindaraju. Noralization propagation: A paraetric technique for reoving internal covariate shift in deep networks. arxiv preprint arxiv: , [2] J. L. Ba, J. R. Kiros, and G. E. Hinton. Layer noralization. arxiv preprint arxiv: , [3] J. Chen, R. Monga, S. Bengio, and R. Jozefowicz. Revisiting distributed synchronous sgd. arxiv preprint arxiv: , [4] J. Goldberger, S. Roweis, G. Hinton, and R. Salakhutdinov. Neighbourhood coponents analysis. In Advances in Neural Inforation Processing Systes 17, [5] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for iage recognition. In Proceedings of the IEEE Conference on Coputer Vision and Pattern Recognition, pages , [6] S. Ioffe and C. Szegedy. Batch noralization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of the 32nd International Conference on Machine Learning (ICML-15), pages , [7] V. Nair and G. E. Hinton. Rectified linear units iprove restricted boltzann achines. In ICML, pages Onipress, [8] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei. IageNet Large Scale Visual Recognition Challenge, [9] T. Salians, I. Goodfellow, W. Zareba, V. Cheung, A. Radford, and X. Chen. Iproved techniques for training gans. In Advances in Neural Inforation Processing Systes, pages , [10] T. Salians and D. P. Kinga. Weight noralization: A siple reparaeterization to accelerate training of deep neural networks. In Advances in Neural Inforation Processing Systes, pages , [11] F. Schroff, D. Kalenichenko, and J. Philbin. Facenet: A unified ebedding for face recognition and clustering. CoRR, abs/ , [12] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna. Rethinking the inception architecture for coputer vision. In Proceedings of the IEEE Conference on Coputer Vision and Pattern Recognition, pages , [13] T. Tielean and G. Hinton. Lecture rsprop. COURSERA: Neural Networks for Machine Learning,

Intelligent Systems: Reasoning and Recognition. Perceptrons and Support Vector Machines

Intelligent Systems: Reasoning and Recognition. Perceptrons and Support Vector Machines Intelligent Systes: Reasoning and Recognition Jaes L. Crowley osig 1 Winter Seester 2018 Lesson 6 27 February 2018 Outline Perceptrons and Support Vector achines Notation...2 Linear odels...3 Lines, Planes