Centered Weight Normalization in Accelerating Training of Deep Neural Networks

Size: px
Start display at page:

Download "Centered Weight Normalization in Accelerating Training of Deep Neural Networks"

Transcription

1 Centered Weight Normalization in Accelerating Training of Deep Neural Networks Lei Huang Xianglong Liu Yang Liu Bo Lang Dacheng Tao State Key Laboratory of Software Development Environment, Beihang University, P.R.China UBTECH Sydney AI Centre, School of IT, FEIT, The University of Sydney, Australia {huanglei, xlliu, blonster, Abstract Training deep neural networks is difficult for the pathological curvature problem. Re-parameterization is an effective way to relieve the problem by learning the curvature approximately or constraining the solutions of weights with good properties for optimization. This paper proposes to reparameterize the input weight of each neuron in deep neural networks by normalizing it with zero-mean and unit-norm, followed by a learnable scalar parameter to adjust the norm of the weight. This technique effectively stabilizes the distribution implicitly. Besides, it improves the conditioning of the optimization problem and thus accelerates the training of deep neural networks. It can be wrapped as a linear module in practice and plugged in any architecture to replace the standard linear module. We highlight the benefits of our method on both multi-layer perceptrons and convolutional neural networks, and demonstrate its scalability and efficiency on SVHN, CIFAR-1, CIFAR-1 and ImageNet datasets. 1. Introduction Recently, deep neural networks have achieved grand success across a broad range of applications, e.g., image classification, speech recognition and object detection [33, 35, 14, 36, 7, 5]. Neural networks typically are composed of stacked layers, and the transformation between layers consists of linear mapping with learnable parameters, in which each neuron computes a weighted sum over its inputs and adds a bias term, and followed by an element-wise nonlinear activation. The stacked structure endows a neural network learning feature hierarchies with features from higher levels formed by the composition of lower level features. Further, deep architectures provide neural networks powerful representation capacity of learning complicated functions that can represent high-level abstractions. Corresponding author While the deep and complex structure enjoys appealing advantages, it also makes learning difficult. Indeed, many explanations for the difficulty of deep learning have been explored, such as the problem of vanishing and exploding gradients [], internal covariate shift [16], and the pathological curvature []. To address these problems, various studies such as finely weight initialization [19, 9, 13, 9, 1], normalization of internal activation [16, 4], and sophistic optimization methods have been proposed accordingly [, 1, 11]. Our work is dedicated to the problem of pathological curvature [], i.e. the condition number of the Hessian matrix of the objective function is low at the optimum regions[8], which makes learning extremely difficult via first order stochastic gradient descent. Several studies [1, 11] recently have attempted to use the pre-conditioning techniques to improve the conditioning of the cost curvature. However, these methods introduce too much overhead and are not convenient to be applied. An alternative track to facilitate the optimization progress is the transformation of parameterization space of a model [1], which is called re-parameterization. The motivation of re-parameterization is that there may be various equivalent ways to parameterize the same model, some of which are much easier to optimize than others [8]. Therefore, exploring good ways of parameterizing neural networks [6, 8] are essentially important in training deep neural networks. Inspired by the practical trick that weights are sampled from a distribution with zero mean and a standard deviation for initialization [19, 9, 13], in this paper we propose to constrain the input weight of each neuron with zero mean and unit norm by re-parameterization during the course of training, followed by a learnable scalar parameter to adjust the norm of the input weight. We use proxy parameters and perform gradient updating on these proxy parameters by backpropagating the gradient information through the normalization process (Figure 1). By introducing this process, the summed input of each neuron is more likely to possess the 183

2 properties of zero mean and stable variance. Besides, this technique can effectively improve the conditioning of the optimization problem and thus can accelerate the training of deep neural networks. We wrap the proposed re-parameterization method into a linear module in practice, which can be plugged in any architecture as a substitute for the standard linear module. The technique we present is generic and can be applied to a broad range of models. Our method is also orthogonal and complementary to recent advances in deep learning, such as Adam optimization [17] and batch normalization [16]. We conduct comprehensive experiments on Yale- B, SVHN, CIFAR-1, CIFAR-1 and ImageNet datasets over multi-layer perceptron and convolutional neural network architectures. The results show that centered weight normalization draws its strength in improving the performance of deep neural networks. Our code is available at Related work Training deep neural network via first order stochastic gradient descent is difficult in practice, mainly due to the pathological curvature problem. Several works have tried the preconditioning techniques to accelerate the training. Martens and Sutskever [] developed a second-order optimization method based on the Hessian-free approach. Other studies [1, 11] explicitly pre-multiply the cost gradient by an approximate inverse of the Fisher information matrix, thereby expecting to obtain an approximate natural gradient. The approximate inverse can be obtained by using Cholesky factorization [1] or Kronecker-factored approximation [11]. These methods usually introduce much overhead. Besides, their optimization procedures are usually coupled with the preconditioning, and thus can not be easily applied. Some works addressed the benefits of centered activations [36] and gradients [6]. They show that these transformations make the Fisher information matrix approximate block diagonal, and improve the optimization performance. Batch normalization [16] further standardizes the activation with centering and scaling based on mini-batch, and includes normalization as a part of the model architecture. Ba et al. [4] computed the layer normalization statics over all the hidden units in the same layers, targeting at the scenario that the size of mini-batch is limited. These methods focus on normalizing the activation of the neurons explicitly while our method works by re-parameterizing the model, and is expected to achieve better conditioning and stabilize the activations implicitly. Re-parameterization is an effective technique to facilitate the optimization progress [6, 8]. Guillaume et al. [6] tried to estimate the expected projection matrix, and implicitly whitening the activations. However, the operation of = Figure 1. An illustrative example of layered neural networks with re-parameterization (for brevity, we leave out the bias nodes). The proposed re-parameterization method finely constructs a transformation ψ over the proxy parameter v to ensure that the transformed weight w has certain beneficial properties for the training of neural network. Gradient updating is executed on the proxy parameter v by back-propagating the gradient information through the normalization process. updating the projection matrix is still coupled with the optimization. Salimans and Kingma [8] designed weight normalization [8] as a part of the model architecture. The proposed weight normalization addresses the normalization of the input weight for each neuron and decouples the length of those weight vectors from their directions. Our work further powers weight normalization by centering the input weight, and we argue that it can ulteriorly improve the conditioning and speed up the training for deep neural networks. There exist studies constructing orthogonal matrix in recurrent neural networks (RNN) [, 37, 8] by using reparameterization to avoid the gradient vanish and explosion problem. However, these methods are limited for the hidden to hidden transformation in RNN, because they require the weight matrix to be square matrix. Other alternative methods [5, 38] focus on reducing the storage and computation costs by re-parameterization. Different from them, our work aims to design a general re-parameterized linear module to accelerate training and improve the performance of deep neural networks, and make it an alternative for the standard linear module. 3. Centered weight normalization We follow the matrix notation that the vector is in column form, except that the derivative is a row vector. Given training data D = {(x i,y i ),i=1,,...,m} where x denotes the input andythe target. A neural network is a function f(x;θ) parameterized by θ that is expected to fit well the training data and have good generalization for the test data. The function f(x;θ) adopted by neural network usually consists of stacked layers. The transformation between layers consists of a linear mapping z l =(W l ) T h l 1 +b l with learnable parameters W l R d n and b l R n, and followed by an element-wise nonlinearity: h l = ϕ(s l ), where l {1,,...,L} indexes the layer and L denotes the 84

3 total number of layers. By convention, h corresponds to the input x and h L corresponds to the output of the network f(x;θ). For clarification, we refer to z and h as preactivation and activation respectively. Under the denotation, the learnable parameters areθ = {W l,b l l =1,,...,L}. Training the neural networks can be viewed as tuning the parameters to minimize the discrepancy between the desired outputyand the predicted outputf(x;θ). This discrepancy is usually described by a loss function L(y,f(x; θ)), and thus the objective is to optimize the parameters θ by minimizing the loss as follows: θ =argmin θ E (x,y) D [L(y,f(x;θ))]. (1) Stochastic gradient descent (SGD) has proved to be an effective way to train deep networks, in which the gradient of the loss function with respect to the parameters θ is approximated by the mini-batch x 1...m of size m at each iteration, by computing θ = 1 m Σm (y i,f(x i;θ)) i=1 θ Methodology Despite the fact that SGD can guarantee a local optimum, it is also well known the practical success of SGD is highly dependent on the curvature of the objective to be optimized. Deep neural network usually exhibits pathological curvature problem, which makes the learning difficult. An alternative track to relieve the pathological curvature issue and thus facilitate the optimization progress is the transformation of parameterization space of a network model, which is called re-parameterization. In the literatures, there exist practical tricks for weight initialization where weights are sampled from a distribution with zero-mean and a standard deviation [19, 9, 13]. These initialization techniques effectively can avoid exponentially reducing/magnifying the magnitudes of input signals or back-propagation signals in the initial phases, by constructing stable and proper variances of the activations among different layers. Motivated by this observation, we propose to constrain the input weight of each neuron with zero mean and unit norm during the course of training by re-parameterization, which enjoys stable and fast training of deep neural networks. For simplicity, we start by considering one certain neuroniin layerl, whose pre-activationzi l =(wl i )T h (l 1) +b l i. We denote wi l as the input weight of the neuron, as shown in Figure 1. We have left out the superscript l and subscript i for brevity in the following discusses. Standardize weight We first re-parameterize the input weight w of each neuron and make sure that it has the following properties: (1) zero-mean, i.e. w T 1 =where 1 is a column vector of all ones; () unit-norm, i.e. w =1 where w denotes the Euclidean norm of w. To achieve Algorithm 1 Forward pass of linear mapping with centered weight normalization. 1: Input: the mini-batch input data X R d m and parameters to be learned: g R n 1, b R n 1, V R d n. : Output: pre-activationz R n m. 3: compute centered weight: ˆV = V 1 d 1 d(1 T d V). 4: for i =1tondo 5: calculate normalized weight with respect to the i-th neuron: w i = ˆvi ˆv i 6: end for 7: calculate: Ẑ = W T X. 8: calculate pre-activation: Z =(g1 T m) Ẑ+b1 T m. the goal, we express the input weightwin terms of the proxy parameterv(figure 1) using w = v 1 d 1(1T v) v 1 d 1(1T v) where d is the dimension of the input weight, and the s- tochastic gradient descent or other alternative techniques such as Adam [17] can be directly applied to the optimization with respect to the proxy parameter v. We refer to the proposed re-parametrization as centered weight normalization, that is, we center and scale the proxy parameter v to ensure that the input weight w has the desired zero-mean and unit-norm properties as discussed before. In our centered weight normalization, the parameter updating is completed based on the proxy parameters, and thus the gradient signal should back-propagate through the normalization process. Specifically, given the derivative of / w, we can get the derivative of loss with respect to the proxy parametervas follows: () v = 1 ˆv [ w ( w w)wt 1 d ( w 1)1T ] (3) where ˆv = v 1 d 1(1T v) is the centered auxiliary parameter. Adjustable weight scale Our method can be viewed effectively as a solution to the constrained optimization problem over neural networks: θ =argmin θ E (x,y) D [L(y,f(x;θ))] s.t. w T 1 =and w =1 (4) where w indicates the input weight for each neuron in each layer. Therefore, we regularize the network with n constraints in each layer wherenis the number of filters in certain layer and the optimization is over the embedded submanifold of the original weight space. While these constraints provide regularization, they also may reduce the 85

4 Algorithm Back-propagation pass of linear mapping with centered weight normalization. 1: Input: pre-activation derivative { Z Rn m }. Other auxiliary variables from respective forward pass: ˆV, W,Ẑ, X,g. : Output: the gradients with respect to the inputs{ X R d m } and learnable parameters: g R1 n, b R 1 n, V Rd n. 3: g = 1T m( Z Ẑ)T 4: b = 1T m T Z 5: = Ẑ Z (g1t m) 6: X = W Ẑ 7: W = X T Ẑ 8: for i =1tondo 9: 1: end for v i = 1 ˆv i ( w i ( w i w i )w T i 1 d ( w i 1 d )1 T d ) representation capacity of the networks. To address it, we simply introduce a learnable scalar parameter g to fine tune the norm of w. Similar idea has been introduced in [8], and proved useful in practice. We initialize it to 1, and expect the network learning process can find a proper scalar under the supervision signals. To summarize, we rewrite the pre-activationz of each neuron in the following way z = g( v 1 d 1(1T v) v 1 d 1(1T v) )T h+b. (5) Wrapped module with re-parameterization We can wrap our proposed method as a common module and plugin it in the neural networks as a substitute for the ordinary linear module. To achieve this goal, the key is to implement the forward and back-propagation passes. Based on Eqn. (5), it is readily to extend for multiple output neurons of size n. We describe the details of the forward pass in Algorithm 1, and the back-propagation pass in Algorithm. In the algorithms, the operator represents element-wise matrix multiplication, 1 d indicates a column vector of all ones of size d, and X R d m is the mini-batch input data feeded into this module, where d is the dimension of the input and m is the size of the mini-batch. v i is the i-th column of V, andw =(w 1,w,...,w n ). Convolutional layer For the convolutional layer, the weight of each feature map is w c R d F h F w where F h and F w indicate the height and width of the filter, and d is the number of the input feature maps. We unroll w c as a F h F w d dimension vector w, then the same normalization can be directly executed over the unrolledw. 3.. Analysis and discussions Next, we will show our centered weight normalization enjoys certain appealing properties. It can stabilize the distribution of pre-activation z with respect to each neuron. To illustrate this point, we introduce the following proposition where we omit the bias termbto simplify the discussion. Proposition 1. Letz = w T h, wherew T 1 =and w = 1. Assume h has Gaussian distribution with the mean: E h [h]=μ1, and covariance matrix: cov(h)=σ I, where μ R andσ R. We have E z [z]=,var(z)=σ. The proof of this proposition is provided in the supplementary materials. Such a proposition tells us that for each neuron the pre-activation z has zero-mean and the same variance as the activations fed in, when the assumption is satisfied. This property does not strictly hold in practice due to the issue of nonlinearities in the hidden layers, however, we still empirically find that our proposed centered weight normalization approximately holds. Note that our method is similar to batch normalization [16]. However, batch normalization focuses on normalizing the pre-activation compulsively such that the pre-activation of current mini-batch data is zero-mean and unit-variance, which is a data dependent normalization. Our method normalizes the weights and is expected to have the effect of zero-mean and stable variance implicitly, which is a data independent normalization. Actually, our method can work well by combining batch normalization as described in subsequent experiments. Besides the good property guaranteed by Proposition 1, another remarkable observation is that the proposed reparameterization method can make the optimization problem easier, which is supported by the following proposition. Proposition. Regarding to the proxy parameter v, centered weight normalization makes that the gradient v has following properties: (1) zero-mean, i.e. v 1 =; () orthogonal to the parametersw, i.e. v w =. The derivative of Proposition is also given in the supplementary materials. The zero-mean property of / v usually has an effect that the leading eigenvalue of the Hessian is smaller, and therefore the optimization problem is likely to become better-conditioned [3]. This promises the network to learn at a higher rate, and hence converge much faster. Meanwhile, the gradient is orthogonal to the parameters w, and therefore is orthogonal to ˆv. Following the analysis in [8], the gradient / v can self-stabilize its norm, which thus makes optimization of neural networks robust to the value of the learning rate. Computation complexity Given mini-batch input data {X R d m }, regarding n neurons, both the forward pass (Algorithm 1) and back-propagation pass (Algorithm ) of 86

5 mean.5.5 L1 L L1 L updates (x1) (a) variance L1 L L1 L 4 6 updates (x1) (b) condition number of FIM updates (x1) (c) training loss updates Figure. A case study on Yale-B dataset ( -Ll indicates the l-th layer). With respect to the training updates, we analyze (a) the mean over the mini-batch data and all neurons; (b) the variance calculated over the mini-batch data and all neurons; (c) the condition number (log-scale) of relative FIM in the second last layer; and (d) the training loss. (d) the centered weight normalization have the computational complexity ofo(mnd+nd). This means that our centered weight normalization has the same computational complexity as the standard linear module, since the extra costo(nd) of centered weight normalization is negligible too(mnd). For a convolution layer with filters W R n d F h F w, given m mini-batch data {x i R d h w,i=1,,...,m}, wherehandw represent the height and width of the feature map, the computational complexity of the standard convolution layer is O(nmdhwF h F w ), and while our proposed method is O(mndhwF h F w + ndf h F w ). The extra cost O(ndF h F w ) of the proposed normalization is also negligible compared to the convolution operation. 4. Experiments In this section, we begin with a set of ablation study to support the preliminary analysis of the proposed method. Then in the following three subsections, we show the effectiveness of our proposed method on the general network architectures, including (1) Multi-Layer Perceptron (MLP) architecture for face and digit recognition tasks; and () Convolutional Neural Network (CNN) models for large s- cale image classification. Experimental protocols For all experiments, we adopt ReLUs [] as the nonlinear activation and the negative log-likelihood as the loss function, where (x,y) represents the (input image, target class) pair. We choose the random weight initialization by default as described in [19] ifwe do not specify weight initialization methods. For the experiments on MLP architecture, the input images are resized and transformed to 1,4 dimensional vectors in gray scale, with mean subtracted and variance divided Case study As discussed in Section 3., the proposed Centered Weight Normalization () method can stabilize the distribution of the pre-activations during training under certain assumptions. A network with the proposed reparameterization technique is likely to become betterconditioned. In this part, we conduct experiments to validate the conclusions empirically. We conduct the experiments on Yale-B dataset 1 for face recognition task. Yale-B dataset has,414 images with 38 classes. We randomly sample 38 images (i.e., 1 images per class) as the test set and the others as training set, where all images are resized as 3 3 pixels. We train a 5-layer MLP with {18,64,48,48} neurons in the hidden layers. We compare our with the network without reparameterization (named for short). In the training, we use stochastic gradient descent with batch size of 3. For each method, the best results by tuning different learning rates in {.1,.,.5,1} are reported. Stabilize activation We investigate the mean and variance of pre-activations in each layer during the course of training. Figure (a) and (b) respectively show the mean and variance from the first and second layers, with respect to different rounds of training updates. The mean and variance are calculated over the mini-batch data and all neurons in the corresponding layers. We find that the distribution of pre-activation in network varies remarkably. For instance, the mean decreases and the variance increases gradually, even though the network nearly converges to the minimal loss. The unstable distribution of pre-activation may result in an unstable learning, as shown in Figure (d), where the loss of network fluctuates sharply. By comparing the loss curve of and the others, we find that the preactivation of the network with re-parameterization has a stable mean and variance as shown in Figure (a) and (b). Especially, in the first layer the pre-activation is nearly zero-mean, which empirically supports the fact in Proposition 1. These beneficial properties make the network with re-parameterization converge to an optimum regions in a stable and fast way. Conditioning analysis The condition number of the Fisher Information Matrix (FIM) is a good indicator to show whether the optimization problem is easy to solve. FIM is defined as F θ = E x π {E y P(y x,θ) [( logp(y x,θ) θ )( logp(y x,θ) θ ) T ]}, 1 leekc/extyaledatabase/extyaleb.html We also observe the mean and variance of the randomly selected neurons in each layer, which also have the similar behaviours. 87

6 training (a) training test (b) test Figure 3. Performances evaluation on MLP architecture over permutation-invariant SVHN BN +BN +BN +BN (a) combining batch normalization (b) using Adam optimization Figure 4. Comparison of different methods by (a) combining batch normalization and (b) using Adam optimization on permutationinvariant SVHN. We evaluate training (solid lines) and test (lines marked with plus) with respect to the training. and its condition number is cond(f θ )= λ(f θ) max λ(f θ ) min where λ(f θ ) max and λ(f θ ) min are the maximum and minimum of the absolute values of F θ eigenvalues. We evaluated the condition number of relative FIM [3] with respect to the last two layers, regarding the feasibility in computation. We compared our methods with and the other two re-parameterization methods: Natural Neural Network () [6] and Weight Normalization () [8]. The results of the second last layer are shown in Figure (c) and the last layer also has similar results, from which we can find that,, achieve better conditioning compared to, and thus converges faster as shown in Figure (d). This indicates that re-parameterization serves as an effective way to make the optimization problem easier if good properties are designed carefully. Compared to, our proposed method further significantly improves the conditioning and speed up convergence. This observation is consistent with our analysis in Section 3. that the proposed zero-mean property of input weight contributes to making the optimization problem easier. At the testing stage, the proposed method achieves the lowest test as shown in Table 3, which means that not only accelerates and stabilizes training, but also has great potential to improve generalization of the neural network. 4.. MLP architecture In this part, we comprehensively evaluate our proposed method on MLP architecture. We investigate both the training and test performances on SVHN [3]. SVHN consistsof3 3 color images of house numbers collected by Google Street View. It has 73,57 train images and 6,3 test images. We train a 6 layers MLP with 18 neurons cor- Table 1. Comparison of test s (%) averaged over 5 independent runs on Yale-B and permutation-invariant SVHN. Yale-B SVHN respondingly in each hidden layer. Note that here we mainly focus on validating the effectiveness of our proposed methods on MLP architecture, and under this situation, SVHN dataset is permutation invariant. We employ the stochastic gradient descent algorithm with mini-batch of size 14. Hyper-parameters are selected by grid search, based on the test on validation set (5% samples of the training set) with the following grid specifications: learning rates in {.1,.,.5, 1}; the natural re-parameterization interval T in {,5,1,,5} and the revised term ε within the range of{.1,.1,.1,1} for method. Figure 3 shows the training and test with respect to the number of. We can find that, and converge faster compared to, which indicates that re-parameterization serves as an effective way to accelerate the training. Among all re-parametrization methods, converges fastest. Besides, we observe that both the training and test of do not fluctuate after a few number of in the training, which means that is much more stable when it reaches the optimal regions. We can conjecture that the following functions of as discussed in Section 3. contributes to this phenomenon: (1) can stabilize the distribution of pre-activation; () the gradient / v of can self-stabilize its norm. Table 3 presents the test for different methods, where we further find that achieves the best performances. These observations together show the great potential of in improving the generalization of the neural network. Combining batch normalization Batch normalization [16] has shown to be very helpful for speeding up the training of deep networks. Combined with batch normalization, previous re-parameterization methods [6] have shown success in improving the performance. Here we show that over the networks equipped with batch normalization, our proposed still outperforms others in all cases. In this experiment, Batch Normalization (BN) is plugged in before the nonlinearity in neural networks as suggested in [16]. With batch normalization, we build different networks using the re-parameterized linear mapping including, and, named +BN, +BN and +BN respectively. Figure 4 (a) shows their experimental results on SVHN dataset. We find that +BN and +BN have no advantages compared to BN, while +BN significantly speeds up the training and achieves better test per- 88

7 Table. Comparison of test s (%) averaged over 5 independent runs on Yale-B and permutation-invariant SVHN with Xavier-Init and He-Init. Xavier-Init He-Init method Yale-B SVHN Yale-B SVHN formance. Indeed, batch normalization is invariant to the weights scaling as analyzed in [4] and [4]. However, batch normalization is not re-centering invariant [4]. Therefore, our can further improve the performance of batch normalization by centering the weights. Amazingly, itself achieves remarkably better performance in terms of both the training speed and the test than BN, and even better than +BN. This is mainly due to the following reasons. Batch normalization can well stabilize the distribution of the preactivation/activation, however it also introduces noises stemming from the forceful transformation based on minibatch. Besides, during test the means and variances are estimated based on the moving averages of mini-batch means and variances [16, 3, 4]. These flaws may have a detrimental effect on models. Our can stabilize the distribution of pre-activation implicitly and is deterministic without introducing stochastic noise. Moreover, the properties described in Proposition can improve the conditioning, self-stabilize the gradient s norm, and thus achieve better performance. Adam optimization We consider an alternative optimization method, Adam [17], to evaluate the performance. The initial learning rate is set to{.1,.,.5,.1}, and the best results for all methods on SVHN dataset are shown in Figure 4 (b). Again, we find that can reach the lowest s at both training and test stages. Different initialization We also have tried different initialization methods: Xavier-Init [9] and He-Init [13]. The experimental setup is the same as in previous experiments. The final test s both on Yale-B and SVHN datasets are shown in Table. We get similar conclusions as the random initialization [19] in previous experiments that achieves the best test performance CNN architectures on CIFAR dataset In this part, we highlight that the proposed centered weight normalization method also works well on several popular state-of-the-art CNN architectures 3, including VG- G[31], GoogLeNet [34], and residual network [14, 15]. We 3 The details of the used CNN architectures are shown in the supplementary materials. Table 3. Comparison of test s (%) averaged over 3 independent runs on 56 layers residual network over CIFAR-1 and CIFAR-1 datasets. Methods CIFAR-1 CIFAR ± ± ± ± ± ±.14 evaluate it over CIFAR-1 and CIFAR-1 datasets [18], which consists of 5, training images and 1, test images from 1 and 1 classes respectively. Each input image consists of3 3 pixels. VGG-A We first evaluate the proposed methods on the VGG-A architecture [31] where we set the number of neurons to 51 in the fully connected layer and use batch normalization in the first fully connected layer. The experiments run on CIFAR-1 dataset and we preprocess the dataset by subtracting the mean and dividing the variance. We employ the stochastic gradient descent as the optimization method with mini-batch size of 56, and use momentum of.9, weight decay of.5. The learning rate decays with each K iterations halving the learning rate. We initialize the learning rate lr = {.5,1,,4}, and K = {1,, 4, 8}. The hyper-parameters are chosen over the validation set of 5, examples from the training set by grid search. We report the results of, and in Figure 5 (a). It is easy to see that with the proposed the training converges much faster and promises the lowest test of 13.6%, compared to of 15.89% and of 14.33%. Note that here obtains even worse performance compared to, which is mainly because that can not ensure the preactivations nearly zero-mean, and thus degenerate the performance when the network is deep. GoogleLeNet We conduct the experiments on Google- LeNet. Here GoogleLeNet is equipped with the batch normalization [16], plugged after the linear mappings. The dataset is preprocessed as described in [1] with global contrast normalization and ZCA whitening. We follow the simple data augmentation that 4 pixels are padded on each side, and a 3 3 crop is randomly sampled from the padded image or its horizontal flip. The models are trained with a mini-batch size of 64, momentum of.9 and weight decay of.5. The learning rate starts from.1 and ends at.1, and decays every two with exponentially decaying until the end of the training with 1. The results on CIFAR-1 are reported in Figure 5 (b), from which we can find that obtains slight speedup compared to and. Moreover, can attain the lowest test of 4.45%, compare to of 5.5% and of 5.39%. We also obtain similar observations on CIFAR-1 (see the supplementary materials). 89

8 (a) (b) (c) Figure 5. Performance comparison on various architecture over CIFAR datasets. We evaluate the training (solid lines) and test (lines marked with plus) with respect to the training. (a) VGG-A architecture over CIFAR-1; (b) GoogLeNet architecture over CIFAR-1; (c) residual network architecture over CIFAR-1; (d) residual network architecture over CIFAR-1. (d) Residual network We also apply our module to the residual network [14] with 56 layers. We follow the same experimental protocol as described in [14] and the publicly available Torch implementation for residual network 4. Figure 5 (c) and (d) show the training and test with respect to on CIFAR-1 and CIFAR-1 dataset respectively, and the test s after training are shown in Table 3. We can find that converges slightly faster than and and achieves the best test performance. Computational cost We implement our module based on Torch, and the re-parameterization is wrapped in the fastest SpatialConvolution of cudnn package 5. We compared the wall clock time of module with the standard convolution of cudnn. We use 3 3 convolution, 3 3 feature map with size 18 and mini-batch size of 64. The results are averaged over 1 runs. costs 18.1ms while the standard one takes 17.3ms Large scale classification on ImageNet dataset To show the scalability of our proposed method, we apply it to the large-scale ImageNet-1 dataset [7] using the GoogLeNet architecture. ImageNet-1 includes images of 1, classes, and is split into training sets with 1.M images, validation set with 5k images, and test set with 1k images. We employ the validation set as the test set, and evaluate the classification performance based on top-1 and top-5. We use single scale and single crop test for simplifying discussion. Here we again use stochastic gradient descent with mini-batch size of 64, momentum of.9 and weight decay of.1. The learning rate starts from.1 and ends at.1, and decays with exponentially decaying until the end of the training with 9. The top-5 training and test curves are shown in Figure 6 and the final test s are shown in Table 4, where we can find that the deep network with can converge to a lower training faster, and obtain lower test. We argue that our module draws its strength in improving the performance of deep neural networks train (a) training test (b) test Figure 6. Top-5 training and test curves on GoogLeNet architecture over ImageNet-1 dataset. Table 4. Comparison of test s (%) on GoogLeNet over ImageNet-1 dataset. 5. Conclusions Methods Top-1 Top In this paper we introduced a new efficient reparameterization method named centered weight normalization, which improves the conditioning and accelerates the convergence of the deep neural networks training. We validate its effectiveness in image classification tasks over both multi-layer perceptron and convolutional neural network architectures. The re-parameterization method is able to stabilize the distribution and accelerate the training for a number of popular networks. More importantly, it has very low computational overhead and can be wrapped as a linear module suitable for different types of networks. We argue that the centered weight normalization module owns the potential to replace the standard linear module, and contributes to new deep architectures with easier optimization and better generalization. Acknowledgement This work was partially supported by NSFC , NSFC-6146, SKLSDE-17ZX-3, FL , DP , LP and the Innovation Foundation of BUAA for PhD Graduates. 81

9 References [1] S.-I. Amari. Natural gradient works efficiently in learning. Neural Comput., 1():51 76, Feb [] M. Arjovsky, A. Shah, and Y. Bengio. Unitary evolution recurrent neural networks. In ICML, 16. 1, [3] D. Arpit, Y. Zhou, B. U. Kota, and V. Govindaraju. Normalization propagation: A parametric technique for removing internal covariate shift in deep networks. In ICML, [4] L. J. Ba, R. Kiros, and G. E. Hinton. Layer normalization. CoRR, abs/ , 16. 1,, 7 [5] M. Denil, B. Shakibi, L. Dinh, M. Ranzato, and N. D. Freitas. Predicting parameters in deep learning. In Advances in Neural Information Processing Systems 6, pages , 13. [6] G. Desjardins, K. Simonyan, R. Pascanu, and K. Kavukcuoglu. Natural neural networks. In NIPS, 15. 1,, 6 [7] Z. Ding, M. Shao, and Y. Fu. Deep robust encoder through locality preserving low-rank dictionary. In ECCV, pages , [8] V. Dorobantu, P. A. Stromhaug, and J. Renteria. Dizzyrnn: Reparameterizing recurrent neural networks for normpreserving backpropagation. CoRR, abs/ , 16. [9] X. Glorot and Y. Bengio. Understanding the difficulty of training deep feedforward neural networks. In AISTATS, 1. 1, 3, 7 [1] I. J. Goodfellow, D. Warde-Farley, M. Mirza, A. C. Courville, and Y. Bengio. Maxout networks. In ICML, [11] R. B. Grosse and J. Martens. A kronecker-factored approximate fisher matrix for convolution layers. In ICML, 16. 1, [1] R. B. Grosse and R. Salakhutdinov. Scaling up natural gradient by sparsely factorizing the inverse fisher matrix. In ICML, 15. 1, [13] K. He, X. Zhang, S. Ren, and J. Sun. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In ICCV, 15. 1, 3, 7 [14] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, 16. 1, 7, 8 [15] K. He, X. Zhang, S. Ren, and J. Sun. Identity mappings in deep residual networks. CoRR, abs/163.57, [16] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML, 15. 1,, 4, 6, 7 [17] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. CoRR, abs/ , 14., 3, 7 [18] A. Krizhevsky. Learning multiple layers of features from tiny images. Technical report, 9. 7 [19] Y. LeCun, L. Bottou, G. B. Orr, and K.-R. Müller. Effiicient backprop. In Neural Networks: Tricks of the Trade, , 3, 5, 7 [] J. Martens and I. Sutskever. Training deep and recurrent networks with hessian-free optimization. In Neural Networks: Tricks of the Trade (nd ed.). 1. 1, [1] D. Mishkin and J. Matas. All you need is a good init. In ICLR, [] V. Nair and G. E. Hinton. Rectified linear units improve restricted boltzmann machines. In ICML, 1. 5 [3] Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y. Ng. Reading digits in natural images with unsupervised feature learning. In NIPS Workshop on Deep Learning and Unsupervised Feature Learning, [4] B. Neyshabur, R. Tomioka, R. Salakhutdinov, and N. Srebro. Data-dependent path normalization in neural networks. CoRR, abs/ , [5] G. Qi. Hierarchically gated deep networks for semantic segmentation. In CVPR, pages 67 75, [6] T. Raiko, H. Valpola, and Y. LeCun. Deep learning made easier by linear transformations in perceptrons. In AISTATS, 1. [7] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 115(3):11 5, [8] T. Salimans and D. P. Kingma. Weight normalization: A simple reparameterization to accelerate training of deep neural networks. In NIPS, 16. 1,, 4, 6 [9] A. M. Saxe, J. L. McClelland, and S. Ganguli. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. CoRR, abs/131.61, [3] N. N. Schraudolph. Accelerated gradient descent by factorcentering decomposition. Technical report, [31] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. CoRR, abs/ , [3] K. Sun and F. Nielsen. Relative natural gradient for learning large complex models. CoRR, abs/ , [33] Y. Sun, Y. Chen, X. Wang, and X. Tang. Deep learning face representation by joint identification-verification. In NIPS, pages , [34] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. E. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. CoRR, abs/ , [35] Y. Tian, P. Luo, X. Wang, and X. Tang. Deep learning strong parts for pedestrian detection. In ICCV, pages , [36] S. Wiesler, A. Richard, R. Schlüter, and H. Ney. Meannormalized stochastic gradient for large-scale deep learning. In ICASSP, 14. 1, [37] S. Wisdom, T. Powers, J. Hershey, J. Le Roux, and L. Atlas. Full-capacity unitary recurrent neural networks. In NIPS, pages [38] Z. Yang, M. Moczulski, M. Denil, N. de Freitas, A. J. Smola, L. Song, and Z. Wang. Deep fried convnets. In ICCV,

Normalization Techniques in Training of Deep Neural Networks

Normalization Techniques in Training of Deep Neural Networks Normalization Techniques in Training of Deep Neural Networks Lei Huang ( 黄雷 ) State Key Laboratory of Software Development Environment, Beihang University Mail:huanglei@nlsde.buaa.edu.cn August 17 th,

More information

Convolutional Neural Networks II. Slides from Dr. Vlad Morariu

Convolutional Neural Networks II. Slides from Dr. Vlad Morariu Convolutional Neural Networks II Slides from Dr. Vlad Morariu 1 Optimization Example of optimization progress while training a neural network. (Loss over mini-batches goes down over time.) 2 Learning rate

More information

Normalization Techniques

Normalization Techniques Normalization Techniques Devansh Arpit Normalization Techniques 1 / 39 Table of Contents 1 Introduction 2 Motivation 3 Batch Normalization 4 Normalization Propagation 5 Weight Normalization 6 Layer Normalization

More information

arxiv: v3 [cs.lg] 4 Jun 2016 Abstract

arxiv: v3 [cs.lg] 4 Jun 2016 Abstract Weight Normalization: A Simple Reparameterization to Accelerate Training of Deep Neural Networks Tim Salimans OpenAI tim@openai.com Diederik P. Kingma OpenAI dpkingma@openai.com arxiv:1602.07868v3 [cs.lg]

More information

Lecture 17: Neural Networks and Deep Learning

Lecture 17: Neural Networks and Deep Learning UVA CS 6316 / CS 4501-004 Machine Learning Fall 2016 Lecture 17: Neural Networks and Deep Learning Jack Lanchantin Dr. Yanjun Qi 1 Neurons 1-Layer Neural Network Multi-layer Neural Network Loss Functions

More information

Deep Feedforward Networks

Deep Feedforward Networks Deep Feedforward Networks Liu Yang March 30, 2017 Liu Yang Short title March 30, 2017 1 / 24 Overview 1 Background A general introduction Example 2 Gradient based learning Cost functions Output Units 3

More information

arxiv: v1 [cs.cv] 11 May 2015 Abstract

arxiv: v1 [cs.cv] 11 May 2015 Abstract Training Deeper Convolutional Networks with Deep Supervision Liwei Wang Computer Science Dept UIUC lwang97@illinois.edu Chen-Yu Lee ECE Dept UCSD chl260@ucsd.edu Zhuowen Tu CogSci Dept UCSD ztu0@ucsd.edu

More information

Machine Learning Lecture 14

Machine Learning Lecture 14 Machine Learning Lecture 14 Tricks of the Trade 07.12.2017 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de leibe@vision.rwth-aachen.de Course Outline Fundamentals Bayes Decision Theory Probability

More information

Swapout: Learning an ensemble of deep architectures

Swapout: Learning an ensemble of deep architectures Swapout: Learning an ensemble of deep architectures Saurabh Singh, Derek Hoiem, David Forsyth Department of Computer Science University of Illinois, Urbana-Champaign {ss1, dhoiem, daf}@illinois.edu Abstract

More information

Training Neural Networks Practical Issues

Training Neural Networks Practical Issues Training Neural Networks Practical Issues M. Soleymani Sharif University of Technology Fall 2017 Most slides have been adapted from Fei Fei Li and colleagues lectures, cs231n, Stanford 2017, and some from

More information

Introduction to Deep Neural Networks

Introduction to Deep Neural Networks Introduction to Deep Neural Networks Presenter: Chunyuan Li Pattern Classification and Recognition (ECE 681.01) Duke University April, 2016 Outline 1 Background and Preliminaries Why DNNs? Model: Logistic

More information

Deep Learning Made Easier by Linear Transformations in Perceptrons

Deep Learning Made Easier by Linear Transformations in Perceptrons Deep Learning Made Easier by Linear Transformations in Perceptrons Tapani Raiko Aalto University School of Science Dept. of Information and Computer Science Espoo, Finland firstname.lastname@aalto.fi Harri

More information

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels Need for Deep Networks Perceptron Can only model linear functions Kernel Machines Non-linearity provided by kernels Need to design appropriate kernels (possibly selecting from a set, i.e. kernel learning)

More information

Eve: A Gradient Based Optimization Method with Locally and Globally Adaptive Learning Rates

Eve: A Gradient Based Optimization Method with Locally and Globally Adaptive Learning Rates Eve: A Gradient Based Optimization Method with Locally and Globally Adaptive Learning Rates Hiroaki Hayashi 1,* Jayanth Koushik 1,* Graham Neubig 1 arxiv:1611.01505v3 [cs.lg] 11 Jun 2018 Abstract Adaptive

More information

Improving training of deep neural networks via Singular Value Bounding

Improving training of deep neural networks via Singular Value Bounding Improving training of deep neural networks via Singular Value Bounding Kui Jia 1, Dacheng Tao 2, Shenghua Gao 3, and Xiangmin Xu 1 1 School of Electronic and Information Engineering, South China University

More information

Introduction to Convolutional Neural Networks (CNNs)

Introduction to Convolutional Neural Networks (CNNs) Introduction to Convolutional Neural Networks (CNNs) nojunk@snu.ac.kr http://mipal.snu.ac.kr Department of Transdisciplinary Studies Seoul National University, Korea Jan. 2016 Many slides are from Fei-Fei

More information

FreezeOut: Accelerate Training by Progressively Freezing Layers

FreezeOut: Accelerate Training by Progressively Freezing Layers FreezeOut: Accelerate Training by Progressively Freezing Layers Andrew Brock, Theodore Lim, & J.M. Ritchie School of Engineering and Physical Sciences Heriot-Watt University Edinburgh, UK {ajb5, t.lim,

More information

P-TELU : Parametric Tan Hyperbolic Linear Unit Activation for Deep Neural Networks

P-TELU : Parametric Tan Hyperbolic Linear Unit Activation for Deep Neural Networks P-TELU : Parametric Tan Hyperbolic Linear Unit Activation for Deep Neural Networks Rahul Duggal rahulduggal2608@gmail.com Anubha Gupta anubha@iiitd.ac.in SBILab (http://sbilab.iiitd.edu.in/) Deptt. of

More information

arxiv: v4 [cs.cv] 6 Sep 2017

arxiv: v4 [cs.cv] 6 Sep 2017 Deep Pyramidal Residual Networks Dongyoon Han EE, KAIST dyhan@kaist.ac.kr Jiwhan Kim EE, KAIST jhkim89@kaist.ac.kr Junmo Kim EE, KAIST junmo.kim@kaist.ac.kr arxiv:1610.02915v4 [cs.cv] 6 Sep 2017 Abstract

More information

Decorrelated Batch Normalization

Decorrelated Batch Normalization Decorrelated Batch Normalization Lei Huang Dawei Yang Bo Lang Jia Deng State Key Laboratory of Software Development Environment, Beihang University, P.R.China University of Michigan, Ann Arbor Abstract

More information

RAGAV VENKATESAN VIJETHA GATUPALLI BAOXIN LI NEURAL DATASET GENERALITY

RAGAV VENKATESAN VIJETHA GATUPALLI BAOXIN LI NEURAL DATASET GENERALITY RAGAV VENKATESAN VIJETHA GATUPALLI BAOXIN LI NEURAL DATASET GENERALITY SIFT HOG ALL ABOUT THE FEATURES DAISY GABOR AlexNet GoogleNet CONVOLUTIONAL NEURAL NETWORKS VGG-19 ResNet FEATURES COMES FROM DATA

More information

Deep Feedforward Networks

Deep Feedforward Networks Deep Feedforward Networks Liu Yang March 30, 2017 Liu Yang Short title March 30, 2017 1 / 24 Overview 1 Background A general introduction Example 2 Gradient based learning Cost functions Output Units 3

More information

Tutorial on: Optimization I. (from a deep learning perspective) Jimmy Ba

Tutorial on: Optimization I. (from a deep learning perspective) Jimmy Ba Tutorial on: Optimization I (from a deep learning perspective) Jimmy Ba Outline Random search v.s. gradient descent Finding better search directions Design white-box optimization methods to improve computation

More information

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels Need for Deep Networks Perceptron Can only model linear functions Kernel Machines Non-linearity provided by kernels Need to design appropriate kernels (possibly selecting from a set, i.e. kernel learning)

More information

Efficient DNN Neuron Pruning by Minimizing Layer-wise Nonlinear Reconstruction Error

Efficient DNN Neuron Pruning by Minimizing Layer-wise Nonlinear Reconstruction Error Efficient DNN Neuron Pruning by Minimizing Layer-wise Nonlinear Reconstruction Error Chunhui Jiang, Guiying Li, Chao Qian, Ke Tang Anhui Province Key Lab of Big Data Analysis and Application, University

More information

Deep Learning & Artificial Intelligence WS 2018/2019

Deep Learning & Artificial Intelligence WS 2018/2019 Deep Learning & Artificial Intelligence WS 2018/2019 Linear Regression Model Model Error Function: Squared Error Has no special meaning except it makes gradients look nicer Prediction Ground truth / target

More information

Deep Learning & Neural Networks Lecture 4

Deep Learning & Neural Networks Lecture 4 Deep Learning & Neural Networks Lecture 4 Kevin Duh Graduate School of Information Science Nara Institute of Science and Technology Jan 23, 2014 2/20 3/20 Advanced Topics in Optimization Today we ll briefly

More information

Spatial Transformer Networks

Spatial Transformer Networks BIL722 - Deep Learning for Computer Vision Spatial Transformer Networks Max Jaderberg Andrew Zisserman Karen Simonyan Koray Kavukcuoglu Contents Introduction to Spatial Transformers Related Works Spatial

More information

arxiv: v2 [cs.cv] 12 Apr 2016

arxiv: v2 [cs.cv] 12 Apr 2016 arxiv:1603.05027v2 [cs.cv] 12 Apr 2016 Identity Mappings in Deep Residual Networks Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun Microsoft Research Abstract Deep residual networks [1] have emerged

More information

arxiv: v2 [stat.ml] 18 Jun 2017

arxiv: v2 [stat.ml] 18 Jun 2017 FREEZEOUT: ACCELERATE TRAINING BY PROGRES- SIVELY FREEZING LAYERS Andrew Brock, Theodore Lim, & J.M. Ritchie School of Engineering and Physical Sciences Heriot-Watt University Edinburgh, UK {ajb5, t.lim,

More information

Lecture 6 Optimization for Deep Neural Networks

Lecture 6 Optimization for Deep Neural Networks Lecture 6 Optimization for Deep Neural Networks CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago April 12, 2017 Things we will look at today Stochastic Gradient Descent Things

More information

DEEP LEARNING AND NEURAL NETWORKS: BACKGROUND AND HISTORY

DEEP LEARNING AND NEURAL NETWORKS: BACKGROUND AND HISTORY DEEP LEARNING AND NEURAL NETWORKS: BACKGROUND AND HISTORY 1 On-line Resources http://neuralnetworksanddeeplearning.com/index.html Online book by Michael Nielsen http://matlabtricks.com/post-5/3x3-convolution-kernelswith-online-demo

More information

Deep Learning with Low Precision by Half-wave Gaussian Quantization

Deep Learning with Low Precision by Half-wave Gaussian Quantization Deep Learning with Low Precision by Half-wave Gaussian Quantization Zhaowei Cai UC San Diego zwcai@ucsd.edu Xiaodong He Microsoft Research Redmond xiaohe@microsoft.com Jian Sun Megvii Inc. sunjian@megvii.com

More information

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6 Machine Learning for Large-Scale Data Analysis and Decision Making 80-629-17A Neural Networks Week #6 Today Neural Networks A. Modeling B. Fitting C. Deep neural networks Today s material is (adapted)

More information

Deep Residual. Variations

Deep Residual. Variations Deep Residual Network and Its Variations Diyu Yang (Originally prepared by Kaiming He from Microsoft Research) Advantages of Depth Degradation Problem Possible Causes? Vanishing/Exploding Gradients. Overfitting

More information

UNSUPERVISED LEARNING

UNSUPERVISED LEARNING UNSUPERVISED LEARNING Topics Layer-wise (unsupervised) pre-training Restricted Boltzmann Machines Auto-encoders LAYER-WISE (UNSUPERVISED) PRE-TRAINING Breakthrough in 2006 Layer-wise (unsupervised) pre-training

More information

Recurrent Neural Network Training with Preconditioned Stochastic Gradient Descent

Recurrent Neural Network Training with Preconditioned Stochastic Gradient Descent Recurrent Neural Network Training with Preconditioned Stochastic Gradient Descent 1 Xi-Lin Li, lixilinx@gmail.com arxiv:1606.04449v2 [stat.ml] 8 Dec 2016 Abstract This paper studies the performance of

More information

Encoder Based Lifelong Learning - Supplementary materials

Encoder Based Lifelong Learning - Supplementary materials Encoder Based Lifelong Learning - Supplementary materials Amal Rannen Rahaf Aljundi Mathew B. Blaschko Tinne Tuytelaars KU Leuven KU Leuven, ESAT-PSI, IMEC, Belgium firstname.lastname@esat.kuleuven.be

More information

Deep Learning Lecture 2

Deep Learning Lecture 2 Fall 2016 Machine Learning CMPSCI 689 Deep Learning Lecture 2 Sridhar Mahadevan Autonomous Learning Lab UMass Amherst COLLEGE Outline of lecture New type of units Convolutional units, Rectified linear

More information

Deep Neural Networks (3) Computational Graphs, Learning Algorithms, Initialisation

Deep Neural Networks (3) Computational Graphs, Learning Algorithms, Initialisation Deep Neural Networks (3) Computational Graphs, Learning Algorithms, Initialisation Steve Renals Machine Learning Practical MLP Lecture 5 16 October 2018 MLP Lecture 5 / 16 October 2018 Deep Neural Networks

More information

RECURRENT NEURAL NETWORKS WITH FLEXIBLE GATES USING KERNEL ACTIVATION FUNCTIONS

RECURRENT NEURAL NETWORKS WITH FLEXIBLE GATES USING KERNEL ACTIVATION FUNCTIONS 2018 IEEE INTERNATIONAL WORKSHOP ON MACHINE LEARNING FOR SIGNAL PROCESSING, SEPT. 17 20, 2018, AALBORG, DENMARK RECURRENT NEURAL NETWORKS WITH FLEXIBLE GATES USING KERNEL ACTIVATION FUNCTIONS Simone Scardapane,

More information

Comparison of Modern Stochastic Optimization Algorithms

Comparison of Modern Stochastic Optimization Algorithms Comparison of Modern Stochastic Optimization Algorithms George Papamakarios December 214 Abstract Gradient-based optimization methods are popular in machine learning applications. In large-scale problems,

More information

Notes on Adversarial Examples

Notes on Adversarial Examples Notes on Adversarial Examples David Meyer dmm@{1-4-5.net,uoregon.edu,...} March 14, 2017 1 Introduction The surprising discovery of adversarial examples by Szegedy et al. [6] has led to new ways of thinking

More information

The K-FAC method for neural network optimization

The K-FAC method for neural network optimization The K-FAC method for neural network optimization James Martens Thanks to my various collaborators on K-FAC research and engineering: Roger Grosse, Jimmy Ba, Vikram Tankasali, Matthew Johnson, Daniel Duckworth,

More information

Lecture 7 Convolutional Neural Networks

Lecture 7 Convolutional Neural Networks Lecture 7 Convolutional Neural Networks CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago April 17, 2017 We saw before: ŷ x 1 x 2 x 3 x 4 A series of matrix multiplications:

More information

Learning Deep ResNet Blocks Sequentially using Boosting Theory

Learning Deep ResNet Blocks Sequentially using Boosting Theory Furong Huang 1 Jordan T. Ash 2 John Langford 3 Robert E. Schapire 3 Abstract We prove a multi-channel telescoping sum boosting theory for the ResNet architectures which simultaneously creates a new technique

More information

COMPARING FIXED AND ADAPTIVE COMPUTATION TIME FOR RE-

COMPARING FIXED AND ADAPTIVE COMPUTATION TIME FOR RE- Workshop track - ICLR COMPARING FIXED AND ADAPTIVE COMPUTATION TIME FOR RE- CURRENT NEURAL NETWORKS Daniel Fojo, Víctor Campos, Xavier Giró-i-Nieto Universitat Politècnica de Catalunya, Barcelona Supercomputing

More information

Large-Scale Feature Learning with Spike-and-Slab Sparse Coding

Large-Scale Feature Learning with Spike-and-Slab Sparse Coding Large-Scale Feature Learning with Spike-and-Slab Sparse Coding Ian J. Goodfellow, Aaron Courville, Yoshua Bengio ICML 2012 Presented by Xin Yuan January 17, 2013 1 Outline Contributions Spike-and-Slab

More information

Grundlagen der Künstlichen Intelligenz

Grundlagen der Künstlichen Intelligenz Grundlagen der Künstlichen Intelligenz Neural networks Daniel Hennes 21.01.2018 (WS 2017/18) University Stuttgart - IPVS - Machine Learning & Robotics 1 Today Logistic regression Neural networks Perceptron

More information

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 16

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 16 Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 16 Slides adapted from Jordan Boyd-Graber, Justin Johnson, Andrej Karpathy, Chris Ketelsen, Fei-Fei Li, Mike Mozer, Michael Nielson

More information

arxiv: v2 [cs.cv] 8 Mar 2018

arxiv: v2 [cs.cv] 8 Mar 2018 RESIDUAL CONNECTIONS ENCOURAGE ITERATIVE IN- FERENCE Stanisław Jastrzębski 1,2,, Devansh Arpit 2,, Nicolas Ballas 3, Vikas Verma 5, Tong Che 2 & Yoshua Bengio 2,6 arxiv:171773v2 [cs.cv] 8 Mar 2018 1 Jagiellonian

More information

Importance Reweighting Using Adversarial-Collaborative Training

Importance Reweighting Using Adversarial-Collaborative Training Importance Reweighting Using Adversarial-Collaborative Training Yifan Wu yw4@andrew.cmu.edu Tianshu Ren tren@andrew.cmu.edu Lidan Mu lmu@andrew.cmu.edu Abstract We consider the problem of reweighting a

More information

Neural Networks 2. 2 Receptive fields and dealing with image inputs

Neural Networks 2. 2 Receptive fields and dealing with image inputs CS 446 Machine Learning Fall 2016 Oct 04, 2016 Neural Networks 2 Professor: Dan Roth Scribe: C. Cheng, C. Cervantes Overview Convolutional Neural Networks Recurrent Neural Networks 1 Introduction There

More information

arxiv: v1 [cs.cv] 18 May 2018

arxiv: v1 [cs.cv] 18 May 2018 Norm-Preservation: Why Residual Networks Can Become Extremely Deep? arxiv:85.7477v [cs.cv] 8 May 8 Alireza Zaeemzadeh University of Central Florida zaeemzadeh@eecs.ucf.edu Nazanin Rahnavard University

More information

Artificial Neural Networks D B M G. Data Base and Data Mining Group of Politecnico di Torino. Elena Baralis. Politecnico di Torino

Artificial Neural Networks D B M G. Data Base and Data Mining Group of Politecnico di Torino. Elena Baralis. Politecnico di Torino Artificial Neural Networks Data Base and Data Mining Group of Politecnico di Torino Elena Baralis Politecnico di Torino Artificial Neural Networks Inspired to the structure of the human brain Neurons as

More information

Normalized Gradient with Adaptive Stepsize Method for Deep Neural Network Training

Normalized Gradient with Adaptive Stepsize Method for Deep Neural Network Training Normalized Gradient with Adaptive Stepsize Method for Deep Neural Network raining Adams Wei Yu, Qihang Lin, Ruslan Salakhutdinov, and Jaime Carbonell School of Computer Science, Carnegie Mellon University

More information

arxiv: v1 [cs.lg] 9 Nov 2015

arxiv: v1 [cs.lg] 9 Nov 2015 HOW FAR CAN WE GO WITHOUT CONVOLUTION: IM- PROVING FULLY-CONNECTED NETWORKS Zhouhan Lin & Roland Memisevic Université de Montréal Canada {zhouhan.lin, roland.memisevic}@umontreal.ca arxiv:1511.02580v1

More information

Maxout Networks. Hien Quoc Dang

Maxout Networks. Hien Quoc Dang Maxout Networks Hien Quoc Dang Outline Introduction Maxout Networks Description A Universal Approximator & Proof Experiments with Maxout Why does Maxout work? Conclusion 10/12/13 Hien Quoc Dang Machine

More information

STA 414/2104: Lecture 8

STA 414/2104: Lecture 8 STA 414/2104: Lecture 8 6-7 March 2017: Continuous Latent Variable Models, Neural networks With thanks to Russ Salakhutdinov, Jimmy Ba and others Outline Continuous latent variable models Background PCA

More information

Stochastic Gradient Estimate Variance in Contrastive Divergence and Persistent Contrastive Divergence

Stochastic Gradient Estimate Variance in Contrastive Divergence and Persistent Contrastive Divergence ESANN 0 proceedings, European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning. Bruges (Belgium), 7-9 April 0, idoc.com publ., ISBN 97-7707-. Stochastic Gradient

More information

Machine Learning for Signal Processing Neural Networks Continue. Instructor: Bhiksha Raj Slides by Najim Dehak 1 Dec 2016

Machine Learning for Signal Processing Neural Networks Continue. Instructor: Bhiksha Raj Slides by Najim Dehak 1 Dec 2016 Machine Learning for Signal Processing Neural Networks Continue Instructor: Bhiksha Raj Slides by Najim Dehak 1 Dec 2016 1 So what are neural networks?? Voice signal N.Net Transcription Image N.Net Text

More information

Convolutional Neural Network Architecture

Convolutional Neural Network Architecture Convolutional Neural Network Architecture Zhisheng Zhong Feburary 2nd, 2018 Zhisheng Zhong Convolutional Neural Network Architecture Feburary 2nd, 2018 1 / 55 Outline 1 Introduction of Convolution Motivation

More information

Two at Once: Enhancing Learning and Generalization Capacities via IBN-Net

Two at Once: Enhancing Learning and Generalization Capacities via IBN-Net Two at Once: Enhancing Learning and Generalization Capacities via IBN-Net Supplementary Material Xingang Pan 1, Ping Luo 1, Jianping Shi 2, and Xiaoou Tang 1 1 CUHK-SenseTime Joint Lab, The Chinese University

More information

Very Deep Residual Networks with Maxout for Plant Identification in the Wild Milan Šulc, Dmytro Mishkin, Jiří Matas

Very Deep Residual Networks with Maxout for Plant Identification in the Wild Milan Šulc, Dmytro Mishkin, Jiří Matas Very Deep Residual Networks with Maxout for Plant Identification in the Wild Milan Šulc, Dmytro Mishkin, Jiří Matas Center for Machine Perception Department of Cybernetics Faculty of Electrical Engineering

More information

Day 3 Lecture 3. Optimizing deep networks

Day 3 Lecture 3. Optimizing deep networks Day 3 Lecture 3 Optimizing deep networks Convex optimization A function is convex if for all α [0,1]: f(x) Tangent line Examples Quadratics 2-norms Properties Local minimum is global minimum x Gradient

More information

Deep Feedforward Networks. Seung-Hoon Na Chonbuk National University

Deep Feedforward Networks. Seung-Hoon Na Chonbuk National University Deep Feedforward Networks Seung-Hoon Na Chonbuk National University Neural Network: Types Feedforward neural networks (FNN) = Deep feedforward networks = multilayer perceptrons (MLP) No feedback connections

More information

Classification goals: Make 1 guess about the label (Top-1 error) Make 5 guesses about the label (Top-5 error) No Bounding Box

Classification goals: Make 1 guess about the label (Top-1 error) Make 5 guesses about the label (Top-5 error) No Bounding Box ImageNet Classification with Deep Convolutional Neural Networks Alex Krizhevsky, Ilya Sutskever, Geoffrey E. Hinton Motivation Classification goals: Make 1 guess about the label (Top-1 error) Make 5 guesses

More information

Nonlinear Optimization Methods for Machine Learning

Nonlinear Optimization Methods for Machine Learning Nonlinear Optimization Methods for Machine Learning Jorge Nocedal Northwestern University University of California, Davis, Sept 2018 1 Introduction We don t really know, do we? a) Deep neural networks

More information

arxiv: v1 [cs.lg] 16 Jun 2017

arxiv: v1 [cs.lg] 16 Jun 2017 L 2 Regularization versus Batch and Weight Normalization arxiv:1706.05350v1 [cs.lg] 16 Jun 2017 Twan van Laarhoven Institute for Computer Science Radboud University Postbus 9010, 6500GL Nijmegen, The Netherlands

More information

Neural Networks. David Rosenberg. July 26, New York University. David Rosenberg (New York University) DS-GA 1003 July 26, / 35

Neural Networks. David Rosenberg. July 26, New York University. David Rosenberg (New York University) DS-GA 1003 July 26, / 35 Neural Networks David Rosenberg New York University July 26, 2017 David Rosenberg (New York University) DS-GA 1003 July 26, 2017 1 / 35 Neural Networks Overview Objectives What are neural networks? How

More information

A summary of Deep Learning without Poor Local Minima

A summary of Deep Learning without Poor Local Minima A summary of Deep Learning without Poor Local Minima by Kenji Kawaguchi MIT oral presentation at NIPS 2016 Learning Supervised (or Predictive) learning Learn a mapping from inputs x to outputs y, given

More information

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference ECE 18-898G: Special Topics in Signal Processing: Sparsity, Structure, and Inference Neural Networks: A brief touch Yuejie Chi Department of Electrical and Computer Engineering Spring 2018 1/41 Outline

More information

Learning Deep Architectures for AI. Part II - Vijay Chakilam

Learning Deep Architectures for AI. Part II - Vijay Chakilam Learning Deep Architectures for AI - Yoshua Bengio Part II - Vijay Chakilam Limitations of Perceptron x1 W, b 0,1 1,1 y x2 weight plane output =1 output =0 There is no value for W and b such that the model

More information

Advanced Training Techniques. Prajit Ramachandran

Advanced Training Techniques. Prajit Ramachandran Advanced Training Techniques Prajit Ramachandran Outline Optimization Regularization Initialization Optimization Optimization Outline Gradient Descent Momentum RMSProp Adam Distributed SGD Gradient Noise

More information

Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, Spis treści

Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, Spis treści Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, 2017 Spis treści Website Acknowledgments Notation xiii xv xix 1 Introduction 1 1.1 Who Should Read This Book?

More information

SHAKE-SHAKE REGULARIZATION OF 3-BRANCH

SHAKE-SHAKE REGULARIZATION OF 3-BRANCH SHAKE-SHAKE REGULARIZATION OF 3-BRANCH RESIDUAL NETWORKS Xavier Gastaldi xgastaldi.mba2011@london.edu ABSTRACT The method introduced in this paper aims at helping computer vision practitioners faced with

More information

Sajid Anwar, Kyuyeon Hwang and Wonyong Sung

Sajid Anwar, Kyuyeon Hwang and Wonyong Sung Sajid Anwar, Kyuyeon Hwang and Wonyong Sung Department of Electrical and Computer Engineering Seoul National University Seoul, 08826 Korea Email: sajid@dsp.snu.ac.kr, khwang@dsp.snu.ac.kr, wysung@snu.ac.kr

More information

Lecture 5: Logistic Regression. Neural Networks

Lecture 5: Logistic Regression. Neural Networks Lecture 5: Logistic Regression. Neural Networks Logistic regression Comparison with generative models Feed-forward neural networks Backpropagation Tricks for training neural networks COMP-652, Lecture

More information

TUTORIAL PART 1 Unsupervised Learning

TUTORIAL PART 1 Unsupervised Learning TUTORIAL PART 1 Unsupervised Learning Marc'Aurelio Ranzato Department of Computer Science Univ. of Toronto ranzato@cs.toronto.edu Co-organizers: Honglak Lee, Yoshua Bengio, Geoff Hinton, Yann LeCun, Andrew

More information

Lecture 14: Deep Generative Learning

Lecture 14: Deep Generative Learning Generative Modeling CSED703R: Deep Learning for Visual Recognition (2017F) Lecture 14: Deep Generative Learning Density estimation Reconstructing probability density function using samples Bohyung Han

More information

arxiv: v7 [cs.ne] 2 Sep 2014

arxiv: v7 [cs.ne] 2 Sep 2014 Learned-Norm Pooling for Deep Feedforward and Recurrent Neural Networks Caglar Gulcehre, Kyunghyun Cho, Razvan Pascanu, and Yoshua Bengio Département d Informatique et de Recherche Opérationelle Université

More information

Learning Deep Architectures

Learning Deep Architectures Learning Deep Architectures Yoshua Bengio, U. Montreal Microsoft Cambridge, U.K. July 7th, 2009, Montreal Thanks to: Aaron Courville, Pascal Vincent, Dumitru Erhan, Olivier Delalleau, Olivier Breuleux,

More information

arxiv: v1 [cs.lg] 25 Sep 2018

arxiv: v1 [cs.lg] 25 Sep 2018 Utilizing Class Information for DNN Representation Shaping Daeyoung Choi and Wonjong Rhee Department of Transdisciplinary Studies Seoul National University Seoul, 08826, South Korea {choid, wrhee}@snu.ac.kr

More information

Layer-wise Relevance Propagation for Neural Networks with Local Renormalization Layers

Layer-wise Relevance Propagation for Neural Networks with Local Renormalization Layers Layer-wise Relevance Propagation for Neural Networks with Local Renormalization Layers Alexander Binder 1, Grégoire Montavon 2, Sebastian Lapuschkin 3, Klaus-Robert Müller 2,4, and Wojciech Samek 3 1 ISTD

More information

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 Outlines Overview Introduction Linear Algebra Probability Linear Regression

More information

Understanding How ConvNets See

Understanding How ConvNets See Understanding How ConvNets See Slides from Andrej Karpathy Springerberg et al, Striving for Simplicity: The All Convolutional Net (ICLR 2015 workshops) CSC321: Intro to Machine Learning and Neural Networks,

More information

Variational Autoencoders

Variational Autoencoders Variational Autoencoders Recap: Story so far A classification MLP actually comprises two components A feature extraction network that converts the inputs into linearly separable features Or nearly linearly

More information

arxiv: v2 [cs.ne] 7 Apr 2015

arxiv: v2 [cs.ne] 7 Apr 2015 A Simple Way to Initialize Recurrent Networks of Rectified Linear Units arxiv:154.941v2 [cs.ne] 7 Apr 215 Quoc V. Le, Navdeep Jaitly, Geoffrey E. Hinton Google Abstract Learning long term dependencies

More information

Deep Belief Networks are compact universal approximators

Deep Belief Networks are compact universal approximators 1 Deep Belief Networks are compact universal approximators Nicolas Le Roux 1, Yoshua Bengio 2 1 Microsoft Research Cambridge 2 University of Montreal Keywords: Deep Belief Networks, Universal Approximation

More information

Intro to Neural Networks and Deep Learning

Intro to Neural Networks and Deep Learning Intro to Neural Networks and Deep Learning Jack Lanchantin Dr. Yanjun Qi UVA CS 6316 1 Neurons 1-Layer Neural Network Multi-layer Neural Network Loss Functions Backpropagation Nonlinearity Functions NNs

More information

Compressing deep neural networks

Compressing deep neural networks From Data to Decisions - M.Sc. Data Science Compressing deep neural networks Challenges and theoretical foundations Presenter: Simone Scardapane University of Exeter, UK Table of contents Introduction

More information

The Deep Ritz method: A deep learning-based numerical algorithm for solving variational problems

The Deep Ritz method: A deep learning-based numerical algorithm for solving variational problems The Deep Ritz method: A deep learning-based numerical algorithm for solving variational problems Weinan E 1 and Bing Yu 2 arxiv:1710.00211v1 [cs.lg] 30 Sep 2017 1 The Beijing Institute of Big Data Research,

More information

Hypernetwork-based Implicit Posterior Estimation and Model Averaging of Convolutional Neural Networks

Hypernetwork-based Implicit Posterior Estimation and Model Averaging of Convolutional Neural Networks Proceedings of Machine Learning Research 95:176-191, 2018 ACML 2018 Hypernetwork-based Implicit Posterior Estimation and Model Averaging of Convolutional Neural Networks Kenya Ukai ukai@ai.cs.kobe-u.ac.jp

More information

ANALYSIS ON GRADIENT PROPAGATION IN BATCH NORMALIZED RESIDUAL NETWORKS

ANALYSIS ON GRADIENT PROPAGATION IN BATCH NORMALIZED RESIDUAL NETWORKS ANALYSIS ON GRADIENT PROPAGATION IN BATCH NORMALIZED RESIDUAL NETWORKS Anonymous authors Paper under double-blind review ABSTRACT We conduct mathematical analysis on the effect of batch normalization (BN)

More information

Nonparametric Inference for Auto-Encoding Variational Bayes

Nonparametric Inference for Auto-Encoding Variational Bayes Nonparametric Inference for Auto-Encoding Variational Bayes Erik Bodin * Iman Malik * Carl Henrik Ek * Neill D. F. Campbell * University of Bristol University of Bath Variational approximations are an

More information

Based on the original slides of Hung-yi Lee

Based on the original slides of Hung-yi Lee Based on the original slides of Hung-yi Lee Google Trends Deep learning obtains many exciting results. Can contribute to new Smart Services in the Context of the Internet of Things (IoT). IoT Services

More information

Measuring the Usefulness of Hidden Units in Boltzmann Machines with Mutual Information

Measuring the Usefulness of Hidden Units in Boltzmann Machines with Mutual Information Measuring the Usefulness of Hidden Units in Boltzmann Machines with Mutual Information Mathias Berglund, Tapani Raiko, and KyungHyun Cho Department of Information and Computer Science Aalto University

More information

arxiv: v2 [cs.lg] 16 Mar 2013

arxiv: v2 [cs.lg] 16 Mar 2013 Metric-Free Natural Gradient for Joint-Training of Boltzmann Machines arxiv:1301.3545v2 cs.lg 16 Mar 2013 Guillaume Desjardins, Razvan Pascanu, Aaron Courville and Yoshua Bengio Département d informatique

More information

RegML 2018 Class 8 Deep learning

RegML 2018 Class 8 Deep learning RegML 2018 Class 8 Deep learning Lorenzo Rosasco UNIGE-MIT-IIT June 18, 2018 Supervised vs unsupervised learning? So far we have been thinking of learning schemes made in two steps f(x) = w, Φ(x) F, x

More information

arxiv: v1 [cs.cv] 27 Mar 2017

arxiv: v1 [cs.cv] 27 Mar 2017 Active Convolution: Learning the Shape of Convolution for Image Classification Yunho Jeon EE, KAIST jyh2986@aist.ac.r Junmo Kim EE, KAIST junmo.im@aist.ac.r arxiv:170.09076v1 [cs.cv] 27 Mar 2017 Abstract

More information