Decoupled Parallel Backpropagation with Convergence Guarantee

Size: px

Start display at page:

Download "Decoupled Parallel Backpropagation with Convergence Guarantee"

Merilyn Craig
5 years ago
Views:

1 Zhouyuan Huo 1 Bin Gu 1 Qian Yang 1 Heng Huang 1 Abstract Backpropagation agorithm is indispensabe for the training of feedforward neura networks. It requires propagating error gradients sequentiay from the output ayer a the way back to the input ayer. The backward ocking in backpropagation agorithm constrains us from updating network ayers in parae and fuy everaging the computing resources. Recenty, severa agorithms have been proposed for breaking the backward ocking. However, their performances degrade seriousy when networks are deep. In this paper, we propose decouped parae backpropagation agorithm for deep earning optimization with convergence guarantee. Firsty, we decoupe the backpropagation agorithm using deayed gradients, and show that the backward ocking is removed when we spit the networks into mutipe modues. Then, we utiize decouped parae backpropagation in two stochastic methods and prove that our method guarantees convergence to critica points for the non-convex probem. Finay, we perform experiments for training deep convoutiona neura networks on benchmark datasets. The experimenta resuts not ony confirm our theoretica anaysis, but aso demonstrate that the proposed method can achieve significant speedup without oss of accuracy. 1. Introduction We have witnessed a series of breakthroughs in computer vision using deep convoutiona neura networks (LeCun et a., 2015). Most neura networks are trained using stochastic gradient descent (SGD) or its variants in which the gradients of the networks are computed by backpropagation agorithm (Rumehart et a., 1988). As shown in Figure 1, the backpropagation agorithm consists of two processes, the 1 Department of Eectrica and Computer Engineering, University of Pittsburgh, Pittsburgh, PA, United States. Correspondence to: Heng Huang <henghuanghh@gmai.com>. Proceedings of the 35 th Internationa Conference on Machine Learning, Stockhom, Sweden, PMLR, Copyright 2018 by the author(s). Figure 1. We spit a mutiayer feedforward neura network into three modues. Each modue is a stack of ayers. Backpropagation agorithm requires running forward pass (from 1 to 3) and backward pass (from 4 to 6) in sequentia order. For exampe, modue A cannot perform step 6 before receiving δ t A which is an output of step 5 in modue B. forward pass to compute prediction and the backward pass to compute gradient and update the mode. After computing prediction in the forward pass, backpropagation agorithm requires propagating error gradients from the top (output ayer) a the way back to the bottom (input ayer). Therefore, in the backward pass, a ayers, or more generay, modues, of the network are ocked unti their dependencies have executed. The backward ocking constrains us from updating modes in parae and fuy everaging the computing resources. It has been shown in practice (Krizhevsky et a., 2012; Simonyan & Zisserman, 2014; Szegedy et a., 2015; He et a., 2016; Huang et a., 2016) and in theory (Edan & Shamir, 2016; Tegarsky, 2016; Bengio et a., 2009) that depth is one of the most critica factors contributing to the success of deep earning. From AexNet with 8 ayers (Krizhevsky et a., 2012) to ResNet-101 with more than one hundred ayers (He et a., 2016), the forward and backward time grow from (4.31ms and 9.58ms) to (53.38ms and ms) when we train the networks on Titan X with the input size of (Johnson, 2017). Therefore, paraeizing the backward pass can greaty reduce the training time when the backward time is about twice of the forward time. We

2 can easiy spit a deep neura network into modues ike Figure 1 and distribute them across mutipe GPUs. However, because of the backward ocking, a GPUs are ide before receiving error gradients from dependent modues in the backward pass. There have been severa agorithms proposed for breaking the backward ocking. For exampe, (Jaderberg et a., 2016; Czarnecki et a., 2017) proposed to remove the ockings in backpropagation by empoying additiona neura networks to approximate error gradients. In the backward pass, a modues use the synthetic gradients to update weights of the mode without incurring any deay. (Nøkand, 2016; Baduzzi et a., 2015) broke the oca dependencies between successive ayers and made a hidden ayers receive error information from the output ayer directy. In (Carreira- Perpinan & Wang, 2014; Tayor et a., 2016), the authors oosened the exact connections between ayers by introducing auxiiary variabes. In each ayer, they imposed equaity constraint between the auxiiary variabe and activation, and optimized the new probem using Aternating Direction Method which is easy to parae. However, for the convoutiona neura network, the performances of a above methods are much worse than backpropagation agorithm when the network is deep. In this paper, we focus on breaking the backward ocking in backpropagtion agorithm for training feedforward neura networks, such that we can update modes in parae without oss of accuracy. The main contributions of our work are as foows: Firsty, we decoupe the backpropagation using deayed gradients in Section 3 such that a modues of the network can be updated in parae without backward ocking. Then, we propose two stochastic agorithms using decouped parae backpropagation in Section 3 for deep earning optimization. We aso provide convergence anaysis for the proposed method in Section 4 and prove that it guarantees convergence to critica points for the non-convex probem. Finay, we perform experiments for training deep convoutiona neura networks in Section 5, experimenta resuts verifying that the proposed method can significanty speed up the training without oss of accuracy. 2. Backgrounds We begin with a brief overview of the backpropagation agorithm for the optimization of neura networks. Suppose that we want to train a feedforward neura network with L ayers, each ayer taking an input h 1 and producing an activation h = F (h 1 ; w ) with weight w. Letting d be the dimension of weights in the network, we have w = [w 1, w 2,..., w L ] R d. Thus, the output of the network can be represented as h L = F (h 0 ; w), where h 0 denotes the input data x. Taking a oss function f and targets y, the training probem is as foows: min f(f (x; w), y). (1) w=[w 1,...,w L ] In the foowing context, we use f(w) for simpicity. Gradients based methods are widey used for deep earning optimization (Robbins & Monro, 1951; Qian, 1999; Hinton et a., 2012; Kingma & Ba, 2014). In iteration t, we put a data sampe x i(t) into the network, where i(t) denotes the index of the sampe. According to stochastic gradient descent (SGD), we update the weights of the network through: w t+1 = w t [ γ t f,xi(t) (w t ) ], {1, 2,..., L} (2) where γ t is the stepsize and f,xi(t) (w t ) R d is the gradient of the oss function (1) with respect to the weights at ayer and data sampe x i(t), a the coordinates in other than ayer are 0. We aways utiize backpropagation agorithm to compute the gradients (Rumehart et a., 1988). The backpropagation agorithm consists of two passes of the network: in the forward pass, the activations of a ayers are cacuated from = 1 to L as foows: h t = F (h t 1; w ); (3) in the backward pass, we appy chain rue for gradients and repeatedy propagate error gradients through the network from the output ayer = L to the input ayer = 1: f(w t ) w t f(w t ) h t 1 = ht w t = h t h t 1 f(w t ) h t, (4) f(w t ) h t, (5) where we et f,xi(t) (w t ) = f(wt ). From equations (4) w t and (5), it is obvious that the computation in ayer is dependent on the error gradient f(wt ) from ayer +1. Therefore, h t the backward ocking constrains a ayers from updating before receiving error gradients from the dependent ayers. When the network is very deep or distributed across mutipe resources, the backward ocking is the main botteneck in the training process. 3. Decouped Parae Backpropagation In this section, we propose to decoupe the backpropagation agorithm using deayed gradients (DDG). Suppose we spit a L-ayer feedforward neura network to K modues, such that the weights of the network are divided into K groups. Therefore, we have w = [w G(1), w G(2),..., w G(K) ] where G(k) denotes ayer indices in the group k.

3 Figure 2. We spit a mutiayer feedforward neura network into three modues (A, B and C), where each modue is a stack of ayers. After executing the forward pass (from 1 to 3) to predict, our proposed method aows a modues to run backward pass (4) using deayed gradients without ocking. Particuary, modue A can perform the backward pass using the stae error gradient δ t 2 A. Meanwhie, It aso receives δ t 1 A from modue B for the update of the next iteration Backpropagation Using Deayed Gradients In iteration t, data sampe x i(t) is input to the network. We run the forward pass from modue k = 1 to k = K. In each modue, we compute the activations in sequentia order as equation (3). In the backward pass, a modues except the ast one have deayed error gradients in store such that they can execute the backward computation without ocking. The ast modue updates with the up-to-date gradients. In particuar, modue k keeps the stae error gradient f(wt K+k ), h t K+k L k where L k denotes the ast ayer in modue k. Therefore, the backward computation in modue k is as foows: w t K+k h t K+k 1 = ht K+k w t K+k = ht K+k h t K+k 1 h t K+k, (6) h t K+k. (7) where G(k). Meanwhie, each modue aso receives error gradient from the dependent modue for further computation. From (6) and (7), we can know that the stae error gradients in a modues are of different time deay. From modue k = 1 to k = K, their corresponding time deays are from K 1 to 0. Deay 0 indicates that the gradients are up-to-date. In this way, we break the backward ocking and achieve parae update in the backward pass. Figure 2 shows an exampe of the decouped backpropagation, where error gradients δ := f(w) h Speedup of Decouped Parae Backpropagation When K = 1, there is no time deay and the proposed method is equivaent to the backpropagation agorithm. When K 1, we can distribute the network across mutipe GPUs and fuy everage the computing resources. Tabe 1 ists the computation time when we sequentiay aocate the network across K GPUs. When T F is necessary to compute accurate predictions, we can acceerate the training by re- Tabe 1. Comparisons of computation time when the network is sequentiay distributed across K GPUs. T F and T B denote the forward and backward time for backpropagation agorithm. Method Backpropagation DDG Computation Time T F + T B T F + T B K ducing the backward time. Because T B is much arge than T F, we can achieve huge speedup even K is sma. Reation to mode paraeism: Mode paraeism usuay refers to fiter-wise paraeism (Yadan et a., 2013). For exampe, we spit a convoutiona ayer with N fiters into two GPUs, each part containing N 2 fiters. Athough the fiterwise paraeism acceerates the training when we distribute the workoads across mutipe GPUs, it sti suffers from the backward ocking. We can think of DDG agorithm as ayer-wise paraeism. It is aso easy to combine fiter-wise paraeism with ayer-wise paraeism for further speedup Stochastic Methods Using Deayed Gradients After computing the gradients of the oss function with respect to the weights of the mode, we update the mode using deayed gradients. Letting f G(k),xi(t K+k) ( w t K+k ) := G(k) w t K+k if t K + k 0, (8) 0 otherwise for any k {1, 2,..., K}, we update the weights in modue k foowing SGD: w t+1 G(k) = wt G(k) γ t[ f G(k),xi(t K+k) ( w t K+k ) ] G(k). (9) where γ t denotes stepsize. Different from SGD, we update the weights with deayed gradients. Besides, the deayed iteration (t K + k) for group k is aso deterministic. We summarize the proposed method in Agorithm 1.

4 Agorithm 1 SGD-DDG Require: Initia weights w 0 = [wg(1) 0,..., w0 G(K) ] Rd ; Stepsize sequence {γ t }; 1: for t = 0, 1, 2,..., T 1 do 2: for k = 1,..., K in parae do 3: Compute deayed [ gradient: gk t ( f )] G(k),xi(t K+k) w t K+k ; G(k) 4: Update weights: 5: end for 6: end for w t+1 G(k) wt G(k) γ t g t k ; Moreover, we can aso appy the deayed gradients to other variants of SGD, for exampe Adam in Agorithm 2. In each iteration, we update the weights and moment vectors with deayed gradients. We anayze the convergence for Agorithm 1 in Section 4, which is the basis of anaysis for other methods. 4. Convergence Anaysis In this section, we estabish the convergence guarantees to critica points for Agorithm 1 when the probem is nonconvex. Anaysis shows that our method admits simiar convergence rate to vania stochastic gradient descent (Bottou et a., 2016). Throughout this paper, we make the foowing commony used assumptions: Assumption 1 (Lipschitz-continuous gradient) The gradient of f(w) is Lipschitz continuous with Lipschitz constant L > 0, such that w, v R d : f(w) f(v) 2 L w v 2 (10) Assumption 2 (Bounded variance) To bound the variance of the stochastic gradient, we assume the second moment of the stochastic gradient is upper bounded, such that there exists constant M 0, for any sampe x i and w R d : f xi (w) 2 2 M (11) Because of the unnoised stochastic gradient E [ f xi (w)] = f(w) and the equation regarding variance E f xi (w) f(w) 2 2 = E f x i (w) 2 2 f(w) 2 2, the variance of the stochastic gradient is guaranteed to be ess than M. Under Assumption 1 and 2, we obtain the foowing emma about the sequence of objective functions. Lemma 1 Assume Assumption 1 and 2 hod. In addition, we et σ := max t γ max{0,t K+1} γ t and M K = KM +σk 4 M. Agorithm 2 Adam-DDG Require: Initia weights: w 0 = [wg(1) 0,..., w0 G(K) ] Rd ; Stepsize: γ; Constant ɛ = 10 8 ; Exponentia decay rates: β 1 = 0.9 and β 2 = ; First moment vector: m 0 G(k) 0, k {1, 2,..., K}; Second moment vector: vg(k) 0 0, k {1, 2,..., K}; 1: for t = 0, 1, 2,..., T 1 do 2: for k = 1,..., K in parae do 3: Compute deayed [ gradient: gk t ( f )] G(k),xi(t K+k) w t K+k ; G(k) 4: Update biased first moment estimate: m t+1 G(k) β 1 m t G(k) + (1 β 1) g t k 5: Update biased second moment estimate: v t+1 G(k) β 2 vg(k) t + (1 β 2) (gk t )2 6: Compute bias-correct first moment estimate: /(1 βt+1 1 ) 7: Compute bias-correct second moment estimate: ˆm t+1 G(k) mt+1 G(k) ˆv t+1 G(k) vt+1 G(k) /(1 βt+1 2 ) 8: Update weights: w t+1 G(k) wt G(k) γ ˆmt+1 9: end for 10: end for G(k) / ( ˆv t+1 G(k) + ɛ ) The iterations in Agorithm 1 satisfy the foowing inequaity, for a t N: E [ f(w t+1 ) ] f(w t ) γ t 2 f(w t ) γ2 t LM K (12) From Lemma 1, we can observe that the expected decrease of the objective function is controed by the stepsize γ t and M K. Therefore, we can guarantee that the vaues of objective functions are decreasing as ong as the stepsizes γ t are sma enough such that the right-hand side of (12) is ess than zero. Using the emma above, we can anayze the convergence property for Agorithm Fixed Stepsize γ t Firsty, we anayze the convergence for Agorithm 1 when γ t is fixed and prove that the earned mode wi converge sub-ineary to the neighborhood of the critica points. Theorem 1 Assume Assumption 1 and 2 hod and the fixed stepsize sequence {γ t } satisfies γ t = γ and γl 1, t {0, 1,..., T 1}. In addition, we assume w to be the optima soution to f(w) and et σ = 1 such that M K = KM + K 4 M. Then, the output of Agorithm 1 satisfies that: T 1 1 E f(w t ) 2 T 2 ( f(w 0 ) f(w ) ) + 2γLM 2 K (13) γt

5 Spit point at ayer Spit point at ayer Spit point at ayer 5 Spit point at ayer Spit point at ayer 1 Spit point at ayer 3 Spit point at ayer 5 Spit point at ayer Figure 3. Training and testing curves regarding epochs for ResNet-8 on CIFAR-10. Upper: function vaues regarding epochs; Bottom: Top 1 cassification accuracies regarding epochs. We spit the network into two modues such that there is ony one spit point in the network for DNI and DDG. In Theorem 1, we can observe that when T, the average norm of the gradients is upper bounded by 2γLM K. The number of modues K affects the vaue of the upper bound. Seecting a sma stepsize γ aows us to get better neighborhood to the critica points, however it aso seriousy decreases the speed of convergence Diminishing Stepsize γ t In this section, we prove that Agorithm 1 with diminishing stepsizes can guarantee the convergence to critica points for the non-convex probem. Theorem 2 Assume Assumption 1 and 2 hod and the diminishing stepsize sequence {γ t } satisfies γ t = γ0 1+t and γ t L 1, t {0, 1,..., T 1}. In addition, we assume w to be the optima soution to f(w) and et σ = K such that M K = KM + K 5 M. Setting Γ T = T 1 γ t, then the Γ T output of Agorithm 1 satisfies that: T 1 1 γ t E f(w t ) 2 2 ( f(w 0 ) f(w ) ) 2 + Γ T 2 T 1 γt 2 LM K (14) Γ T Coroary 1 Since γ t = γ0 t+1, the stepsize requirements in (Robbins & Monro, 1951) are satisfied that: T 1 im T T 1 γ t = and im T γ 2 t <. (15) Therefore, according to Theorem 2, when T, the right-hand side of (14) converges to 0. Coroary 2 Suppose w s is chosen randomy from {w t } T 1 with probabiities proportiona to {γ t} T 1. According to Theorem 2, we can prove that Agorithm 1 guarantees convergence to critica points for the non-convex probem: 5. Experiments im s E f(ws ) 2 2 = 0 (16) In this section, we experiment with ResNet (He et a., 2016) on image cassification benchmark datasets: CIFAR-10 and CIFAR-100 (Krizhevsky & Hinton, 2009). In section 5.1, we evauate our method by varying the positions and the number of the spit points in the network; In section 5.2 we use our method to optimize deeper neura networks and show that its performance is as good as the performance of backpropagation; finay, we spit and distribute the ResNet- 110 across GPUs in Section 5.3, resuts showing that the proposed method achieves a speedup of two times without oss of accuracy. Impementation Detais: We impement DDG agorithm using PyTorch ibrary (Paszke et a., 2017). The trained network is spit into K modues where each modue is running on a subprocess. The subprocesses are spawned using mutiprocessing package 1 such that we can fuy everage 1

6 Spit points at ayers {1,3} Spit points at ayers {1,3,5} Spit points at ayers {1,3,5,7} Spit points at ayers {1,3} Spit points at ayers {1,3,5} Spit points at ayers {1,3,5,7} Figure 4. Training and testing curves regarding epochs for ResNet-8 on CIFAR-10. Upper: function vaues regarding epochs; Bottom: Top1 cassification accuracies regarding epochs. For DNI and DDG, the number of spit points in the network ranges from 2 to 4. mutipe processors on a given machine. Running modues on different subprocesses make the communication very difficut. To make the communication fast, we utiize the shared memory objects in the mutiprocessing package. As in Figure 2, every two adjacent modues share a pair of activation (h) and error gradient (δ) Comparison of BP, DNI and DDG In this section, we train ResNet-8 on CIFAR-10 on a singe Titan X GPU. The architecture of the ResNet-8 is in Tabe 2. A experiments are run for 300 epochs and optimized using Adam optimizer (Kingma & Ba, 2014) with a batch size of 128. The stepsize is initiaized at We augment the dataset with random cropping, random horizonta fipping and normaize the image using mean and standard deviation. There are three compared methods in this experiment: BP: Adam optimizer in Pytorch uses backpropagation agorithm with data paraeism (Rumehart et a., 1988) to compute gradients. DNI: Decouped neura interface (DNI) in (Jaderberg et a., 2016). Foowing (Jaderberg et a., 2016), the synthetic network is a stack of three convoutiona ayers with L 5 5 fiters with resoution preserving padding. The fiter depth L is determined by the position of DNI. We aso input abe information into the synthetic network to increase fina accuracy. Tabe 2. Architectura detais. Units denotes the number of residua units in each group. Each unit is a basic residua bock without botteneck. Channes indicates the number of fiters used in each unit in each group. Architecture Units Channes ResNet ResNet ResNet DDG: Adam optimizer using deayed gradients in Agorithm 2. Impact of spit position (depth). The position (depth) of the spit points determines the number of ayers using deayed gradients. Stae or synthetic gradients wi induce noises in the training process, affecting the convergence of the objective. Figure 3 exhibits the experimenta resuts when there is ony one spit point with varying positions. In the first coumn, we know that a compared methods have simiar performances when the spit point is at ayer 1. DDG performs consistenty we when we pace the spit point at deeper positions 3, 5 or 7. On the contrary, the performance of DNI degrades as we vary the positions and it cannot even converge when the spit point is at ayer 7. Impact of the number of spit points. From equation (7), we know that the maximum time deay is determined by

Convergence on CIFAR-10 10-3 1GPUs 2GPUs 3GPUs 4GPUs 1GPUs 2GPUs 3GPUs 4GPUs (a) (b) Figure 5. Training and testing oss curves for ResNet-110 on CIFAR-10 using mutipe GPUs.

7 Convergence on CIFAR GPUs 2GPUs 3GPUs 4GPUs 1GPUs 2GPUs 3GPUs 4GPUs (a) (b) Figure 5. Training and testing oss curves for ResNet-110 on CIFAR-10 using mutipe GPUs. (5a) function vaue regarding epochs. (5b) function vaue regarding computation time. Computation Time Computation time and accuracy on CIFAR % 93.41% 93.38% 93.39% 93.38% Backward Tabe 3. The best Top 1 cassification accuracy (%) for ResNet-56 and ResNet-110 on the test data of CIFAR-10 and CIFAR-100. Architecture CIFAR-10 CIFAR-100 BP DDG BP DDG ResNet ResNet Forward BP #GPUs=1 DDG #GPUs=1 DDG #GPUs=2 DDG #GPUs=3 DDG #GPUs=4 Optimization Setup Figure 6. Computation time and the best Top 1 accuracy for ResNet-110 on the test data of CIFAR-10. The most eft bar denotes the computation time using backpropagation agorithm on a GPU, where the forward time accounts for about 32%. We normaize the computation time of a optimization settings using the amount of time required by backpropagation. the number of modues K. Theorem 2 aso shows that K affects the convergence rate. In this experiment, we vary the number of spit points in the network from 2 to 4 and pot the resuts in Figure 4. It is easy to observe that DDG performs as we as BP, regardess of the number of spit points in the network. However, DNI is very unstabe when we pace more spit points, and cannot even converge sometimes Optimizing Deeper Neura Networks In this section, we empoy DDG to optimize two very deep neura networks (ResNet-56 and ResNet-110) on CIFAR-10 and CIFAR-100. Each network is spit into two modues at the center. We use SGD with the momentum of 0.9 and 91.5 the stepsize is initiaized to Each mode is trained for 300 epochs and the stepsize is divided by a factor of 10 at 1 and 225 epochs. The weight decay constant is set to We perform the same data augmentation as in section 5.1. Experiments are run on a singe Titan X GPU. Figure 7 presents the experimenta resuts of BP and DDG. We do not compare DNI because its performance is far worse when modes are deep. Figures in the first coumn present the convergence of oss regarding epochs, showing that DDG and BP admit simiar convergence rates. We can aso observe that DDG converges faster when we compare the oss regarding computation time in the second coumn of Figure 7. In the experiment, the Voatie GPU Utiity is about % when we train the modes with BP. Our method runs on two subprocesses such that it fuy everages the computing capacity of the GPU. We can draw simiar concusions when we compare the Top 1 accuracy in the third and fourth coumns of Figure 7. In Tabe 3, we ist the best Top 1 accuracy on the test data of CIFAR-10 and CIFAR We can observe that DDG can obtain comparabe or better accuracy even when the network is deep Scaing the Number of GPUs In this section, we spit ResNet-110 into K modues and aocate them across K Titan X GPUs sequentiay. We do

8 ResNet-56 on CIFAR-10 ResNet-56 on CIFAR-10 ResNet-56 on CIFAR Time (s) ResNet-110 on CIFAR-10 ResNet-110 on CIFAR-10 ResNet-110 on CIFAR Time (s) ResNet-56 on CIFAR-100 ResNet-56 on CIFAR-100 ResNet-56 on CIFAR Time (s) ResNet-110 on CIFAR-100 ResNet-110 on CIFAR-100 ResNet-110 on CIFAR Time (s) Figure 7. Training and testing curves for ResNet-56 and ResNet-110 on CIFAR-10 and CIFAR-100. Coumn 1 and 2 present the oss function vaue regrading epochs and computation time respectivey; Coumn 3 and 4 present the Top 1 cassification accuracy regrading epochs and computation time. For DDG, there is ony one spit point at the center of the network. not consider fiter-wise mode paraeism in this experiment. The seections of the parameters in the experiment are simiar to Section 5.2. From Figure 5, we know that training networks in mutipe GPUs does not affect the convergence rate. For comparison, we aso count the computation time of backpropagation agorithm on a singe GPU. The computation time is worse when we run backpropagation agorithm on mutipe GPUs because of the communication overhead. In Figure 6, we can observe that forward time ony accounts for about 32% of the tota computation time for backpropagation agorithm. Therefore, backward ocking is the main botteneck. In Figure 6, it is obvious that when we increase the number of GPUs from 2 to 4, our method reduces about 30% to % of the tota computation time. In other words, DDG achieves a speedup of about 2 times without oss of accuracy when we train the networks across 4 GPUs. 6. Concusion In this paper, we propose decouped parae backpropagation agorithm, which breaks the backward ocking in backpropagation agorithm using deayed gradients. We then appy the decouped parae backpropagation to two stochastic methods for deep earning optimization. In the theoretica section, we aso provide convergence anaysis and prove that the proposed method guarantees convergence to critica points for the non-convex probem. Finay, we perform experiments on deep convoutiona neura networks, resuts verifying that our method can acceerate the training significanty without oss of accuracy.

9 Acknowedgement This work was partiay supported by U.S. NIH R01 AG049371, NSF IIS , IIS , DBI , IIS , IIS References Baduzzi, D., Vanchinathan, H., and Buhmann, J. M. Kickback cuts backprop s red-tape: Bioogicay pausibe credit assignment in neura networks. In AAAI, pp , Bengio, Y. et a. Learning deep architectures for ai. Foundations and trends R in Machine Learning, 2(1):1 127, Bottou, L., Curtis, F. E., and Noceda, J. Optimization methods for arge-scae machine earning. arxiv preprint arxiv: , Carreira-Perpinan, M. and Wang, W. Distributed optimization of deepy nested systems. In Artificia Inteigence and Statistics, pp , Czarnecki, W. M., Świrszcz, G., Jaderberg, M., Osindero, S., Vinyas, O., and Kavukcuogu, K. Understanding synthetic gradients and decouped neura interfaces. arxiv preprint arxiv: , Edan, R. and Shamir, O. The power of depth for feedforward neura networks. In Conference on Learning Theory, pp. 7 9, He, K., Zhang, X., Ren, S., and Sun, J. Deep residua earning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp , Hinton, G., Srivastava, N., and Swersky, K. Neura networks for machine earning-ecture 6a-overview of mini-batch gradient descent, Huang, G., Liu, Z., Weinberger, K. Q., and van der Maaten, L. Densey connected convoutiona networks. arxiv preprint arxiv: , Jaderberg, M., Czarnecki, W. M., Osindero, S., Vinyas, O., Graves, A., and Kavukcuogu, K. Decouped neura interfaces using synthetic gradients. arxiv preprint arxiv: , Krizhevsky, A. and Hinton, G. Learning mutipe ayers of features from tiny images Krizhevsky, A., Sutskever, I., and Hinton, G. E. Imagenet cassification with deep convoutiona neura networks. In Advances in neura information processing systems, pp , LeCun, Y., Bengio, Y., and Hinton, G. Deep earning. Nature, 521(7553): , Nøkand, A. Direct feedback aignment provides earning in deep neura networks. In Advances in Neura Information Processing Systems, pp , Paszke, A., Gross, S., Chintaa, S., Chanan, G., Yang, E., DeVito, Z., Lin, Z., Desmaison, A., Antiga, L., and Lerer, A. Automatic differentiation in pytorch Qian, N. On the momentum term in gradient descent earning agorithms. Neura networks, 12(1): , Robbins, H. and Monro, S. A stochastic approximation method. The annas of mathematica statistics, pp. 0 7, Rumehart, D. E., Hinton, G. E., Wiiams, R. J., et a. Learning representations by back-propagating errors. Cognitive modeing, 5(3):1, Simonyan, K. and Zisserman, A. Very deep convoutiona networks for arge-scae image recognition. arxiv preprint arxiv: , Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Angueov, D., Erhan, D., Vanhoucke, V., and Rabinovich, A. Going deeper with convoutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1 9, Tayor, G., Burmeister, R., Xu, Z., Singh, B., Pate, A., and Godstein, T. Training neura networks without gradients: A scaabe admm approach. In Internationa Conference on Machine Learning, pp , Tegarsky, M. Benefits of depth in neura networks. arxiv preprint arxiv: , Yadan, O., Adams, K., Taigman, Y., and Ranzato, M. Mutigpu training of convnets. arxiv preprint arxiv: , Johnson, J. Benchmarks for popuar cnn modes. https: //github.com/jcjohnson/cnn-benchmarks, Kingma, D. and Ba, J. Adam: A method for stochastic optimization. arxiv preprint arxiv: , 2014.

Convolutional Networks 2: Training, deep convolutional networks

Convolutional Networks 2: Training, deep convolutional networks Convoutiona Networks 2: Training, deep convoutiona networks Hakan Bien Machine Learning Practica MLP Lecture 8 30 October / 6 November 2018 MLP Lecture 8 / 30 October / 6 November 2018 Convoutiona Networks