Decoupled Parallel Backpropagation with Convergence Guarantee

Size: px
Start display at page:

Download "Decoupled Parallel Backpropagation with Convergence Guarantee"

Transcription

1 Zhouyuan Huo 1 Bin Gu 1 Qian Yang 1 Heng Huang 1 Abstract Backpropagation agorithm is indispensabe for the training of feedforward neura networks. It requires propagating error gradients sequentiay from the output ayer a the way back to the input ayer. The backward ocking in backpropagation agorithm constrains us from updating network ayers in parae and fuy everaging the computing resources. Recenty, severa agorithms have been proposed for breaking the backward ocking. However, their performances degrade seriousy when networks are deep. In this paper, we propose decouped parae backpropagation agorithm for deep earning optimization with convergence guarantee. Firsty, we decoupe the backpropagation agorithm using deayed gradients, and show that the backward ocking is removed when we spit the networks into mutipe modues. Then, we utiize decouped parae backpropagation in two stochastic methods and prove that our method guarantees convergence to critica points for the non-convex probem. Finay, we perform experiments for training deep convoutiona neura networks on benchmark datasets. The experimenta resuts not ony confirm our theoretica anaysis, but aso demonstrate that the proposed method can achieve significant speedup without oss of accuracy. 1. Introduction We have witnessed a series of breakthroughs in computer vision using deep convoutiona neura networks (LeCun et a., 2015). Most neura networks are trained using stochastic gradient descent (SGD) or its variants in which the gradients of the networks are computed by backpropagation agorithm (Rumehart et a., 1988). As shown in Figure 1, the backpropagation agorithm consists of two processes, the 1 Department of Eectrica and Computer Engineering, University of Pittsburgh, Pittsburgh, PA, United States. Correspondence to: Heng Huang <henghuanghh@gmai.com>. Proceedings of the 35 th Internationa Conference on Machine Learning, Stockhom, Sweden, PMLR, Copyright 2018 by the author(s). Figure 1. We spit a mutiayer feedforward neura network into three modues. Each modue is a stack of ayers. Backpropagation agorithm requires running forward pass (from 1 to 3) and backward pass (from 4 to 6) in sequentia order. For exampe, modue A cannot perform step 6 before receiving δ t A which is an output of step 5 in modue B. forward pass to compute prediction and the backward pass to compute gradient and update the mode. After computing prediction in the forward pass, backpropagation agorithm requires propagating error gradients from the top (output ayer) a the way back to the bottom (input ayer). Therefore, in the backward pass, a ayers, or more generay, modues, of the network are ocked unti their dependencies have executed. The backward ocking constrains us from updating modes in parae and fuy everaging the computing resources. It has been shown in practice (Krizhevsky et a., 2012; Simonyan & Zisserman, 2014; Szegedy et a., 2015; He et a., 2016; Huang et a., 2016) and in theory (Edan & Shamir, 2016; Tegarsky, 2016; Bengio et a., 2009) that depth is one of the most critica factors contributing to the success of deep earning. From AexNet with 8 ayers (Krizhevsky et a., 2012) to ResNet-101 with more than one hundred ayers (He et a., 2016), the forward and backward time grow from (4.31ms and 9.58ms) to (53.38ms and ms) when we train the networks on Titan X with the input size of (Johnson, 2017). Therefore, paraeizing the backward pass can greaty reduce the training time when the backward time is about twice of the forward time. We

2 can easiy spit a deep neura network into modues ike Figure 1 and distribute them across mutipe GPUs. However, because of the backward ocking, a GPUs are ide before receiving error gradients from dependent modues in the backward pass. There have been severa agorithms proposed for breaking the backward ocking. For exampe, (Jaderberg et a., 2016; Czarnecki et a., 2017) proposed to remove the ockings in backpropagation by empoying additiona neura networks to approximate error gradients. In the backward pass, a modues use the synthetic gradients to update weights of the mode without incurring any deay. (Nøkand, 2016; Baduzzi et a., 2015) broke the oca dependencies between successive ayers and made a hidden ayers receive error information from the output ayer directy. In (Carreira- Perpinan & Wang, 2014; Tayor et a., 2016), the authors oosened the exact connections between ayers by introducing auxiiary variabes. In each ayer, they imposed equaity constraint between the auxiiary variabe and activation, and optimized the new probem using Aternating Direction Method which is easy to parae. However, for the convoutiona neura network, the performances of a above methods are much worse than backpropagation agorithm when the network is deep. In this paper, we focus on breaking the backward ocking in backpropagtion agorithm for training feedforward neura networks, such that we can update modes in parae without oss of accuracy. The main contributions of our work are as foows: Firsty, we decoupe the backpropagation using deayed gradients in Section 3 such that a modues of the network can be updated in parae without backward ocking. Then, we propose two stochastic agorithms using decouped parae backpropagation in Section 3 for deep earning optimization. We aso provide convergence anaysis for the proposed method in Section 4 and prove that it guarantees convergence to critica points for the non-convex probem. Finay, we perform experiments for training deep convoutiona neura networks in Section 5, experimenta resuts verifying that the proposed method can significanty speed up the training without oss of accuracy. 2. Backgrounds We begin with a brief overview of the backpropagation agorithm for the optimization of neura networks. Suppose that we want to train a feedforward neura network with L ayers, each ayer taking an input h 1 and producing an activation h = F (h 1 ; w ) with weight w. Letting d be the dimension of weights in the network, we have w = [w 1, w 2,..., w L ] R d. Thus, the output of the network can be represented as h L = F (h 0 ; w), where h 0 denotes the input data x. Taking a oss function f and targets y, the training probem is as foows: min f(f (x; w), y). (1) w=[w 1,...,w L ] In the foowing context, we use f(w) for simpicity. Gradients based methods are widey used for deep earning optimization (Robbins & Monro, 1951; Qian, 1999; Hinton et a., 2012; Kingma & Ba, 2014). In iteration t, we put a data sampe x i(t) into the network, where i(t) denotes the index of the sampe. According to stochastic gradient descent (SGD), we update the weights of the network through: w t+1 = w t [ γ t f,xi(t) (w t ) ], {1, 2,..., L} (2) where γ t is the stepsize and f,xi(t) (w t ) R d is the gradient of the oss function (1) with respect to the weights at ayer and data sampe x i(t), a the coordinates in other than ayer are 0. We aways utiize backpropagation agorithm to compute the gradients (Rumehart et a., 1988). The backpropagation agorithm consists of two passes of the network: in the forward pass, the activations of a ayers are cacuated from = 1 to L as foows: h t = F (h t 1; w ); (3) in the backward pass, we appy chain rue for gradients and repeatedy propagate error gradients through the network from the output ayer = L to the input ayer = 1: f(w t ) w t f(w t ) h t 1 = ht w t = h t h t 1 f(w t ) h t, (4) f(w t ) h t, (5) where we et f,xi(t) (w t ) = f(wt ). From equations (4) w t and (5), it is obvious that the computation in ayer is dependent on the error gradient f(wt ) from ayer +1. Therefore, h t the backward ocking constrains a ayers from updating before receiving error gradients from the dependent ayers. When the network is very deep or distributed across mutipe resources, the backward ocking is the main botteneck in the training process. 3. Decouped Parae Backpropagation In this section, we propose to decoupe the backpropagation agorithm using deayed gradients (DDG). Suppose we spit a L-ayer feedforward neura network to K modues, such that the weights of the network are divided into K groups. Therefore, we have w = [w G(1), w G(2),..., w G(K) ] where G(k) denotes ayer indices in the group k.

3 Figure 2. We spit a mutiayer feedforward neura network into three modues (A, B and C), where each modue is a stack of ayers. After executing the forward pass (from 1 to 3) to predict, our proposed method aows a modues to run backward pass (4) using deayed gradients without ocking. Particuary, modue A can perform the backward pass using the stae error gradient δ t 2 A. Meanwhie, It aso receives δ t 1 A from modue B for the update of the next iteration Backpropagation Using Deayed Gradients In iteration t, data sampe x i(t) is input to the network. We run the forward pass from modue k = 1 to k = K. In each modue, we compute the activations in sequentia order as equation (3). In the backward pass, a modues except the ast one have deayed error gradients in store such that they can execute the backward computation without ocking. The ast modue updates with the up-to-date gradients. In particuar, modue k keeps the stae error gradient f(wt K+k ), h t K+k L k where L k denotes the ast ayer in modue k. Therefore, the backward computation in modue k is as foows: w t K+k h t K+k 1 = ht K+k w t K+k = ht K+k h t K+k 1 h t K+k, (6) h t K+k. (7) where G(k). Meanwhie, each modue aso receives error gradient from the dependent modue for further computation. From (6) and (7), we can know that the stae error gradients in a modues are of different time deay. From modue k = 1 to k = K, their corresponding time deays are from K 1 to 0. Deay 0 indicates that the gradients are up-to-date. In this way, we break the backward ocking and achieve parae update in the backward pass. Figure 2 shows an exampe of the decouped backpropagation, where error gradients δ := f(w) h Speedup of Decouped Parae Backpropagation When K = 1, there is no time deay and the proposed method is equivaent to the backpropagation agorithm. When K 1, we can distribute the network across mutipe GPUs and fuy everage the computing resources. Tabe 1 ists the computation time when we sequentiay aocate the network across K GPUs. When T F is necessary to compute accurate predictions, we can acceerate the training by re- Tabe 1. Comparisons of computation time when the network is sequentiay distributed across K GPUs. T F and T B denote the forward and backward time for backpropagation agorithm. Method Backpropagation DDG Computation Time T F + T B T F + T B K ducing the backward time. Because T B is much arge than T F, we can achieve huge speedup even K is sma. Reation to mode paraeism: Mode paraeism usuay refers to fiter-wise paraeism (Yadan et a., 2013). For exampe, we spit a convoutiona ayer with N fiters into two GPUs, each part containing N 2 fiters. Athough the fiterwise paraeism acceerates the training when we distribute the workoads across mutipe GPUs, it sti suffers from the backward ocking. We can think of DDG agorithm as ayer-wise paraeism. It is aso easy to combine fiter-wise paraeism with ayer-wise paraeism for further speedup Stochastic Methods Using Deayed Gradients After computing the gradients of the oss function with respect to the weights of the mode, we update the mode using deayed gradients. Letting f G(k),xi(t K+k) ( w t K+k ) := G(k) w t K+k if t K + k 0, (8) 0 otherwise for any k {1, 2,..., K}, we update the weights in modue k foowing SGD: w t+1 G(k) = wt G(k) γ t[ f G(k),xi(t K+k) ( w t K+k ) ] G(k). (9) where γ t denotes stepsize. Different from SGD, we update the weights with deayed gradients. Besides, the deayed iteration (t K + k) for group k is aso deterministic. We summarize the proposed method in Agorithm 1.

4 Agorithm 1 SGD-DDG Require: Initia weights w 0 = [wg(1) 0,..., w0 G(K) ] Rd ; Stepsize sequence {γ t }; 1: for t = 0, 1, 2,..., T 1 do 2: for k = 1,..., K in parae do 3: Compute deayed [ gradient: gk t ( f )] G(k),xi(t K+k) w t K+k ; G(k) 4: Update weights: 5: end for 6: end for w t+1 G(k) wt G(k) γ t g t k ; Moreover, we can aso appy the deayed gradients to other variants of SGD, for exampe Adam in Agorithm 2. In each iteration, we update the weights and moment vectors with deayed gradients. We anayze the convergence for Agorithm 1 in Section 4, which is the basis of anaysis for other methods. 4. Convergence Anaysis In this section, we estabish the convergence guarantees to critica points for Agorithm 1 when the probem is nonconvex. Anaysis shows that our method admits simiar convergence rate to vania stochastic gradient descent (Bottou et a., 2016). Throughout this paper, we make the foowing commony used assumptions: Assumption 1 (Lipschitz-continuous gradient) The gradient of f(w) is Lipschitz continuous with Lipschitz constant L > 0, such that w, v R d : f(w) f(v) 2 L w v 2 (10) Assumption 2 (Bounded variance) To bound the variance of the stochastic gradient, we assume the second moment of the stochastic gradient is upper bounded, such that there exists constant M 0, for any sampe x i and w R d : f xi (w) 2 2 M (11) Because of the unnoised stochastic gradient E [ f xi (w)] = f(w) and the equation regarding variance E f xi (w) f(w) 2 2 = E f x i (w) 2 2 f(w) 2 2, the variance of the stochastic gradient is guaranteed to be ess than M. Under Assumption 1 and 2, we obtain the foowing emma about the sequence of objective functions. Lemma 1 Assume Assumption 1 and 2 hod. In addition, we et σ := max t γ max{0,t K+1} γ t and M K = KM +σk 4 M. Agorithm 2 Adam-DDG Require: Initia weights: w 0 = [wg(1) 0,..., w0 G(K) ] Rd ; Stepsize: γ; Constant ɛ = 10 8 ; Exponentia decay rates: β 1 = 0.9 and β 2 = ; First moment vector: m 0 G(k) 0, k {1, 2,..., K}; Second moment vector: vg(k) 0 0, k {1, 2,..., K}; 1: for t = 0, 1, 2,..., T 1 do 2: for k = 1,..., K in parae do 3: Compute deayed [ gradient: gk t ( f )] G(k),xi(t K+k) w t K+k ; G(k) 4: Update biased first moment estimate: m t+1 G(k) β 1 m t G(k) + (1 β 1) g t k 5: Update biased second moment estimate: v t+1 G(k) β 2 vg(k) t + (1 β 2) (gk t )2 6: Compute bias-correct first moment estimate: /(1 βt+1 1 ) 7: Compute bias-correct second moment estimate: ˆm t+1 G(k) mt+1 G(k) ˆv t+1 G(k) vt+1 G(k) /(1 βt+1 2 ) 8: Update weights: w t+1 G(k) wt G(k) γ ˆmt+1 9: end for 10: end for G(k) / ( ˆv t+1 G(k) + ɛ ) The iterations in Agorithm 1 satisfy the foowing inequaity, for a t N: E [ f(w t+1 ) ] f(w t ) γ t 2 f(w t ) γ2 t LM K (12) From Lemma 1, we can observe that the expected decrease of the objective function is controed by the stepsize γ t and M K. Therefore, we can guarantee that the vaues of objective functions are decreasing as ong as the stepsizes γ t are sma enough such that the right-hand side of (12) is ess than zero. Using the emma above, we can anayze the convergence property for Agorithm Fixed Stepsize γ t Firsty, we anayze the convergence for Agorithm 1 when γ t is fixed and prove that the earned mode wi converge sub-ineary to the neighborhood of the critica points. Theorem 1 Assume Assumption 1 and 2 hod and the fixed stepsize sequence {γ t } satisfies γ t = γ and γl 1, t {0, 1,..., T 1}. In addition, we assume w to be the optima soution to f(w) and et σ = 1 such that M K = KM + K 4 M. Then, the output of Agorithm 1 satisfies that: T 1 1 E f(w t ) 2 T 2 ( f(w 0 ) f(w ) ) + 2γLM 2 K (13) γt

5 Spit point at ayer Spit point at ayer Spit point at ayer 5 Spit point at ayer Spit point at ayer 1 Spit point at ayer 3 Spit point at ayer 5 Spit point at ayer Figure 3. Training and testing curves regarding epochs for ResNet-8 on CIFAR-10. Upper: function vaues regarding epochs; Bottom: Top 1 cassification accuracies regarding epochs. We spit the network into two modues such that there is ony one spit point in the network for DNI and DDG. In Theorem 1, we can observe that when T, the average norm of the gradients is upper bounded by 2γLM K. The number of modues K affects the vaue of the upper bound. Seecting a sma stepsize γ aows us to get better neighborhood to the critica points, however it aso seriousy decreases the speed of convergence Diminishing Stepsize γ t In this section, we prove that Agorithm 1 with diminishing stepsizes can guarantee the convergence to critica points for the non-convex probem. Theorem 2 Assume Assumption 1 and 2 hod and the diminishing stepsize sequence {γ t } satisfies γ t = γ0 1+t and γ t L 1, t {0, 1,..., T 1}. In addition, we assume w to be the optima soution to f(w) and et σ = K such that M K = KM + K 5 M. Setting Γ T = T 1 γ t, then the Γ T output of Agorithm 1 satisfies that: T 1 1 γ t E f(w t ) 2 2 ( f(w 0 ) f(w ) ) 2 + Γ T 2 T 1 γt 2 LM K (14) Γ T Coroary 1 Since γ t = γ0 t+1, the stepsize requirements in (Robbins & Monro, 1951) are satisfied that: T 1 im T T 1 γ t = and im T γ 2 t <. (15) Therefore, according to Theorem 2, when T, the right-hand side of (14) converges to 0. Coroary 2 Suppose w s is chosen randomy from {w t } T 1 with probabiities proportiona to {γ t} T 1. According to Theorem 2, we can prove that Agorithm 1 guarantees convergence to critica points for the non-convex probem: 5. Experiments im s E f(ws ) 2 2 = 0 (16) In this section, we experiment with ResNet (He et a., 2016) on image cassification benchmark datasets: CIFAR-10 and CIFAR-100 (Krizhevsky & Hinton, 2009). In section 5.1, we evauate our method by varying the positions and the number of the spit points in the network; In section 5.2 we use our method to optimize deeper neura networks and show that its performance is as good as the performance of backpropagation; finay, we spit and distribute the ResNet- 110 across GPUs in Section 5.3, resuts showing that the proposed method achieves a speedup of two times without oss of accuracy. Impementation Detais: We impement DDG agorithm using PyTorch ibrary (Paszke et a., 2017). The trained network is spit into K modues where each modue is running on a subprocess. The subprocesses are spawned using mutiprocessing package 1 such that we can fuy everage 1

6 Spit points at ayers {1,3} Spit points at ayers {1,3,5} Spit points at ayers {1,3,5,7} Spit points at ayers {1,3} Spit points at ayers {1,3,5} Spit points at ayers {1,3,5,7} Figure 4. Training and testing curves regarding epochs for ResNet-8 on CIFAR-10. Upper: function vaues regarding epochs; Bottom: Top1 cassification accuracies regarding epochs. For DNI and DDG, the number of spit points in the network ranges from 2 to 4. mutipe processors on a given machine. Running modues on different subprocesses make the communication very difficut. To make the communication fast, we utiize the shared memory objects in the mutiprocessing package. As in Figure 2, every two adjacent modues share a pair of activation (h) and error gradient (δ) Comparison of BP, DNI and DDG In this section, we train ResNet-8 on CIFAR-10 on a singe Titan X GPU. The architecture of the ResNet-8 is in Tabe 2. A experiments are run for 300 epochs and optimized using Adam optimizer (Kingma & Ba, 2014) with a batch size of 128. The stepsize is initiaized at We augment the dataset with random cropping, random horizonta fipping and normaize the image using mean and standard deviation. There are three compared methods in this experiment: BP: Adam optimizer in Pytorch uses backpropagation agorithm with data paraeism (Rumehart et a., 1988) to compute gradients. DNI: Decouped neura interface (DNI) in (Jaderberg et a., 2016). Foowing (Jaderberg et a., 2016), the synthetic network is a stack of three convoutiona ayers with L 5 5 fiters with resoution preserving padding. The fiter depth L is determined by the position of DNI. We aso input abe information into the synthetic network to increase fina accuracy. Tabe 2. Architectura detais. Units denotes the number of residua units in each group. Each unit is a basic residua bock without botteneck. Channes indicates the number of fiters used in each unit in each group. Architecture Units Channes ResNet ResNet ResNet DDG: Adam optimizer using deayed gradients in Agorithm 2. Impact of spit position (depth). The position (depth) of the spit points determines the number of ayers using deayed gradients. Stae or synthetic gradients wi induce noises in the training process, affecting the convergence of the objective. Figure 3 exhibits the experimenta resuts when there is ony one spit point with varying positions. In the first coumn, we know that a compared methods have simiar performances when the spit point is at ayer 1. DDG performs consistenty we when we pace the spit point at deeper positions 3, 5 or 7. On the contrary, the performance of DNI degrades as we vary the positions and it cannot even converge when the spit point is at ayer 7. Impact of the number of spit points. From equation (7), we know that the maximum time deay is determined by

7 Convergence on CIFAR GPUs 2GPUs 3GPUs 4GPUs 1GPUs 2GPUs 3GPUs 4GPUs (a) (b) Figure 5. Training and testing oss curves for ResNet-110 on CIFAR-10 using mutipe GPUs. (5a) function vaue regarding epochs. (5b) function vaue regarding computation time. Computation Time Computation time and accuracy on CIFAR % 93.41% 93.38% 93.39% 93.38% Backward Tabe 3. The best Top 1 cassification accuracy (%) for ResNet-56 and ResNet-110 on the test data of CIFAR-10 and CIFAR-100. Architecture CIFAR-10 CIFAR-100 BP DDG BP DDG ResNet ResNet Forward BP #GPUs=1 DDG #GPUs=1 DDG #GPUs=2 DDG #GPUs=3 DDG #GPUs=4 Optimization Setup Figure 6. Computation time and the best Top 1 accuracy for ResNet-110 on the test data of CIFAR-10. The most eft bar denotes the computation time using backpropagation agorithm on a GPU, where the forward time accounts for about 32%. We normaize the computation time of a optimization settings using the amount of time required by backpropagation. the number of modues K. Theorem 2 aso shows that K affects the convergence rate. In this experiment, we vary the number of spit points in the network from 2 to 4 and pot the resuts in Figure 4. It is easy to observe that DDG performs as we as BP, regardess of the number of spit points in the network. However, DNI is very unstabe when we pace more spit points, and cannot even converge sometimes Optimizing Deeper Neura Networks In this section, we empoy DDG to optimize two very deep neura networks (ResNet-56 and ResNet-110) on CIFAR-10 and CIFAR-100. Each network is spit into two modues at the center. We use SGD with the momentum of 0.9 and 91.5 the stepsize is initiaized to Each mode is trained for 300 epochs and the stepsize is divided by a factor of 10 at 1 and 225 epochs. The weight decay constant is set to We perform the same data augmentation as in section 5.1. Experiments are run on a singe Titan X GPU. Figure 7 presents the experimenta resuts of BP and DDG. We do not compare DNI because its performance is far worse when modes are deep. Figures in the first coumn present the convergence of oss regarding epochs, showing that DDG and BP admit simiar convergence rates. We can aso observe that DDG converges faster when we compare the oss regarding computation time in the second coumn of Figure 7. In the experiment, the Voatie GPU Utiity is about % when we train the modes with BP. Our method runs on two subprocesses such that it fuy everages the computing capacity of the GPU. We can draw simiar concusions when we compare the Top 1 accuracy in the third and fourth coumns of Figure 7. In Tabe 3, we ist the best Top 1 accuracy on the test data of CIFAR-10 and CIFAR We can observe that DDG can obtain comparabe or better accuracy even when the network is deep Scaing the Number of GPUs In this section, we spit ResNet-110 into K modues and aocate them across K Titan X GPUs sequentiay. We do

8 ResNet-56 on CIFAR-10 ResNet-56 on CIFAR-10 ResNet-56 on CIFAR Time (s) ResNet-110 on CIFAR-10 ResNet-110 on CIFAR-10 ResNet-110 on CIFAR Time (s) ResNet-56 on CIFAR-100 ResNet-56 on CIFAR-100 ResNet-56 on CIFAR Time (s) ResNet-110 on CIFAR-100 ResNet-110 on CIFAR-100 ResNet-110 on CIFAR Time (s) Figure 7. Training and testing curves for ResNet-56 and ResNet-110 on CIFAR-10 and CIFAR-100. Coumn 1 and 2 present the oss function vaue regrading epochs and computation time respectivey; Coumn 3 and 4 present the Top 1 cassification accuracy regrading epochs and computation time. For DDG, there is ony one spit point at the center of the network. not consider fiter-wise mode paraeism in this experiment. The seections of the parameters in the experiment are simiar to Section 5.2. From Figure 5, we know that training networks in mutipe GPUs does not affect the convergence rate. For comparison, we aso count the computation time of backpropagation agorithm on a singe GPU. The computation time is worse when we run backpropagation agorithm on mutipe GPUs because of the communication overhead. In Figure 6, we can observe that forward time ony accounts for about 32% of the tota computation time for backpropagation agorithm. Therefore, backward ocking is the main botteneck. In Figure 6, it is obvious that when we increase the number of GPUs from 2 to 4, our method reduces about 30% to % of the tota computation time. In other words, DDG achieves a speedup of about 2 times without oss of accuracy when we train the networks across 4 GPUs. 6. Concusion In this paper, we propose decouped parae backpropagation agorithm, which breaks the backward ocking in backpropagation agorithm using deayed gradients. We then appy the decouped parae backpropagation to two stochastic methods for deep earning optimization. In the theoretica section, we aso provide convergence anaysis and prove that the proposed method guarantees convergence to critica points for the non-convex probem. Finay, we perform experiments on deep convoutiona neura networks, resuts verifying that our method can acceerate the training significanty without oss of accuracy.

9 Acknowedgement This work was partiay supported by U.S. NIH R01 AG049371, NSF IIS , IIS , DBI , IIS , IIS References Baduzzi, D., Vanchinathan, H., and Buhmann, J. M. Kickback cuts backprop s red-tape: Bioogicay pausibe credit assignment in neura networks. In AAAI, pp , Bengio, Y. et a. Learning deep architectures for ai. Foundations and trends R in Machine Learning, 2(1):1 127, Bottou, L., Curtis, F. E., and Noceda, J. Optimization methods for arge-scae machine earning. arxiv preprint arxiv: , Carreira-Perpinan, M. and Wang, W. Distributed optimization of deepy nested systems. In Artificia Inteigence and Statistics, pp , Czarnecki, W. M., Świrszcz, G., Jaderberg, M., Osindero, S., Vinyas, O., and Kavukcuogu, K. Understanding synthetic gradients and decouped neura interfaces. arxiv preprint arxiv: , Edan, R. and Shamir, O. The power of depth for feedforward neura networks. In Conference on Learning Theory, pp. 7 9, He, K., Zhang, X., Ren, S., and Sun, J. Deep residua earning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp , Hinton, G., Srivastava, N., and Swersky, K. Neura networks for machine earning-ecture 6a-overview of mini-batch gradient descent, Huang, G., Liu, Z., Weinberger, K. Q., and van der Maaten, L. Densey connected convoutiona networks. arxiv preprint arxiv: , Jaderberg, M., Czarnecki, W. M., Osindero, S., Vinyas, O., Graves, A., and Kavukcuogu, K. Decouped neura interfaces using synthetic gradients. arxiv preprint arxiv: , Krizhevsky, A. and Hinton, G. Learning mutipe ayers of features from tiny images Krizhevsky, A., Sutskever, I., and Hinton, G. E. Imagenet cassification with deep convoutiona neura networks. In Advances in neura information processing systems, pp , LeCun, Y., Bengio, Y., and Hinton, G. Deep earning. Nature, 521(7553): , Nøkand, A. Direct feedback aignment provides earning in deep neura networks. In Advances in Neura Information Processing Systems, pp , Paszke, A., Gross, S., Chintaa, S., Chanan, G., Yang, E., DeVito, Z., Lin, Z., Desmaison, A., Antiga, L., and Lerer, A. Automatic differentiation in pytorch Qian, N. On the momentum term in gradient descent earning agorithms. Neura networks, 12(1): , Robbins, H. and Monro, S. A stochastic approximation method. The annas of mathematica statistics, pp. 0 7, Rumehart, D. E., Hinton, G. E., Wiiams, R. J., et a. Learning representations by back-propagating errors. Cognitive modeing, 5(3):1, Simonyan, K. and Zisserman, A. Very deep convoutiona networks for arge-scae image recognition. arxiv preprint arxiv: , Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Angueov, D., Erhan, D., Vanhoucke, V., and Rabinovich, A. Going deeper with convoutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1 9, Tayor, G., Burmeister, R., Xu, Z., Singh, B., Pate, A., and Godstein, T. Training neura networks without gradients: A scaabe admm approach. In Internationa Conference on Machine Learning, pp , Tegarsky, M. Benefits of depth in neura networks. arxiv preprint arxiv: , Yadan, O., Adams, K., Taigman, Y., and Ranzato, M. Mutigpu training of convnets. arxiv preprint arxiv: , Johnson, J. Benchmarks for popuar cnn modes. https: //github.com/jcjohnson/cnn-benchmarks, Kingma, D. and Ba, J. Adam: A method for stochastic optimization. arxiv preprint arxiv: , 2014.

Convolutional Networks 2: Training, deep convolutional networks

Convolutional Networks 2: Training, deep convolutional networks Convoutiona Networks 2: Training, deep convoutiona networks Hakan Bien Machine Learning Practica MLP Lecture 8 30 October / 6 November 2018 MLP Lecture 8 / 30 October / 6 November 2018 Convoutiona Networks

More information

Stochastic Variational Inference with Gradient Linearization

Stochastic Variational Inference with Gradient Linearization Stochastic Variationa Inference with Gradient Linearization Suppementa Materia Tobias Pötz * Anne S Wannenwetsch Stefan Roth Department of Computer Science, TU Darmstadt Preface In this suppementa materia,

More information

How the backpropagation algorithm works Srikumar Ramalingam School of Computing University of Utah

How the backpropagation algorithm works Srikumar Ramalingam School of Computing University of Utah How the backpropagation agorithm works Srikumar Ramaingam Schoo of Computing University of Utah Reference Most of the sides are taken from the second chapter of the onine book by Michae Nieson: neuranetworksanddeepearning.com

More information

How the backpropagation algorithm works Srikumar Ramalingam School of Computing University of Utah

How the backpropagation algorithm works Srikumar Ramalingam School of Computing University of Utah How the backpropagation agorithm works Srikumar Ramaingam Schoo of Computing University of Utah Reference Most of the sides are taken from the second chapter of the onine book by Michae Nieson: neuranetworksanddeepearning.com

More information

An Algorithm for Pruning Redundant Modules in Min-Max Modular Network

An Algorithm for Pruning Redundant Modules in Min-Max Modular Network An Agorithm for Pruning Redundant Modues in Min-Max Moduar Network Hui-Cheng Lian and Bao-Liang Lu Department of Computer Science and Engineering, Shanghai Jiao Tong University 1954 Hua Shan Rd., Shanghai

More information

NISP: Pruning Networks using Neuron Importance Score Propagation

NISP: Pruning Networks using Neuron Importance Score Propagation NISP: Pruning Networks using Neuron Importance Score Propagation Ruichi Yu 1 Ang Li 3 Chun-Fu Chen 2 Jui-Hsin Lai 5 Vad I. Morariu 4 Xintong Han 1 Mingfei Gao 1 Ching-Yung Lin 6 Larry S. Davis 1 1 University

More information

Appendix for Stochastic Gradient Monomial Gamma Sampler

Appendix for Stochastic Gradient Monomial Gamma Sampler 3 4 5 6 7 8 9 3 4 5 6 7 8 9 3 4 5 6 7 8 9 3 3 3 33 34 35 36 37 38 39 4 4 4 43 44 45 46 47 48 49 5 5 5 53 54 Appendix for Stochastic Gradient Monomia Gamma Samper A The Main Theorem We provide the foowing

More information

MARKOV CHAINS AND MARKOV DECISION THEORY. Contents

MARKOV CHAINS AND MARKOV DECISION THEORY. Contents MARKOV CHAINS AND MARKOV DECISION THEORY ARINDRIMA DATTA Abstract. In this paper, we begin with a forma introduction to probabiity and expain the concept of random variabes and stochastic processes. After

More information

PROXIMAL BACKPROPAGATION

PROXIMAL BACKPROPAGATION Pubished as a conference paper at ICLR 208 PROXIMAL BACKPROPAGATION Thomas Frerix, Thomas Möenhoff, Michae Moeer 2, Danie Cremers thomas.frerix@tum.de thomas.moeenhoff@tum.de michae.moeer@uni-siegen.de

More information

Appendix for Stochastic Gradient Monomial Gamma Sampler

Appendix for Stochastic Gradient Monomial Gamma Sampler Appendix for Stochastic Gradient Monomia Gamma Samper A The Main Theorem We provide the foowing theorem to characterize the stationary distribution of the stochastic process with SDEs in (3) Theorem 3

More information

FreezeOut: Accelerate Training by Progressively Freezing Layers

FreezeOut: Accelerate Training by Progressively Freezing Layers FreezeOut: Accelerate Training by Progressively Freezing Layers Andrew Brock, Theodore Lim, & J.M. Ritchie School of Engineering and Physical Sciences Heriot-Watt University Edinburgh, UK {ajb5, t.lim,

More information

BP neural network-based sports performance prediction model applied research

BP neural network-based sports performance prediction model applied research Avaiabe onine www.jocpr.com Journa of Chemica and Pharmaceutica Research, 204, 6(7:93-936 Research Artice ISSN : 0975-7384 CODEN(USA : JCPRC5 BP neura networ-based sports performance prediction mode appied

More information

Distributed average consensus: Beyond the realm of linearity

Distributed average consensus: Beyond the realm of linearity Distributed average consensus: Beyond the ream of inearity Usman A. Khan, Soummya Kar, and José M. F. Moura Department of Eectrica and Computer Engineering Carnegie Meon University 5 Forbes Ave, Pittsburgh,

More information

A Brief Introduction to Markov Chains and Hidden Markov Models

A Brief Introduction to Markov Chains and Hidden Markov Models A Brief Introduction to Markov Chains and Hidden Markov Modes Aen B MacKenzie Notes for December 1, 3, &8, 2015 Discrete-Time Markov Chains You may reca that when we first introduced random processes,

More information

A. Distribution of the test statistic

A. Distribution of the test statistic A. Distribution of the test statistic In the sequentia test, we first compute the test statistic from a mini-batch of size m. If a decision cannot be made with this statistic, we keep increasing the mini-batch

More information

Multilayer Kerceptron

Multilayer Kerceptron Mutiayer Kerceptron Zotán Szabó, András Lőrincz Department of Information Systems, Facuty of Informatics Eötvös Loránd University Pázmány Péter sétány 1/C H-1117, Budapest, Hungary e-mai: szzoi@csetehu,

More information

Moreau-Yosida Regularization for Grouped Tree Structure Learning

Moreau-Yosida Regularization for Grouped Tree Structure Learning Moreau-Yosida Reguarization for Grouped Tree Structure Learning Jun Liu Computer Science and Engineering Arizona State University J.Liu@asu.edu Jieping Ye Computer Science and Engineering Arizona State

More information

Eve: A Gradient Based Optimization Method with Locally and Globally Adaptive Learning Rates

Eve: A Gradient Based Optimization Method with Locally and Globally Adaptive Learning Rates Eve: A Gradient Based Optimization Method with Locally and Globally Adaptive Learning Rates Hiroaki Hayashi 1,* Jayanth Koushik 1,* Graham Neubig 1 arxiv:1611.01505v3 [cs.lg] 11 Jun 2018 Abstract Adaptive

More information

A Novel Learning Method for Elman Neural Network Using Local Search

A Novel Learning Method for Elman Neural Network Using Local Search Neura Information Processing Letters and Reviews Vo. 11, No. 8, August 2007 LETTER A Nove Learning Method for Eman Neura Networ Using Loca Search Facuty of Engineering, Toyama University, Gofuu 3190 Toyama

More information

Steepest Descent Adaptation of Min-Max Fuzzy If-Then Rules 1

Steepest Descent Adaptation of Min-Max Fuzzy If-Then Rules 1 Steepest Descent Adaptation of Min-Max Fuzzy If-Then Rues 1 R.J. Marks II, S. Oh, P. Arabshahi Λ, T.P. Caude, J.J. Choi, B.G. Song Λ Λ Dept. of Eectrica Engineering Boeing Computer Services University

More information

Lecture Note 3: Stationary Iterative Methods

Lecture Note 3: Stationary Iterative Methods MATH 5330: Computationa Methods of Linear Agebra Lecture Note 3: Stationary Iterative Methods Xianyi Zeng Department of Mathematica Sciences, UTEP Stationary Iterative Methods The Gaussian eimination (or

More information

Soft Clustering on Graphs

Soft Clustering on Graphs Soft Custering on Graphs Kai Yu 1, Shipeng Yu 2, Voker Tresp 1 1 Siemens AG, Corporate Technoogy 2 Institute for Computer Science, University of Munich kai.yu@siemens.com, voker.tresp@siemens.com spyu@dbs.informatik.uni-muenchen.de

More information

Inductive Bias: How to generalize on novel data. CS Inductive Bias 1

Inductive Bias: How to generalize on novel data. CS Inductive Bias 1 Inductive Bias: How to generaize on nove data CS 478 - Inductive Bias 1 Overfitting Noise vs. Exceptions CS 478 - Inductive Bias 2 Non-Linear Tasks Linear Regression wi not generaize we to the task beow

More information

A Solution to the 4-bit Parity Problem with a Single Quaternary Neuron

A Solution to the 4-bit Parity Problem with a Single Quaternary Neuron Neura Information Processing - Letters and Reviews Vo. 5, No. 2, November 2004 LETTER A Soution to the 4-bit Parity Probem with a Singe Quaternary Neuron Tohru Nitta Nationa Institute of Advanced Industria

More information

Determining The Degree of Generalization Using An Incremental Learning Algorithm

Determining The Degree of Generalization Using An Incremental Learning Algorithm Determining The Degree of Generaization Using An Incrementa Learning Agorithm Pabo Zegers Facutad de Ingeniería, Universidad de os Andes San Caros de Apoquindo 22, Las Condes, Santiago, Chie pzegers@uandes.c

More information

A Better Way to Pretrain Deep Boltzmann Machines

A Better Way to Pretrain Deep Boltzmann Machines A Better Way to Pretrain Deep Botzmann Machines Rusan Saakhutdino Department of Statistics and Computer Science Uniersity of Toronto rsaakhu@cs.toronto.edu Geoffrey Hinton Department of Computer Science

More information

Convolutional Neural Networks II. Slides from Dr. Vlad Morariu

Convolutional Neural Networks II. Slides from Dr. Vlad Morariu Convolutional Neural Networks II Slides from Dr. Vlad Morariu 1 Optimization Example of optimization progress while training a neural network. (Loss over mini-batches goes down over time.) 2 Learning rate

More information

arxiv: v2 [stat.ml] 18 Jun 2017

arxiv: v2 [stat.ml] 18 Jun 2017 FREEZEOUT: ACCELERATE TRAINING BY PROGRES- SIVELY FREEZING LAYERS Andrew Brock, Theodore Lim, & J.M. Ritchie School of Engineering and Physical Sciences Heriot-Watt University Edinburgh, UK {ajb5, t.lim,

More information

Notes on Backpropagation with Cross Entropy

Notes on Backpropagation with Cross Entropy Notes on Backpropagation with Cross Entropy I-Ta ee, Dan Gowasser, Bruno Ribeiro Purue University October 3, 07. Overview This note introuces backpropagation for a common neura network muti-cass cassifier.

More information

TRAINED TERNARY QUANTIZATION

TRAINED TERNARY QUANTIZATION Pubished as a conference paper at ICLR 2017 TRAINED TERNARY QUANTIZATION Chenzhuo Zhu Tsinghua University zhucz13@mais.tsinghua.edu.cn Huizi Mao Stanford University huizi@stanford.edu Song Han Stanford

More information

T.C. Banwell, S. Galli. {bct, Telcordia Technologies, Inc., 445 South Street, Morristown, NJ 07960, USA

T.C. Banwell, S. Galli. {bct, Telcordia Technologies, Inc., 445 South Street, Morristown, NJ 07960, USA ON THE SYMMETRY OF THE POWER INE CHANNE T.C. Banwe, S. Gai {bct, sgai}@research.tecordia.com Tecordia Technoogies, Inc., 445 South Street, Morristown, NJ 07960, USA Abstract The indoor power ine network

More information

A Fundamental Storage-Communication Tradeoff in Distributed Computing with Straggling Nodes

A Fundamental Storage-Communication Tradeoff in Distributed Computing with Straggling Nodes A Fundamenta Storage-Communication Tradeoff in Distributed Computing with Stragging odes ifa Yan, Michèe Wigger LTCI, Téécom ParisTech 75013 Paris, France Emai: {qifa.yan, michee.wigger} @teecom-paristech.fr

More information

arxiv: v2 [stat.ml] 23 May 2016

arxiv: v2 [stat.ml] 23 May 2016 A Kronecker-factored approximate Fisher matrix for convoution ayers arxiv:1602.01407v2 [stat.ml] 23 May 2016 Roger Grosse & James Martens Department of Computer Science University of Toronto Toronto, ON,

More information

8 APPENDIX. E[m M] = (n S )(1 exp( exp(s min + c M))) (19) E[m M] n exp(s min + c M) (20) 8.1 EMPIRICAL EVALUATION OF SAMPLING

8 APPENDIX. E[m M] = (n S )(1 exp( exp(s min + c M))) (19) E[m M] n exp(s min + c M) (20) 8.1 EMPIRICAL EVALUATION OF SAMPLING 8 APPENDIX 8.1 EMPIRICAL EVALUATION OF SAMPLING We wish to evauate the empirica accuracy of our samping technique on concrete exampes. We do this in two ways. First, we can sort the eements by probabiity

More information

A Kronecker-factored approximate Fisher matrix for convolution layers

A Kronecker-factored approximate Fisher matrix for convolution layers Roger Grosse James Martens Department of Computer Science, University of Toronto RGROSSE@CS.TORONTO.EDU JMARTENS@CS.TORONTO.EDU Abstract Second-order optimization methods such as natura gradient descent

More information

The Shattered Gradients Problem: If resnets are the answer, then what is the question?

The Shattered Gradients Problem: If resnets are the answer, then what is the question? The Shattered Gradients Probem: If resnets are the answer, then what is the question? David Baduzzi Marcus Frean Lennox Leary JP Lewis Kurt Wan-Duo Ma Brian McWiiams 3 We are interested in how the gradient

More information

P-TELU : Parametric Tan Hyperbolic Linear Unit Activation for Deep Neural Networks

P-TELU : Parametric Tan Hyperbolic Linear Unit Activation for Deep Neural Networks P-TELU : Parametric Tan Hyperbolic Linear Unit Activation for Deep Neural Networks Rahul Duggal rahulduggal2608@gmail.com Anubha Gupta anubha@iiitd.ac.in SBILab (http://sbilab.iiitd.edu.in/) Deptt. of

More information

Sajid Anwar, Kyuyeon Hwang and Wonyong Sung

Sajid Anwar, Kyuyeon Hwang and Wonyong Sung Sajid Anwar, Kyuyeon Hwang and Wonyong Sung Department of Electrical and Computer Engineering Seoul National University Seoul, 08826 Korea Email: sajid@dsp.snu.ac.kr, khwang@dsp.snu.ac.kr, wysung@snu.ac.kr

More information

CS229 Lecture notes. Andrew Ng

CS229 Lecture notes. Andrew Ng CS229 Lecture notes Andrew Ng Part IX The EM agorithm In the previous set of notes, we taked about the EM agorithm as appied to fitting a mixture of Gaussians. In this set of notes, we give a broader view

More information

pp in Backpropagation Convergence Via Deterministic Nonmonotone Perturbed O. L. Mangasarian & M. V. Solodov Madison, WI Abstract

pp in Backpropagation Convergence Via Deterministic Nonmonotone Perturbed O. L. Mangasarian & M. V. Solodov Madison, WI Abstract pp 383-390 in Advances in Neura Information Processing Systems 6 J.D. Cowan, G. Tesauro and J. Aspector (eds), Morgan Kaufmann Pubishers, San Francisco, CA, 1994 Backpropagation Convergence Via Deterministic

More information

Neural Networks Compression for Language Modeling

Neural Networks Compression for Language Modeling Neura Networks Compression for Language Modeing Artem M. Grachev 1,2, Dmitry I. Ignatov 2, and Andrey V. Savchenko 3 arxiv:1708.05963v1 [stat.ml] 20 Aug 2017 1 Samsung R&D Institute Rus, Moscow, Russia

More information

Unconditional security of differential phase shift quantum key distribution

Unconditional security of differential phase shift quantum key distribution Unconditiona security of differentia phase shift quantum key distribution Kai Wen, Yoshihisa Yamamoto Ginzton Lab and Dept of Eectrica Engineering Stanford University Basic idea of DPS-QKD Protoco. Aice

More information

Seasonal Time Series Data Forecasting by Using Neural Networks Multiscale Autoregressive Model

Seasonal Time Series Data Forecasting by Using Neural Networks Multiscale Autoregressive Model American ourna of Appied Sciences 7 (): 372-378, 2 ISSN 546-9239 2 Science Pubications Seasona Time Series Data Forecasting by Using Neura Networks Mutiscae Autoregressive Mode Suhartono, B.S.S. Uama and

More information

Paragraph Topic Classification

Paragraph Topic Classification Paragraph Topic Cassification Eugene Nho Graduate Schoo of Business Stanford University Stanford, CA 94305 enho@stanford.edu Edward Ng Department of Eectrica Engineering Stanford University Stanford, CA

More information

Expectation-Maximization for Estimating Parameters for a Mixture of Poissons

Expectation-Maximization for Estimating Parameters for a Mixture of Poissons Expectation-Maximization for Estimating Parameters for a Mixture of Poissons Brandon Maone Department of Computer Science University of Hesini February 18, 2014 Abstract This document derives, in excrutiating

More information

Accelerating Convolutional Networks via Global & Dynamic Filter Pruning

Accelerating Convolutional Networks via Global & Dynamic Filter Pruning Acceerating Convoutiona Networks via Goba & Dynamic Fiter Pruning Shaohui Lin 1,2, Rongrong Ji 1,2, Yuchao Li 1,2, Yongjian Wu 3, Feiyue Huang 3, Baochang Zhang 4 1 Fujian Key Laboratory of Sensing and

More information

THREE MECHANISMS OF WEIGHT DECAY REGULARIZATION

THREE MECHANISMS OF WEIGHT DECAY REGULARIZATION THREE MECHANISMS OF WEIGHT DECAY REGULARIZATION Guodong Zhang, Chaoqi Wang, Bowen Xu, Roger Grosse University of Toronto, Vector Institute {gdzhang, cqwang, bowenxu, rgrosse}@cs.toronto.edu ABSTRACT arxiv:18.12281v1

More information

(This is a sample cover image for this issue. The actual cover is not yet available at this time.)

(This is a sample cover image for this issue. The actual cover is not yet available at this time.) (This is a sampe cover image for this issue The actua cover is not yet avaiabe at this time) This artice appeared in a journa pubished by Esevier The attached copy is furnished to the author for interna

More information

Learning Structured Weight Uncertainty in Bayesian Neural Networks

Learning Structured Weight Uncertainty in Bayesian Neural Networks Learning Structured Weight Uncertainty in Bayesian Neura Networks Shengyang Sun Changyou Chen Lawrence Carin Tsinghua University Duke University Duke University Abstract Deep neura networks DNNs are increasingy

More information

Asynchronous Mini-Batch Gradient Descent with Variance Reduction for Non-Convex Optimization

Asynchronous Mini-Batch Gradient Descent with Variance Reduction for Non-Convex Optimization Proceedings of the hirty-first AAAI Conference on Artificial Intelligence (AAAI-7) Asynchronous Mini-Batch Gradient Descent with Variance Reduction for Non-Convex Optimization Zhouyuan Huo Dept. of Computer

More information

Reliability Improvement with Optimal Placement of Distributed Generation in Distribution System

Reliability Improvement with Optimal Placement of Distributed Generation in Distribution System Reiabiity Improvement with Optima Pacement of Distributed Generation in Distribution System N. Rugthaicharoencheep, T. Langtharthong Abstract This paper presents the optima pacement and sizing of distributed

More information

Maximizing Sum Rate and Minimizing MSE on Multiuser Downlink: Optimality, Fast Algorithms and Equivalence via Max-min SIR

Maximizing Sum Rate and Minimizing MSE on Multiuser Downlink: Optimality, Fast Algorithms and Equivalence via Max-min SIR 1 Maximizing Sum Rate and Minimizing MSE on Mutiuser Downink: Optimaity, Fast Agorithms and Equivaence via Max-min SIR Chee Wei Tan 1,2, Mung Chiang 2 and R. Srikant 3 1 Caifornia Institute of Technoogy,

More information

arxiv: v1 [cs.lg] 31 Oct 2017

arxiv: v1 [cs.lg] 31 Oct 2017 ACCELERATED SPARSE SUBSPACE CLUSTERING Abofaz Hashemi and Haris Vikao Department of Eectrica and Computer Engineering, University of Texas at Austin, Austin, TX, USA arxiv:7.26v [cs.lg] 3 Oct 27 ABSTRACT

More information

Statistical Learning Theory: A Primer

Statistical Learning Theory: A Primer Internationa Journa of Computer Vision 38(), 9 3, 2000 c 2000 uwer Academic Pubishers. Manufactured in The Netherands. Statistica Learning Theory: A Primer THEODOROS EVGENIOU, MASSIMILIANO PONTIL AND TOMASO

More information

Optimality of Inference in Hierarchical Coding for Distributed Object-Based Representations

Optimality of Inference in Hierarchical Coding for Distributed Object-Based Representations Optimaity of Inference in Hierarchica Coding for Distributed Object-Based Representations Simon Brodeur, Jean Rouat NECOTIS, Département génie éectrique et génie informatique, Université de Sherbrooke,

More information

Explicit overall risk minimization transductive bound

Explicit overall risk minimization transductive bound 1 Expicit overa risk minimization transductive bound Sergio Decherchi, Paoo Gastado, Sandro Ridea, Rodofo Zunino Dept. of Biophysica and Eectronic Engineering (DIBE), Genoa University Via Opera Pia 11a,

More information

c 2007 Society for Industrial and Applied Mathematics

c 2007 Society for Industrial and Applied Mathematics SIAM REVIEW Vo. 49,No. 1,pp. 111 1 c 7 Society for Industria and Appied Mathematics Domino Waves C. J. Efthimiou M. D. Johnson Abstract. Motivated by a proposa of Daykin [Probem 71-19*, SIAM Rev., 13 (1971),

More information

Training Algorithm for Extra Reduced Size Lattice Ladder Multilayer Perceptrons

Training Algorithm for Extra Reduced Size Lattice Ladder Multilayer Perceptrons Training Agorithm for Extra Reduced Size Lattice Ladder Mutiayer Perceptrons Daius Navakauskas Division of Automatic Contro Department of Eectrica Engineering Linköpings universitet, SE-581 83 Linköping,

More information

Tutorial on: Optimization I. (from a deep learning perspective) Jimmy Ba

Tutorial on: Optimization I. (from a deep learning perspective) Jimmy Ba Tutorial on: Optimization I (from a deep learning perspective) Jimmy Ba Outline Random search v.s. gradient descent Finding better search directions Design white-box optimization methods to improve computation

More information

SUPPLEMENTARY MATERIAL TO INNOVATED SCALABLE EFFICIENT ESTIMATION IN ULTRA-LARGE GAUSSIAN GRAPHICAL MODELS

SUPPLEMENTARY MATERIAL TO INNOVATED SCALABLE EFFICIENT ESTIMATION IN ULTRA-LARGE GAUSSIAN GRAPHICAL MODELS ISEE 1 SUPPLEMENTARY MATERIAL TO INNOVATED SCALABLE EFFICIENT ESTIMATION IN ULTRA-LARGE GAUSSIAN GRAPHICAL MODELS By Yingying Fan and Jinchi Lv University of Southern Caifornia This Suppementary Materia

More information

Sequential Decoding of Polar Codes with Arbitrary Binary Kernel

Sequential Decoding of Polar Codes with Arbitrary Binary Kernel Sequentia Decoding of Poar Codes with Arbitrary Binary Kerne Vera Miosavskaya, Peter Trifonov Saint-Petersburg State Poytechnic University Emai: veram,petert}@dcn.icc.spbstu.ru Abstract The probem of efficient

More information

Partial permutation decoding for MacDonald codes

Partial permutation decoding for MacDonald codes Partia permutation decoding for MacDonad codes J.D. Key Department of Mathematics and Appied Mathematics University of the Western Cape 7535 Bevie, South Africa P. Seneviratne Department of Mathematics

More information

A Statistical Framework for Real-time Event Detection in Power Systems

A Statistical Framework for Real-time Event Detection in Power Systems 1 A Statistica Framework for Rea-time Event Detection in Power Systems Noan Uhrich, Tim Christman, Phiip Swisher, and Xichen Jiang Abstract A quickest change detection (QCD) agorithm is appied to the probem

More information

Bayesian Learning. You hear a which which could equally be Thanks or Tanks, which would you go with?

Bayesian Learning. You hear a which which could equally be Thanks or Tanks, which would you go with? Bayesian Learning A powerfu and growing approach in machine earning We use it in our own decision making a the time You hear a which which coud equay be Thanks or Tanks, which woud you go with? Combine

More information

XSAT of linear CNF formulas

XSAT of linear CNF formulas XSAT of inear CN formuas Bernd R. Schuh Dr. Bernd Schuh, D-50968 Kön, Germany; bernd.schuh@netcoogne.de eywords: compexity, XSAT, exact inear formua, -reguarity, -uniformity, NPcompeteness Abstract. Open

More information

Learning Deep ResNet Blocks Sequentially using Boosting Theory

Learning Deep ResNet Blocks Sequentially using Boosting Theory Furong Huang 1 Jordan T. Ash 2 John Langford 3 Robert E. Schapire 3 Abstract We prove a multi-channel telescoping sum boosting theory for the ResNet architectures which simultaneously creates a new technique

More information

Lecture 17: Neural Networks and Deep Learning

Lecture 17: Neural Networks and Deep Learning UVA CS 6316 / CS 4501-004 Machine Learning Fall 2016 Lecture 17: Neural Networks and Deep Learning Jack Lanchantin Dr. Yanjun Qi 1 Neurons 1-Layer Neural Network Multi-layer Neural Network Loss Functions

More information

Swapout: Learning an ensemble of deep architectures

Swapout: Learning an ensemble of deep architectures Swapout: Learning an ensemble of deep architectures Saurabh Singh, Derek Hoiem, David Forsyth Department of Computer Science University of Illinois, Urbana-Champaign {ss1, dhoiem, daf}@illinois.edu Abstract

More information

Nonlinear Gaussian Filtering via Radial Basis Function Approximation

Nonlinear Gaussian Filtering via Radial Basis Function Approximation 51st IEEE Conference on Decision and Contro December 10-13 01 Maui Hawaii USA Noninear Gaussian Fitering via Radia Basis Function Approximation Huazhen Fang Jia Wang and Raymond A de Caafon Abstract This

More information

Adaptive Noise Cancellation Using Deep Cerebellar Model Articulation Controller

Adaptive Noise Cancellation Using Deep Cerebellar Model Articulation Controller daptive Noise Canceation Using Deep Cerebear Mode rticuation Controer Yu Tsao, Member, IEEE, Hao-Chun Chu, Shih-Wei an, Shih-Hau Fang, Senior Member, IEEE, Junghsi ee*, and Chih-Min in, Feow, IEEE bstract

More information

From Margins to Probabilities in Multiclass Learning Problems

From Margins to Probabilities in Multiclass Learning Problems From Margins to Probabiities in Muticass Learning Probems Andrea Passerini and Massimiiano Ponti 2 and Paoo Frasconi 3 Abstract. We study the probem of muticass cassification within the framework of error

More information

Algorithms to solve massively under-defined systems of multivariate quadratic equations

Algorithms to solve massively under-defined systems of multivariate quadratic equations Agorithms to sove massivey under-defined systems of mutivariate quadratic equations Yasufumi Hashimoto Abstract It is we known that the probem to sove a set of randomy chosen mutivariate quadratic equations

More information

FORECASTING TELECOMMUNICATIONS DATA WITH AUTOREGRESSIVE INTEGRATED MOVING AVERAGE MODELS

FORECASTING TELECOMMUNICATIONS DATA WITH AUTOREGRESSIVE INTEGRATED MOVING AVERAGE MODELS FORECASTING TEECOMMUNICATIONS DATA WITH AUTOREGRESSIVE INTEGRATED MOVING AVERAGE MODES Niesh Subhash naawade a, Mrs. Meenakshi Pawar b a SVERI's Coege of Engineering, Pandharpur. nieshsubhash15@gmai.com

More information

Approximation and Fast Calculation of Non-local Boundary Conditions for the Time-dependent Schrödinger Equation

Approximation and Fast Calculation of Non-local Boundary Conditions for the Time-dependent Schrödinger Equation Approximation and Fast Cacuation of Non-oca Boundary Conditions for the Time-dependent Schrödinger Equation Anton Arnod, Matthias Ehrhardt 2, and Ivan Sofronov 3 Universität Münster, Institut für Numerische

More information

NOISE-INDUCED STABILIZATION OF STOCHASTIC DIFFERENTIAL EQUATIONS

NOISE-INDUCED STABILIZATION OF STOCHASTIC DIFFERENTIAL EQUATIONS NOISE-INDUCED STABILIZATION OF STOCHASTIC DIFFERENTIAL EQUATIONS TONY ALLEN, EMILY GEBHARDT, AND ADAM KLUBALL 3 ADVISOR: DR. TIFFANY KOLBA 4 Abstract. The phenomenon of noise-induced stabiization occurs

More information

Turbo Codes. Coding and Communication Laboratory. Dept. of Electrical Engineering, National Chung Hsing University

Turbo Codes. Coding and Communication Laboratory. Dept. of Electrical Engineering, National Chung Hsing University Turbo Codes Coding and Communication Laboratory Dept. of Eectrica Engineering, Nationa Chung Hsing University Turbo codes 1 Chapter 12: Turbo Codes 1. Introduction 2. Turbo code encoder 3. Design of intereaver

More information

DIGITAL FILTER DESIGN OF IIR FILTERS USING REAL VALUED GENETIC ALGORITHM

DIGITAL FILTER DESIGN OF IIR FILTERS USING REAL VALUED GENETIC ALGORITHM DIGITAL FILTER DESIGN OF IIR FILTERS USING REAL VALUED GENETIC ALGORITHM MIKAEL NILSSON, MATTIAS DAHL AND INGVAR CLAESSON Bekinge Institute of Technoogy Department of Teecommunications and Signa Processing

More information

The influence of temperature of photovoltaic modules on performance of solar power plant

The influence of temperature of photovoltaic modules on performance of solar power plant IOSR Journa of Engineering (IOSRJEN) ISSN (e): 2250-3021, ISSN (p): 2278-8719 Vo. 05, Issue 04 (Apri. 2015), V1 PP 09-15 www.iosrjen.org The infuence of temperature of photovotaic modues on performance

More information

Efficient Visual-Inertial Navigation using a Rolling-Shutter Camera with Inaccurate Timestamps

Efficient Visual-Inertial Navigation using a Rolling-Shutter Camera with Inaccurate Timestamps Efficient Visua-Inertia Navigation using a Roing-Shutter Camera with Inaccurate Timestamps Chao X. Guo, Dimitrios G. Kottas, Ryan C. DuToit Ahmed Ahmed, Ruipeng Li and Stergios I. Roumeiotis Mutipe Autonomous

More information

Appendix A: MATLAB commands for neural networks

Appendix A: MATLAB commands for neural networks Appendix A: MATLAB commands for neura networks 132 Appendix A: MATLAB commands for neura networks p=importdata('pn.xs'); t=importdata('tn.xs'); [pn,meanp,stdp,tn,meant,stdt]=prestd(p,t); for m=1:10 net=newff(minmax(pn),[m,1],{'tansig','purein'},'trainm');

More information

FRST Multivariate Statistics. Multivariate Discriminant Analysis (MDA)

FRST Multivariate Statistics. Multivariate Discriminant Analysis (MDA) 1 FRST 531 -- Mutivariate Statistics Mutivariate Discriminant Anaysis (MDA) Purpose: 1. To predict which group (Y) an observation beongs to based on the characteristics of p predictor (X) variabes, using

More information

Stochastic Gradient MCMC with Stale Gradients

Stochastic Gradient MCMC with Stale Gradients Stochastic Gradient MCMC with Stae Gradients Changyou Chen Nan Ding Chunyuan i Yizhe Zhang awrence Carin Dept. of Eectrica and Computer Engineering, Duke University, Durham, NC, USA Googe Inc., Venice,

More information

Bayesian Unscented Kalman Filter for State Estimation of Nonlinear and Non-Gaussian Systems

Bayesian Unscented Kalman Filter for State Estimation of Nonlinear and Non-Gaussian Systems Bayesian Unscented Kaman Fiter for State Estimation of Noninear and Non-aussian Systems Zhong Liu, Shing-Chow Chan, Ho-Chun Wu and iafei Wu Department of Eectrica and Eectronic Engineering, he University

More information

Research of Data Fusion Method of Multi-Sensor Based on Correlation Coefficient of Confidence Distance

Research of Data Fusion Method of Multi-Sensor Based on Correlation Coefficient of Confidence Distance Send Orders for Reprints to reprints@benthamscience.ae 340 The Open Cybernetics & Systemics Journa, 015, 9, 340-344 Open Access Research of Data Fusion Method of Muti-Sensor Based on Correation Coefficient

More information

Quantized Densely Connected U-Nets for Efficient Landmark Localization

Quantized Densely Connected U-Nets for Efficient Landmark Localization Quantized Densey Connected U-Nets for Efficient Landmark Locaization Zhiqiang Tang 1, Xi Peng 2, Shijie Geng 1, Lingfei Wu 3, Shaoting Zhang 4, and Dimitris Metaxas 1 1 Rutgers University, {zt53, sg1309,

More information

Separation of Variables and a Spherical Shell with Surface Charge

Separation of Variables and a Spherical Shell with Surface Charge Separation of Variabes and a Spherica She with Surface Charge In cass we worked out the eectrostatic potentia due to a spherica she of radius R with a surface charge density σθ = σ cos θ. This cacuation

More information

Safety Evaluation Model of Chemical Logistics Park Operation Based on Back Propagation Neural Network

Safety Evaluation Model of Chemical Logistics Park Operation Based on Back Propagation Neural Network 1513 A pubication of CHEMICAL ENGINEERING TRANSACTIONS VOL. 6, 017 Guest Editors: Fei Song, Haibo Wang, Fang He Copyright 017, AIDIC Servizi S.r.. ISBN 978-88-95608-60-0; ISSN 83-916 The Itaian Association

More information

SVM: Terminology 1(6) SVM: Terminology 2(6)

SVM: Terminology 1(6) SVM: Terminology 2(6) Andrew Kusiak Inteigent Systems Laboratory 39 Seamans Center he University of Iowa Iowa City, IA 54-57 SVM he maxima margin cassifier is simiar to the perceptron: It aso assumes that the data points are

More information

Supplement of Limited-memory Common-directions Method for Distributed Optimization and its Application on Empirical Risk Minimization

Supplement of Limited-memory Common-directions Method for Distributed Optimization and its Application on Empirical Risk Minimization Suppement of Limited-memory Common-directions Method for Distributed Optimization and its Appication on Empirica Risk Minimization Ching-pei Lee Po-Wei Wang Weizhu Chen Chih-Jen Lin I Introduction In this

More information

Primal and dual active-set methods for convex quadratic programming

Primal and dual active-set methods for convex quadratic programming Math. Program., Ser. A 216) 159:469 58 DOI 1.17/s117-15-966-2 FULL LENGTH PAPER Prima and dua active-set methods for convex quadratic programming Anders Forsgren 1 Phiip E. Gi 2 Eizabeth Wong 2 Received:

More information

New Efficiency Results for Makespan Cost Sharing

New Efficiency Results for Makespan Cost Sharing New Efficiency Resuts for Makespan Cost Sharing Yvonne Beischwitz a, Forian Schoppmann a, a University of Paderborn, Department of Computer Science Fürstenaee, 3302 Paderborn, Germany Abstract In the context

More information

Melodic contour estimation with B-spline models using a MDL criterion

Melodic contour estimation with B-spline models using a MDL criterion Meodic contour estimation with B-spine modes using a MDL criterion Damien Loive, Ney Barbot, Oivier Boeffard IRISA / University of Rennes 1 - ENSSAT 6 rue de Kerampont, B.P. 80518, F-305 Lannion Cedex

More information

MONTE CARLO SIMULATIONS

MONTE CARLO SIMULATIONS MONTE CARLO SIMULATIONS Current physics research 1) Theoretica 2) Experimenta 3) Computationa Monte Caro (MC) Method (1953) used to study 1) Discrete spin systems 2) Fuids 3) Poymers, membranes, soft matter

More information

Deep Feedforward Networks

Deep Feedforward Networks Deep Feedforward Networks Liu Yang March 30, 2017 Liu Yang Short title March 30, 2017 1 / 24 Overview 1 Background A general introduction Example 2 Gradient based learning Cost functions Output Units 3

More information

Schedulability Analysis of Deferrable Scheduling Algorithms for Maintaining Real-Time Data Freshness

Schedulability Analysis of Deferrable Scheduling Algorithms for Maintaining Real-Time Data Freshness 1 Scheduabiity Anaysis of Deferrabe Scheduing Agorithms for Maintaining Rea-Time Data Freshness Song Han, Deji Chen, Ming Xiong, Kam-yiu Lam, Aoysius K. Mok, Krithi Ramamritham UT Austin, Emerson Process

More information

BALANCING REGULAR MATRIX PENCILS

BALANCING REGULAR MATRIX PENCILS BALANCING REGULAR MATRIX PENCILS DAMIEN LEMONNIER AND PAUL VAN DOOREN Abstract. In this paper we present a new diagona baancing technique for reguar matrix pencis λb A, which aims at reducing the sensitivity

More information

Improved Min-Sum Decoding of LDPC Codes Using 2-Dimensional Normalization

Improved Min-Sum Decoding of LDPC Codes Using 2-Dimensional Normalization Improved Min-Sum Decoding of LDPC Codes sing -Dimensiona Normaization Juntan Zhang and Marc Fossorier Department of Eectrica Engineering niversity of Hawaii at Manoa Honouu, HI 968 Emai: juntan, marc@spectra.eng.hawaii.edu

More information

Fast Spectral Clustering via the Nyström Method

Fast Spectral Clustering via the Nyström Method Fast Spectra Custering via the Nyström Method Anna Choromanska, Tony Jebara, Hyungtae Kim, Mahesh Mohan 3, and Caire Monteeoni 3 Department of Eectrica Engineering, Coumbia University, NY, USA Department

More information

Some Measures for Asymmetry of Distributions

Some Measures for Asymmetry of Distributions Some Measures for Asymmetry of Distributions Georgi N. Boshnakov First version: 31 January 2006 Research Report No. 5, 2006, Probabiity and Statistics Group Schoo of Mathematics, The University of Manchester

More information

arxiv: v1 [math.fa] 23 Aug 2018

arxiv: v1 [math.fa] 23 Aug 2018 An Exact Upper Bound on the L p Lebesgue Constant and The -Rényi Entropy Power Inequaity for Integer Vaued Random Variabes arxiv:808.0773v [math.fa] 3 Aug 08 Peng Xu, Mokshay Madiman, James Mebourne Abstract

More information