An Effective Training Method For Deep Convolutional Neural Network

Size: px

Start display at page:

Download "An Effective Training Method For Deep Convolutional Neural Network"

Alicia Cooper
6 years ago
Views:

1 An Effectve Tranng Method For Deep Convoutona Neura Network Yang Jang 1*, Zeyang Dou 1*, Qun Hao 1,, Je Cao 1, Kun Gao 1, X Chen 3 1. Schoo of Optoeectronc, Bejng Insttute of Technoogy. Bejng, Chna, Graduate Schoo at Shenzhen, Tsnghua Unversty, Tsnghua Campus, The Unversty Town, Shenzhen, Chna, BGI Research, Beshan Industra Zone, Yantan Dstrct, Shenzhen, Chna, Correspondence: qhao@bt.edu.cn Abstract In ths paper, we propose the nonnearty generaton method to speed up and stabze the tranng of deep convoutona neura networks. The proposed method modfes a famy of actvaton functons as nonnearty generators (NGs). NGs make the actvaton functons near symmetrc for ther nputs to ower mode capacty, and automatcay ntroduce nonnearty to enhance the capacty of the mode durng tranng. The proposed method can be consdered an unusua form of reguarzaton: the mode parameters are obtaned by tranng a reatvey ow-capacty mode, that s reatvey easy to optmze at the begnnng, wth ony a few teratons, and these parameters are reused for the ntazaton of a hgher-capacty mode. We derve the upper and ower bounds of varance of the weght varaton, and show that the nta symmetrc structure of NGs heps stabze tranng. We evauate the proposed method on dfferent frameworks of convoutona neura networks over two object recognton benchmark tasks (CIFAR-10 and CIFAR-100). Expermenta resuts showed that the proposed method aows us to (1) speed up the convergence of tranng, () aow for ess carefu weght ntazaton, (3) mprove or at east mantan the performance of the mode at neggbe extra computatona cost, and (4) easy tran a very deep mode. Introducton Convoutona neura networks (CNNs) have enabed the rapd deveopment of a varety of appcatons, partcuary ones reated to vsua recognton tasks [He et a. 015, 016]. There are two major reasons for ths: the budng of more powerfu modes, and the deveopment of more effcent and robust strateges. On the one hand, recent deep earnng modes are becomng ncreasngy compex owng to ther ncreasng depth [He et a. 016, Smonyan 014, Szegedy 015] and wdth [Zeer 014, Sermanet et a. 013], and decreasng strdes [Sermanet et a. 013]; on the other hand, better generazaton performance s obtaned by usng varous reguarzaton technques [Ioffe 015, Ba and Samans 016], desgnng modes of varyng structure [He et a. 016, Huang et a. 016] and new nonnear actvaton functons [Goodfeow et a. 013, Agostne et a. 014, Esenach et a. 016]. However, most of the aforementoned technques foow the same underyng hypothess: the mode s hghy non-near because many nputs drecty fa nto the nonnear parts of the actvaton functons. Athough the hghy nonnear structure mproves mode capacty, t eads to dffcutes n tranng. Tranng probems have recenty been party addressed through carefuy constructed ntazatons [Mshkn 015, Samans 016] and batch normazaton (BN) [Ioffe 015]. However, these methods may ead to a oss of effcency n the tranng of very deep neura networks. Even though ResNets outperform the pan CNNs, they st requre a warmng up trck to tran a very deep network [He 016]. Thus the chaenges to tranng have not been competey soved. As a counterpart, deep neura networks wth near act-vaton functons, whch have reatvey ow mode capacty, are reatvey easy to optmze at the begnnng of tranng, as we w show. Thus, we need to strke and mantan a baance between the dffcutes n tranng and mode ca-pacty. Wth regard to the possbty of combnng the ad-vantages of owcapacty and hgh-capacty modes, one motvatng exampe s the mutgrd method [Bakhvaov 1966] used to acceerate the convergence of dfferenta equatons usng a herarchy of dscretzaton. The man dea underyng the mutgrd method s to speed up computaton by usng a coarse grd and nterpoatng a correcton computed from ths nto a fne grd. The mode *Authors contrbuted equay

2 wth ow capacty can be consdered a coarse approxmaton of the probem, and the hgh-capacty mode corresponds to a fne approxmaton. Smary, unsupervsed pre-tranng [Hnton 006] and pre-tranng wth shaow networks [Smonyan 014] correspond to the use of modes wth reatvey ow capacty to coarsey ft the tranng data. The pre-traned parameters are then transferred to the compex mode to carefuy ft the dataset. NG(f(x)) (a) (b) Fgure. 1 Exampes of ReLU, LReLU and PReLU and ther NG versons. (a) ReLU, LReLU and PReLU; (b) NG-ReLU, NG-LReLU and NG-PReLU, t= 1. Vaue Fg. The vaues of t n ayer 19 durng the tranng procedure. The mode s 56-ayer pan CNN. Above a, t makes sense to combne the advantages of modes of dfferent capacty by frst restrctng capacty to coarsey ft the probem, and then endowng the mode wth greater capacty durng the tranng procedure, hence enabng t to graduay perform better. Note that the symmetrc structures of actvaton functons aso make the tranng procedure stabe, as we w show. Therefore we modfy a famy of actvaton functons such as Recfed Lnear Unt (ReLU), Leaky ReLU (LReLU) and Parametrc ReLU (PReLU) by ntroducng a tranabe parameter t, whch we ca nonnearty generator (NG), to make the actvaton functons near symmetrc for the nputs at nta stage. NGs then ntroduce nonnearty durng the tranng procedure, endowng the mode wth greater capacty. The ntroduced parameter can be easy ncorporated nto the backpropagaton framework. The proposed method aows us to (1) speed up the convergence of tranng, () aow for ess carefu parameter ntazatons, (3) mprove or at east mantan the performance of convoutona neura network at neggbe extra computatona cost, and (4) easy tran a very deep mode. -1 t n ayer 19 1 Epoch 31 Epoch 61 Epoch 91 Epoch Kerne Index Nonnearty Generator Approach We defne the nonnearty generator (NG) as foows NG( f( x)) = f( x t) + t (1) Where f s an actvaton functon such as ReLU, LReLU and PReLU, x s the nput of actvaton functon on the th node, and t s a tranabe parameter controng the nearty of the generator gven the nput dstrbuton. Note that t s dfferent for each node, we ca equaton (1) the eement-wse verson. We aso consder a channe-wse varant: the dfferent nodes n the same channe share the same t. If t ncreases, we say that NG ntroduces greater nonnearty because ths ncreases the probabty that nputs of the actvaton functons fa nto the nonnear parts. If t s smaer than the mnmum nput, a nputs of the NG are n the near area, makng t a near symmetrc actvaton functon for the nputs. Fgure 1 compares the shapes of dfferent actvaton functons (ReLU, LReLU and PReLU) and ther NG versons (NG-ReLU, NG-LReLU and NG-PReLU). Gven a preprocessed nput mage and some proper weght ntazaton, the property of NG can guarantee that the mode s amost near at the nta stage. As w show n anayss, the capacty of ths mode s reatvey ow, makng tranng reatvey easy. Thus, dffcutes n tranng are aevated durng the nta teratons. t can be easy optmzed usng back propagaton agorthm. Consder the eement-wse verson for exampe; the dervatve of { t } s smpy foowed the chan rue ε ε f( x) = () t f( x) t whereε represents the objectve functon, and

3 Accuracy ReLU tran ReLU test NG-ReLU tran NG-ReLU test Accuracy LReLU tran LReLU test NG-LReLU tran NG-LReLU test (a) (b) (c) Fg. 3 Tranng and test accuracy comparsons of dfferent actvaton functons. (a) 56-ayer ResNet wth ReLUs and NG-ReLUs; (b) 56- ayer ResNet wth LReLUs and NG-LReLUs. (c) 56-ayer ResNet wth PReLUs and NG-PReLUs. Accuracy PReLU tran PReLU test NG-PReLU tran NG-PReLU test ReLU NG-ReLU ReLU NG-ReLU Varance Varance Layer Index Layer Index (a) (b) Fg. 4 Weght varance comparsons of every ayer. (a) 0- Layer pan CNN; (6) 56-ayer ResNet. f ( x ) 0 x = > t. (3) t 1 x t By usng gradent descent, the NG can tsef determne the degree of nonnearty for each ayer based on gradent nformaton durng the tranng process, endowng the mode wth greater capacty. Snce the experments have shown that the performances of the channe-wse and eement-wse versons are comparabe, we use the former because t ntroduces very sma number of extra parameters. We adopt the momentum method when updatng parameter t t = η t + γ ε. (4) t NG Acts as a Reguarzer Anayss Frst, we use a toy exampe to show that NG wth strong nonnearty mproves the capactes of modes. We used a smpe CNN wthout actvaton functons as a basene to ft the test dataset CIFAR-10. The mode had two convoutona ayers, one dense ayer wthout actvaton functons and a softmax ayer. Each convoutona ayer contaned three 3 3 kernes, and the oss functon was cross entropy. We used the stochastc gradent descend (SGD) optmzer to tran the mode. The nta earnng rate was 1, and we dvded t by 10 when the oss no onger decreased n vaue. The batch sze was 18, and we adopt the momentum of 0.9, the same smpe data augmentaton as [He et a. 016], and MSRA weght ntazaton [He et a. 015].We then added NG-ReLUs wth dfferent untranabe vaues of t for the two convoutona ayers and make comparsons wth the basene mode. Fnay, we ntay set t to -1 and made t tranabe to test tranng performance. Tabe 1 shows the maxmum tranng accuraces of the dfferent modes, where None means the basene mode wthout actvaton functons. We see that as t ncreased, so dd the maxmum tranng accuracy. For the NG wth tranabe t, the mean vaue ncreased to at the end of tranng, and thus the performance was between that of -0.5 and -5. Ths toy exampe shows that the nonnearty of NG changes mode capacty.

4 Tabe 1 Tranng accuracy comparsons t None Tranabe (%) Next, we expermentay show that CNN wth ess nonnear NG s reatvey easy to tran. Our goa s to expore the crtca depth, whch s the depth beyond whch the mode does not converge. The test modes were the pan CNNs wth NG- ReLUs, the dfferent parameters t of whch were untranabe. We then set t = 1 ntay, made t as a tranabe parameter and tested the crtca depth. The mode structure was the same as [He et a. 016] wthout BN, and the dataset was CIFAR- 10. We used the same tranng strategy except that Xaver ntazaton [Gorort 010] were used as the weght ntazaton for a the experments. Tabe shows the resuts. Tabe Crtca depth of CNN for dfferent t t Tranabe Crtca Depth We see that as t ncreased, the crtca depth decreased, ndcatng the ncreasng dffcuty n tranng. As dscussed prevousy, a greater vaue of t corresponds to hgher mode capacty, and thus modes wth ower capactes are reatvey easy to tran. Reca the propertes of NG, t ntay makes the mode amost near, and automatcay enhance the mode capacty durng the tranng. Thus, NG can be consdered an unusua form of reguarzaton durng the tranng procedure: parameters n the eary stages are obtaned by tranng a reatvey ow-capacty mode for ony a few teratons, and these parameters are reused for the ntazaton of a hgher-capacty mode. As a the nputs of NG were n the near area usng a proper vaue of t, the mode was amost near n the nta stage; as the tranng contaned, t was updated based on gradent nformaton, hence endowng the mode wth greater capacty. We extracted t from the 56-ayer pan CNN to see how t changed durng the tranng procedure. Fgure shows the vaue of t n ayer 19 for dfferent epochs. We see that as the tranng proceeded, t became ncreasngy oscatory, ncreasng the capacty of the mode. An advantage of ths strategy s that the mode s ess key to overft n the eary stage of the tranng procedure because we restrct the parameters to partcuar regons at frst, foowng whch they are expanded graduay to seek better parameters. As seen n Fgure 3, athough the tranng accuraces of ResNets wth ReLUs, LReLUs and PReLUs ncreased steady, ther test accuracy oscated. By comparson, the test accuraces of ResNets wth NG-ReLUs, NG-LReLUs and NG-PReLUs were far more stabe. Symmetrc Structure of Actvaton Functon Affects Tranng procedure Many researchers have shown that the approxmatey symmetrc structure of actvaton functons can speed up earnng because t aevates the mean shft probem. Mean shft correcton pushes the off-dagona bock of Fsher Informaton Matrx cose to zero [He et a. 015], renderng the SGD sover coser to the natura gradent descent [Cevert 015]. In ths subsecton, we further show that varance of weght varaton may nfuence the tranng as we. We have the foowng theorem. Theorem 1: Gven a fuy-connected neura network wth M ayers, et W and Z be the weght and the nput of ayer, S = W Z, then varance of the weght varaton has the foowng ower and upper bounds E ( ε ) Var( Z ) Var( W ) ( Var( ε ) E ( Z ) Var( Z ) Var( ε η η + ) + Var( Z ) E ( ε )) (5) S S S S where η, W and ε represent the earnng rate, the weght varaton n ayer and the objectve functon respectvey. Proof: Pease see the appendx. Ths theorem shows the varance of weght varaton n ayer s cosey reated to the expectaton of the actvatons E( Z ) and gradents E( ε S ). The asymmetrc structure of the actvaton functon may cause mean shft probem [Cevert 015]. Mean shft caused by prevous ayer acts as bas for the next ayer. The more the unts are correated, the hgher ther mean shft [Cevert 015]. The shft of actvaton expectatons n turn may affect the expectaton of the gradent. From theorem 1, the mean shft probem makes the upper and ower bounds of varance Var( W) n dfferent ayers unstabe, thus they rase the nstabty of the varaton of the weghts n dfferent ayers. The unstabe varaton of the weghts woud resut n unstabe

5 ReLU-MSRA ReLU-Xaver ReLU-Orthogona NG-ReLU-MSRA NG-ReLU-Xaver NG-ReLU-Orthogona LReLU-MSRA LReLU-Xaver LReLU-Orthogona NG-LReLU-MSRA NG-LReLU-Xaver NG-LReLU-Orthogona (a) (b) (c) ReLU-MSRA ReLU-Xaver ReLU-Orthogona NG-ReLU-MSRA NG-ReLU-Xaver NG-ReLU-Orthogona LReLU-MSRA LReLU-Xaver LReLU-Orthogona NG-LReLU-MSRA NG-LReLU-Xaver NG-LReLU-Orthogona (d) (e) (f) PReLU-MSRA PReLU-Xaver PReLU-Orthogona NG-PReLU-MSRA NG-PReLU-Xaver NG-PReLU-Orthogona PReLU-MSRA PReLU-Xaver PReLU-Orthogona NG-PReLU-MSRA NG-PReLU-Xaver NG-PReLU-Orthogona SELU-MSRA SELU-Xaver SELU-Orthogona NG-ReLU-MSRA NG-ReLU-Xaver NG-ReLU-Orthogona NG-LReLU-MSRA NG-LReLU-Xaver NG-LReLU-Orthogona NG-PReLU-MSRA NG-PReLU-Xaver NG-PReLU-Orthogona (g) (h) Fgure 5. Learnng behavor comparsons wth dfferent ntazatons on CIFAR-10. (a), (b), (c), (g) pan CNNs; (d), (e), (f), (h) ResNets. weght varance durng the tranng, whch hampers nformaton fow [Gorot 010], rasng the tranng probems when we tran a very deep mode. To stabze the tranng procedure, we want to keep the two bounds stabe. Because the proposed NG s symmetrc wth respect to the orgna pont for the near area, t can pu E( Z ) and E( ε S ) cose to zero, makng the tranng procedure more stabe. Fgure 4 shows the comparsons of weght varance for the 0-ayer pan CNNs and 56- ayer ResNets. Compared wth ReLU, the weght varance usng NG-ReLU was more stabe, whch supports our anayss above. SELU-MSRA SELU-Xaver SELU-Orthogona NG-ReLU-MSRA NG-ReLU-Xaver NG-ReLU-Orthogona NG-LReLU-MSRA NG-LReLU-Xaver NG-LReLU-Orthogona NG-PReLU-MSRA NG-PReLU-Xaver NG-PReLU-Orthogona Experments We tested our method on two modes (pan CNN wthout BN and ResNet, whose structures were the same as [He et a. 016] and reported n the appendx). Our GPU was a GTX1080-t. We focused on a far performance comparsons of the modes, not on state-of-the-art resuts. Thus, we used the same strateges durng tranng. We modfed the actvaton functons of the mode,.e., ReLU, LReLU, and PReLU, as NG-ReLU, NG-LReLU, and NG-PReLU, respectvey, and compare the mode performances wth ther counterparts and the scaed exponenta near unts (SELU). The datasets are CIFAR-10 (coor mages n 10 casses, 50k tran and 10k test) and CIFAR-100 (coor mages n 100 casses, 50k tran and 10k test). We used the same smpe data augmentaton method as [He et a. 016] for CIFAR-10 and CIFAR-100. t were set to -1 for a experments. For pan CNNs, the nta earnng rate was 1, and we dvded t by 10 durng the tranng procedure when the test error no onger decreased. For ResNets, the nta earnng rate was 0.1, and we dvded t by 10 at 80, 10, and 160 epochs, and termnated tranng at 00 epochs. For both modes, we used a batch sze of 18 and a weght decay of 005, and a

6 momentum of 0.9. Note that we dd not use other technques, such as dropout [Srvastava et a. 014], to reduce the effect of the random factors on the expermenta resuts to a mnmum. Learnng Behavor We evauated our method on the pan CNNs and ResNets over CIFAR-10. The test modes were 0-ayer pan CNNs and 56-ayer ResNets wth dfferent actvaton functons,.e. ReLUs, LReLUs, PReLUs, ther NG versons (NG-ReLUs, NG- LReLUs, NG-PReLUs) and SELUs. Three parameter ntazaton methods Xaver ntazaton, MSRA ntazaton, and orthogona ntazaton [Saxe et a. 013] were used to test the behavor of the modes. Fg. 5 shows the tranng accuracy comparsons of dfferent modes. We see the modes wth our method were nsenstve to the dfferent ntazatons, whe the other modes were not. In addton, the modes wth NGs converged much faster than ther counterparts n a cases. For ResNets, our methods were st robust to the ntazatons and converged much faster than ther counterparts, whe the other modes were reatvey senstve to the ntazatons. Mode Performance We tested and compared varous modes, ncudng 44-ayer, 56-ayer pan CNNs and ResNets, on CIFAR-10 and CIFAR- 100 datasets. MSRA ntazaton was used. Note that pan CNNs wth ReLUs, LReLUs, PReLUs and SELUs dd not converge when the depths were more than 0 ayers, whereas modes wth NGs st converged. Tabe 3 shows the test accuracy comparsons. Athough our method frst traned the modes wth the reatvey ow capacty, they st resuted better or at east comparabe performances compared wth ther counterparts and SELUs n most cases. Tabe 3 Test accuracy comparsons on dfferent datasets. - means not convergent Mode Dataset ReLU LReLU PReLU SELU NG-ReLU NG-LReLU NG-PReLU Pan-44 CIFAR % 88.44% 89.03% Pan-56 CIFAR % 87.56% 87.95% ResNet-44 CIFAR % 9.71% 93.3% 9.66% 93.48% 93.18% 93.33% ResNet-56 CIFAR % 9.18% 93.8% 9.57% 93.90% 93.56% 93.49% Pan-44 CIFAR % 61.5% 61.17% Pan-56 CIFAR % 58.30% 61.38% ResNet-44 CIFAR % 71.9% 79% 71.11% 71.8% 71.69% 71.99% ResNet-56 CIFAR % 71.34% 7.3% 79% 71.90% 7.06% 7.8% Our method can be used wth BN to further mprove the performance. We used CIFAR-10 as a test dataset and tested the performance of the pan CNNs wth BN usng ReLUs and NG-ReLUs, as shown n tabe 4. The test accuracy usng NG- ReLUs wth BN was the hghest n a experments. Pease see the appendx for more comparsons about the pan CNNs wth BN. Tabe 4 Test accuracy comparsons of pan CNNs wth BN Mode depth ReLU wth BN NG-ReLU NG-ReLU wth BN % 89.58% 89.98% % 88.7% 89.59% We aso used a 0-ayer wder ResNet, the structure of whch s shown n the appendx, to test performance. The fna test accuracy was 95.43%, better than the 1001-ayer pre-actvaton ResNet n [He et a. 016]. Athough NG ntroduces new tranng parameters, the extra computatona cost s sma. For 56-ayer pan CNN wth BN, NG-ReLUs took 7 seconds on average to run an epoch, whereas ReLUs and PReLUs take 68 and 77 seconds on average. For 56-ayer ResNet, NG-ReLUs take 94 seconds on average per epoch, whereas ReLUs and PReLUs took 87 seconds and 100 seconds on average per epoch. Exporng Very Deep Mode We expored the crtca depths of the pan CNNs wth NGs. The mode depth was graduay ncreased by ntegra mutpes of 1 ayers, whch s the same as [he et a. 016]. The test dataset was CIFAR-10. Tabe 5 shows the resuts of three weght ntazatons. The crtca depths for our method were much greater.

7 Tabe 5 Crtca depth of dfferent mode Mode Crtca depth SELU-MSRA <0 SELU-Xaver 44 SELU-Orthogona 68 ReLU-MSRA 0 ReLU-Xaver <0 ReLU-Orthogona <0 LReLU-MSRA 0 LReLU-Xaver <0 LReLU-Orthogona <0 PReLU-MSRA 0 PReLU-Xaver 0 PReLU-Orthogona 0 NG-ReLU-MSRA 80 NG-ReLU-Xaver 116 NG-ReLU-Orthogona 15 NG-LReLU-MSRA 68 NG-LReLU-Xaver 104 NG-LReLU-Orthogona 18 NG-PReLU-MSRA 18 NG-PReLU-Xaver 18 NG-PReLU-Orthogona 15 Concuson In ths paper, we proposed the nonnearty generaton method that begns tranng wth a reatvey ow-capacty mode and graduay mproves mode capacty. The proposed tranng method modfes a famy of actvaton functons by ntroducng a tranabe parameter t to make the actvaton functons near symmetrc for the nputs, whch makes the mode wth reatvey ow capacty and easy to optmze at the begnnng. Nonnearty s then ntroduced automatcay durng the tranng procedure to endow the mode wth greater capacty. The ntroduced parameters can be easy ncorporated nto tranng. The proposed method can be consdered an unusua form of reguarzaton durng the tranng procedure: The parameters n the eary stages are obtaned by tranng a reatvey ow-capacty mode for ony a few teratons, and are reused for the ntazaton of a hgher-capacty mode. We derved the upper and ower bounds of the varance for weght varaton and showed that the symmetrc structure of NGs heps stabze tranng. Experments showed that the proposed method speeds up the convergence of tranng, aows for ess carefu ntazaton, mproves or at east mantans the performance of CNNs at neggbe extra computatona cost and can be used wth BN to further mprove the performance. Fnay, we can tran a very deep mode wth the proposed method easy. References Agostne F, Hoffman M, Sadowsk P, et a Learnng actvaton functons to mprove deep neura networks. arxv preprnt arxv: Ba J L, Kros J R, Hnton G E. Layer normazaton arxv preprnt arxv: Bakhvaov N S On the convergence of a reaxaton method wth natura constrants on the eptc operator. USSR Computatona Mathematcs and Mathematca Physcs, 6(5): Cevert D A, Unterthner T, Hochreter S Fast and accurate deep network earnng by exponenta near unts (eus). arxv preprnt arxv: Esenach C, Wang Z, Lu H Nonparametrcay Learnng Actvaton Functons n Deep Neura Nets. ICLR Gorot X, Bengo Y Understandng the dffcuty of tranng deep feedforward neura networks. In Proceedngs of the Thrteenth Internatona Conference on Artfca Integence and Statstcs.: Goodfeow I J, Warde-Farey D, Mrza M, et a Maxout net-works. arxv preprnt arxv:

8 He K, Zhang X, Ren S, et a Devng deep nto rectfers: Sur-passng human-eve performance on magenet cassfcaton. In Proceedngs of the IEEE nternatona conference on computer vson.: He K, Zhang X, Ren S, et a Deep resdua earnng for mage recognton. In Proceedngs of the IEEE conference on computer vson and pattern recognton.: He K, Zhang X, Ren S, et a Identty mappngs n deep resdua networks. In Proceedngs of European Conference on Computer Vson Sprnger. Hnton G E, Saakhutdnov R R Reducng the dmensonaty of data wth neura networks. scence, 313(5786): Huang G, Lu Z, Wenberger K Q, et a. Densey connected convoutona networks. arxv preprnt arxv: , 016. Ioffe S, Szegedy C Batch normazaton: Acceeratng deep network tranng by reducng nterna covarate shft. In Proceedngs of Internatona Conference on Machne Learnng.: LeCun Y A, Bottou L, Orr G B, et a. 01. Effcent back-prop.neura networks: Trcks of the trade. Sprnger Bern He-deberg: Mshkn D, Matas J A you need s a good nt. arxv preprnt arxv: Saxe A M, McCeand J L, Gangu S Exact soutons to the nonnear dynamcs of earnng n deep near neura networks. arxv preprnt arxv: Samans T, Kngma D P Weght normazaton: A smpe reparameterzaton to acceerate tranng of deep neura networks. In Proceedngs of Advances n Neura Informaton Processng Systems.: Sermanet P, Egen D, Zhang X, et a. 013.Overfeat: Integrated recognton, ocazaton and detecton usng convoutona net-works. arxv preprnt arxv: Smonyan K, Zsserman A Very deep convoutona networks for arge-scae mage recognton. arxv preprnt arxv: , 4. Srvastava N, Hnton G E, Krzhevsky A, et a Dropout: a smpe way to prevent neura networks from overfttng. Journa of machne earnng research, 15(1): Szegedy C, Lu W, Ja Y, et a Gong deeper wth convou-tons. In Proceedngs of the IEEE conference on computer vson and pattern recognton.: 1-9. Zeer M D, Fergus R Vsuazng and understandng convoutona networks. In Proceedngs of European conference on computer vson.: Sprnger 1. Proof of Theorem 1 Appendx Proof of theorem 1: Reca the weght update formua for ayer, w, j = η ε zj. s Where, j denote the th and j th nodes from ayer and ayer 1 respectvey. n 1 s denotes w, jzj, and zj= f( s j ). f() s the actvaton functon. w, j s the weght varaton of w, j. Then the varance of the weght varaton s: η ε 1 ε Var( w ) = ( z ( z )) m n m n j k mn j s mn t k st where m and n are the number of the nodes n ayer +1and ayer respectvey. Usng the foowng nequaty n n 1 ( ) a a n We have n m m n η ε 1 ε Var( w ) ( z j ( zk)) mn s n s m n j t k η ε = ( ) ( ) n m zj z j s Var z E ε S = η ( ) ( ) t j

9 Where z 1 n = zj. On the other hand, by denotng n j s End the proof.. Structure detas of the test mode ε as 1 m ε, we have m s η ε ε Var( w )= ( ( z z z ) z ) mn s s m n j + j = ( ( ) ( ) ) η m n z ε ε + z j z ε mn j s s s ( ) ( ε ε η ) + ( ) ( ε ) m n m z η zj z m s s n j m s ε ε ε = + + S S S η ( Var( ) E ( Z ) Var( Z ) Var( ) Var( Z ) E ( )). Pan CNN: The pan CNNs are 44, 56 ayers wth the structure shown n tabe 1. Tabe. 1 Pan CNN structure. [ m, n] z denotes the convoutona ayer wth kerne sze m, number of the kernes n and repeatng ths convouton ayer z tmes. 44-ayer CNN 56-ayer CNN Output sze nput 3 3 [3,16] [16] 14 [16] Max Poong [3] 14 [3] Max Poong [64] 14 [64] Goba Average Poong Softmax ResNet: ResNets are 0, 44, 56 ayers wth the structure shown n tabe. Tabe. ResNet structure. [Bock, mn, ] z denotes the dentty bock, whch s same the bock used n CIFAR-10 as [He et a. 016], wth kerne sze m, number of the kernes n and repeatng ths dentty bock z tmes, downsampeng s performed wth a strde of 44-ayer ResNet 56-ayer ResNet 0-ayer wder ResNet Output sze [Bock, 3,, [Bock, 3,, [Bock, 3, [Bock 3, [Bock 3, Input 3 3 [16] 1 [18] ] 7 [Bock, 16] 9 [Bock, 18] ] 1, strde [Bock, 3] 1, strde [Bock, 56] 1, strde [Bock, 56] 3 3] 6 [B ock, 3] ] 1, strde [Bock, 64] 1, strde [Bock, 51] 1, strde 8 8 [Bock, 51] 3 64] 6 [Bock, 64] 8 Goba Average poong Softmax

3. Experments for pan CNN wth batch normazaton 3.1 Learnng Behavor We tested the 56-ayer pan CNNs wth batch normazaton (BN) and made comparsons on CIFAR10 dataset.

10 3. Experments for pan CNN wth batch normazaton 3.1 Learnng Behavor We tested the 56-ayer pan CNNs wth batch normazaton (BN) and made comparsons on CIFAR10 dataset. We used the same tranng strateges as the paper. From Fg. 1 we saw that smar to the experments of the pan CNNs wthout BN, NGs were nsenstve to the weght ntazatons, whe the other modes processed reatvey dfferent behavors wth dfferent ntazatons. In addton, the modes wth NGs converged faster than ther counterparts n the most cases. NG aso enabes arger earnng rate (r), as shown n Fg., the modes wth NG-ReLUs converge wth very arge rs, whe the modes wth ReLUs do not. (a) ReLU-MSRA ReLU-Xaver ReLU-Orthogona NG-ReLU-MSRA NG-ReLU-Xaver NG-ReLU-Orthogona (b) LReLU-MSRA LReLU-Xaver LReLU-Orthogona NG-LReLU-MSRA NG-LReLU-Xaver NG-LReLU-Orthogona PReLU-MSRA PReLU-Xaver PReLU-Orthogona NG-PReLU-MSRA NG-PReLU-Xaver NG-PReLU-Orthogona (c) (d) Fgure. 1 Learnng behavor comparsons wth dfferent ntazatons on CIFAR SELU-MSRA SELU-Xaver SELU-Orthogona NG-ReLU-MSRA NG-ReLU-Xaver NG-ReLU-Orthogona NG-LReLU-MSRA NG-LReLU-Xaver NG-LReLU-Orthogona NG-PReLU-MSRA NG-PReLU-Xaver 0 r=1 0 NG 0 ReLU Epoches Epoches (a) (b) Fgure. Learnng behavors of 56-ayer pan CNNs wth BN usng dfferent earnng rates. (a), r = 1; (b) r = r=1.5 0 NG 0 ReLU

11 We extracted the weght varance of every ayer for the 56-ayer pan CNNs and made comparsons. Fgure 3 shows the comparsons. Compared wth ReLUs, the weght varance usng NG-ReLUs s more stabe, whch supports our anayss n secton 3. of the paper. Varance Layer ndex Fgure. 3 Weght varance comparson of every ayer for the pan CNNs. The test mode s 56-ayer pan CNN wth BN. 3. Mode Performance We tested and compared 44-ayer and 56-ayer pan CNNs on CIFAR-10 and SVHN datasets. MSRA ntazaton was used. Tabe 3 shows the test accuracy comparsons. Smar to the resuts of the pan CNNs wthout BN, NGs perform sghty better or comparabe resuts. NG ReLU Tabe 3 Test accuracy comparsons on dfferent datasets. Mode Dataset ReLU LReLU PReLU SELU NG-ReLU NG-LReLU NG-PReLU pan-44 CNN CIFAR % 89.01% 89.3% 89.09% 89.48% 89.06% 89.70% pan-56 CNN CIFAR % 86.0% 89.13% 88.30% 89.40% 88.90% 89.% pan-44 CNN SVHN 95.99% 95.80% 96.30% 95.39% 96.07% 95.91% 96.31% pan-56 CNN SVHN 95.79% 95.50% 96.14% 95.59% 96.1% 95.84% 96.10% 3.3 Exporng Very Deep Mode Fnay, we expored the crtca depths of the pan CNNs wth BN. The test dataset was CIFAR-10.We coud not test the earnng behavors of the pan CNNs wth more than 48 ayers because of GPU memory constrants. So the argest depth was 48 ayers n ths subsecton. Tabe 4 shows the resuts of ReLU and NG-ReLU wth three weght ntazatons. The crtca depths NG-ReLU are much greater than ReLU. Tabe 4 Crtca depth of dfferent mode Mode Crtca depth SELU-MSRA 4 SELU-Xaver 48 SELU-Orthogona 36 ReLU-MSRA 104 ReLU-Xaver 80 ReLU-Orthogona 116 LReLU-MSRA 80 LReLU-Xaver 80 LReLU-Orthogona 116 PReLU-MSRA 00 PReLU-Xaver 188 PReLU-Orthogona 4 NG-ReLU-MSRA 48 NG-ReLU-Xaver 48 NG-ReLU-Orthogona 36 NG-LReLU-MSRA 48 NG-LReLU-Xaver 48

12 NG-LReLU-Orthogona 48 NG-PReLU-MSRA 48 NG-PReLU-Xaver 48 NG-PReLU-Orthogona 48

Research on Complex Networks Control Based on Fuzzy Integral Sliding Theory

Research on Complex Networks Control Based on Fuzzy Integral Sliding Theory Advanced Scence and Technoogy Letters Vo.83 (ISA 205), pp.60-65 http://dx.do.org/0.4257/ast.205.83.2 Research on Compex etworks Contro Based on Fuzzy Integra Sdng Theory Dongsheng Yang, Bngqng L, 2, He