Simplified Stochastic Feedforward Neural Networks

Smplfed Stochastc Feedforward Neural Networks Kmn Lee, Jaehyung Km, Song Chong, Jnwoo Shn Aprl 1, 017 Abstract arxv:1704.03188v1 [cs.lg] 11 Apr 017 It has been beleved that stochastc feedforward neural networks SFNNs have several advantages beyond determnstc deep neural networks DNNs: they have more expressve power allowng mult-modal mappngs and regularze better due to ther stochastc nature. However, tranng large-scale SFNN s notorously harder. In ths paper, we am at developng effcent tranng methods for SFNN, n partcular usng known archtectures and pre-traned parameters of DNN. To ths end, we propose a new ntermedate stochastc model, called Smplfed-SFNN, whch can be bult upon any baselne DNN and approxmates certan SFNN by smplfyng ts upper latent unts above stochastc ones. The man novelty of our approach s n establshng the connecton between three models,.e., DNN Smplfed-SFNN SFNN, whch naturally leads to an effcent tranng procedure of the stochastc models utlzng pre-traned parameters of DNN. Usng several popular DNNs, we show how they can be effectvely transferred to the correspondng stochastc models for both mult-modal and classfcaton tasks on MNIST, TFD, CASIA, CIFAR-10, CIFAR-100 and SVHN datasets. In partcular, we tran a stochastc model of 8 layers and 36 mllon parameters, where tranng such a largescale stochastc network s sgnfcantly challengng wthout usng Smplfed-SFNN. 1 Introducton Recently, determnstc deep neural networks DNNs have demonstrated state-of-the-art performance on many supervsed tasks, e.g., speech recognton [4] and obect recognton [11]. One of the man components underlyng these successes s the effcent tranng methods for large-scale DNNs, whch nclude backpropagaton [9], stochastc gradent descent [30], dropout/dropconnect [16, 10], batch/weght normalzaton [17, 46], and varous actvaton functons [33, 8]. On the other hand, stochastc feedforward neural networks SFNNs [5] havng random latent unts are often necessary n to model the complex stochastc natures of many real-world tasks, e.g., structured predcton [8] and mage generaton [41]. Furthermore, t s beleved that SFNN has several advantages beyond DNN [9]: t has more expressve power for mult-modal learnng and regularzes better for large-scale networks. Tranng large-scale SFNN s notorously hard snce backpropagaton s not drectly applcable. Certan stochastc neural networks usng contnuous random unts are known to be tranable effcently usng backpropagaton wth varatonal technques and reparameterzaton trcks [3, 1]. On the other hand, tranng SFNN havng dscrete,.e., bnary or mult-modal, random unts s more dffcult snce ntractable probablstc nference s nvolved requrng too K. Lee, J. Km, S. Chong and J. Shn are wth School of Electrcal Engneerng at Korea Advanced Insttute of Scence Technology, Republc of Korea. Authors e-mals: kmnlee@kast.ac.kr, aehyungkm@kast.ac.kr songchong@kast.edu, nwoos@kast.ac.kr 1

many random samples. There have been several efforts toward developng effcent tranng methods for SFNN havng bnary random latent unts [5, 38, 8, 35, 9, 34] see Secton.1 for more detals. However, tranng a SFNN s stll sgnfcantly slower than tranng a DNN of the same archtecture, consequently most pror works have consdered a small number at most 5 or so of layers n SFNN. We am for the same goal, but our drecton s complementary to them. Instead of tranng a SFNN drectly, we study whether pre-traned parameters from a DNN or easer models can be transferred to t, possbly wth further low-cost fne-tunng. Ths approach can be attractve snce one can utlze recent advances n DNN desgn and tranng. For example, one can desgn the network structure of SFNN followng known specalzed ones of DNN and use ther pre-traned parameters. To ths end, we frst try transferrng pre-traned parameters of DNN usng sgmod actvaton functons to those of the correspondng SFNN drectly. In our experments, the heurstc reasonably works well. For mult-modal learnng, SFNN under such a smple transformaton outperforms DNN. Even for the MNIST classfcaton, the former performs smlarly as the latter see Secton for more detals. However, t s questonable whether a smlar strategy works n general, partcularly for other unbounded actvaton functons lke ReLU [33] snce SFNN has bnary,.e., bounded, random latent unts. Moreover, t loses the regularzaton beneft of SFNN: t s beleved that transferrng parameters of stochastc models to DNN helps ts regularzaton, but the opposte s unlkely. Contrbuton. To address these ssues, we propose a specal form of stochastc neural networks, named Smplfed-SFNN, whch s ntermedate between SFNN and DNN, havng the followng propertes. Frst, Smplfed-SFNN can be bult upon any baselne DNN, possbly havng unbounded actvaton functons. The most sgnfcant part of our approach les n provdng rgorous network knowledge transferrng [] between Smplfed-SFNN and DNN. In partcular, we prove that parameters of DNN can be transformed to those of the correspondng Smplfed-SFNN whle preservng the performance,.e., both represent the same mappng. Second, Smplfed-SFNN approxmates certan SFNN, better than DNN, by smplfyng ts upper latent unts above stochastc ones usng two dfferent non-lnear actvaton functons. Smplfed-SFNN s much easer to tran than SFNN whle stll mantanng ts stochastc regularzaton effect. We also remark that SFNN s a Bayesan network, whle Smplfed-SFNN s not. The above connecton DNN Smplfed-SFNN SFNN naturally suggests the followng tranng procedure for both SFNN and Smplfed-SFNN: tran a baselne DNN frst and then fne-tune ts correspondng Smplfed-SFNN ntalzed by the transformed DNN parameters. The pre-tranng stage accelerates the tranng task snce DNN s faster to tran than Smplfed-SFNN. In addton, one can also utlze known DNN tranng technques such as dropout and batch normalzaton for fne-tunng Smplfed-SFNN. In our experments, we tran SFNN and Smplfed-SFNN under the proposed strategy. They consstently outperform the correspondng DNN for both mult-modal and classfcaton tasks, where the former and the latter are for measurng the model expressve power and the regularzaton effect, respectvely. To the best of our knowledge, we are the frst to confrm that SFNN ndeed regularzes better than DNN. We also construct the stochastc models followng the same network structure of popular DNNs ncludng Lenet-5 [6], NIN [13], FCN [] and WRN [39]. In partcular, WRN wde resdual network of 8 layers and 36 mllon parameters has shown the stateof-art performances on CIFAR-10 and CIFAR-100 classfcaton datasets, and our stochastc models bult upon WRN outperform the determnstc WRN on the datasets.

1.0 Tranng data Samples from sgmod-dnn 1.0 Tranng data Samples from SFNN sgmod actvaton y 0.5 y 0.5 0 0 0 0. 0.4 0.6 0.8 1.0 x 0 0. 0.4 0.6 0.8 1.0 x a b Fgure 1: The generated samples from a sgmod-dnn and b SFNN whch uses same parameters traned by sgmod-dnn. One can note that SFNN can model the multple modes n output space y around x = 0.4. Inference Model Network Structure MNIST Classfcaton Mult-modal Learnng Tranng NLL Tranng Error Rate % Test Error Rate % Test NLL sgmod-dnn hdden layers 0 0 1.54 5.90 SFNN hdden layers 0 0 1.56 1.564 sgmod-dnn 3 hdden layers 0.00 0.03 1.84 4.880 SFNN 3 hdden layers 0.0 0.04 1.81 0.575 sgmod-dnn 4 hdden layers 0 0.01 1.74 4.850 SFNN 4 hdden layers 0.003 0.03 1.73 0.39 ReLU-DNN hdden layers 0.005 0.04 1.49 7.49 SFNN hdden layers 0.819 4.50 5.73.678 ReLU-DNN 3 hdden layers 0 0 1.43 7.56 SFNN 3 hdden layers 1.174 16.14 17.83 4.468 ReLU-DNN 4 hdden layers 0 0 1.49 7.57 SFNN 4 hdden layers 1.13 13.13 14.64 1.470 Table 1: The performance of smple parameter transformatons from DNN to SFNN on the MNIST and synthetc datasets, where each layer of neural networks contans 800 and 50 hdden unts for two datasets, respectvely. For all experments, only the frst hdden layer of the DNN s replaced by stochastc one. We report negatve log-lkelhood NLL and classfcaton error rates. Smple Transformaton from DNN to SFNN.1 Prelmnares for SFNN Stochastc feedforward neural network SFNN s a hybrd model, whch has both stochastc bnary and determnstc hdden unts. We frst ntroduce SFNN wth one stochastc hdden layer and wthout determnstc hdden layers for smplcty. Throughout ths paper, we commonly denote the bas for unt and the weght matrx of the l-th hdden layer by b l and W l, respectvely. Then, the stochastc hdden layer n SFNN s defned as a bnary random vector wth N 1 unts,.e., h 1 {0, 1} N 1, drawn under the followng dstrbuton: P h 1 x = P h 1 x, where P h 1 = 1 x = σ W 1 x + b 1. 1 N 1 =1 3

In the above, x s the nput vector and σ x = 1/ 1 + e x s the sgmod functon. Our condtonal dstrbuton of the output y s defned as follows: [ P y x = E P h 1 x P y h 1 ] [ = E P h 1 x N y W h 1 + b, σy ], where N denotes the normal dstrbuton wth mean W h 1 + b and fxed varance σy. Therefore, P y x can express a very complex, mult-modal dstrbuton snce t s a mxture of exponentally many normal dstrbutons. The mult-layer extenson s straghtforward va a combnaton of stochastc and determnstc hdden layers see [8, 9]. Furthermore, one can use any other output dstrbutons as lke DNN, e.g., softmax for classfcaton tasks. There are two computatonal ssues for tranng SFNN: computng expectatons wth respect to stochastc unts n the forward pass and computng gradents n the backward pass. One can notce that both are computatonally ntractable snce they requre summatons over exponentally many confguratons of all stochastc unts. Frst, n order to handle the ssue n the forward pass, one can use the followng Monte Carlo approxmaton for estmatng the expectaton: P y x 1 M M P y h m, where h m P h 1 x and M s the number m=1 of samples. Ths random estmator s unbased and has relatvely low varance [8] snce one can draw samples from the exact dstrbuton. Next, n order to handle the ssue n backward pass, [5] proposed Gbbs samplng, but t s known that t often mxes poorly. [38] proposed a varatonal learnng based on the mean-feld approxmaton, but t has addtonal parameters makng the varatonal lower bound looser. More recently, several other technques have been proposed ncludng unbased estmators of the varatonal bound usng mportance samplng [8, 9] and based/unbased estmators of the gradent for approxmatng backpropagaton [35, 9, 34].. Smple transformaton from sgmod-dnn and ReLU-DNN to SFNN Despte the recent advances, tranng SFNN s stll very slow compared to DNN due to the samplng procedures: n partcular, t s notorously hard to tran SFNN when the network structure s deeper and wder. In order to handle these ssues, we consder the followng approxmaton: [ P y x = E P h 1 x N y W h 1 + b, σy] N [ y E P h 1 x W h 1] + b, σy = N y W σ W 1 x + b 1 + b, σ y. Note that the above approxmaton corresponds to replacng stochastc unts by determnstc ones such that ther hdden actvaton values are same as margnal dstrbutons of stochastc unts,.e., SFNN can be approxmated by DNN usng sgmod actvaton functons, say sgmod-dnn. When there exst more latent layers above the stochastc one, one has to apply smlar approxmatons to all of them,.e., exchangng the orders of expectatons and nonlnear functons, for makng the DNN and SFNN are equvalent. Therefore, nstead of tranng SFNN drectly, one can try transferrng pre-traned parameters of sgmod-dnn to those of the correspondng SFNN drectly: tran sgmod-dnn nstead of SFNN, and replace determnstc unts by stochastc ones for the nference purpose. Although such a strategy looks somewhat rude, t was often observed n the lterature that t reasonably works well for SFNN [9] and we also evaluate t as reported n Table 1. We also note that smlar approxmatons appear n the context of dropout: t trans a stochastc model averagng exponentally many DNNs sharng parameters, but also approxmates a sngle DNN well. 4

Now we nvestgate a smlar transformaton n the case when the DNN uses the unbounded ReLU actvaton functon, say ReLU-DNN. Many recent deep networks are of the ReLU-DNN type because they mtgate the gradent vanshng problem, and ther pre-traned parameters are often avalable. Although t s straghtforward to buld SFNN from sgmod-dnn, t s less clear n ths case snce ReLU s unbounded. To handle ths ssue, we redefne the stochastc latent unts of SFNN: P h 1 x N 1 = P h 1 x, where P h 1 = 1 x { = mn αf } W 1 x + b 1, 1. 3 =1 In the above, fx = max{x, 0} s the ReLU actvaton functon and α s some hyperparameter. A smple transformaton can be defned smlarly as the case of sgmod-dnn va replacng determnstc unts by stochastc ones. However, to preserve the parameter nformaton of ReLU-DNN, one has to choose α such that αf W 1x + b1 1 and rescale upper parameters W as follows: α 1 max,x f Ŵ1 x + b 1, W 1, b 1 Ŵ1, b 1, W, b Ŵ /α, b. Then, applyng smlar approxmatons as n,.e., exchangng the orders of expectatons and non-lnear functons, one can observe that ReLU-DNN and SFNN are equvalent. We evaluate the performance of the smple transformatons from DNN to SFNN on the MNIST dataset [6] and the synthetc dataset [36], where the former and the latter are popular datasets used for a classfcaton task and a mult-modal.e., one-to-many mappngs predcton learnng, respectvely. In all experments reported n ths paper, we commonly use the softmax and Gaussan wth standard devaton of σ y = 0.05 are used for the output probablty on classfcaton and regresson tasks, respectvely. The only frst hdden layer of DNN s replaced by a stochastc layer, and we use 500 samples for estmatng the expectatons n the SFNN nference. As reported n Table 1, we observe that the smple transformaton often works well for both tasks: the SFNN and sgmod-dnn nferences usng same parameters traned by sgmod-dnn perform smlarly for the classfcaton task and the former sgnfcantly outperforms the latter for the mult-modal task also see Fgure 1. It mght suggest some possbltes that the expensve SFNN tranng mght not be not necessary, dependng on the targeted learnng qualty. However, n case of ReLU, SFNN performs much worse than ReLU-DNN for the MNIST classfcaton task under the parameter transformaton. 3 Transformaton from DNN to SFNN va Smplfed-SFNN In ths secton, we propose an advanced method to utlze the pre-traned parameters of DNN for tranng SFNN. As shown n the prevous secton, smple parameter transformatons from DNN to SFNN do not clearly work n general, n partcular for actvaton functons other than sgmod. Moreover, tranng DNN does not utlze the stochastc regularzng effect, whch s an mportant beneft of SFNN. To address the ssues, we desgn an ntermedate model, called Smplfed-SFNN. The proposed model s a specal form of stochastc neural network, whch approxmates certan SFNN by smplfyng ts upper latent unts above stochastc ones. Then, we establsh more rgorous connectons between three models: DNN Smplfed-SFNN SFNN, whch leads to an effcent tranng procedure of the stochastc models utlzng pretraned parameters of DNN. In our experments, we evaluate the strategy for varous tasks and popular DNN archtectures. 4 5

Input Input Layer 1 Layer 1 : Stochastc layer : Determnstc layer Layer Layer : Stochastcty Output Output Test Error Rate [%] 3.5 3.0.5.0 1.5 Baselne ReLU-DNN ReLU-DNN * traned by ReLU-DNN * ReLU-DNN * traned by Smplfed-SFNN 1.0 0 50 100 150 00 50 Epoch Knowledge Transferrng Loss 30 0 10 0 # of Samples = 1000 1 3 4 5 10 50 100 The value of γ a b c Fgure : a Smplfed-SFNN top and SFNN bottom. The randomness of the stochastc layer propagates only to ts upper layer n the case of Smplfed-SFNN. b For frst 00 epochs, we tran a baselne ReLU-DNN. Then, we tran smplfed-sfnn ntalzed by the DNN parameters under transformaton 8 wth γ = 50. We observe that tranng ReLU- DNN drectly does not reduce the test error even when network knowledge transferrng stll holds between the baselne ReLU-DNN and the correspondng ReLU-DNN. c As the value of γ ncreases, knowledge transferrng loss measured as 1 1 h l x ĥl x s decreasng. D N l x 3.1 Smplfed-SFNN of two hdden layers and non-negatve actvaton functons For clarty of presentaton, we frst ntroduce Smplfed-SFNN wth two hdden layers and non-negatve actvaton functons, where ts extensons to multple layers and general actvaton functons are presented n Secton 4. We also remark that we prmarly descrbe fullyconnected Smplfed-SFNNs, but ther convolutonal versons can also be naturally defned. In Smplfed-SFNN of two hdden layers, we assume that the frst and second hdden layers consst of stochastc bnary hdden unts and determnstc ones, respectvely. As n 3, the frst layer s defned as a bnary random vector wth N 1 unts,.e., h 1 {0, 1} N 1, drawn under the followng dstrbuton: P h 1 x = P h 1 x, where N 1 =1 P h 1 = 1 x { = mn α 1 f } W 1 x + b 1, 1. 5 where x s the nput vector, α 1 > 0 s a hyper-parameter for the frst layer, and f : R R + s some non-negatve non-lnear actvaton functon wth f x 1 for all x R, e.g., ReLU and sgmod actvaton functons. Now the second layer s defned as the followng determnstc vector wth N unts,.e., h x R N : h x = [ f [ α EP h 1 x s W h 1 + b ] ] s 0 :, 6 where α > 0 s a hyper-parameter for the second layer and s : R R s a dfferentable functon wth s x 1 for all x R, e.g., sgmod and tanh functons. In our experments, we use the sgmod functon for sx. Here, one can note that the proposed model also has the same computatonal ssues wth SFNN n forward and backward passes due to the complex expectaton. One can tran Smplfed-SFNN smlarly as SFNN: we use Monte Carlo approxmaton for estmatng the expectaton and the based estmator of the gradent for approxmatng backpropagaton nspred by [9] see Secton 3.3 for more detals. 6

Inference Model Tranng Model Network Structure wthout BN & DO wth BN wth DO sgmod-dnn sgmod-dnn hdden layers 1.54 1.57 1.5 SFNN sgmod-dnn hdden layers 1.56.3 1.7 Smplfed-SFNN fne-tuned by Smplfed-SFNN hdden layers 1.51 1.5 1.11 sgmod-dnn fne-tuned by Smplfed-SFNN hdden layers 1.48 0.06 1.48 0.09 1.14 0.11 SFNN fne-tuned by Smplfed-SFNN hdden layers 1.51 1.57 1.11 ReLU-DNN ReLU-DNN hdden layers 1.49 1.5 1.1 SFNN ReLU-DNN hdden layers 5.73 3.47 1.74 Smplfed-SFNN fne-tuned by Smplfed-SFNN hdden layers 1.41 1.17 1.06 ReLU-DNN fne-tuned by Smplfed-SFNN hdden layers 1.3 0.17 1.16 0.09 1.05 0.07 SFNN fne-tuned by Smplfed-SFNN hdden layers.63 1.34 1.51 ReLU-DNN ReLU-DNN 3 hdden layers 1.43 1.34 1.4 SFNN ReLU-DNN 3 hdden layers 17.83 4.15 1.49 Smplfed-SFNN fne-tuned by Smplfed-SFNN 3 hdden layers 1.8 1.5 1.04 ReLU-DNN fne-tuned by Smplfed-SFNN 3 hdden layers 1.7 0.16 1.4 0.1 1.03 0.1 SFNN fne-tuned by Smplfed-SFNN 3 hdden layers 1.56 1.8 1.16 ReLU-DNN ReLU-DNN 4 hdden layers 1.49 1.34 1.30 SFNN ReLU-DNN 4 hdden layers 14.64 3.85.17 Smplfed-SFNN fne-tuned by Smplfed-SFNN 4 hdden layers 1.3 1.3 1.5 ReLU-DNN fne-tuned by Smplfed-SFNN 4 hdden layers 1.9 0. 1.9 0.05 1.5 0.05 SFNN fne-tuned by Smplfed-SFNN 4 hdden layers 3.44 1.89 1.56 Table : Classfcaton test error rates [%] on MNIST, where each layer of neural networks contans 800 hdden unts. All Smplfed-SFNNs are constructed by replacng the frst hdden layer of a baselne DNN wth stochastc hdden layer. We also consder tranng DNN and fnetunng Smplfed-SFNN usng batch normalzaton BN and dropout DO. The performance mprovements beyond baselne DNN due to fne-tunng DNN parameters under Smplfed- SFNN are calculated n the bracket. We are nterested n transferrng parameters of DNN to Smplfed-SFNN to utlze the tranng benefts of DNN snce the former s much faster to tran than the latter. To ths end, we consder the followng DNN of whch l-th hdden layer s determnstc and defned as follows: [ ĥ l x = ĥl x = f Ŵl ĥ l 1 x + b ] l :, 7 where ĥ0 x = x. As stated n the followng theorem, we establsh a rgorous way how to ntalze parameters of Smplfed-SFNN n order to transfer the knowledge stored n DNN. Theorem 1 Assume that both DNN and Smplfed-SFNN wth two hdden layers have same network structure wth non-negatve actvaton functon f. Gven parameters {Ŵl, b l : l = 1, } of DNN and nput dataset D, choose those of Smplfed-SFNN as follows: α1, W 1, b 1 1 γ 1, Ŵ1, b 1 where γ 1 = max,x D, x D, t follows that, α, W, b γ γ 1 s 0, 1 γ Ŵ, 1 γ 1 γ b, 8 f Ŵ1 x + b 1 and γ > 0 s any postve constant. Then, for all h x ĥ x γ 1 Ŵ + b γ 1 1 s. 0 γ The proof of the above theorem s presented n Secton 5.1. Our proof s bult upon the frstorder Taylor expanson of non-lnear functon sx. Theorem 1 mples that one can make 7

Inference Model Tranng Model MNIST TFD Hdden Layers 3 Hdden Layers Hdden Layers 3 Hdden Layers sgmod-dnn sgmod-dnn 1.409 1.70-0.064 0.005 SFNN sgmod-dnn 0.644 1.076-0.461-0.401 Smplfed-SFNN fne-tuned by Smplfed-SFNN 1.474 1.757-0.071-0.08 SFNN fne-tuned by Smplfed-SFNN 0.619 0.991-0.509-0.43 ReLU-DNN ReLU-DNN 1.747 1.741 1.71 1.3 SFNN ReLU-DNN -1.019-1.01 0.83 1.11 Smplfed-SFNN fne-tuned by Smplfed-SFNN.1.6 0.175 0.343 SFNN fne-tuned by Smplfed-SFNN -1.90-1.061-0.380-0.193 Table 3: Test negatve log-lkelhood NLL on MNIST and TFD datasets, where each layer of neural networks contans 00 hdden unts. All Smplfed-SFNNs are constructed by replacng the frst hdden layer of a baselne DNN wth stochastc hdden layer. Smplfed-SFNN represent the functon values of DNN wth bounded errors usng a lnear transformaton. Furthermore, the errors can be made arbtrarly small by choosng large γ,.e., lm h x ĥ x = 0,, x D. Fgure c shows that knowledge transferrng γ loss decreases as γ ncreases on MNIST classfcaton. Based on ths, we choose γ = 50 commonly for all experments. 3. Why Smplfed-SFNN? Gven a Smplfed-SFNN model, the correspondng SFNN can be naturally defned by takng out the expectaton n 6. As llustrated n Fgure a, the man dfference between SFNN and Smplfed-SFNN s that the randomness of the stochastc layer propagates only to ts upper layer n the latter,.e., the randomness of h 1 s averaged out at ts upper unts h and does not propagate to h 3 or output y. Hence, Smplfed-SFNN s no longer a Bayesan network. Ths makes tranng Smplfed-SFNN much easer than SFNN snce random samples are not requred at some layers 1 and consequently the qualty of gradent estmatons can also be mproved, n partcular for unbounded actvaton functons. Furthermore, one can use the same approxmaton procedure to see that Smplfed-SFNN approxmates SFNN. However, snce Smplfed-SFNN stll mantans bnary random unts, t uses approxmaton steps later, n comparson wth DNN. In summary, Smplfed-SFNN s an ntermedate model between DNN and SFNN,.e., DNN Smplfed-SFNN SFNN. The above connecton naturally suggests the followng tranng procedure for both SFNN and Smplfed-SFNN: tran a baselne DNN frst and then fne-tune ts correspondng Smplfed- SFNN ntalzed by the transformed DNN parameters. Fnally, the fne-tuned parameters can be used for SFNN as well. We evaluate the strategy for the MNIST classfcaton. The MNIST dataset conssts of 8 8 pxel greyscale mages, each contanng a dgt 0 to 9 wth 60,000 tranng and 10,000 test mages. For ths experment, we do not use any data augmentaton or pre-processng. The loss was mnmzed usng ADAM learnng rule [7] wth a mn-batch sze of 18. We used an exponentally decayng learnng rate. Hyper-parameters are tuned on the valdaton set consstng of the last 10,000 tranng mages. All Smplfed-SFNNs are constructed by replacng the frst hdden layer of a baselne DNN wth stochastc hdden layer. 1 For example, f one replaces the frst feature maps n the ffth resdual unt of Pre-ResNet havng 164 layers [45] by stochastc ones, then the correspondng DNN, Smplfed-SFNN and SFNN took 1 mns 35 secs, mns 5 secs and 16 mns 6 secs per each tranng epoch, respectvely, on our machne wth one Intel CPU Core 7-580K 6-Core@3.3GHz and one NVIDIA GPU GTX Ttan X, 307 CUDA cores. Here, we traned both stochastc models usng the based estmator [9] wth 10 random samples on CIFAR-10 dataset. 8

We frst tran a baselne DNN for frst 00 epochs, and the traned parameters of DNN are used for ntalzng those of Smplfed-SFNN. For 50 epochs, we tran smplfed-sfnn. We choose the hyper-parameter γ = 50 n the parameter transformaton. All Smplfed-SFNNs are traned wth M = 0 samples at each epoch, and n the test, we use 500 samples. Table shows that SFNN under the two-stage tranng always performs better than SFNN under a smple transformaton 4 from ReLU-DNN. More nterestngly, Smplfed-SFNN consstently outperforms ts baselne DNN due to the stochastc regularzng effect, even when we tran both models usng dropout [16] and batch normalzaton [17]. Ths mples that the proposed stochastc model can be used for mprovng the performance of DNNs and t can be also combned wth other regularzaton methods such as dropout batch normalzaton. In order to confrm the regularzaton effects, one can agan approxmate a traned Smplfed-SFNN by a new determnstc DNN whch we call DNN and s dfferent from ts baselne DNN under the followng approxmaton at upper latent unts above bnary random unts: E Ph l x [ s W l+1 h l] [ s E Ph l x W l+1 h l] = s W l+1 P h l = 1 x. We found that DNN usng fned-tuned parameters of Smplfed-SFNN also outperforms the baselne DNN as shown n Table and Fgure b. 3.3 Tranng Smplfed-SFNN The parameters of Smplfed-SFNN can be learned usng a varant of the backpropagaton algorthm [9] n a smlar manner to DNN. However, n contrast to DNN, there are two computatonal ssues for smplfed-sfnn: computng expectatons wth respect to stochastc unts n forward pass and computng gradents n back pass. One can notce that both are ntractable snce they requre summatons over all possble confguratons of all stochastc unts. Frst, n order to handle the ssue n forward pass, we use the followng Monte Carlo approxmaton for estmatng the expectaton: [ E P h 1 x s W h 1 + b 1 ] M M m=1 s W h m + b, where h m P h 1 x and M s the number of samples. Ths random estmator s unbased and has relatvely low varance [8] snce ts accuracy does not depend on the dmensonalty of h 1 and one can draw samples from the exact dstrbuton. Next, n order to handle the ssue n back pass, we use the followng approxmaton nspred by [9]: W W 1 [ E P h 1 x s W h 1 + b 1 ] M m [ E P h 1 x s W h 1 + b W ] M m W s W h m + b, s W h m + b W 1 P h 1 = 1 x, where h m P h 1 x and M s the number of samples. In our experments, we commonly choose M = 0. 9 9

4 Extensons of Smplfed-SFNN In ths secton, we descrbe how the network knowledge transferrng between Smplfed- SFNN and DNN,.e., Theorem 1, generalzes to multple layers and general actvaton functons. 4.1 Extenson to multple layers A deeper Smplfed-SFNN wth L hdden layers can be defned smlarly as the case of L =. We also establsh network knowledge transferrng between Smplfed-SFNN and DNN wth L hdden layers as stated n the followng theorem. Here, we assume that stochastc layers are not consecutve for smpler presentaton, but the theorem s generalzable for consecutve stochastc layers. Theorem Assume that both DNN and Smplfed-SFNN wth L hdden layers have same network structure wth non-negatve actvaton functon f. Gven parameters {Ŵl, b l : l = 1,..., L} of DNN and nput dataset D, choose the same ones for Smplfed-SFNN ntally and modfy them for each l-th stochastc layer and ts upper layer as follows: where γ l = max,x D that α l 1, α l+1, W l+1, b l+1 γ l γ l γ l+1 s 0, Ŵl+1 γ l+1, b l+1, 10 γ l γ l+1 f Ŵl h l 1 x + b l and γl+1 s any postve constant. Then, t follows lm γ l+1 stochastc hdden layer l h L x ĥl x = 0,, x D. The above theorem agan mples that t s possble to transfer knowledge from DNN to Smplfed-SFNN by choosng large γ l+1. The proof of Theorem s smlar to that of Theorem 1 and gven n Secton 5.. 4. Extenson to general actvaton functons In ths secton, we descrbe an extended verson of Smplfed-SFNN whch can utlze any actvaton functon. To ths end, we modfy the defntons of stochastc layers and ther upper layers by ntroducng certan addtonal terms. If the l-th hdden layer s stochastc, then we slghtly modfy the orgnal defnton 5 as follows: N l P h l x = P =1 { h l x wth P h l = 1 x = mn α l f W 1 x + b 1 + 1 }, 1, where f : R R s a non-lnear possbly, negatve actvaton functon wth f x 1 for all x R. In addton, we re-defne ts upper layer as follows: [ [ ] h l+1 x = f α l+1 E Ph l x s W l+1 h l + b l+1 s 0 s 0 ] W l+1 :, where h 0 x = x and s : R R s a dfferentable functon wth s x 1 for all x R. Under ths general Smplfed-SFNN model, we also show that transferrng network knowledge from DNN to Smplfed-SFNN s possble as stated n the followng theorem. Here, we agan assume that stochastc layers are not consecutve for smpler presentaton. 10

Theorem 3 Assume that both DNN and Smplfed-SFNN wth L hdden layers have same network structure wth non-lnear actvaton functon f. Gven parameters {Ŵl, b l : l = 1,..., L} of DNN and nput dataset D, choose the same ones for Smplfed-SFNN ntally and modfy them for each l-th stochastc layer and ts upper layer as follows: where γ l = max,x D that α l 1, α l+1, W l+1, b l+1 γ l γ l γ l+1 s 0, Ŵl+1 γ l+1, b l+1 γ l γ l+1 f Ŵl h l 1 x + b l, and γl+1 s any postve constant. Then, t follows lm γ l+1 stochastc hdden layer l h L x ĥl x = 0,, x D. We omt the proof of the above theorem snce t s somewhat drect adaptaton of that of Theorem. 5 Proofs of Theorems 5.1 Proof of Theorem 1 Frst consder the frst hdden layer,.e., stochastc layer. Let γ 1 = max Ŵ1 f x + b 1 be,x D the maxmum value of hdden unts n DNN. If we ntalze the parameters α 1, W 1, b 1 1, γ1 Ŵ1, b 1, then the margnal dstrbuton of each hdden unt becomes P h 1 = 1 x, W 1, b 1 { = mn α 1 f Ŵ1 x + b } 1, 1 = 1 f Ŵ1 γ x + b 1,, x D. 1 Next consder the second hdden layer. From Taylor s theorem, there exsts a value z between 0 and x such that sx = s0 + s 0x + Rx, where Rx = s zx!. Snce we consder a bnary random vector,.e., h 1 {0, 1} N 1, one can wrte [ s β h 1 ] E P h 1 x = h 1 s 0 + s 0 β h 1 + R β h 1 P h 1 x = s 0 + s 0 W P h 1 = 1 x + b, 11 [ + E P h 1 x Rβ h 1 ], 1 where β h 1 := W h1 + b s the ncomng sgnal. From 6 and 11, for every hdden unt, t follows that h x; W, b =f α s 1 0 W γ ĥ1 x + b [ + E P h 1 x R β h 1 ]. 1 Snce we assume that f x 1, the followng nequalty holds: h x; W, b f α s 1 0 W γ ĥ1 x + b 1 11

[ α E P h 1 x Rβ h 1 ] α [ W ] E P h 1 x h 1 + b, 13 where we use s z < 1 for the last nequalty. Therefore, t follows that h x; W, b ĥ x; Ŵ, b γ 1 Ŵ + b γ 1 1 s,, 0γ snce we set α, W, b γ γ 1 5. Proof of Theorem s 0, Ŵ γ, γ 1 1 γ b. Ths completes the proof of Theorem 1. For the proof of Theorem, we frst state the two key lemmas on error propagaton n Smplfed- SFNN. Lemma 4 Assume that there exsts some postve constant B such that h l 1 x ĥl 1 x B,, x D, and the l-th hdden layer of Smplfed-SFNN s standard determnstc layer as defned n 7. Gven parameters {Ŵl, b l } of DNN, choose same ones for Smplfed-SFNN. Then, the followng nequalty holds: h l x ĥl x BN l 1 Ŵmax, l, x D. where Ŵ l max = max Ŵ l. Proof. See Secton 5.3. Lemma 5 Assume that there exsts some postve constant B such that h l 1 x ĥl 1 x B,, x D, and the l-th hdden layer of smplfed-sfnn s stochastc layer. Gven parameters {Ŵl, Ŵl+1, b l, b l+1 } of DNN, choose those of Smplfed-SFNN as follows: α l 1, α l+1, W l+1, b l+1 γ l γ l+1 γ l s 0, Ŵl+1 b l+1,, γ l+1 γ l γ l+1 where γ l = max f Ŵl h l 1 x + b l and γl+1 s any postve constant. Then, for all,x D, x D, t follows that h l+1 k x ĥl+1 k where b l max = max b l x BN l 1 N l ŴmaxŴ l l+1 and Ŵ l max = max Ŵ l. max + γ l N l Ŵmax l+1 + b l+1 maxγ 1 l s 0γ l+1, 1

Proof. See Secton 5.4. Assume that l-th layer s frst stochastc hdden layer n Smplfed-SFNN. Then, from Theorem 1, we have h l+1 γ l N l Ŵmax l+1 + b l+1 maxγ 1 x ĥl+1 l x s,, x D. 14 0γ l+1 Accordng to Lemma 4 and 5, the fnal error generated by the rght hand sde of 14 s bounded by where τ l = L l =l+ τ l γ l N l Ŵmax l+1 + b l+1 maxγ 1 l s, 15 0 γ l+1 N l 1 Ŵmax l. One can note that every error generated by each stochastc layer s bounded by 15. Therefore, for all, x D, t follows that h L τ l γ l N l Ŵmax l+1 + b l+1 maxγ 1 x ĥl l x s 0 γ l+1 l:stochastc hdden layer From above nequalty, we can conclude that lm h L γ l+1 x ĥl x = 0,, x D. stochastc hdden layer l Ths completes the proof of Theorem. 5.3 Proof of Lemma 4 From assumpton, there exsts some constant ɛ such that ɛ < B and h l 1 x = ĥl 1 x + ɛ,, x.. By defnton of standard determnstc layer, t follows that h l x = f Ŵh l l 1 l 1 x + b = f Ŵ l ĥl 1 x + Ŵ l ɛ + b l. Snce we assume that f x 1, one can conclude that hl x f Ŵ ĥl 1 l x + b l Ŵ l ɛ B Ths completes the proof of Lemma 4. Ŵ l BN l 1 Ŵmax. l 13

5.4 Proof of Lemma 5 From assumpton, there exsts some constant ɛ l 1 Let γ l = max,x D h l 1 such that ɛ l 1 < B and x = ĥl 1 x + ɛ l 1,, x. 16 f Ŵl h l 1 x + b l be the maxmum value of hdden unts. If we ntalze the parameters α l, W l, b l P 1 γl, Ŵl, b l, then the margnal dstrbuton becomes h l = 1 x, W l, b l { = mn α l f Ŵl h l 1 x + b } l, 1 = 1 f Ŵl γ h l 1 x + b l,, x. l From 16, t follows that P h l = 1 x, W l, b l = 1 γ l f Ŵ l ĥl 1 x + Ŵɛ l l 1 + b l,, x. Smlar to Lemma 4, there exsts some constant ɛ l such that ɛ l < BN l 1 Ŵ l max and P h l = 1 x, W l, b l = 1 ĥl γ x + ɛ l,, x. 17 l Next, consder the upper hdden layer of stochastc layer. From Taylor s theorem, there exsts a value z between 0 and t such that sx = s0 + s 0x + Rx, where Rx = s zx!. Snce we consder a bnary random vector,.e., h l {0, 1} N l, one can wrte E P h l x[sβ k h l ] = s0 + s 0β k h l + R β k h l P h l x h l = s0 + s 0 W l+1 k P hl = 1 x + b l+1 k + Rβ k h l P h l x, h l where β k h l = W l+1 k h l + b l+1 k s the ncomng sgnal. From 17 and above equaton, for every hdden unt k, we have h l+1 k x; W l+1, b l+1 = f α l+1 s 1 0 γ l W l+1 k ĥl x + W l+1 k ɛl + b l+1 k + E P h l x [ Rβ k h l ]. Snce we assume that f x < 1, the followng nequalty holds: x; W l+1, b l+1 f α l+1 s 0 1 γ l hl+1 k α l+1 s 0 γ l W l+1 [ ] W l+1 k ɛl + α l+1 E P h l x Rβ k h l 14 ĥ l x + b l+1

α l+1 s 0 γ l W l+1 k ɛl [ + α ] l+1 E P h l x W l+1 k h l + b l+1 k, 18 where we use s z < 1 for the last nequalty. Therefore, t follows that h l+1 k x ĥl+1 k x BN l 1 N l ŴmaxŴ l max l+1 γ l N l Ŵmax l+1 + b l+1 maxγ 1 l + s 0γ l+1 snce we set α l+1, W l+1, b l+1 of Lemma 5. 6 Expermental Results γ l+1 γ l s 0, Ŵl+1 γ l+1, γ 1 l b l+1, γ l+1. Ths completes the proof We present several expermental results for both mult-modal and classfcaton tasks on MNIST [6], Toronto Face Database TFD [37], CASIA [3], CIFAR-10, CIFAR-100 [14] and SVHN [6]. The softmax and Gaussan wth the standard devaton of 0.05 are used as the output probablty for the classfcaton task and the mult-modal predcton, respectvely. In all experments, we frst tran a baselne model, and the traned parameters are used for further fne-tunng those of Smplfed-SFNN. 6.1 Mult-modal regresson We frst verfy that t s possble to learn one-to-many mappng va Smplfed-SFNN on the TFD, MNIST and CASIA datasets. The TFD dataset conssts of 48 48 pxel greyscale mages, each contanng a face mage of 900 ndvduals wth 7 dfferent expressons. Smlar to [9], we use 14 ndvduals wth at least 10 facal expressons as data. We randomly choose 100 ndvduals wth 1403 mages for tranng and the remanng 4 ndvduals wth 36 mages for the test. We take the mean of face mages per ndvdual as the nput and set the output as the dfferent expressons of the same ndvdual. The MNIST dataset conssts of greyscale mages, each contanng a dgt 0 to 9 wth 60,000 tranng and 10,000 test mages. For ths experments, each pxel of every dgt mages s bnarzed usng ts grey-scale value smlar to [9]. We take the upper half of the MNIST dgt as the nput and set the output as the lower half of t. We remark that both tasks are commonly performed n recent other works to test the mult-modal learnng usng SFNN [9, 34]. The CASIA dataset conssts of greyscale mages, each contanng a handwrtten Chnese character. We use 10 offlne solated characters produced by 345 wrters as data. We randomly choose 300 wrters wth 3000 mages for tranng and the remanng 45 wrters wth 450 mages for testng. We take the radcal character per wrter as the nput and set the output as ether the related character or the radcal character tself of the same wrter see Fgure 5. The bcubc nterpolaton [4] s used for re-szng all mages as 3 3 pxels. For both TFD and MNIST datasets, we use fully-connected DNNs as the baselne models smlar to other works [9, 34]. All Smplfed-SFNNs are constructed by replacng the frst hdden layer of a baselne DNN wth stochastc hdden layer. The loss was mnmzed usng We use 5 radcal characters e.g., we, sh, shu, and mu n Mandarn and 5 related characters e.g., hu, s, ben, ren and qu n Mandarn. 15

Input 6 feature maps 6 Stochastc Stoc. 16 16 10 unts 84 unts Output A [Convoluton Conv.] [Max pool] [Stochastc Stoc. Conv.] [Max pool] [Fully-connected] [Fully-connected] [Fully-connected] a Input 19 160 96 96 19 19 19 19 19 19 Stoc. 10 Output A [Conv.] [Conv.] [Conv.] [Max pool] [Conv.] [Conv.] [Conv.] [Avg pool] [Conv.] [Conv.] [Stoc. Conv.] [Avg pool] b [Conv.] 64 uu 1, 3 uu = 1,,3 Input A [Conv.] 16 vv & uu = 3 [Conv.] eeeeeeee 64 uu 1 Stoc. 64 uu 1 [Stoc. Conv.] [Conv.] vv 3 & 64 uu 1 64 uu 1 uu = 3 [Stoc. Conv.] 56 [Conv.] eeeeeeee Stoc. [Conv.] [Avg pool] Output [Fullyconnected] c [Conv.] 160 uu 1, 3 vv = 1,,3 3 uu = 1,,3 Input A 16 160 uu 1 [Conv.] [Conv.] [Conv.] 160 uu 1 160 uu 1 vv vv 160 uu 1 640 &uu = 3 [Stoc. Conv.] [Conv.] eeeeeeee Stoc. [Conv.] [Avg pool] Output [Fullyconnected] d Fgure 3: The overall structures of a Lenet-5, b NIN, c WRN wth 16 layers, and d WRN wth 8 layers. The red feature maps correspond to the stochastc ones. In case of WRN, we ntroduce one v = 3 and two v = stochastc feature maps. ADAM learnng rule [7] wth a mn-batch sze of 18. We used an exponentally decayng learnng rate. We tran Smplfed-SFNNs wth M = 0 samples at each epoch, and n the test, we use 500 samples. We use 00 hdden unts for each layer of neural networks n two experments. Learnng rate s chosen from {0.005, 0.00, 0.001,..., 0.0001}, and the best result s reported for both tasks. Table 3 reports the test negatve log-lkelhood on TFD and MNIST. One can note that SFNNs fne-tuned by Smplfed-SFNN consstently outperform SFNN under the smple transformaton. For the CASIA dataset, we choose fully-convolutonal network FCN models [] as the baselne ones, whch conssts of convolutonal layers followed by a fully-convolutonal layer. In a smlar manner to the case of fully-connected networks, one can defne a stochastc convoluton layer, whch consders the nput feature map as a bnary random matrx and generates 16

a b c d Fgure 4: Generated samples for predctng the lower half of the MNIST dgt gven the upper half. a The orgnal dgts and the correspondng nputs. The generated samples from b sgmod-dnn, c SFNN under the smple transformaton, and d SFNN fne-tuned by Smplfed-SFNN. We remark that SFNN fne-tuned by Smplfed-SFNN can generate more varous samples e.g., red rectangles from same nputs better than SFNN under the smple transformaton. Inference Model Tranng Model Conv Layers 3 Conv Layers FCN FCN 3.89 3.47 SFNN FCN.61 1.97 Smplfed-SFNN fne-tuned by Smplfed-SFNN 3.83 3.47 SFNN fne-tuned by Smplfed-SFNN.45 1.81 Table 4: Test NLL on CASIA dataset. All Smplfed-SFNNs are constructed by replacng the frst hdden feature maps of baselne models wth stochastc ones. the output feature map as defned n 6. For ths experment, we use a baselne model, whch conssts of convolutonal layers followed by a fully convolutonal layer. The convolutonal layers have 64, 18 and 56 flters respectvely. Each convolutonal layer has a44receptve feld appled wth a strde of pxel. All Smplfed-SFNNs are constructed by replacng the frst hdden feature maps of baselne models wth stochastc ones. The loss was mnmzed usng ADAM learnng rule [7] wth a mn-batch sze of 18. We used an exponentally decayng learnng rate. We tran Smplfed-SFNNs wth M = 0 samples at each epoch, and n the test, we use 100 samples due to the memory lmtatons. One can note that SFNNs fne-tuned by Smplfed-SFNN outperform SFNN under the smple transformaton as reported n Table 4 and Fgure 5. 6. Classfcaton We also evaluate the regularzaton effect of Smplfed-SFNN for the classfcaton tasks on CIFAR-10, CIFAR-100 and SVHN. The CIFAR-10 and CIFAR-100 datasets consst of 50,000 tranng and 10,000 test mages. They have 10 and 100 mage classes, respectvely. The SVHN dataset conssts of 73,57 tranng and 6,03 test mages. 3 It conssts of a house number 0 to 9 collected by Google Street Vew. Smlar to [39], we pre-process the data usng global contrast normalzaton and ZCA whtenng. For these datasets, we desgn a convolutonal verson of Smplfed-SFNN usng convolutonal neural networks such as Lenet-5 [6], NIN [13] and WRN [39]. All Smplfed-SFNNs are constructed by replacng a hdden feature map 3 We do not use the extra SVHN dataset for tranng. 17

Input: / we Or s Output we Or hu a b c d Fgure 5: a The nput-output Chnese characters. The generated samples from b FCN, c SFNN under the smple transformaton and d SFNN fne-tuned by Smplfed- SFNN. Our model can generate the multmodal outputs red rectangle, whle SFNN under the smple transformaton cannot. Test Error Rate [%].0 1.5 1.0 0.5 0.0 WRN * traned by Smplfed-SFNN one stochastc layer WRN * traned by Smplfed-SFNN two stochastc layers Baselne WRN 19.5 0 50 100 150 00 Epoch Fgure 6: Test error rates of WRN per each tranng epoch on CIFAR-100. One can note that the performance gan s more sgnfcant when we ntroduce more stochastc layers. of a baselne models,.e., Lenet-5, NIN and WRN, wth stochastc one as shown n Fgure 3d. We use WRN wth 16 and 8 layers for SVHN and CIFAR datasets, respectvely, snce they showed state-of-the-art performance as reported by [39]. In case of WRN, we ntroduce up to two stochastc convoluton layers. Smlar to [39], the loss was mnmzed usng the stochastc gradent descent method wth Nesterov momentum. The mnbatch sze s set to 18, and weght decay s set to 0.0005. For 100 epochs, we frst tran baselne models,.e., Lenet-5, NIN and WRN, and traned parameters are used for ntalzng those of Smplfed-SFNNs. All Smplfed-SFNNs are traned wth M = 5 samples and the test error s only measured by the approxmaton 9. The test errors of baselne models are measured after tranng them for 00 epochs smlar to [39]. All models are traned usng dropout [16] and batch normalzaton [17]. Table 5 reports the classfcaton error rates on CIFAR-10, CIFAR-100 and SVHN. Due to the regularzaton effects, Smplfed-SFNNs consstently outperform ther baselne DNNs. In partcular, WRN of 8 layers and 36 mllon parameters outperforms WRN by 0.08% on CIFAR-10 and 0.58% on CIFAR-100. Fgure 6 shows that the error rate s decreased more when we ntroduce more stochastc layers, but t ncreases the fne-tunng tme-complexty of Smplfed-SFNN. 7 Concluson In order to develop an effcent tranng method for large-scale SFNN, ths paper proposes a new ntermedate stochastc model, called Smplfed-SFNN. We establsh the connecton between three models,.e., DNN Smplfed-SFNN SFNN, whch naturally leads to an effcent tranng procedure of the stochastc models utlzng pre-traned parameters of DNN. Ths connecton naturally leads an effcent tranng procedure of the stochastc models utlzng pre-traned parameters and archtectures of DNN. Usng several popular DNNs ncludng Lenet-5, NIN, FCN and WRN, we show how they can be effectvely transferred to the correspondng stochastc models for both mult-modal and non-mult-modal tasks. We beleve that our work brngs a new angle for tranng stochastc neural networks, whch would be of 18

Inference model Tranng Model CIFAR-10 CIFAR-100 SVHN Lenet-5 Lenet-5 37.67 77.6 11.18 Lenet-5 fne-tuned by Smplfed-SFNN 33.58 73.0 9.88 NIN NIN 9.51 3.66 3.1 NIN fne-tuned by Smplfed-SFNN 9.33 30.81 3.01 WRN WRN 4. 4.39 0.30 0.04 3.5 WRN fne-tuned by Smplfed-SFNN one stochastc layer 4.1 19.98 3.09 WRN fne-tuned by Smplfed-SFNN two stochastc layers 4.14 19.7 3.06 Table 5: Test error rates [%] on CIFAR-10, CIFAR-100 and SVHN. The error rates for WRN are from our experments, where orgnal ones reported n [39] are n the brackets. Results wth are obtaned usng the horzontal flppng and random croppng augmentaton. broader nterest n many related applcatons. References [1] Ruz, F.R., AUEB, M.T.R. and Ble, D. The generalzed reparameterzaton gradent. In Advances n Neural Informaton Processng Systems NIPS, 016. [] Long, J., Shelhamer, E. and Darrell, T. Fully convolutonal networks for semantc segmentaton. In Proceedngs of the IEEE Conference on Computer Vson and Pattern Recognton CVPR, 015. [3] Lu, C.L., Yn, F., Wang, D.H. and Wang, Q.F. CASIA onlne and offlne Chnese handwrtng databases. In Internatonal Conference on Document Analyss and Recognton ICDAR, 011. [4] Keys, R. Cubc convoluton nterpolaton for dgtal mage processng. In IEEE transactons on acoustcs, speech, and sgnal processng, 1981. [5] Neal, R.M. Learnng stochastc feedforward networks. Department of Computer Scence, Unversty of Toronto, 1990. [6] LeCun, Y., Bottou, L., Bengo, Y. and Haffner, P. Gradent-based learnng appled to document recognton. Proceedngs of the IEEE, 1998. [7] Kngma, D. and Ba, J. Adam: A method for stochastc optmzaton. arxv preprnt arxv:141.6980, 014. [8] Tang, Y. and Salakhutdnov, R.R. Learnng stochastc feedforward neural networks. In Advances n Neural Informaton Processng Systems NIPS, 013. [9] Rako, T., Berglund, M., Alan, G. and Dnh, L. Technques for learnng bnary stochastc feedforward neural networks. arxv preprnt arxv:1406.989, 014. [10] Wan, L., Zeler, M., Zhang, S., Cun, Y.L. and Fergus, R. Regularzaton of neural networks usng dropconnect. In Internatonal Conference on Machne Learnng ICML, 013. 19

[11] Krzhevsky, A., Sutskever, I. and Hnton, G.E. Imagenet classfcaton wth deep convolutonal neural networks. In Advances n Neural Informaton Processng Systems NIPS, 01. [1] Goodfellow, I.J., Warde-Farley, D., Mrza, M., Courvlle, A. and Bengo, Y. Maxout networks. In Internatonal Conference on Machne Learnng ICML, 013. [13] Ln, M., Chen, Q. and Yan, S. Network n network. In Internatonal Conference on Learnng Representatons ICLR, 014. [14] Krzhevsky, A. and Hnton, G. Learnng multple layers of features from tny mages. Masters thess, Department of Computer Scence, Unversty of Toronto, 009. [15] Blundell, C., Cornebse, J., Kavukcuoglu, K. and Werstra, D. Weght uncertanty n neural networks. In Internatonal Conference on Machne Learnng ICML, 015. [16] Hnton, G.E., Srvastava, N., Krzhevsky, A., Sutskever, I. and Salakhutdnov, R.R. Improvng neural networks by preventng co-adaptaton of feature detectors. arxv preprnt arxv:107.0580, 01. [17] Ioffe, S. and Szegedy, C. Batch normalzaton: Acceleratng deep network tranng by reducng nternal covarate shft. In Internatonal Conference on Machne Learnng ICML, 015. [18] Salakhutdnov, R. and Hnton, G.E. Deep boltzmann machnes. In Conference on Artfcal Intellgence and Statstcs AISTATS, 009. [19] Neal, R.M. Connectonst learnng of belef networks. Artfcal ntellgence, 199. [0] Hnton, G.E., Osndero, S. and Teh, Y.W. A fast learnng algorthm for deep belef nets. Neural computaton, 006. [1] Poon, H. and Domngos, P. Sum-product networks: A new deep archtecture. In Conference on Uncertanty n Artfcal Intellgence UAI, 011. [] Chen, T., Goodfellow, I. and Shlens, J. NetNet: Acceleratng Learnng va Knowledge Transfer. arxv preprnt arxv:1511.05641, 015. [3] Maass, W. Networks of spkng neurons: the thrd generaton of neural network models. Neural Networks, 1997. [4] Hnton, G., Deng, L., Yu, D., Dahl, G.E., Mohamed, A.R., Jatly, N., Senor, A., Vanhoucke, V., Nguyen, P., Sanath, T.N. and Kngsbury, B. Deep neural networks for acoustc modelng n speech recognton: The shared vews of four research groups. IEEE Sgnal Processng Magazne, 01. [5] Zeler, M.D. and Fergus, R. Stochastc poolng for regularzaton of deep convolutonal neural networks. In Internatonal Conference on Learnng Representatons ICLR, 013. [6] Netzer, Y., Wang, T., Coates, A., Bssacco, A., Wu, B. and Ng, A.Y. Readng dgts n natural mages wth unsupervsed feature learnng. In NIPS Workshop on Deep Learnng and Unsupervsed Feature Learnng, 011. 0

[7] Courbaraux, M., Bengo, Y. and Davd, J.P. BnaryConnect: Tranng Deep Neural Networks wth bnary weghts durng propagatons. arxv preprnt arxv:1511.00363, 015. [8] Gulcehre, C., Moczulsk, M., Denl, M. and Bengo, Y. Nosy actvaton functons. arxv preprnt arxv:1603.00391, 016. [9] Rumelhart, D.E., Hnton, G.E. and Wllams, R.J. Learnng nternal representatons by error propagaton. In Neurocomputng: Foundatons of research, MIT Press, 1988. [30] Robbns, H. and Monro, S. A stochastc approxmaton method. Annals of Mathematcal Statstcs, 1951. [31] Ba, J. and Frey, B. Adaptve dropout for tranng deep neural networks. In Advances n Neural Informaton Processng Systems NIPS, 013. [3] Kngma, D.P. and Wellng, M. Auto-encodng varatonal bayes. arxv preprnt arxv:131.6114, 013. [33] Nar, V. and Hnton, G.E. Rectfed lnear unts mprove restrcted boltzmann machnes. In Internatonal Conference on Machne Learnng ICML, 010. [34] Gu, S., Levne, S., Sutskever, I. and Mnh, A. MuProp: Unbased backpropagaton for stochastc neural networks. arxv preprnt arxv:1511.05176, 015. [35] Bengo, Y., Lonard, N. and Courvlle, A. Estmatng or propagatng gradents through stochastc neurons for condtonal computaton. arxv preprnt arxv:1308.343, 013. [36] Bshop, C.M. Mxture densty networks. Aston Unversty, 1994. [37] Sussknd, J.M., Anderson, A.K. and Hnton, G.E. The toronto face database. Department of Computer Scence, Unversty of Toronto, Toronto, ON, Canada, Tech. Rep, 3, 010. [38] Saul, L.K., Jaakkola, T. and Jordan, M.I. Mean feld theory for sgmod belef networks. Artfcal ntellgence, 1996. [39] Zagoruyko, S. and Komodaks, N. Wde resdual networks. arxv preprnt arxv:1605.07146, 016. [40] Huang, G., Sun, Y., Lu, Z., Sedra, D. and Wenberger, K.Q. Deep networks wth stochastc depth. In European Conference on Computer Vson ECCV, 016. [41] Goodfellow, I., Pouget-Abade, J., Mrza, M., Xu, B., Warde-Farley, D., Ozar, S., Courvlle, A. and Bengo, Y. Generatve adversaral nets. In Advances n Neural Informaton Processng Systems NIPS, 014. [4] Zaremba, W. and Sutskever, I. Renforcement learnng neural Turng machnes. arxv preprnt arxv:1505.0051, 36, 015. [43] Szegedy, C., Lu, W., Ja, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V. and Rabnovch, A. Gong deeper wth convolutons. In Proceedngs of the IEEE Conference on Computer Vson and Pattern Recognton CVPR, 015. [44] Smonyan, K. and Zsserman, A. Very deep convolutonal networks for large-scale mage recognton. arxv preprnt arxv:1409.1556, 014. 1