Simplified Stochastic Feedforward Neural Networks

Size: px
Start display at page:

Download "Simplified Stochastic Feedforward Neural Networks"

Transcription

1 Smplfed Stochastc Feedforward Neural Networks Kmn Lee, Jaehyung Km, Song Chong, Jnwoo Shn Aprl 1, 017 Abstract arxv: v1 [cs.lg] 11 Apr 017 It has been beleved that stochastc feedforward neural networks SFNNs have several advantages beyond determnstc deep neural networks DNNs: they have more expressve power allowng mult-modal mappngs and regularze better due to ther stochastc nature. However, tranng large-scale SFNN s notorously harder. In ths paper, we am at developng effcent tranng methods for SFNN, n partcular usng known archtectures and pre-traned parameters of DNN. To ths end, we propose a new ntermedate stochastc model, called Smplfed-SFNN, whch can be bult upon any baselne DNN and approxmates certan SFNN by smplfyng ts upper latent unts above stochastc ones. The man novelty of our approach s n establshng the connecton between three models,.e., DNN Smplfed-SFNN SFNN, whch naturally leads to an effcent tranng procedure of the stochastc models utlzng pre-traned parameters of DNN. Usng several popular DNNs, we show how they can be effectvely transferred to the correspondng stochastc models for both mult-modal and classfcaton tasks on MNIST, TFD, CASIA, CIFAR-10, CIFAR-100 and SVHN datasets. In partcular, we tran a stochastc model of 8 layers and 36 mllon parameters, where tranng such a largescale stochastc network s sgnfcantly challengng wthout usng Smplfed-SFNN. 1 Introducton Recently, determnstc deep neural networks DNNs have demonstrated state-of-the-art performance on many supervsed tasks, e.g., speech recognton [4] and obect recognton [11]. One of the man components underlyng these successes s the effcent tranng methods for large-scale DNNs, whch nclude backpropagaton [9], stochastc gradent descent [30], dropout/dropconnect [16, 10], batch/weght normalzaton [17, 46], and varous actvaton functons [33, 8]. On the other hand, stochastc feedforward neural networks SFNNs [5] havng random latent unts are often necessary n to model the complex stochastc natures of many real-world tasks, e.g., structured predcton [8] and mage generaton [41]. Furthermore, t s beleved that SFNN has several advantages beyond DNN [9]: t has more expressve power for mult-modal learnng and regularzes better for large-scale networks. Tranng large-scale SFNN s notorously hard snce backpropagaton s not drectly applcable. Certan stochastc neural networks usng contnuous random unts are known to be tranable effcently usng backpropagaton wth varatonal technques and reparameterzaton trcks [3, 1]. On the other hand, tranng SFNN havng dscrete,.e., bnary or mult-modal, random unts s more dffcult snce ntractable probablstc nference s nvolved requrng too K. Lee, J. Km, S. Chong and J. Shn are wth School of Electrcal Engneerng at Korea Advanced Insttute of Scence Technology, Republc of Korea. Authors e-mals: kmnlee@kast.ac.kr, aehyungkm@kast.ac.kr songchong@kast.edu, nwoos@kast.ac.kr 1

2 many random samples. There have been several efforts toward developng effcent tranng methods for SFNN havng bnary random latent unts [5, 38, 8, 35, 9, 34] see Secton.1 for more detals. However, tranng a SFNN s stll sgnfcantly slower than tranng a DNN of the same archtecture, consequently most pror works have consdered a small number at most 5 or so of layers n SFNN. We am for the same goal, but our drecton s complementary to them. Instead of tranng a SFNN drectly, we study whether pre-traned parameters from a DNN or easer models can be transferred to t, possbly wth further low-cost fne-tunng. Ths approach can be attractve snce one can utlze recent advances n DNN desgn and tranng. For example, one can desgn the network structure of SFNN followng known specalzed ones of DNN and use ther pre-traned parameters. To ths end, we frst try transferrng pre-traned parameters of DNN usng sgmod actvaton functons to those of the correspondng SFNN drectly. In our experments, the heurstc reasonably works well. For mult-modal learnng, SFNN under such a smple transformaton outperforms DNN. Even for the MNIST classfcaton, the former performs smlarly as the latter see Secton for more detals. However, t s questonable whether a smlar strategy works n general, partcularly for other unbounded actvaton functons lke ReLU [33] snce SFNN has bnary,.e., bounded, random latent unts. Moreover, t loses the regularzaton beneft of SFNN: t s beleved that transferrng parameters of stochastc models to DNN helps ts regularzaton, but the opposte s unlkely. Contrbuton. To address these ssues, we propose a specal form of stochastc neural networks, named Smplfed-SFNN, whch s ntermedate between SFNN and DNN, havng the followng propertes. Frst, Smplfed-SFNN can be bult upon any baselne DNN, possbly havng unbounded actvaton functons. The most sgnfcant part of our approach les n provdng rgorous network knowledge transferrng [] between Smplfed-SFNN and DNN. In partcular, we prove that parameters of DNN can be transformed to those of the correspondng Smplfed-SFNN whle preservng the performance,.e., both represent the same mappng. Second, Smplfed-SFNN approxmates certan SFNN, better than DNN, by smplfyng ts upper latent unts above stochastc ones usng two dfferent non-lnear actvaton functons. Smplfed-SFNN s much easer to tran than SFNN whle stll mantanng ts stochastc regularzaton effect. We also remark that SFNN s a Bayesan network, whle Smplfed-SFNN s not. The above connecton DNN Smplfed-SFNN SFNN naturally suggests the followng tranng procedure for both SFNN and Smplfed-SFNN: tran a baselne DNN frst and then fne-tune ts correspondng Smplfed-SFNN ntalzed by the transformed DNN parameters. The pre-tranng stage accelerates the tranng task snce DNN s faster to tran than Smplfed-SFNN. In addton, one can also utlze known DNN tranng technques such as dropout and batch normalzaton for fne-tunng Smplfed-SFNN. In our experments, we tran SFNN and Smplfed-SFNN under the proposed strategy. They consstently outperform the correspondng DNN for both mult-modal and classfcaton tasks, where the former and the latter are for measurng the model expressve power and the regularzaton effect, respectvely. To the best of our knowledge, we are the frst to confrm that SFNN ndeed regularzes better than DNN. We also construct the stochastc models followng the same network structure of popular DNNs ncludng Lenet-5 [6], NIN [13], FCN [] and WRN [39]. In partcular, WRN wde resdual network of 8 layers and 36 mllon parameters has shown the stateof-art performances on CIFAR-10 and CIFAR-100 classfcaton datasets, and our stochastc models bult upon WRN outperform the determnstc WRN on the datasets.

3 1.0 Tranng data Samples from sgmod-dnn 1.0 Tranng data Samples from SFNN sgmod actvaton y 0.5 y x x a b Fgure 1: The generated samples from a sgmod-dnn and b SFNN whch uses same parameters traned by sgmod-dnn. One can note that SFNN can model the multple modes n output space y around x = 0.4. Inference Model Network Structure MNIST Classfcaton Mult-modal Learnng Tranng NLL Tranng Error Rate % Test Error Rate % Test NLL sgmod-dnn hdden layers SFNN hdden layers sgmod-dnn 3 hdden layers SFNN 3 hdden layers sgmod-dnn 4 hdden layers SFNN 4 hdden layers ReLU-DNN hdden layers SFNN hdden layers ReLU-DNN 3 hdden layers SFNN 3 hdden layers ReLU-DNN 4 hdden layers SFNN 4 hdden layers Table 1: The performance of smple parameter transformatons from DNN to SFNN on the MNIST and synthetc datasets, where each layer of neural networks contans 800 and 50 hdden unts for two datasets, respectvely. For all experments, only the frst hdden layer of the DNN s replaced by stochastc one. We report negatve log-lkelhood NLL and classfcaton error rates. Smple Transformaton from DNN to SFNN.1 Prelmnares for SFNN Stochastc feedforward neural network SFNN s a hybrd model, whch has both stochastc bnary and determnstc hdden unts. We frst ntroduce SFNN wth one stochastc hdden layer and wthout determnstc hdden layers for smplcty. Throughout ths paper, we commonly denote the bas for unt and the weght matrx of the l-th hdden layer by b l and W l, respectvely. Then, the stochastc hdden layer n SFNN s defned as a bnary random vector wth N 1 unts,.e., h 1 {0, 1} N 1, drawn under the followng dstrbuton: P h 1 x = P h 1 x, where P h 1 = 1 x = σ W 1 x + b 1. 1 N 1 =1 3

4 In the above, x s the nput vector and σ x = 1/ 1 + e x s the sgmod functon. Our condtonal dstrbuton of the output y s defned as follows: [ P y x = E P h 1 x P y h 1 ] [ = E P h 1 x N y W h 1 + b, σy ], where N denotes the normal dstrbuton wth mean W h 1 + b and fxed varance σy. Therefore, P y x can express a very complex, mult-modal dstrbuton snce t s a mxture of exponentally many normal dstrbutons. The mult-layer extenson s straghtforward va a combnaton of stochastc and determnstc hdden layers see [8, 9]. Furthermore, one can use any other output dstrbutons as lke DNN, e.g., softmax for classfcaton tasks. There are two computatonal ssues for tranng SFNN: computng expectatons wth respect to stochastc unts n the forward pass and computng gradents n the backward pass. One can notce that both are computatonally ntractable snce they requre summatons over exponentally many confguratons of all stochastc unts. Frst, n order to handle the ssue n the forward pass, one can use the followng Monte Carlo approxmaton for estmatng the expectaton: P y x 1 M M P y h m, where h m P h 1 x and M s the number m=1 of samples. Ths random estmator s unbased and has relatvely low varance [8] snce one can draw samples from the exact dstrbuton. Next, n order to handle the ssue n backward pass, [5] proposed Gbbs samplng, but t s known that t often mxes poorly. [38] proposed a varatonal learnng based on the mean-feld approxmaton, but t has addtonal parameters makng the varatonal lower bound looser. More recently, several other technques have been proposed ncludng unbased estmators of the varatonal bound usng mportance samplng [8, 9] and based/unbased estmators of the gradent for approxmatng backpropagaton [35, 9, 34].. Smple transformaton from sgmod-dnn and ReLU-DNN to SFNN Despte the recent advances, tranng SFNN s stll very slow compared to DNN due to the samplng procedures: n partcular, t s notorously hard to tran SFNN when the network structure s deeper and wder. In order to handle these ssues, we consder the followng approxmaton: [ P y x = E P h 1 x N y W h 1 + b, σy] N [ y E P h 1 x W h 1] + b, σy = N y W σ W 1 x + b 1 + b, σ y. Note that the above approxmaton corresponds to replacng stochastc unts by determnstc ones such that ther hdden actvaton values are same as margnal dstrbutons of stochastc unts,.e., SFNN can be approxmated by DNN usng sgmod actvaton functons, say sgmod-dnn. When there exst more latent layers above the stochastc one, one has to apply smlar approxmatons to all of them,.e., exchangng the orders of expectatons and nonlnear functons, for makng the DNN and SFNN are equvalent. Therefore, nstead of tranng SFNN drectly, one can try transferrng pre-traned parameters of sgmod-dnn to those of the correspondng SFNN drectly: tran sgmod-dnn nstead of SFNN, and replace determnstc unts by stochastc ones for the nference purpose. Although such a strategy looks somewhat rude, t was often observed n the lterature that t reasonably works well for SFNN [9] and we also evaluate t as reported n Table 1. We also note that smlar approxmatons appear n the context of dropout: t trans a stochastc model averagng exponentally many DNNs sharng parameters, but also approxmates a sngle DNN well. 4

5 Now we nvestgate a smlar transformaton n the case when the DNN uses the unbounded ReLU actvaton functon, say ReLU-DNN. Many recent deep networks are of the ReLU-DNN type because they mtgate the gradent vanshng problem, and ther pre-traned parameters are often avalable. Although t s straghtforward to buld SFNN from sgmod-dnn, t s less clear n ths case snce ReLU s unbounded. To handle ths ssue, we redefne the stochastc latent unts of SFNN: P h 1 x N 1 = P h 1 x, where P h 1 = 1 x { = mn αf } W 1 x + b 1, 1. 3 =1 In the above, fx = max{x, 0} s the ReLU actvaton functon and α s some hyperparameter. A smple transformaton can be defned smlarly as the case of sgmod-dnn va replacng determnstc unts by stochastc ones. However, to preserve the parameter nformaton of ReLU-DNN, one has to choose α such that αf W 1x + b1 1 and rescale upper parameters W as follows: α 1 max,x f Ŵ1 x + b 1, W 1, b 1 Ŵ1, b 1, W, b Ŵ /α, b. Then, applyng smlar approxmatons as n,.e., exchangng the orders of expectatons and non-lnear functons, one can observe that ReLU-DNN and SFNN are equvalent. We evaluate the performance of the smple transformatons from DNN to SFNN on the MNIST dataset [6] and the synthetc dataset [36], where the former and the latter are popular datasets used for a classfcaton task and a mult-modal.e., one-to-many mappngs predcton learnng, respectvely. In all experments reported n ths paper, we commonly use the softmax and Gaussan wth standard devaton of σ y = 0.05 are used for the output probablty on classfcaton and regresson tasks, respectvely. The only frst hdden layer of DNN s replaced by a stochastc layer, and we use 500 samples for estmatng the expectatons n the SFNN nference. As reported n Table 1, we observe that the smple transformaton often works well for both tasks: the SFNN and sgmod-dnn nferences usng same parameters traned by sgmod-dnn perform smlarly for the classfcaton task and the former sgnfcantly outperforms the latter for the mult-modal task also see Fgure 1. It mght suggest some possbltes that the expensve SFNN tranng mght not be not necessary, dependng on the targeted learnng qualty. However, n case of ReLU, SFNN performs much worse than ReLU-DNN for the MNIST classfcaton task under the parameter transformaton. 3 Transformaton from DNN to SFNN va Smplfed-SFNN In ths secton, we propose an advanced method to utlze the pre-traned parameters of DNN for tranng SFNN. As shown n the prevous secton, smple parameter transformatons from DNN to SFNN do not clearly work n general, n partcular for actvaton functons other than sgmod. Moreover, tranng DNN does not utlze the stochastc regularzng effect, whch s an mportant beneft of SFNN. To address the ssues, we desgn an ntermedate model, called Smplfed-SFNN. The proposed model s a specal form of stochastc neural network, whch approxmates certan SFNN by smplfyng ts upper latent unts above stochastc ones. Then, we establsh more rgorous connectons between three models: DNN Smplfed-SFNN SFNN, whch leads to an effcent tranng procedure of the stochastc models utlzng pretraned parameters of DNN. In our experments, we evaluate the strategy for varous tasks and popular DNN archtectures. 4 5

6 Input Input Layer 1 Layer 1 : Stochastc layer : Determnstc layer Layer Layer : Stochastcty Output Output Test Error Rate [%] Baselne ReLU-DNN ReLU-DNN * traned by ReLU-DNN * ReLU-DNN * traned by Smplfed-SFNN Epoch Knowledge Transferrng Loss # of Samples = The value of γ a b c Fgure : a Smplfed-SFNN top and SFNN bottom. The randomness of the stochastc layer propagates only to ts upper layer n the case of Smplfed-SFNN. b For frst 00 epochs, we tran a baselne ReLU-DNN. Then, we tran smplfed-sfnn ntalzed by the DNN parameters under transformaton 8 wth γ = 50. We observe that tranng ReLU- DNN drectly does not reduce the test error even when network knowledge transferrng stll holds between the baselne ReLU-DNN and the correspondng ReLU-DNN. c As the value of γ ncreases, knowledge transferrng loss measured as 1 1 h l x ĥl x s decreasng. D N l x 3.1 Smplfed-SFNN of two hdden layers and non-negatve actvaton functons For clarty of presentaton, we frst ntroduce Smplfed-SFNN wth two hdden layers and non-negatve actvaton functons, where ts extensons to multple layers and general actvaton functons are presented n Secton 4. We also remark that we prmarly descrbe fullyconnected Smplfed-SFNNs, but ther convolutonal versons can also be naturally defned. In Smplfed-SFNN of two hdden layers, we assume that the frst and second hdden layers consst of stochastc bnary hdden unts and determnstc ones, respectvely. As n 3, the frst layer s defned as a bnary random vector wth N 1 unts,.e., h 1 {0, 1} N 1, drawn under the followng dstrbuton: P h 1 x = P h 1 x, where N 1 =1 P h 1 = 1 x { = mn α 1 f } W 1 x + b 1, 1. 5 where x s the nput vector, α 1 > 0 s a hyper-parameter for the frst layer, and f : R R + s some non-negatve non-lnear actvaton functon wth f x 1 for all x R, e.g., ReLU and sgmod actvaton functons. Now the second layer s defned as the followng determnstc vector wth N unts,.e., h x R N : h x = [ f [ α EP h 1 x s W h 1 + b ] ] s 0 :, 6 where α > 0 s a hyper-parameter for the second layer and s : R R s a dfferentable functon wth s x 1 for all x R, e.g., sgmod and tanh functons. In our experments, we use the sgmod functon for sx. Here, one can note that the proposed model also has the same computatonal ssues wth SFNN n forward and backward passes due to the complex expectaton. One can tran Smplfed-SFNN smlarly as SFNN: we use Monte Carlo approxmaton for estmatng the expectaton and the based estmator of the gradent for approxmatng backpropagaton nspred by [9] see Secton 3.3 for more detals. 6

7 Inference Model Tranng Model Network Structure wthout BN & DO wth BN wth DO sgmod-dnn sgmod-dnn hdden layers SFNN sgmod-dnn hdden layers Smplfed-SFNN fne-tuned by Smplfed-SFNN hdden layers sgmod-dnn fne-tuned by Smplfed-SFNN hdden layers SFNN fne-tuned by Smplfed-SFNN hdden layers ReLU-DNN ReLU-DNN hdden layers SFNN ReLU-DNN hdden layers Smplfed-SFNN fne-tuned by Smplfed-SFNN hdden layers ReLU-DNN fne-tuned by Smplfed-SFNN hdden layers SFNN fne-tuned by Smplfed-SFNN hdden layers ReLU-DNN ReLU-DNN 3 hdden layers SFNN ReLU-DNN 3 hdden layers Smplfed-SFNN fne-tuned by Smplfed-SFNN 3 hdden layers ReLU-DNN fne-tuned by Smplfed-SFNN 3 hdden layers SFNN fne-tuned by Smplfed-SFNN 3 hdden layers ReLU-DNN ReLU-DNN 4 hdden layers SFNN ReLU-DNN 4 hdden layers Smplfed-SFNN fne-tuned by Smplfed-SFNN 4 hdden layers ReLU-DNN fne-tuned by Smplfed-SFNN 4 hdden layers SFNN fne-tuned by Smplfed-SFNN 4 hdden layers Table : Classfcaton test error rates [%] on MNIST, where each layer of neural networks contans 800 hdden unts. All Smplfed-SFNNs are constructed by replacng the frst hdden layer of a baselne DNN wth stochastc hdden layer. We also consder tranng DNN and fnetunng Smplfed-SFNN usng batch normalzaton BN and dropout DO. The performance mprovements beyond baselne DNN due to fne-tunng DNN parameters under Smplfed- SFNN are calculated n the bracket. We are nterested n transferrng parameters of DNN to Smplfed-SFNN to utlze the tranng benefts of DNN snce the former s much faster to tran than the latter. To ths end, we consder the followng DNN of whch l-th hdden layer s determnstc and defned as follows: [ ĥ l x = ĥl x = f Ŵl ĥ l 1 x + b ] l :, 7 where ĥ0 x = x. As stated n the followng theorem, we establsh a rgorous way how to ntalze parameters of Smplfed-SFNN n order to transfer the knowledge stored n DNN. Theorem 1 Assume that both DNN and Smplfed-SFNN wth two hdden layers have same network structure wth non-negatve actvaton functon f. Gven parameters {Ŵl, b l : l = 1, } of DNN and nput dataset D, choose those of Smplfed-SFNN as follows: α1, W 1, b 1 1 γ 1, Ŵ1, b 1 where γ 1 = max,x D, x D, t follows that, α, W, b γ γ 1 s 0, 1 γ Ŵ, 1 γ 1 γ b, 8 f Ŵ1 x + b 1 and γ > 0 s any postve constant. Then, for all h x ĥ x γ 1 Ŵ + b γ 1 1 s. 0 γ The proof of the above theorem s presented n Secton 5.1. Our proof s bult upon the frstorder Taylor expanson of non-lnear functon sx. Theorem 1 mples that one can make 7

8 Inference Model Tranng Model MNIST TFD Hdden Layers 3 Hdden Layers Hdden Layers 3 Hdden Layers sgmod-dnn sgmod-dnn SFNN sgmod-dnn Smplfed-SFNN fne-tuned by Smplfed-SFNN SFNN fne-tuned by Smplfed-SFNN ReLU-DNN ReLU-DNN SFNN ReLU-DNN Smplfed-SFNN fne-tuned by Smplfed-SFNN SFNN fne-tuned by Smplfed-SFNN Table 3: Test negatve log-lkelhood NLL on MNIST and TFD datasets, where each layer of neural networks contans 00 hdden unts. All Smplfed-SFNNs are constructed by replacng the frst hdden layer of a baselne DNN wth stochastc hdden layer. Smplfed-SFNN represent the functon values of DNN wth bounded errors usng a lnear transformaton. Furthermore, the errors can be made arbtrarly small by choosng large γ,.e., lm h x ĥ x = 0,, x D. Fgure c shows that knowledge transferrng γ loss decreases as γ ncreases on MNIST classfcaton. Based on ths, we choose γ = 50 commonly for all experments. 3. Why Smplfed-SFNN? Gven a Smplfed-SFNN model, the correspondng SFNN can be naturally defned by takng out the expectaton n 6. As llustrated n Fgure a, the man dfference between SFNN and Smplfed-SFNN s that the randomness of the stochastc layer propagates only to ts upper layer n the latter,.e., the randomness of h 1 s averaged out at ts upper unts h and does not propagate to h 3 or output y. Hence, Smplfed-SFNN s no longer a Bayesan network. Ths makes tranng Smplfed-SFNN much easer than SFNN snce random samples are not requred at some layers 1 and consequently the qualty of gradent estmatons can also be mproved, n partcular for unbounded actvaton functons. Furthermore, one can use the same approxmaton procedure to see that Smplfed-SFNN approxmates SFNN. However, snce Smplfed-SFNN stll mantans bnary random unts, t uses approxmaton steps later, n comparson wth DNN. In summary, Smplfed-SFNN s an ntermedate model between DNN and SFNN,.e., DNN Smplfed-SFNN SFNN. The above connecton naturally suggests the followng tranng procedure for both SFNN and Smplfed-SFNN: tran a baselne DNN frst and then fne-tune ts correspondng Smplfed- SFNN ntalzed by the transformed DNN parameters. Fnally, the fne-tuned parameters can be used for SFNN as well. We evaluate the strategy for the MNIST classfcaton. The MNIST dataset conssts of 8 8 pxel greyscale mages, each contanng a dgt 0 to 9 wth 60,000 tranng and 10,000 test mages. For ths experment, we do not use any data augmentaton or pre-processng. The loss was mnmzed usng ADAM learnng rule [7] wth a mn-batch sze of 18. We used an exponentally decayng learnng rate. Hyper-parameters are tuned on the valdaton set consstng of the last 10,000 tranng mages. All Smplfed-SFNNs are constructed by replacng the frst hdden layer of a baselne DNN wth stochastc hdden layer. 1 For example, f one replaces the frst feature maps n the ffth resdual unt of Pre-ResNet havng 164 layers [45] by stochastc ones, then the correspondng DNN, Smplfed-SFNN and SFNN took 1 mns 35 secs, mns 5 secs and 16 mns 6 secs per each tranng epoch, respectvely, on our machne wth one Intel CPU Core 7-580K 6-Core@3.3GHz and one NVIDIA GPU GTX Ttan X, 307 CUDA cores. Here, we traned both stochastc models usng the based estmator [9] wth 10 random samples on CIFAR-10 dataset. 8

9 We frst tran a baselne DNN for frst 00 epochs, and the traned parameters of DNN are used for ntalzng those of Smplfed-SFNN. For 50 epochs, we tran smplfed-sfnn. We choose the hyper-parameter γ = 50 n the parameter transformaton. All Smplfed-SFNNs are traned wth M = 0 samples at each epoch, and n the test, we use 500 samples. Table shows that SFNN under the two-stage tranng always performs better than SFNN under a smple transformaton 4 from ReLU-DNN. More nterestngly, Smplfed-SFNN consstently outperforms ts baselne DNN due to the stochastc regularzng effect, even when we tran both models usng dropout [16] and batch normalzaton [17]. Ths mples that the proposed stochastc model can be used for mprovng the performance of DNNs and t can be also combned wth other regularzaton methods such as dropout batch normalzaton. In order to confrm the regularzaton effects, one can agan approxmate a traned Smplfed-SFNN by a new determnstc DNN whch we call DNN and s dfferent from ts baselne DNN under the followng approxmaton at upper latent unts above bnary random unts: E Ph l x [ s W l+1 h l] [ s E Ph l x W l+1 h l] = s W l+1 P h l = 1 x. We found that DNN usng fned-tuned parameters of Smplfed-SFNN also outperforms the baselne DNN as shown n Table and Fgure b. 3.3 Tranng Smplfed-SFNN The parameters of Smplfed-SFNN can be learned usng a varant of the backpropagaton algorthm [9] n a smlar manner to DNN. However, n contrast to DNN, there are two computatonal ssues for smplfed-sfnn: computng expectatons wth respect to stochastc unts n forward pass and computng gradents n back pass. One can notce that both are ntractable snce they requre summatons over all possble confguratons of all stochastc unts. Frst, n order to handle the ssue n forward pass, we use the followng Monte Carlo approxmaton for estmatng the expectaton: [ E P h 1 x s W h 1 + b 1 ] M M m=1 s W h m + b, where h m P h 1 x and M s the number of samples. Ths random estmator s unbased and has relatvely low varance [8] snce ts accuracy does not depend on the dmensonalty of h 1 and one can draw samples from the exact dstrbuton. Next, n order to handle the ssue n back pass, we use the followng approxmaton nspred by [9]: W W 1 [ E P h 1 x s W h 1 + b 1 ] M m [ E P h 1 x s W h 1 + b W ] M m W s W h m + b, s W h m + b W 1 P h 1 = 1 x, where h m P h 1 x and M s the number of samples. In our experments, we commonly choose M =

10 4 Extensons of Smplfed-SFNN In ths secton, we descrbe how the network knowledge transferrng between Smplfed- SFNN and DNN,.e., Theorem 1, generalzes to multple layers and general actvaton functons. 4.1 Extenson to multple layers A deeper Smplfed-SFNN wth L hdden layers can be defned smlarly as the case of L =. We also establsh network knowledge transferrng between Smplfed-SFNN and DNN wth L hdden layers as stated n the followng theorem. Here, we assume that stochastc layers are not consecutve for smpler presentaton, but the theorem s generalzable for consecutve stochastc layers. Theorem Assume that both DNN and Smplfed-SFNN wth L hdden layers have same network structure wth non-negatve actvaton functon f. Gven parameters {Ŵl, b l : l = 1,..., L} of DNN and nput dataset D, choose the same ones for Smplfed-SFNN ntally and modfy them for each l-th stochastc layer and ts upper layer as follows: where γ l = max,x D that α l 1, α l+1, W l+1, b l+1 γ l γ l γ l+1 s 0, Ŵl+1 γ l+1, b l+1, 10 γ l γ l+1 f Ŵl h l 1 x + b l and γl+1 s any postve constant. Then, t follows lm γ l+1 stochastc hdden layer l h L x ĥl x = 0,, x D. The above theorem agan mples that t s possble to transfer knowledge from DNN to Smplfed-SFNN by choosng large γ l+1. The proof of Theorem s smlar to that of Theorem 1 and gven n Secton Extenson to general actvaton functons In ths secton, we descrbe an extended verson of Smplfed-SFNN whch can utlze any actvaton functon. To ths end, we modfy the defntons of stochastc layers and ther upper layers by ntroducng certan addtonal terms. If the l-th hdden layer s stochastc, then we slghtly modfy the orgnal defnton 5 as follows: N l P h l x = P =1 { h l x wth P h l = 1 x = mn α l f W 1 x + b }, 1, where f : R R s a non-lnear possbly, negatve actvaton functon wth f x 1 for all x R. In addton, we re-defne ts upper layer as follows: [ [ ] h l+1 x = f α l+1 E Ph l x s W l+1 h l + b l+1 s 0 s 0 ] W l+1 :, where h 0 x = x and s : R R s a dfferentable functon wth s x 1 for all x R. Under ths general Smplfed-SFNN model, we also show that transferrng network knowledge from DNN to Smplfed-SFNN s possble as stated n the followng theorem. Here, we agan assume that stochastc layers are not consecutve for smpler presentaton. 10

11 Theorem 3 Assume that both DNN and Smplfed-SFNN wth L hdden layers have same network structure wth non-lnear actvaton functon f. Gven parameters {Ŵl, b l : l = 1,..., L} of DNN and nput dataset D, choose the same ones for Smplfed-SFNN ntally and modfy them for each l-th stochastc layer and ts upper layer as follows: where γ l = max,x D that α l 1, α l+1, W l+1, b l+1 γ l γ l γ l+1 s 0, Ŵl+1 γ l+1, b l+1 γ l γ l+1 f Ŵl h l 1 x + b l, and γl+1 s any postve constant. Then, t follows lm γ l+1 stochastc hdden layer l h L x ĥl x = 0,, x D. We omt the proof of the above theorem snce t s somewhat drect adaptaton of that of Theorem. 5 Proofs of Theorems 5.1 Proof of Theorem 1 Frst consder the frst hdden layer,.e., stochastc layer. Let γ 1 = max Ŵ1 f x + b 1 be,x D the maxmum value of hdden unts n DNN. If we ntalze the parameters α 1, W 1, b 1 1, γ1 Ŵ1, b 1, then the margnal dstrbuton of each hdden unt becomes P h 1 = 1 x, W 1, b 1 { = mn α 1 f Ŵ1 x + b } 1, 1 = 1 f Ŵ1 γ x + b 1,, x D. 1 Next consder the second hdden layer. From Taylor s theorem, there exsts a value z between 0 and x such that sx = s0 + s 0x + Rx, where Rx = s zx!. Snce we consder a bnary random vector,.e., h 1 {0, 1} N 1, one can wrte [ s β h 1 ] E P h 1 x = h 1 s 0 + s 0 β h 1 + R β h 1 P h 1 x = s 0 + s 0 W P h 1 = 1 x + b, 11 [ + E P h 1 x Rβ h 1 ], 1 where β h 1 := W h1 + b s the ncomng sgnal. From 6 and 11, for every hdden unt, t follows that h x; W, b =f α s 1 0 W γ ĥ1 x + b [ + E P h 1 x R β h 1 ]. 1 Snce we assume that f x 1, the followng nequalty holds: h x; W, b f α s 1 0 W γ ĥ1 x + b 1 11

12 [ α E P h 1 x Rβ h 1 ] α [ W ] E P h 1 x h 1 + b, 13 where we use s z < 1 for the last nequalty. Therefore, t follows that h x; W, b ĥ x; Ŵ, b γ 1 Ŵ + b γ 1 1 s,, 0γ snce we set α, W, b γ γ 1 5. Proof of Theorem s 0, Ŵ γ, γ 1 1 γ b. Ths completes the proof of Theorem 1. For the proof of Theorem, we frst state the two key lemmas on error propagaton n Smplfed- SFNN. Lemma 4 Assume that there exsts some postve constant B such that h l 1 x ĥl 1 x B,, x D, and the l-th hdden layer of Smplfed-SFNN s standard determnstc layer as defned n 7. Gven parameters {Ŵl, b l } of DNN, choose same ones for Smplfed-SFNN. Then, the followng nequalty holds: h l x ĥl x BN l 1 Ŵmax, l, x D. where Ŵ l max = max Ŵ l. Proof. See Secton 5.3. Lemma 5 Assume that there exsts some postve constant B such that h l 1 x ĥl 1 x B,, x D, and the l-th hdden layer of smplfed-sfnn s stochastc layer. Gven parameters {Ŵl, Ŵl+1, b l, b l+1 } of DNN, choose those of Smplfed-SFNN as follows: α l 1, α l+1, W l+1, b l+1 γ l γ l+1 γ l s 0, Ŵl+1 b l+1,, γ l+1 γ l γ l+1 where γ l = max f Ŵl h l 1 x + b l and γl+1 s any postve constant. Then, for all,x D, x D, t follows that h l+1 k x ĥl+1 k where b l max = max b l x BN l 1 N l ŴmaxŴ l l+1 and Ŵ l max = max Ŵ l. max + γ l N l Ŵmax l+1 + b l+1 maxγ 1 l s 0γ l+1, 1

13 Proof. See Secton 5.4. Assume that l-th layer s frst stochastc hdden layer n Smplfed-SFNN. Then, from Theorem 1, we have h l+1 γ l N l Ŵmax l+1 + b l+1 maxγ 1 x ĥl+1 l x s,, x D. 14 0γ l+1 Accordng to Lemma 4 and 5, the fnal error generated by the rght hand sde of 14 s bounded by where τ l = L l =l+ τ l γ l N l Ŵmax l+1 + b l+1 maxγ 1 l s, 15 0 γ l+1 N l 1 Ŵmax l. One can note that every error generated by each stochastc layer s bounded by 15. Therefore, for all, x D, t follows that h L τ l γ l N l Ŵmax l+1 + b l+1 maxγ 1 x ĥl l x s 0 γ l+1 l:stochastc hdden layer From above nequalty, we can conclude that lm h L γ l+1 x ĥl x = 0,, x D. stochastc hdden layer l Ths completes the proof of Theorem. 5.3 Proof of Lemma 4 From assumpton, there exsts some constant ɛ such that ɛ < B and h l 1 x = ĥl 1 x + ɛ,, x.. By defnton of standard determnstc layer, t follows that h l x = f Ŵh l l 1 l 1 x + b = f Ŵ l ĥl 1 x + Ŵ l ɛ + b l. Snce we assume that f x 1, one can conclude that hl x f Ŵ ĥl 1 l x + b l Ŵ l ɛ B Ths completes the proof of Lemma 4. Ŵ l BN l 1 Ŵmax. l 13

14 5.4 Proof of Lemma 5 From assumpton, there exsts some constant ɛ l 1 Let γ l = max,x D h l 1 such that ɛ l 1 < B and x = ĥl 1 x + ɛ l 1,, x. 16 f Ŵl h l 1 x + b l be the maxmum value of hdden unts. If we ntalze the parameters α l, W l, b l P 1 γl, Ŵl, b l, then the margnal dstrbuton becomes h l = 1 x, W l, b l { = mn α l f Ŵl h l 1 x + b } l, 1 = 1 f Ŵl γ h l 1 x + b l,, x. l From 16, t follows that P h l = 1 x, W l, b l = 1 γ l f Ŵ l ĥl 1 x + Ŵɛ l l 1 + b l,, x. Smlar to Lemma 4, there exsts some constant ɛ l such that ɛ l < BN l 1 Ŵ l max and P h l = 1 x, W l, b l = 1 ĥl γ x + ɛ l,, x. 17 l Next, consder the upper hdden layer of stochastc layer. From Taylor s theorem, there exsts a value z between 0 and t such that sx = s0 + s 0x + Rx, where Rx = s zx!. Snce we consder a bnary random vector,.e., h l {0, 1} N l, one can wrte E P h l x[sβ k h l ] = s0 + s 0β k h l + R β k h l P h l x h l = s0 + s 0 W l+1 k P hl = 1 x + b l+1 k + Rβ k h l P h l x, h l where β k h l = W l+1 k h l + b l+1 k s the ncomng sgnal. From 17 and above equaton, for every hdden unt k, we have h l+1 k x; W l+1, b l+1 = f α l+1 s 1 0 γ l W l+1 k ĥl x + W l+1 k ɛl + b l+1 k + E P h l x [ Rβ k h l ]. Snce we assume that f x < 1, the followng nequalty holds: x; W l+1, b l+1 f α l+1 s 0 1 γ l hl+1 k α l+1 s 0 γ l W l+1 [ ] W l+1 k ɛl + α l+1 E P h l x Rβ k h l 14 ĥ l x + b l+1

15 α l+1 s 0 γ l W l+1 k ɛl [ + α ] l+1 E P h l x W l+1 k h l + b l+1 k, 18 where we use s z < 1 for the last nequalty. Therefore, t follows that h l+1 k x ĥl+1 k x BN l 1 N l ŴmaxŴ l max l+1 γ l N l Ŵmax l+1 + b l+1 maxγ 1 l + s 0γ l+1 snce we set α l+1, W l+1, b l+1 of Lemma 5. 6 Expermental Results γ l+1 γ l s 0, Ŵl+1 γ l+1, γ 1 l b l+1, γ l+1. Ths completes the proof We present several expermental results for both mult-modal and classfcaton tasks on MNIST [6], Toronto Face Database TFD [37], CASIA [3], CIFAR-10, CIFAR-100 [14] and SVHN [6]. The softmax and Gaussan wth the standard devaton of 0.05 are used as the output probablty for the classfcaton task and the mult-modal predcton, respectvely. In all experments, we frst tran a baselne model, and the traned parameters are used for further fne-tunng those of Smplfed-SFNN. 6.1 Mult-modal regresson We frst verfy that t s possble to learn one-to-many mappng va Smplfed-SFNN on the TFD, MNIST and CASIA datasets. The TFD dataset conssts of pxel greyscale mages, each contanng a face mage of 900 ndvduals wth 7 dfferent expressons. Smlar to [9], we use 14 ndvduals wth at least 10 facal expressons as data. We randomly choose 100 ndvduals wth 1403 mages for tranng and the remanng 4 ndvduals wth 36 mages for the test. We take the mean of face mages per ndvdual as the nput and set the output as the dfferent expressons of the same ndvdual. The MNIST dataset conssts of greyscale mages, each contanng a dgt 0 to 9 wth 60,000 tranng and 10,000 test mages. For ths experments, each pxel of every dgt mages s bnarzed usng ts grey-scale value smlar to [9]. We take the upper half of the MNIST dgt as the nput and set the output as the lower half of t. We remark that both tasks are commonly performed n recent other works to test the mult-modal learnng usng SFNN [9, 34]. The CASIA dataset conssts of greyscale mages, each contanng a handwrtten Chnese character. We use 10 offlne solated characters produced by 345 wrters as data. We randomly choose 300 wrters wth 3000 mages for tranng and the remanng 45 wrters wth 450 mages for testng. We take the radcal character per wrter as the nput and set the output as ether the related character or the radcal character tself of the same wrter see Fgure 5. The bcubc nterpolaton [4] s used for re-szng all mages as 3 3 pxels. For both TFD and MNIST datasets, we use fully-connected DNNs as the baselne models smlar to other works [9, 34]. All Smplfed-SFNNs are constructed by replacng the frst hdden layer of a baselne DNN wth stochastc hdden layer. The loss was mnmzed usng We use 5 radcal characters e.g., we, sh, shu, and mu n Mandarn and 5 related characters e.g., hu, s, ben, ren and qu n Mandarn. 15

16 Input 6 feature maps 6 Stochastc Stoc unts 84 unts Output A [Convoluton Conv.] [Max pool] [Stochastc Stoc. Conv.] [Max pool] [Fully-connected] [Fully-connected] [Fully-connected] a Input Stoc. 10 Output A [Conv.] [Conv.] [Conv.] [Max pool] [Conv.] [Conv.] [Conv.] [Avg pool] [Conv.] [Conv.] [Stoc. Conv.] [Avg pool] b [Conv.] 64 uu 1, 3 uu = 1,,3 Input A [Conv.] 16 vv & uu = 3 [Conv.] eeeeeeee 64 uu 1 Stoc. 64 uu 1 [Stoc. Conv.] [Conv.] vv 3 & 64 uu 1 64 uu 1 uu = 3 [Stoc. Conv.] 56 [Conv.] eeeeeeee Stoc. [Conv.] [Avg pool] Output [Fullyconnected] c [Conv.] 160 uu 1, 3 vv = 1,,3 3 uu = 1,,3 Input A uu 1 [Conv.] [Conv.] [Conv.] 160 uu uu 1 vv vv 160 uu &uu = 3 [Stoc. Conv.] [Conv.] eeeeeeee Stoc. [Conv.] [Avg pool] Output [Fullyconnected] d Fgure 3: The overall structures of a Lenet-5, b NIN, c WRN wth 16 layers, and d WRN wth 8 layers. The red feature maps correspond to the stochastc ones. In case of WRN, we ntroduce one v = 3 and two v = stochastc feature maps. ADAM learnng rule [7] wth a mn-batch sze of 18. We used an exponentally decayng learnng rate. We tran Smplfed-SFNNs wth M = 0 samples at each epoch, and n the test, we use 500 samples. We use 00 hdden unts for each layer of neural networks n two experments. Learnng rate s chosen from {0.005, 0.00, 0.001,..., }, and the best result s reported for both tasks. Table 3 reports the test negatve log-lkelhood on TFD and MNIST. One can note that SFNNs fne-tuned by Smplfed-SFNN consstently outperform SFNN under the smple transformaton. For the CASIA dataset, we choose fully-convolutonal network FCN models [] as the baselne ones, whch conssts of convolutonal layers followed by a fully-convolutonal layer. In a smlar manner to the case of fully-connected networks, one can defne a stochastc convoluton layer, whch consders the nput feature map as a bnary random matrx and generates 16

17 a b c d Fgure 4: Generated samples for predctng the lower half of the MNIST dgt gven the upper half. a The orgnal dgts and the correspondng nputs. The generated samples from b sgmod-dnn, c SFNN under the smple transformaton, and d SFNN fne-tuned by Smplfed-SFNN. We remark that SFNN fne-tuned by Smplfed-SFNN can generate more varous samples e.g., red rectangles from same nputs better than SFNN under the smple transformaton. Inference Model Tranng Model Conv Layers 3 Conv Layers FCN FCN SFNN FCN Smplfed-SFNN fne-tuned by Smplfed-SFNN SFNN fne-tuned by Smplfed-SFNN Table 4: Test NLL on CASIA dataset. All Smplfed-SFNNs are constructed by replacng the frst hdden feature maps of baselne models wth stochastc ones. the output feature map as defned n 6. For ths experment, we use a baselne model, whch conssts of convolutonal layers followed by a fully convolutonal layer. The convolutonal layers have 64, 18 and 56 flters respectvely. Each convolutonal layer has a44receptve feld appled wth a strde of pxel. All Smplfed-SFNNs are constructed by replacng the frst hdden feature maps of baselne models wth stochastc ones. The loss was mnmzed usng ADAM learnng rule [7] wth a mn-batch sze of 18. We used an exponentally decayng learnng rate. We tran Smplfed-SFNNs wth M = 0 samples at each epoch, and n the test, we use 100 samples due to the memory lmtatons. One can note that SFNNs fne-tuned by Smplfed-SFNN outperform SFNN under the smple transformaton as reported n Table 4 and Fgure Classfcaton We also evaluate the regularzaton effect of Smplfed-SFNN for the classfcaton tasks on CIFAR-10, CIFAR-100 and SVHN. The CIFAR-10 and CIFAR-100 datasets consst of 50,000 tranng and 10,000 test mages. They have 10 and 100 mage classes, respectvely. The SVHN dataset conssts of 73,57 tranng and 6,03 test mages. 3 It conssts of a house number 0 to 9 collected by Google Street Vew. Smlar to [39], we pre-process the data usng global contrast normalzaton and ZCA whtenng. For these datasets, we desgn a convolutonal verson of Smplfed-SFNN usng convolutonal neural networks such as Lenet-5 [6], NIN [13] and WRN [39]. All Smplfed-SFNNs are constructed by replacng a hdden feature map 3 We do not use the extra SVHN dataset for tranng. 17

18 Input: / we Or s Output we Or hu a b c d Fgure 5: a The nput-output Chnese characters. The generated samples from b FCN, c SFNN under the smple transformaton and d SFNN fne-tuned by Smplfed- SFNN. Our model can generate the multmodal outputs red rectangle, whle SFNN under the smple transformaton cannot. Test Error Rate [%] WRN * traned by Smplfed-SFNN one stochastc layer WRN * traned by Smplfed-SFNN two stochastc layers Baselne WRN Epoch Fgure 6: Test error rates of WRN per each tranng epoch on CIFAR-100. One can note that the performance gan s more sgnfcant when we ntroduce more stochastc layers. of a baselne models,.e., Lenet-5, NIN and WRN, wth stochastc one as shown n Fgure 3d. We use WRN wth 16 and 8 layers for SVHN and CIFAR datasets, respectvely, snce they showed state-of-the-art performance as reported by [39]. In case of WRN, we ntroduce up to two stochastc convoluton layers. Smlar to [39], the loss was mnmzed usng the stochastc gradent descent method wth Nesterov momentum. The mnbatch sze s set to 18, and weght decay s set to For 100 epochs, we frst tran baselne models,.e., Lenet-5, NIN and WRN, and traned parameters are used for ntalzng those of Smplfed-SFNNs. All Smplfed-SFNNs are traned wth M = 5 samples and the test error s only measured by the approxmaton 9. The test errors of baselne models are measured after tranng them for 00 epochs smlar to [39]. All models are traned usng dropout [16] and batch normalzaton [17]. Table 5 reports the classfcaton error rates on CIFAR-10, CIFAR-100 and SVHN. Due to the regularzaton effects, Smplfed-SFNNs consstently outperform ther baselne DNNs. In partcular, WRN of 8 layers and 36 mllon parameters outperforms WRN by 0.08% on CIFAR-10 and 0.58% on CIFAR-100. Fgure 6 shows that the error rate s decreased more when we ntroduce more stochastc layers, but t ncreases the fne-tunng tme-complexty of Smplfed-SFNN. 7 Concluson In order to develop an effcent tranng method for large-scale SFNN, ths paper proposes a new ntermedate stochastc model, called Smplfed-SFNN. We establsh the connecton between three models,.e., DNN Smplfed-SFNN SFNN, whch naturally leads to an effcent tranng procedure of the stochastc models utlzng pre-traned parameters of DNN. Ths connecton naturally leads an effcent tranng procedure of the stochastc models utlzng pre-traned parameters and archtectures of DNN. Usng several popular DNNs ncludng Lenet-5, NIN, FCN and WRN, we show how they can be effectvely transferred to the correspondng stochastc models for both mult-modal and non-mult-modal tasks. We beleve that our work brngs a new angle for tranng stochastc neural networks, whch would be of 18

19 Inference model Tranng Model CIFAR-10 CIFAR-100 SVHN Lenet-5 Lenet Lenet-5 fne-tuned by Smplfed-SFNN NIN NIN NIN fne-tuned by Smplfed-SFNN WRN WRN WRN fne-tuned by Smplfed-SFNN one stochastc layer WRN fne-tuned by Smplfed-SFNN two stochastc layers Table 5: Test error rates [%] on CIFAR-10, CIFAR-100 and SVHN. The error rates for WRN are from our experments, where orgnal ones reported n [39] are n the brackets. Results wth are obtaned usng the horzontal flppng and random croppng augmentaton. broader nterest n many related applcatons. References [1] Ruz, F.R., AUEB, M.T.R. and Ble, D. The generalzed reparameterzaton gradent. In Advances n Neural Informaton Processng Systems NIPS, 016. [] Long, J., Shelhamer, E. and Darrell, T. Fully convolutonal networks for semantc segmentaton. In Proceedngs of the IEEE Conference on Computer Vson and Pattern Recognton CVPR, 015. [3] Lu, C.L., Yn, F., Wang, D.H. and Wang, Q.F. CASIA onlne and offlne Chnese handwrtng databases. In Internatonal Conference on Document Analyss and Recognton ICDAR, 011. [4] Keys, R. Cubc convoluton nterpolaton for dgtal mage processng. In IEEE transactons on acoustcs, speech, and sgnal processng, [5] Neal, R.M. Learnng stochastc feedforward networks. Department of Computer Scence, Unversty of Toronto, [6] LeCun, Y., Bottou, L., Bengo, Y. and Haffner, P. Gradent-based learnng appled to document recognton. Proceedngs of the IEEE, [7] Kngma, D. and Ba, J. Adam: A method for stochastc optmzaton. arxv preprnt arxv: , 014. [8] Tang, Y. and Salakhutdnov, R.R. Learnng stochastc feedforward neural networks. In Advances n Neural Informaton Processng Systems NIPS, 013. [9] Rako, T., Berglund, M., Alan, G. and Dnh, L. Technques for learnng bnary stochastc feedforward neural networks. arxv preprnt arxv: , 014. [10] Wan, L., Zeler, M., Zhang, S., Cun, Y.L. and Fergus, R. Regularzaton of neural networks usng dropconnect. In Internatonal Conference on Machne Learnng ICML,

20 [11] Krzhevsky, A., Sutskever, I. and Hnton, G.E. Imagenet classfcaton wth deep convolutonal neural networks. In Advances n Neural Informaton Processng Systems NIPS, 01. [1] Goodfellow, I.J., Warde-Farley, D., Mrza, M., Courvlle, A. and Bengo, Y. Maxout networks. In Internatonal Conference on Machne Learnng ICML, 013. [13] Ln, M., Chen, Q. and Yan, S. Network n network. In Internatonal Conference on Learnng Representatons ICLR, 014. [14] Krzhevsky, A. and Hnton, G. Learnng multple layers of features from tny mages. Masters thess, Department of Computer Scence, Unversty of Toronto, 009. [15] Blundell, C., Cornebse, J., Kavukcuoglu, K. and Werstra, D. Weght uncertanty n neural networks. In Internatonal Conference on Machne Learnng ICML, 015. [16] Hnton, G.E., Srvastava, N., Krzhevsky, A., Sutskever, I. and Salakhutdnov, R.R. Improvng neural networks by preventng co-adaptaton of feature detectors. arxv preprnt arxv: , 01. [17] Ioffe, S. and Szegedy, C. Batch normalzaton: Acceleratng deep network tranng by reducng nternal covarate shft. In Internatonal Conference on Machne Learnng ICML, 015. [18] Salakhutdnov, R. and Hnton, G.E. Deep boltzmann machnes. In Conference on Artfcal Intellgence and Statstcs AISTATS, 009. [19] Neal, R.M. Connectonst learnng of belef networks. Artfcal ntellgence, 199. [0] Hnton, G.E., Osndero, S. and Teh, Y.W. A fast learnng algorthm for deep belef nets. Neural computaton, 006. [1] Poon, H. and Domngos, P. Sum-product networks: A new deep archtecture. In Conference on Uncertanty n Artfcal Intellgence UAI, 011. [] Chen, T., Goodfellow, I. and Shlens, J. NetNet: Acceleratng Learnng va Knowledge Transfer. arxv preprnt arxv: , 015. [3] Maass, W. Networks of spkng neurons: the thrd generaton of neural network models. Neural Networks, [4] Hnton, G., Deng, L., Yu, D., Dahl, G.E., Mohamed, A.R., Jatly, N., Senor, A., Vanhoucke, V., Nguyen, P., Sanath, T.N. and Kngsbury, B. Deep neural networks for acoustc modelng n speech recognton: The shared vews of four research groups. IEEE Sgnal Processng Magazne, 01. [5] Zeler, M.D. and Fergus, R. Stochastc poolng for regularzaton of deep convolutonal neural networks. In Internatonal Conference on Learnng Representatons ICLR, 013. [6] Netzer, Y., Wang, T., Coates, A., Bssacco, A., Wu, B. and Ng, A.Y. Readng dgts n natural mages wth unsupervsed feature learnng. In NIPS Workshop on Deep Learnng and Unsupervsed Feature Learnng,

21 [7] Courbaraux, M., Bengo, Y. and Davd, J.P. BnaryConnect: Tranng Deep Neural Networks wth bnary weghts durng propagatons. arxv preprnt arxv: , 015. [8] Gulcehre, C., Moczulsk, M., Denl, M. and Bengo, Y. Nosy actvaton functons. arxv preprnt arxv: , 016. [9] Rumelhart, D.E., Hnton, G.E. and Wllams, R.J. Learnng nternal representatons by error propagaton. In Neurocomputng: Foundatons of research, MIT Press, [30] Robbns, H. and Monro, S. A stochastc approxmaton method. Annals of Mathematcal Statstcs, [31] Ba, J. and Frey, B. Adaptve dropout for tranng deep neural networks. In Advances n Neural Informaton Processng Systems NIPS, 013. [3] Kngma, D.P. and Wellng, M. Auto-encodng varatonal bayes. arxv preprnt arxv: , 013. [33] Nar, V. and Hnton, G.E. Rectfed lnear unts mprove restrcted boltzmann machnes. In Internatonal Conference on Machne Learnng ICML, 010. [34] Gu, S., Levne, S., Sutskever, I. and Mnh, A. MuProp: Unbased backpropagaton for stochastc neural networks. arxv preprnt arxv: , 015. [35] Bengo, Y., Lonard, N. and Courvlle, A. Estmatng or propagatng gradents through stochastc neurons for condtonal computaton. arxv preprnt arxv: , 013. [36] Bshop, C.M. Mxture densty networks. Aston Unversty, [37] Sussknd, J.M., Anderson, A.K. and Hnton, G.E. The toronto face database. Department of Computer Scence, Unversty of Toronto, Toronto, ON, Canada, Tech. Rep, 3, 010. [38] Saul, L.K., Jaakkola, T. and Jordan, M.I. Mean feld theory for sgmod belef networks. Artfcal ntellgence, [39] Zagoruyko, S. and Komodaks, N. Wde resdual networks. arxv preprnt arxv: , 016. [40] Huang, G., Sun, Y., Lu, Z., Sedra, D. and Wenberger, K.Q. Deep networks wth stochastc depth. In European Conference on Computer Vson ECCV, 016. [41] Goodfellow, I., Pouget-Abade, J., Mrza, M., Xu, B., Warde-Farley, D., Ozar, S., Courvlle, A. and Bengo, Y. Generatve adversaral nets. In Advances n Neural Informaton Processng Systems NIPS, 014. [4] Zaremba, W. and Sutskever, I. Renforcement learnng neural Turng machnes. arxv preprnt arxv: , 36, 015. [43] Szegedy, C., Lu, W., Ja, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V. and Rabnovch, A. Gong deeper wth convolutons. In Proceedngs of the IEEE Conference on Computer Vson and Pattern Recognton CVPR, 015. [44] Smonyan, K. and Zsserman, A. Very deep convolutonal networks for large-scale mage recognton. arxv preprnt arxv: ,

For now, let us focus on a specific model of neurons. These are simplified from reality but can achieve remarkable results.

For now, let us focus on a specific model of neurons. These are simplified from reality but can achieve remarkable results. Neural Networks : Dervaton compled by Alvn Wan from Professor Jtendra Malk s lecture Ths type of computaton s called deep learnng and s the most popular method for many problems, such as computer vson

More information

Supporting Information

Supporting Information Supportng Informaton The neural network f n Eq. 1 s gven by: f x l = ReLU W atom x l + b atom, 2 where ReLU s the element-wse rectfed lnear unt, 21.e., ReLUx = max0, x, W atom R d d s the weght matrx to

More information

CSC321 Tutorial 9: Review of Boltzmann machines and simulated annealing

CSC321 Tutorial 9: Review of Boltzmann machines and simulated annealing CSC321 Tutoral 9: Revew of Boltzmann machnes and smulated annealng (Sldes based on Lecture 16-18 and selected readngs) Yue L Emal: yuel@cs.toronto.edu Wed 11-12 March 19 Fr 10-11 March 21 Outlne Boltzmann

More information

EEE 241: Linear Systems

EEE 241: Linear Systems EEE : Lnear Systems Summary #: Backpropagaton BACKPROPAGATION The perceptron rule as well as the Wdrow Hoff learnng were desgned to tran sngle layer networks. They suffer from the same dsadvantage: they

More information

1 Convex Optimization

1 Convex Optimization Convex Optmzaton We wll consder convex optmzaton problems. Namely, mnmzaton problems where the objectve s convex (we assume no constrants for now). Such problems often arse n machne learnng. For example,

More information

Multigradient for Neural Networks for Equalizers 1

Multigradient for Neural Networks for Equalizers 1 Multgradent for Neural Netorks for Equalzers 1 Chulhee ee, Jnook Go and Heeyoung Km Department of Electrcal and Electronc Engneerng Yonse Unversty 134 Shnchon-Dong, Seodaemun-Ku, Seoul 1-749, Korea ABSTRACT

More information

Homework Assignment 3 Due in class, Thursday October 15

Homework Assignment 3 Due in class, Thursday October 15 Homework Assgnment 3 Due n class, Thursday October 15 SDS 383C Statstcal Modelng I 1 Rdge regresson and Lasso 1. Get the Prostrate cancer data from http://statweb.stanford.edu/~tbs/elemstatlearn/ datasets/prostate.data.

More information

Natural Images, Gaussian Mixtures and Dead Leaves Supplementary Material

Natural Images, Gaussian Mixtures and Dead Leaves Supplementary Material Natural Images, Gaussan Mxtures and Dead Leaves Supplementary Materal Danel Zoran Interdscplnary Center for Neural Computaton Hebrew Unversty of Jerusalem Israel http://www.cs.huj.ac.l/ danez Yar Wess

More information

Week 5: Neural Networks

Week 5: Neural Networks Week 5: Neural Networks Instructor: Sergey Levne Neural Networks Summary In the prevous lecture, we saw how we can construct neural networks by extendng logstc regresson. Neural networks consst of multple

More information

Using deep belief network modelling to characterize differences in brain morphometry in schizophrenia

Using deep belief network modelling to characterize differences in brain morphometry in schizophrenia Usng deep belef network modellng to characterze dfferences n bran morphometry n schzophrena Walter H. L. Pnaya * a ; Ary Gadelha b ; Orla M. Doyle c ; Crstano Noto b ; André Zugman d ; Qurno Cordero b,

More information

Parametric fractional imputation for missing data analysis. Jae Kwang Kim Survey Working Group Seminar March 29, 2010

Parametric fractional imputation for missing data analysis. Jae Kwang Kim Survey Working Group Seminar March 29, 2010 Parametrc fractonal mputaton for mssng data analyss Jae Kwang Km Survey Workng Group Semnar March 29, 2010 1 Outlne Introducton Proposed method Fractonal mputaton Approxmaton Varance estmaton Multple mputaton

More information

Generalized Linear Methods

Generalized Linear Methods Generalzed Lnear Methods 1 Introducton In the Ensemble Methods the general dea s that usng a combnaton of several weak learner one could make a better learner. More formally, assume that we have a set

More information

INF 5860 Machine learning for image classification. Lecture 3 : Image classification and regression part II Anne Solberg January 31, 2018

INF 5860 Machine learning for image classification. Lecture 3 : Image classification and regression part II Anne Solberg January 31, 2018 INF 5860 Machne learnng for mage classfcaton Lecture 3 : Image classfcaton and regresson part II Anne Solberg January 3, 08 Today s topcs Multclass logstc regresson and softma Regularzaton Image classfcaton

More information

MLE and Bayesian Estimation. Jie Tang Department of Computer Science & Technology Tsinghua University 2012

MLE and Bayesian Estimation. Jie Tang Department of Computer Science & Technology Tsinghua University 2012 MLE and Bayesan Estmaton Je Tang Department of Computer Scence & Technology Tsnghua Unversty 01 1 Lnear Regresson? As the frst step, we need to decde how we re gong to represent the functon f. One example:

More information

MULTISPECTRAL IMAGE CLASSIFICATION USING BACK-PROPAGATION NEURAL NETWORK IN PCA DOMAIN

MULTISPECTRAL IMAGE CLASSIFICATION USING BACK-PROPAGATION NEURAL NETWORK IN PCA DOMAIN MULTISPECTRAL IMAGE CLASSIFICATION USING BACK-PROPAGATION NEURAL NETWORK IN PCA DOMAIN S. Chtwong, S. Wtthayapradt, S. Intajag, and F. Cheevasuvt Faculty of Engneerng, Kng Mongkut s Insttute of Technology

More information

Numerical Heat and Mass Transfer

Numerical Heat and Mass Transfer Master degree n Mechancal Engneerng Numercal Heat and Mass Transfer 06-Fnte-Dfference Method (One-dmensonal, steady state heat conducton) Fausto Arpno f.arpno@uncas.t Introducton Why we use models and

More information

A neural network with localized receptive fields for visual pattern classification

A neural network with localized receptive fields for visual pattern classification Unversty of Wollongong Research Onlne Faculty of Informatcs - Papers (Archve) Faculty of Engneerng and Informaton Scences 2005 A neural network wth localzed receptve felds for vsual pattern classfcaton

More information

The Order Relation and Trace Inequalities for. Hermitian Operators

The Order Relation and Trace Inequalities for. Hermitian Operators Internatonal Mathematcal Forum, Vol 3, 08, no, 507-57 HIKARI Ltd, wwwm-hkarcom https://doorg/0988/mf088055 The Order Relaton and Trace Inequaltes for Hermtan Operators Y Huang School of Informaton Scence

More information

Chapter 13: Multiple Regression

Chapter 13: Multiple Regression Chapter 13: Multple Regresson 13.1 Developng the multple-regresson Model The general model can be descrbed as: It smplfes for two ndependent varables: The sample ft parameter b 0, b 1, and b are used to

More information

Lecture Notes on Linear Regression

Lecture Notes on Linear Regression Lecture Notes on Lnear Regresson Feng L fl@sdueducn Shandong Unversty, Chna Lnear Regresson Problem In regresson problem, we am at predct a contnuous target value gven an nput feature vector We assume

More information

MATH 567: Mathematical Techniques in Data Science Lab 8

MATH 567: Mathematical Techniques in Data Science Lab 8 1/14 MATH 567: Mathematcal Technques n Data Scence Lab 8 Domnque Gullot Departments of Mathematcal Scences Unversty of Delaware Aprl 11, 2017 Recall We have: a (2) 1 = f(w (1) 11 x 1 + W (1) 12 x 2 + W

More information

Open Problem: The landscape of the loss surfaces of multilayer networks

Open Problem: The landscape of the loss surfaces of multilayer networks JMLR: Workshop and Conference Proceedngs vol 4: 5, 5 8th Annual Conference on Learnng Theory Open Problem: The landscape of the loss surfaces of multlayer networks Anna Choromanska Courant Insttute of

More information

On an Extension of Stochastic Approximation EM Algorithm for Incomplete Data Problems. Vahid Tadayon 1

On an Extension of Stochastic Approximation EM Algorithm for Incomplete Data Problems. Vahid Tadayon 1 On an Extenson of Stochastc Approxmaton EM Algorthm for Incomplete Data Problems Vahd Tadayon Abstract: The Stochastc Approxmaton EM (SAEM algorthm, a varant stochastc approxmaton of EM, s a versatle tool

More information

Deep Belief Network using Reinforcement Learning and its Applications to Time Series Forecasting

Deep Belief Network using Reinforcement Learning and its Applications to Time Series Forecasting Deep Belef Network usng Renforcement Learnng and ts Applcatons to Tme Seres Forecastng Takaom HIRATA, Takash KUREMOTO, Masanao OBAYASHI, Shngo MABU Graduate School of Scence and Engneerng Yamaguch Unversty

More information

Logistic Regression. CAP 5610: Machine Learning Instructor: Guo-Jun QI

Logistic Regression. CAP 5610: Machine Learning Instructor: Guo-Jun QI Logstc Regresson CAP 561: achne Learnng Instructor: Guo-Jun QI Bayes Classfer: A Generatve model odel the posteror dstrbuton P(Y X) Estmate class-condtonal dstrbuton P(X Y) for each Y Estmate pror dstrbuton

More information

Support Vector Machines. Vibhav Gogate The University of Texas at dallas

Support Vector Machines. Vibhav Gogate The University of Texas at dallas Support Vector Machnes Vbhav Gogate he Unversty of exas at dallas What We have Learned So Far? 1. Decson rees. Naïve Bayes 3. Lnear Regresson 4. Logstc Regresson 5. Perceptron 6. Neural networks 7. K-Nearest

More information

Econ107 Applied Econometrics Topic 3: Classical Model (Studenmund, Chapter 4)

Econ107 Applied Econometrics Topic 3: Classical Model (Studenmund, Chapter 4) I. Classcal Assumptons Econ7 Appled Econometrcs Topc 3: Classcal Model (Studenmund, Chapter 4) We have defned OLS and studed some algebrac propertes of OLS. In ths topc we wll study statstcal propertes

More information

Lecture 10 Support Vector Machines II

Lecture 10 Support Vector Machines II Lecture 10 Support Vector Machnes II 22 February 2016 Taylor B. Arnold Yale Statstcs STAT 365/665 1/28 Notes: Problem 3 s posted and due ths upcomng Frday There was an early bug n the fake-test data; fxed

More information

Deep Learning: A Quick Overview

Deep Learning: A Quick Overview Deep Learnng: A Quck Overvew Seungjn Cho Department of Computer Scence and Engneerng Pohang Unversty of Scence and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjn@postech.ac.kr http://mlg.postech.ac.kr/

More information

Finite Mixture Models and Expectation Maximization. Most slides are from: Dr. Mario Figueiredo, Dr. Anil Jain and Dr. Rong Jin

Finite Mixture Models and Expectation Maximization. Most slides are from: Dr. Mario Figueiredo, Dr. Anil Jain and Dr. Rong Jin Fnte Mxture Models and Expectaton Maxmzaton Most sldes are from: Dr. Maro Fgueredo, Dr. Anl Jan and Dr. Rong Jn Recall: The Supervsed Learnng Problem Gven a set of n samples X {(x, y )},,,n Chapter 3 of

More information

3.1 Expectation of Functions of Several Random Variables. )' be a k-dimensional discrete or continuous random vector, with joint PMF p (, E X E X1 E X

3.1 Expectation of Functions of Several Random Variables. )' be a k-dimensional discrete or continuous random vector, with joint PMF p (, E X E X1 E X Statstcs 1: Probablty Theory II 37 3 EPECTATION OF SEVERAL RANDOM VARIABLES As n Probablty Theory I, the nterest n most stuatons les not on the actual dstrbuton of a random vector, but rather on a number

More information

Ensemble Methods: Boosting

Ensemble Methods: Boosting Ensemble Methods: Boostng Ncholas Ruozz Unversty of Texas at Dallas Based on the sldes of Vbhav Gogate and Rob Schapre Last Tme Varance reducton va baggng Generate new tranng data sets by samplng wth replacement

More information

COMPARISON OF SOME RELIABILITY CHARACTERISTICS BETWEEN REDUNDANT SYSTEMS REQUIRING SUPPORTING UNITS FOR THEIR OPERATIONS

COMPARISON OF SOME RELIABILITY CHARACTERISTICS BETWEEN REDUNDANT SYSTEMS REQUIRING SUPPORTING UNITS FOR THEIR OPERATIONS Avalable onlne at http://sck.org J. Math. Comput. Sc. 3 (3), No., 6-3 ISSN: 97-537 COMPARISON OF SOME RELIABILITY CHARACTERISTICS BETWEEN REDUNDANT SYSTEMS REQUIRING SUPPORTING UNITS FOR THEIR OPERATIONS

More information

LOW BIAS INTEGRATED PATH ESTIMATORS. James M. Calvin

LOW BIAS INTEGRATED PATH ESTIMATORS. James M. Calvin Proceedngs of the 007 Wnter Smulaton Conference S G Henderson, B Bller, M-H Hseh, J Shortle, J D Tew, and R R Barton, eds LOW BIAS INTEGRATED PATH ESTIMATORS James M Calvn Department of Computer Scence

More information

Research Article Green s Theorem for Sign Data

Research Article Green s Theorem for Sign Data Internatonal Scholarly Research Network ISRN Appled Mathematcs Volume 2012, Artcle ID 539359, 10 pages do:10.5402/2012/539359 Research Artcle Green s Theorem for Sgn Data Lous M. Houston The Unversty of

More information

Notes on Frequency Estimation in Data Streams

Notes on Frequency Estimation in Data Streams Notes on Frequency Estmaton n Data Streams In (one of) the data streamng model(s), the data s a sequence of arrvals a 1, a 2,..., a m of the form a j = (, v) where s the dentty of the tem and belongs to

More information

STATS 306B: Unsupervised Learning Spring Lecture 10 April 30

STATS 306B: Unsupervised Learning Spring Lecture 10 April 30 STATS 306B: Unsupervsed Learnng Sprng 2014 Lecture 10 Aprl 30 Lecturer: Lester Mackey Scrbe: Joey Arthur, Rakesh Achanta 10.1 Factor Analyss 10.1.1 Recap Recall the factor analyss (FA) model for lnear

More information

Efficient Discriminative Convolution Using Fisher Weight Map

Efficient Discriminative Convolution Using Fisher Weight Map H. NAKAYAMA: EFFICIENT DISCRIMINATIVE CONVOLUTION USING FWM 1 Effcent Dscrmnatve Convoluton Usng Fsher Weght Map Hdek Nakayama http://www.nlab.c..u-tokyo.ac.jp/ Graduate School of Informaton Scence and

More information

A New Evolutionary Computation Based Approach for Learning Bayesian Network

A New Evolutionary Computation Based Approach for Learning Bayesian Network Avalable onlne at www.scencedrect.com Proceda Engneerng 15 (2011) 4026 4030 Advanced n Control Engneerng and Informaton Scence A New Evolutonary Computaton Based Approach for Learnng Bayesan Network Yungang

More information

Admin NEURAL NETWORKS. Perceptron learning algorithm. Our Nervous System 10/25/16. Assignment 7. Class 11/22. Schedule for the rest of the semester

Admin NEURAL NETWORKS. Perceptron learning algorithm. Our Nervous System 10/25/16. Assignment 7. Class 11/22. Schedule for the rest of the semester 0/25/6 Admn Assgnment 7 Class /22 Schedule for the rest of the semester NEURAL NETWORKS Davd Kauchak CS58 Fall 206 Perceptron learnng algorthm Our Nervous System repeat untl convergence (or for some #

More information

Boostrapaggregating (Bagging)

Boostrapaggregating (Bagging) Boostrapaggregatng (Baggng) An ensemble meta-algorthm desgned to mprove the stablty and accuracy of machne learnng algorthms Can be used n both regresson and classfcaton Reduces varance and helps to avod

More information

Linear Feature Engineering 11

Linear Feature Engineering 11 Lnear Feature Engneerng 11 2 Least-Squares 2.1 Smple least-squares Consder the followng dataset. We have a bunch of nputs x and correspondng outputs y. The partcular values n ths dataset are x y 0.23 0.19

More information

Lecture 4: Universal Hash Functions/Streaming Cont d

Lecture 4: Universal Hash Functions/Streaming Cont d CSE 5: Desgn and Analyss of Algorthms I Sprng 06 Lecture 4: Unversal Hash Functons/Streamng Cont d Lecturer: Shayan Oves Gharan Aprl 6th Scrbe: Jacob Schreber Dsclamer: These notes have not been subjected

More information

Markov Chain Monte Carlo Lecture 6

Markov Chain Monte Carlo Lecture 6 where (x 1,..., x N ) X N, N s called the populaton sze, f(x) f (x) for at least one {1, 2,..., N}, and those dfferent from f(x) are called the tral dstrbutons n terms of mportance samplng. Dfferent ways

More information

Report on Image warping

Report on Image warping Report on Image warpng Xuan Ne, Dec. 20, 2004 Ths document summarzed the algorthms of our mage warpng soluton for further study, and there s a detaled descrpton about the mplementaton of these algorthms.

More information

Global Sensitivity. Tuesday 20 th February, 2018

Global Sensitivity. Tuesday 20 th February, 2018 Global Senstvty Tuesday 2 th February, 28 ) Local Senstvty Most senstvty analyses [] are based on local estmates of senstvty, typcally by expandng the response n a Taylor seres about some specfc values

More information

Microwave Diversity Imaging Compression Using Bioinspired

Microwave Diversity Imaging Compression Using Bioinspired Mcrowave Dversty Imagng Compresson Usng Bonspred Neural Networks Youwe Yuan 1, Yong L 1, Wele Xu 1, Janghong Yu * 1 School of Computer Scence and Technology, Hangzhou Danz Unversty, Hangzhou, Zhejang,

More information

1 Motivation and Introduction

1 Motivation and Introduction Instructor: Dr. Volkan Cevher EXPECTATION PROPAGATION September 30, 2008 Rce Unversty STAT 63 / ELEC 633: Graphcal Models Scrbes: Ahmad Beram Andrew Waters Matthew Nokleby Index terms: Approxmate nference,

More information

Multilayer neural networks

Multilayer neural networks Lecture Multlayer neural networks Mlos Hauskrecht mlos@cs.ptt.edu 5329 Sennott Square Mdterm exam Mdterm Monday, March 2, 205 In-class (75 mnutes) closed book materal covered by February 25, 205 Multlayer

More information

CS294A Lecture notes. Andrew Ng

CS294A Lecture notes. Andrew Ng CS294A Lecture notes Andrew Ng Sparse autoencoder 1 Introducton Supervsed learnng s one of the most powerful tools of AI, and has led to automatc zp code recognton, speech recognton, self-drvng cars, and

More information

The Expectation-Maximization Algorithm

The Expectation-Maximization Algorithm The Expectaton-Maxmaton Algorthm Charles Elan elan@cs.ucsd.edu November 16, 2007 Ths chapter explans the EM algorthm at multple levels of generalty. Secton 1 gves the standard hgh-level verson of the algorthm.

More information

Scalable Multi-Class Gaussian Process Classification using Expectation Propagation

Scalable Multi-Class Gaussian Process Classification using Expectation Propagation Scalable Mult-Class Gaussan Process Classfcaton usng Expectaton Propagaton Carlos Vllacampa-Calvo and Danel Hernández Lobato Computer Scence Department Unversdad Autónoma de Madrd http://dhnzl.org, danel.hernandez@uam.es

More information

Feature Selection: Part 1

Feature Selection: Part 1 CSE 546: Machne Learnng Lecture 5 Feature Selecton: Part 1 Instructor: Sham Kakade 1 Regresson n the hgh dmensonal settng How do we learn when the number of features d s greater than the sample sze n?

More information

SDMML HT MSc Problem Sheet 4

SDMML HT MSc Problem Sheet 4 SDMML HT 06 - MSc Problem Sheet 4. The recever operatng characterstc ROC curve plots the senstvty aganst the specfcty of a bnary classfer as the threshold for dscrmnaton s vared. Let the data space be

More information

Module 3 LOSSY IMAGE COMPRESSION SYSTEMS. Version 2 ECE IIT, Kharagpur

Module 3 LOSSY IMAGE COMPRESSION SYSTEMS. Version 2 ECE IIT, Kharagpur Module 3 LOSSY IMAGE COMPRESSION SYSTEMS Verson ECE IIT, Kharagpur Lesson 6 Theory of Quantzaton Verson ECE IIT, Kharagpur Instructonal Objectves At the end of ths lesson, the students should be able to:

More information

Pop-Click Noise Detection Using Inter-Frame Correlation for Improved Portable Auditory Sensing

Pop-Click Noise Detection Using Inter-Frame Correlation for Improved Portable Auditory Sensing Advanced Scence and Technology Letters, pp.164-168 http://dx.do.org/10.14257/astl.2013 Pop-Clc Nose Detecton Usng Inter-Frame Correlaton for Improved Portable Audtory Sensng Dong Yun Lee, Kwang Myung Jeon,

More information

Feature Selection & Dynamic Tracking F&P Textbook New: Ch 11, Old: Ch 17 Guido Gerig CS 6320, Spring 2013

Feature Selection & Dynamic Tracking F&P Textbook New: Ch 11, Old: Ch 17 Guido Gerig CS 6320, Spring 2013 Feature Selecton & Dynamc Trackng F&P Textbook New: Ch 11, Old: Ch 17 Gudo Gerg CS 6320, Sprng 2013 Credts: Materal Greg Welch & Gary Bshop, UNC Chapel Hll, some sldes modfed from J.M. Frahm/ M. Pollefeys,

More information

Non-linear Canonical Correlation Analysis Using a RBF Network

Non-linear Canonical Correlation Analysis Using a RBF Network ESANN' proceedngs - European Smposum on Artfcal Neural Networks Bruges (Belgum), 4-6 Aprl, d-sde publ., ISBN -97--, pp. 57-5 Non-lnear Canoncal Correlaton Analss Usng a RBF Network Sukhbnder Kumar, Elane

More information

Errors for Linear Systems

Errors for Linear Systems Errors for Lnear Systems When we solve a lnear system Ax b we often do not know A and b exactly, but have only approxmatons  and ˆb avalable. Then the best thng we can do s to solve ˆx ˆb exactly whch

More information

Multilayer Perceptrons and Backpropagation. Perceptrons. Recap: Perceptrons. Informatics 1 CG: Lecture 6. Mirella Lapata

Multilayer Perceptrons and Backpropagation. Perceptrons. Recap: Perceptrons. Informatics 1 CG: Lecture 6. Mirella Lapata Multlayer Perceptrons and Informatcs CG: Lecture 6 Mrella Lapata School of Informatcs Unversty of Ednburgh mlap@nf.ed.ac.uk Readng: Kevn Gurney s Introducton to Neural Networks, Chapters 5 6.5 January,

More information

10-701/ Machine Learning, Fall 2005 Homework 3

10-701/ Machine Learning, Fall 2005 Homework 3 10-701/15-781 Machne Learnng, Fall 2005 Homework 3 Out: 10/20/05 Due: begnnng of the class 11/01/05 Instructons Contact questons-10701@autonlaborg for queston Problem 1 Regresson and Cross-valdaton [40

More information

Unified Subspace Analysis for Face Recognition

Unified Subspace Analysis for Face Recognition Unfed Subspace Analyss for Face Recognton Xaogang Wang and Xaoou Tang Department of Informaton Engneerng The Chnese Unversty of Hong Kong Shatn, Hong Kong {xgwang, xtang}@e.cuhk.edu.hk Abstract PCA, LDA

More information

ADVANCED MACHINE LEARNING ADVANCED MACHINE LEARNING

ADVANCED MACHINE LEARNING ADVANCED MACHINE LEARNING 1 ADVANCED ACHINE LEARNING ADVANCED ACHINE LEARNING Non-lnear regresson technques 2 ADVANCED ACHINE LEARNING Regresson: Prncple N ap N-dm. nput x to a contnuous output y. Learn a functon of the type: N

More information

A Robust Method for Calculating the Correlation Coefficient

A Robust Method for Calculating the Correlation Coefficient A Robust Method for Calculatng the Correlaton Coeffcent E.B. Nven and C. V. Deutsch Relatonshps between prmary and secondary data are frequently quantfed usng the correlaton coeffcent; however, the tradtonal

More information

Multi-Conditional Learning for Joint Probability Models with Latent Variables

Multi-Conditional Learning for Joint Probability Models with Latent Variables Mult-Condtonal Learnng for Jont Probablty Models wth Latent Varables Chrs Pal, Xueru Wang, Mchael Kelm and Andrew McCallum Department of Computer Scence Unversty of Massachusetts Amherst Amherst, MA USA

More information

Comparison of the Population Variance Estimators. of 2-Parameter Exponential Distribution Based on. Multiple Criteria Decision Making Method

Comparison of the Population Variance Estimators. of 2-Parameter Exponential Distribution Based on. Multiple Criteria Decision Making Method Appled Mathematcal Scences, Vol. 7, 0, no. 47, 07-0 HIARI Ltd, www.m-hkar.com Comparson of the Populaton Varance Estmators of -Parameter Exponental Dstrbuton Based on Multple Crtera Decson Makng Method

More information

Neural Networks & Learning

Neural Networks & Learning Neural Netorks & Learnng. Introducton The basc prelmnares nvolved n the Artfcal Neural Netorks (ANN) are descrbed n secton. An Artfcal Neural Netorks (ANN) s an nformaton-processng paradgm that nspred

More information

MAXIMUM A POSTERIORI TRANSDUCTION

MAXIMUM A POSTERIORI TRANSDUCTION MAXIMUM A POSTERIORI TRANSDUCTION LI-WEI WANG, JU-FU FENG School of Mathematcal Scences, Peng Unversty, Bejng, 0087, Chna Center for Informaton Scences, Peng Unversty, Bejng, 0087, Chna E-MIAL: {wanglw,

More information

BOOTSTRAP METHOD FOR TESTING OF EQUALITY OF SEVERAL MEANS. M. Krishna Reddy, B. Naveen Kumar and Y. Ramu

BOOTSTRAP METHOD FOR TESTING OF EQUALITY OF SEVERAL MEANS. M. Krishna Reddy, B. Naveen Kumar and Y. Ramu BOOTSTRAP METHOD FOR TESTING OF EQUALITY OF SEVERAL MEANS M. Krshna Reddy, B. Naveen Kumar and Y. Ramu Department of Statstcs, Osmana Unversty, Hyderabad -500 007, Inda. nanbyrozu@gmal.com, ramu0@gmal.com

More information

9 Derivation of Rate Equations from Single-Cell Conductance (Hodgkin-Huxley-like) Equations

9 Derivation of Rate Equations from Single-Cell Conductance (Hodgkin-Huxley-like) Equations Physcs 171/271 - Chapter 9R -Davd Klenfeld - Fall 2005 9 Dervaton of Rate Equatons from Sngle-Cell Conductance (Hodgkn-Huxley-lke) Equatons We consder a network of many neurons, each of whch obeys a set

More information

1 Derivation of Rate Equations from Single-Cell Conductance (Hodgkin-Huxley-like) Equations

1 Derivation of Rate Equations from Single-Cell Conductance (Hodgkin-Huxley-like) Equations Physcs 171/271 -Davd Klenfeld - Fall 2005 (revsed Wnter 2011) 1 Dervaton of Rate Equatons from Sngle-Cell Conductance (Hodgkn-Huxley-lke) Equatons We consder a network of many neurons, each of whch obeys

More information

Gaussian process classification: a message-passing viewpoint

Gaussian process classification: a message-passing viewpoint Gaussan process classfcaton: a message-passng vewpont Flpe Rodrgues fmpr@de.uc.pt November 014 Abstract The goal of ths short paper s to provde a message-passng vewpont of the Expectaton Propagaton EP

More information

Difference Equations

Difference Equations Dfference Equatons c Jan Vrbk 1 Bascs Suppose a sequence of numbers, say a 0,a 1,a,a 3,... s defned by a certan general relatonshp between, say, three consecutve values of the sequence, e.g. a + +3a +1

More information

Appendix B: Resampling Algorithms

Appendix B: Resampling Algorithms 407 Appendx B: Resamplng Algorthms A common problem of all partcle flters s the degeneracy of weghts, whch conssts of the unbounded ncrease of the varance of the mportance weghts ω [ ] of the partcles

More information

The Multiple Classical Linear Regression Model (CLRM): Specification and Assumptions. 1. Introduction

The Multiple Classical Linear Regression Model (CLRM): Specification and Assumptions. 1. Introduction ECONOMICS 5* -- NOTE (Summary) ECON 5* -- NOTE The Multple Classcal Lnear Regresson Model (CLRM): Specfcaton and Assumptons. Introducton CLRM stands for the Classcal Lnear Regresson Model. The CLRM s also

More information

2E Pattern Recognition Solutions to Introduction to Pattern Recognition, Chapter 2: Bayesian pattern classification

2E Pattern Recognition Solutions to Introduction to Pattern Recognition, Chapter 2: Bayesian pattern classification E395 - Pattern Recognton Solutons to Introducton to Pattern Recognton, Chapter : Bayesan pattern classfcaton Preface Ths document s a soluton manual for selected exercses from Introducton to Pattern Recognton

More information

Sparse Gaussian Processes Using Backward Elimination

Sparse Gaussian Processes Using Backward Elimination Sparse Gaussan Processes Usng Backward Elmnaton Lefeng Bo, Lng Wang, and Lcheng Jao Insttute of Intellgent Informaton Processng and Natonal Key Laboratory for Radar Sgnal Processng, Xdan Unversty, X an

More information

Simulated Power of the Discrete Cramér-von Mises Goodness-of-Fit Tests

Simulated Power of the Discrete Cramér-von Mises Goodness-of-Fit Tests Smulated of the Cramér-von Mses Goodness-of-Ft Tests Steele, M., Chaselng, J. and 3 Hurst, C. School of Mathematcal and Physcal Scences, James Cook Unversty, Australan School of Envronmental Studes, Grffth

More information

Gaussian Mixture Models

Gaussian Mixture Models Lab Gaussan Mxture Models Lab Objectve: Understand the formulaton of Gaussan Mxture Models (GMMs) and how to estmate GMM parameters. You ve already seen GMMs as the observaton dstrbuton n certan contnuous

More information

Convergence of random processes

Convergence of random processes DS-GA 12 Lecture notes 6 Fall 216 Convergence of random processes 1 Introducton In these notes we study convergence of dscrete random processes. Ths allows to characterze phenomena such as the law of large

More information

Deep Learning. Boyang Albert Li, Jie Jay Tan

Deep Learning. Boyang Albert Li, Jie Jay Tan Deep Learnng Boyang Albert L, Je Jay Tan An Unrelated Vdeo A bcycle controller learned usng NEAT (Stanley) What do you mean, deep? Shallow Hdden Markov models ANNs wth one hdden layer Manually selected

More information

Regularized Discriminant Analysis for Face Recognition

Regularized Discriminant Analysis for Face Recognition 1 Regularzed Dscrmnant Analyss for Face Recognton Itz Pma, Mayer Aladem Department of Electrcal and Computer Engneerng, Ben-Guron Unversty of the Negev P.O.Box 653, Beer-Sheva, 845, Israel. Abstract Ths

More information

Hidden Markov Models

Hidden Markov Models Hdden Markov Models Namrata Vaswan, Iowa State Unversty Aprl 24, 204 Hdden Markov Model Defntons and Examples Defntons:. A hdden Markov model (HMM) refers to a set of hdden states X 0, X,..., X t,...,

More information

Department of Statistics University of Toronto STA305H1S / 1004 HS Design and Analysis of Experiments Term Test - Winter Solution

Department of Statistics University of Toronto STA305H1S / 1004 HS Design and Analysis of Experiments Term Test - Winter Solution Department of Statstcs Unversty of Toronto STA35HS / HS Desgn and Analyss of Experments Term Test - Wnter - Soluton February, Last Name: Frst Name: Student Number: Instructons: Tme: hours. Ads: a non-programmable

More information

A New Scrambling Evaluation Scheme based on Spatial Distribution Entropy and Centroid Difference of Bit-plane

A New Scrambling Evaluation Scheme based on Spatial Distribution Entropy and Centroid Difference of Bit-plane A New Scramblng Evaluaton Scheme based on Spatal Dstrbuton Entropy and Centrod Dfference of Bt-plane Lang Zhao *, Avshek Adhkar Kouch Sakura * * Graduate School of Informaton Scence and Electrcal Engneerng,

More information

Motion Perception Under Uncertainty. Hongjing Lu Department of Psychology University of Hong Kong

Motion Perception Under Uncertainty. Hongjing Lu Department of Psychology University of Hong Kong Moton Percepton Under Uncertanty Hongjng Lu Department of Psychology Unversty of Hong Kong Outlne Uncertanty n moton stmulus Correspondence problem Qualtatve fttng usng deal observer models Based on sgnal

More information

Why feed-forward networks are in a bad shape

Why feed-forward networks are in a bad shape Why feed-forward networks are n a bad shape Patrck van der Smagt, Gerd Hrznger Insttute of Robotcs and System Dynamcs German Aerospace Center (DLR Oberpfaffenhofen) 82230 Wesslng, GERMANY emal smagt@dlr.de

More information

Assortment Optimization under MNL

Assortment Optimization under MNL Assortment Optmzaton under MNL Haotan Song Aprl 30, 2017 1 Introducton The assortment optmzaton problem ams to fnd the revenue-maxmzng assortment of products to offer when the prces of products are fxed.

More information

Chapter 5. Solution of System of Linear Equations. Module No. 6. Solution of Inconsistent and Ill Conditioned Systems

Chapter 5. Solution of System of Linear Equations. Module No. 6. Solution of Inconsistent and Ill Conditioned Systems Numercal Analyss by Dr. Anta Pal Assstant Professor Department of Mathematcs Natonal Insttute of Technology Durgapur Durgapur-713209 emal: anta.bue@gmal.com 1 . Chapter 5 Soluton of System of Lnear Equatons

More information

On mutual information estimation for mixed-pair random variables

On mutual information estimation for mixed-pair random variables On mutual nformaton estmaton for mxed-par random varables November 3, 218 Aleksandr Beknazaryan, Xn Dang and Haln Sang 1 Department of Mathematcs, The Unversty of Msssspp, Unversty, MS 38677, USA. E-mal:

More information

Predictive Analytics : QM901.1x Prof U Dinesh Kumar, IIMB. All Rights Reserved, Indian Institute of Management Bangalore

Predictive Analytics : QM901.1x Prof U Dinesh Kumar, IIMB. All Rights Reserved, Indian Institute of Management Bangalore Sesson Outlne Introducton to classfcaton problems and dscrete choce models. Introducton to Logstcs Regresson. Logstc functon and Logt functon. Maxmum Lkelhood Estmator (MLE) for estmaton of LR parameters.

More information

Deep Learning for Causal Inference

Deep Learning for Causal Inference Deep Learnng for Causal Inference Vkas Ramachandra Stanford Unversty Graduate School of Busness 655 Knght Way, Stanford, CA 94305 Abstract In ths paper, we propose the use of deep learnng technques n econometrcs,

More information

8 Derivation of Network Rate Equations from Single- Cell Conductance Equations

8 Derivation of Network Rate Equations from Single- Cell Conductance Equations Physcs 178/278 - Davd Klenfeld - Wnter 2015 8 Dervaton of Network Rate Equatons from Sngle- Cell Conductance Equatons We consder a network of many neurons, each of whch obeys a set of conductancebased,

More information

Econ Statistical Properties of the OLS estimator. Sanjaya DeSilva

Econ Statistical Properties of the OLS estimator. Sanjaya DeSilva Econ 39 - Statstcal Propertes of the OLS estmator Sanjaya DeSlva September, 008 1 Overvew Recall that the true regresson model s Y = β 0 + β 1 X + u (1) Applyng the OLS method to a sample of data, we estmate

More information

CSci 6974 and ECSE 6966 Math. Tech. for Vision, Graphics and Robotics Lecture 21, April 17, 2006 Estimating A Plane Homography

CSci 6974 and ECSE 6966 Math. Tech. for Vision, Graphics and Robotics Lecture 21, April 17, 2006 Estimating A Plane Homography CSc 6974 and ECSE 6966 Math. Tech. for Vson, Graphcs and Robotcs Lecture 21, Aprl 17, 2006 Estmatng A Plane Homography Overvew We contnue wth a dscusson of the major ssues, usng estmaton of plane projectve

More information

Neural Networks. Perceptrons and Backpropagation. Silke Bussen-Heyen. 5th of Novemeber Universität Bremen Fachbereich 3. Neural Networks 1 / 17

Neural Networks. Perceptrons and Backpropagation. Silke Bussen-Heyen. 5th of Novemeber Universität Bremen Fachbereich 3. Neural Networks 1 / 17 Neural Networks Perceptrons and Backpropagaton Slke Bussen-Heyen Unverstät Bremen Fachberech 3 5th of Novemeber 2012 Neural Networks 1 / 17 Contents 1 Introducton 2 Unts 3 Network structure 4 Snglelayer

More information

The Feynman path integral

The Feynman path integral The Feynman path ntegral Aprl 3, 205 Hesenberg and Schrödnger pctures The Schrödnger wave functon places the tme dependence of a physcal system n the state, ψ, t, where the state s a vector n Hlbert space

More information

j) = 1 (note sigma notation) ii. Continuous random variable (e.g. Normal distribution) 1. density function: f ( x) 0 and f ( x) dx = 1

j) = 1 (note sigma notation) ii. Continuous random variable (e.g. Normal distribution) 1. density function: f ( x) 0 and f ( x) dx = 1 Random varables Measure of central tendences and varablty (means and varances) Jont densty functons and ndependence Measures of assocaton (covarance and correlaton) Interestng result Condtonal dstrbutons

More information

Semi-supervised Classification with Active Query Selection

Semi-supervised Classification with Active Query Selection Sem-supervsed Classfcaton wth Actve Query Selecton Jao Wang and Swe Luo School of Computer and Informaton Technology, Beng Jaotong Unversty, Beng 00044, Chna Wangjao088@63.com Abstract. Labeled samples

More information

Kernel Methods and SVMs Extension

Kernel Methods and SVMs Extension Kernel Methods and SVMs Extenson The purpose of ths document s to revew materal covered n Machne Learnng 1 Supervsed Learnng regardng support vector machnes (SVMs). Ths document also provdes a general

More information