arxiv: v1 [cs.ne] 8 Apr 2016

Size: px

Start display at page:

Download "arxiv: v1 [cs.ne] 8 Apr 2016"

Lindsay Dalton
5 years ago
Views:

1 Norm-preservng Orthogonal Permutaton Lnear Unt Actvaton Functons (OPLU) 1 Artem Chernodub 2 and Dmtr Nowck 3 Insttute of MMS of NASU, Center for Cybernetcs, 42 Glushkova ave., Kev, Ukrane Abstract. We propose a novel actvaton functon that mplements pece-wse orthogonal non-lnear mappngs based on permutatons. It s straghtforward to mplement, and very computatonally effcent, also t has lttle memory requrements. We tested t on two toy problems for feedforward and recurrent networks, t shows smlar performance to tanh and ReLU. OPLU actvaton functon ensures norm preservance of the backpropagated gradents; therefore t s potentally good for the tranng of deep, extra deep, and recurrent neural networks. arxv: v1 cs.ne] 8 Apr Introducton Deep neural networks become deeper and deeper. Early DNNs had 4-6 hdden layers 1], 2]. Wnner of ILSVRC (Large Scale Vsual Recognton Challenge) AlexNet has 8 layers 3]. Wnner-2014, VGGNet, has 19 layers 4], a recent wnner, ResNet 5] has 152 layers. It seems that stackng more non-lnear layers produce more freedom for obtanng better performance. The man problem of tranng deep feedforward networks s vanshng/explodng gradents effect 6]. The same problem exsts n recurrent networks, whch essentally can be represented as unfolded back through tme deep networks wth shared weghts. One can say that recurrent networks suffer from the vanshng/explodng gradents effect even more because of multplcatons on of the same weght matrces durng forward and backward passes. Usually, archtectural methods are used to prevent vanshng/explodng gradent n RNNs: NARX networks have short lnks between output error and old weghts unfolded back n tme 7], LSTM has a specal structure of nput and forgettng gates whch produce a constant error carousel 8], for Echo State networks 9] only the last feedforward layer s modfed durng the tranng and so on. Greedy layer-wse pre-tranng of layers usng RBM and autoencoders made a revoluton n feedforward deep networks tranng feedforward deep networks but t stll s numercal resources-demandng. Smart ntalzaton of DNNs weghts s the current topc of research 10], 11], 12], 13]. In 14] orthogonal ntal condtons on weghts for deep feedforward networks were proposed. It was hypotheszed that orthogonalty of weght matrces produces a smlar effect to unsupervsed pre-tranng that leads to fathful propagaton of gradents and faster convergence of tranng. From the vanshng/explodng gradent perspectve ths makes sense snce the orthogonal matrx preserves the norm of backpropagated gradents. However, tradtonal actvaton functons break the orthogonalty of backpropagated flow that prevents preservng of gradent norms. Among the vast emprcal research on mprovement of DNN tranng methods, search of good non-lnear actvaton functons plays an mportant role, for example, ReLU 15], Maxout 16], ELU 17]. In ths paper, we propose a novel nonlnear actvaton functon, whch s assumed to be used together wth the orthogonal weght matrces that may help the tranng of extremely deep neural networks. 2 Backpropagaton Mechancs Consder a multlayer perceptron (MLP) that has N layers. MLP s n-th layer receves postsynaptc actvaton from the prevous layer z (n 1) (z (0) s an nput data vector x) and produces a new postsynaptc actvaton z (n) : a (n) z (n 1) w (n) + b (n), z (n) f(a (n) ), (1) where w (n) s a matrx of weghts, a (n) s known as a pre-synaptc actvaton, f( ) s a nonlnear actvaton functon. After processng of all network s layers and producng the output y, target error E(y(w)) s calculated accordng to chosen error functon E( ). To tran the neural network usng a gradent-based optmzaton algorthm we have to calculate dervatves of error functon subect to to network s weghts, n 1,..., N for E w (n) all network s layers. Standard chan rule-based backpropagaton s a common choce for ths task. Intermedate 1 Submtted to ICANN a.chernodub@gmal.com 3 nowck@nnteam.org.ua

2 2 A.N. Chernodub, D.V. Nowck E varables δ (n) are called local gradents or smply deltas ; they are usually ntroduced for convenence. If deltas are avalable for specfc layer n then correspondng mmedate dervatves can be calculated a (n) E easly: (z (n) δ (n). w (n) For the last layer δ (N) s an error resdual, for the ntermedate layers deltas are ncrementally calculated accordng to very famous backpropagaton formula: δ (n 1) f (a (n 1) ) w (n) δ(n) (2) Let s wrte ths equaton n a matrx form: δ (n 1) δ (n) (w (n) dag(f (a (n 1) )), (3) where dag converts a vector nto to a dagonal matrx. In partcular, for ReLU actvaton functon f(a ) max(a, 0) f we denote D (n) dag(f (a (n 1) )), we get for the forward pass and z (n) a (n) D (n) (4) δ (n 1) δ (n) (w (n) D (n) (5) for the backward pass where D (n) matrx contans ether zeros or ones on the dagonal. Equaton (3) may be rewrtten usng the Jacoban matrx J (n) z(n) : z (n 1) where δ (n 1) δ (n) J (n) (6) J (n) (w (n) dag(f (a (n 1) )). (7) Now we can see an ntutve understandng of explodng/vanshng gradents problem that was proposed and deeply nvestgated n classc 18], 6] and modern papers 19], 20]. As t follows from (6), the norm of the backpropagated deltas strongly depends on the norm of the Jacobans. Moreover, they actually are product of Jacobans: δ (n k) δ (n) J (n) J (n 1)...J (n k+1). In practce, Jacobans are more lkely to be less than 1 because norms of two factors (7) often are tendng to be less than 1. For the frst factor, usually w(n) T < 1 because large norms leads to non-robust behavor of the neural network. One can easly remember popular weght-decay regularzaton for neural networks that prevents ncreasng the weghts durng the tranng. As for the second factor of (7) D (n) dag(f (a (n 1) )), t s L 2 norm s equal to the absolute largest egenvalue; n standard case f we have a real-valued dagonal matrx t s norm s smply the largest element n f (a (n 1) ). The maxmum value of dervatve for tanh s 1, for sgmod t s 1 / 4, so D (n) 1 and D (n) 1 / 4 respectvely. As for ReLU functon, n the most cases D (n) 1. Indeed, for ReLU, the largest element n f (a (n 1) ) s not 1 f and only f all elements n f (a (n 1) ) are zeros. At the same tme, ( ) even f both factors n ((7) have ) norm 1, t stll not guarantee( norm) 1 of J (n). For example, f we have A, A 1 and B, B 1 we get C AB, C 0. In practce, after passng the ReLUs norm of gradent δ (n) usually becomes smaller. The suffcent condton for strct preservaton of the norm of backpropagated gradents (6) s orthogonalty of Jacoban matrces (7). In theoretcal work 14] a new class of random orthogonal ntal condtons on weghts for deep feedforward networks was proposed. It was hypotheszed that such orthogonalty of weght matrces produces a smlar effect to unsupervsed pre-tranng that leads to fathful propagaton of gradents and faster convergence of tranng. At the same tme, n the mentoned paper the theoretcal analyss and experments were provded for lnear case. In ths way, actvaton functons f( ) are lnear functons and therefore the dagonal matrx n Jacoban (7) becomes smply a unty matrx. Thereby, a Jacoban becomes smply a transposed weghts matrx, J wrec. T Orthogonal ntalzaton of weghts s an actve area of research n Deep Learnng communty 11], 12], 13]. In 21] a soluton based on untary matrces s proposed. Meanwhle, usng common-known non-lnear actvaton functons breaks the orthogonalty of Jacobans even f weght matrces are orthogonal and prevents norm preservng n the backpropagaton flow. Actvaton functon that provdes orthogonal mappng n a standard neural network s setup where all neurons are ndependent of each other s not known yet. The obvous soluton z abs(a), unfortunately, s not sutable because t s not a monotonc functon and therefore t shows poor convergence propertes. In ths work, we propose a novel actvaton functon called Orthogonal Permutaton Lnear Unts (OPLU) that ensures the orthogonalty of nonlnear mappng and acts on neurons n a par-wse manner.

3 Orthogonal Permutaton Lnear Unt Actvaton Functons (OPLU) 3 3 Orthogonal Permutaton Lnear Unt actvaton functon (OPLU) Actvaton functon produces a vector of postsynaptc values z of the same dmensonalty for a vector of presynaptc values a. Suppose we have a neural network s layer wth an even number of neurons. Then we may defne a lst of neuron s pars; for each par of nput presynaptc values {a, a } we get a par of neuron s outputs {z, z } accordng to the followng rule: ( ) ( ) z max(a, a ). (8) z mn(a, a ) Actually, we perform permutatons of pars of presynaptc values under the certan condtons: ( z z ( a a f a a and ( z z ( a a else (Fg. 1, left). Fg. 1: Orthogonal Permutaton Lnear Unt (OPLU) actvaton functon n acton(left) and ts dervatve (rght). Ths 2D mappng has a couple of nterestng propertes that makes t promsng for usng as an actvaton functon n neural networks. Frst, t s non-lnear and contnuous. Second, smlarly to ReLU, t s a peace-wse lnear mappng: forward (4) and backward (5) passes may be expressed as a multplcaton of argument on the same matrx D (n). Thrd, ths matrx s always orthogonal and, therefore, OPLU actvaton functon s norm-preservng. Ths s the most mportant and promsng property snce now an actvaton functon s not a reason of vanshng or exploson of gradents. The orthogonalty of D (n) s easly seen snce (8) s equal to the multplcaton of a vector of presynaptc values a and one of two orthogonal 2x2 matrces, den- ( ) ( ) 0 1 tty matrx or permutaton matrx and 0 1 block-dagonal matrx whose blocks are all orthogonal matrces s also an orthogonal matrx. Fnally, t s straghtforward to mplement, s computatonally effcent and has lttle memory requrements. For mplementaton usng low-level or mddle-level language we don t even need to compute the real-valued outputs; what we need s to change nteger ponters to the data values n memory. Int: splt all layer s neurons to pars {p k } (n) {(, )} (n), k 1,..., N L /2, where N L s a number of neurons n the layer, n s layer s number,, are neuron s ndexes. Forward pass: for each par of neurons p k (, ), k 1,..., N L /2, calculate the next values: z (n) z (n) a (n) a (n) a (n) (n) f a a (n) f a (n) a (n), < a (n). Backward pass: for each par of neurons p k (, ), k 1,..., N L /2, calculate the prevous deltas: δ (n) where δ (n+1) δ (n) ] δ (n+1) T δ (n+1) δ (n+1) (n+1) δ f a (n) f a (n) a (n), < a (n). Surely, t s possble to use per- m l w(n+1) lm δ(n+1) l, m 1,..., N L. mutatons of orders more than 2. However, t doesn t seem to be useful for practce because hgh nterconnectvty between neurons may lead to overfttng. Actually, the core dea of a popular regularzaton method dropout 22] s to prevent such nterconnectvty as much as possble. We suppose that par-wse nterconnectvty between the neurons s a mnmal payment for strct orthogonalty of mappng s dervatve. 4 Experments 4.1 MNIST Problem As a feasblty check, frst we tred to tran the feedforward network at the MNIST problem. We used the LeNet convolutonal network. It s a standard out-of-box Caffe s example, t s archtecture s conv 5x5]-pool

4 4 A.N. Chernodub, D.V. Nowck max]-conv 5x5]-pool max]-full connected]-relu]-full connected]-softmax]. Nets were traned usng the default parameters: Stochastc Gradent Descent (SGD) algorthm, tranng speed α 10 2, momentum µ 0.9, 10,000 teratons. Intal weghts were flled by a standard Xaver method 10]. We traned a set of 10 networks for each actvaton functon. As we see from Table 1, for OPLU results are very smlar to tanh and ReLU. Surprsngly, we were able to exceed the threshold 99% wthout any tunng of tranng hyper-parameters whch were optmzed for ReLU functon. Table 1: Classfcaton accuraces at MNIST problem for dfferent actvaton functons. best mean TanH 99.16%, 99.07% ReLU 99.17% 99.10% OPLU 99.16%, 99.06% Our Caffe s mplementaton of OPLU functon s avalable for download here oplu_caffe.gt. 4.2 Addng problem We traned a recurrent network at the Addng problem. It s a synthetc problem for testng the ablty of the neural network to capture the long-term dependences n data 19], 12], 21]. The nput conssts of a sequence of random numbers, where two random postons (one n the begnnng and one n the mddle of the sequence) are marked. The model must predct the sum of the two random numbers after the entre sequence was seen. We traned Smple Recurrent Networks (SRN) 19] wth one hdden layer contanng 100 unts and a lnear output layer. Fg. 2: Mean norms of ntal backpropagated gradents for SRN, horzon of BPTT h 100. We see that SRN wth OPLU actvaton functon and ntalzed by random orthogonal matrx (red color) has the constant backpropagated gradents. The gradents were obtaned usng the BPTT method. For tanh and ReLU functons weghts were ntalzed by xaver method 10], for OPLU case the weghts were ntalzed by random orthogonal matrces. To generate them we took a matrx exponental of random skew-symmetrc matrces; we casually found out that such ntalzaton works better than the bult-n MATLAB s orth() functon. We traned networks usng the SGD, α 10 4, µ 0.9, the sze of mn-batches s 20. The dataset contans 20,000 samples for tranng, 1000 samples for valdaton and 10,000 samples for test. Tranng process conssts 2000 epochs for T{30,50,70} and 5000 epochs for T100, each epoch has 50 teratons. For ReLU actvaton functon we were not able to successfully tran the SRN network. It seems that the reason s fast vanshng of gradents for ths case (Fg. 2, green). OPLU shows good performance that s smlar to tanh for comparatvely short sequences T{30,50,70}. It shows even better performance for the best exemplars from the sets and faster convergence. However, for the T100 we were not able to successfully tran the SRN wth OPLU, currently, we can t accurately explan why. The MATLAB code for ths experment s avalable here: oplu_addng.gt.

5 Orthogonal Permutaton Lnear Unt Actvaton Functons (OPLU) 5 Fg. 3: Valdaton MSE error curves for dfferent lengths T durng the tranng for the Addng problem. Tanh (blue color), OPLU (red color). Table 2: Rates of success for dfferent actvatons for the Addng problem. T 30 T50 T70 T100 best mean best mean best mean best mean TanH 99.24%, 98.49% 98.90% 98.48% 99.31% 90.46% 98.44% 52.91% OPLU 99.34%, 98.83% 99.21% 98.70% 99.43% 81.58% 16.33% 15.36% 5 Concluson In ths study we ntroduced a new type of pecewse-lnear actvaton functon. Ths functon (OPLU) acts parwse on postsynaptc potentals of networks layer, and ts dervatve s an orthogonal operator at every pont. Ths approach s promsng thanks to strong and clear mathematcal ustfcaton that guarantees strct norm preservaton for unlmted number of layers f ther weght matrces are orthogonal. It s also nterestng that an somorphsm could be establshed between OPLU actvaton fucton and Maxout 16]. At current stage of the research we proved feasblty of OPLU actvaton for small problems for feed-forward convolutonal and smple recurrent networks. Explorng of ts potental and lmtatons for real-lfe problems s a subect of our future research. References 1. G.E. Hnton and R.R. Salakhutdnov. Reducng the dmensonalty of data wth neural networks. Scence, J. Schmdhuber D.C. Cresan, U. Meer L.M. Gambardella. Deep, bg, smple neural nets for handwrtten dgt recognton. Neural Computaton, 22(12): , G.E. Hnton A. Krzhevsky, I. Sutskever. Imagenet classfcaton wth deep convolutonal neural networks. In Advances n neural nformaton processng systems, pages , A. Vedald A. Zsserman K. Chatfeld, K. Smonyan. Return of the devl n the detals: Delvng deep nto convolutonal nets. arxv preprnt arxv: , S. Ren J. Sun K. He, X. Zhang. Deep resdual learnng for mage recognton. arxv preprnt arxv: , P. Frascon Y. Bengo, P. Smard. Learnng long-term dependences wth gradent descent s dffcult. IEEE Trans. Neural Networks, 5(2): , H. Cardot R. Bone. Advanced Methods for Tme Seres Predcton Usng Recurrent Neural Networks, page 24. Intech, Croata, J. Schmdhuber S. Hochreter. Long short-term memory. Neural Computaton, 9(8): , H. Jaeger. Long short-term memory n echo state networks: Detals of a smulaton study. Techncal Report 27, Jacobs Unversty, Y. Bengo X. Glorot. Understandng the dffculty of tranng deep feedforward neural networks. In AISTATS, pages , J. Matas D. Mshkn. All you need s a good nt. arxv preprnt arxv: , 2015.

6 6 A.N. Chernodub, D.V. Nowck 12. G.E. Hnton Q.V. Le, N. Jatly. A smple way to ntalze recurrent networks of rectfed lnear unts. arxv preprnt arxv: , J. Donahue T. Darrell P. Krhenbhl, C. Doersch. Data-dependent ntalzatons of convolutonal neural networks. arxv preprnt arxv: , S. Gangul A.M. Saxe, J.L. McClelland. Exact solutons to the nonlnear dynamcs of learnng n deep lnear neural networks. arxv preprnt arxv: , Y. Bengo X. Glorot, A. Bordes. Deep sparse rectfer neural networks. In Internatonal Conference on Artfcal Intellgence and Statstcs, pages , M. Mrza A. Courvlle Y. Bengo I.J. Goodfellow, D. Warde-Farley. Maxout networks. arxv preprnt arxv: , S. Hochreter D.-A. Clevert, T. Unterthner. Fast and accurate deep network learnng by exponental lnear unts (elus). arxv preprnt arxv: , S. Hochreter. Untersuchungen zu dynamschen neuronalen netzen. Master s thess, TU Munch, Y. Bengo R. Pascanu. On the dffculty of tranng recurrent neural networks. Techncal report, Unverste de Montreal, R. Pascanu Y. Bengo, N. Boulanger-Lewandowsk. Advances n optmzng recurrent networks. In ICASSP, pages , Martn Arovsky, Amar Shah, and Yoshua Bengo. Untary evoluton recurrent neural networks. arxv preprnt arxv: , A. Krzhevsky I. Sutskever R. Salakhutdnov N. Srvastava, G. Hnton. Dropout: A smple way to prevent neural networks from overfttng. The Journal of Machne Learnng Research, 15(1): , 2014.

EEE 241: Linear Systems

EEE 241: Linear Systems EEE : Lnear Systems Summary #: Backpropagaton BACKPROPAGATION The perceptron rule as well as the Wdrow Hoff learnng were desgned to tran sngle layer networks. They suffer from the same dsadvantage: they