Training Convolutional Neural Networks

Tranng Convolutonal Neural Networks Carlo Tomas November 26, 208 The Soft-Max Smplex Neural networks are typcally desgned to compute real-valued functons y = h(x) : R d R e of ther nput x When a classfer s needed, a soft-max functon s used as the last layer, wth e entres n ts output vector p f there are e classes n the label space Y The class correspondng to nput x s then s found as the arg max of p Thus, the network can be vewed as a functon p = f(x, w) : X P that transforms data space X nto the soft-max smplex P, the set of all nonnegatve real-valued vectors p R e whose entres add up to : P def = {p R e : p 0 and e p = } Ths set has dmenson e, and s the convex hull of the e columns of the dentty matrx n R e Fgure shows the -smplex and the 2-smplex The vector w n the expresson above collects all the parameters of the neural network, that s, the gans and bases of all the neurons More specfcally, for a deep neural network wth K layers ndexed by k =,, K, we can wrte w = w () w (K) where w (k) s a vector collectng both gans and bases for layer k If the arg max rule s used to compute the class, = ŷ = h(x) = arg max p, then the network has a low tranng rsk f the transformed data ponts p fall n the decson regons P c = {p c p j for j c} for c =,, e These regons are convex, because ther boundares are defned by lnear nequaltes n the entres of p Thus, when used for classfcaton, the neural network can be vewed as learnng a transformaton of the orgnal decson regons n X nto the convex decson regons n the soft-max smplex In geometry, the smplces are named by ther dmenson, whch s one less than the number of classes

y 3 y 2 /3 /3 y 2 /2 /3 y /2 y Fgure : The -smplex for two classes (dark segment n the dagram on the left) and the 2-smplex for three classes (lght trangle n the dagram on the rght) The blue dot on the left and the blue lne segments on the rght are the boundares of the decson regons The boundares meet at the unt pont /e n e dmensons 2 Loss The rsk L T to be mnmzed to tran a neural network s the average loss on a tranng set of nput-output pars T = {(x, y ),, (x N, y N )} The outputs y n are categorcal n a classfcaton problem, and real-valued vectors n a regresson problem For a regresson problem, the loss functon s typcally the quadratc loss, l(y, y ) = y y 2 For classfcaton, on the other hand, we would lke the rsk L T (h) to be dfferentable, n order to be able to use gradent descent methods However, the arg max s a pecewse-constant functon, and ts dervatves are ether zero or undefned (where the arg max changes value) The zero-one loss functon has smlar propertes To address these ssue, a dfferentable loss defned on f s used as a proxy for the zero-one loss defned on h Specfcally, the mult-class cross-entropy loss s used, whch we studed n the context of logstc-regresson classfers Its defnton s repeated here for convenence: l(y, p) = log p y Equvalently, f q(y) s the one-hot encodng of the true label y, the cross-entropy loss can also be wrtten as follows: e l(y, p) = q (y) log p k= Wth these defntons, L T, s a pecewse-dfferentable functon, and one can use gradent or sub-gradent methods to compute the gradent of L T wth respect to the parameter vector w Exceptons to dfferentablty are due to the use of the ReLU, whch has a cusp at the orgn, as the nonlnearty n neurons, as well as to the possble use of max-poolng These exceptons are pontwse, and 2

are typcally gnored n both the lterature and the software packages used to mnmze L T If desred, they could be addresses by ether computng sub-gradents rather than gradents [3], or roundng out the cusps wth dfferentable jonts As usual, once the loss has been settled on, the tranng rsk s defned as the average loss over the tranng set, and expressed as a functon of the parameters w of f: L T (w) = N N l n (w) where l n (w) = l(y n, f(x n, w)) () n= 3 Back-Propagaton A local mnmum for the rsk L T (w) s found by an teratve procedure that starts wth some ntal values w 0 for w, and then at step t performs the followng operatons: Compute the gradent of the tranng rsk, L T w w=wt Take a step that reduces the value of L T by movng n the drecton of the negatve gradent by a varant of the steepest descent method called Stochastc Gradent Descent (SGD), dscussed n Secton 4 The gradent computaton s called back-propagaton and s descrbed next The computaton of the n-th loss term l n (w) can be rewrtten as follows: x (0) = x n x (k) = f (k) (W (k) x (k ) ) for k =,, K p = x (K) l n = l(y n, p) where (x n, y n ) s the n-th tranng sample and f (k) descrbes the functon mplemented by layer k Computaton of the dervatves of the loss term l n (w) can be understood wth reference to Fgure 2 The term l n depends on the parameter vector w (k) for layer k through the output x (k) from that layer and nothng else, so that we can wrte w (k) = w (k) for k = K,, (2) and the frst gradent on the rght-hand sde satsfes the backward recurson x (k ) = x (k ) for k = K,, 2 (3) because l n depends on the output x (k ) from layer k only through the output x (k) from layer k The recurson (3) starts wth l = (4) x (K) p 3

x n = x (0) x = p l n l () (2) (3) x x f () f (2) f (3) w () w (2) w (3) y n Fgure 2: Example data flow for the computaton of the loss term l n for a neural network wth K = 3 layers When vewed from the loss term l n, the output x (k) from layer k (pck for nstance k = 2) s a bottleneck of nformaton for both the parameter vector w (k) for that layer and the output x (k ) from the prevous layer (k = n the example) Ths observaton justfes the use of the chan rule for dfferentaton to obtan equatons (2) and (3) where p s the second argument to the loss functon l In the equatons above, the dervatve of a functon wth respect to a vector s to be nterpreted as the row vector of all dervatves Let d k be the dmensonalty (number of entres) of x (k), and j k be the dmensonalty of w (k) The two matrces w (k) = w (k) d k w (k) w (k) j k d k w (k) j k and x (k ) = x (k ) d k x (k ) x (k ) d k d k x (k ) d k are the Jacoban matrces of the layer output x (k) wth respect to the layer parameters and nputs Computaton of the entres of these Jacobans s a smple exercse n dfferentaton, and s left to the Appendx The equatons (2-5) are the bass for the back-propagaton algorthm for the computaton of the gradent of the tranng rsk L T (w) wth respect to the parameter vector w of the neural network (Algorthm ) The algorthm loops over the tranng samples For each sample, t feeds the nput x n to the network to compute the layer outputs x (k) for that sample and for all k =,, K, n ths order The algorthm temporarly stores all the values x (k), because they are needed to compute the requred dervatves Ths ntal volley of computaton s called forward propagaton (of the nputs) The algorthm then revsts the layers n reverse order whle computng the dervatves n equaton (4) frst and then n equatons (2) and (3), and concatenates the resultng K layer gradents nto a sngle gradent ln w Ths computaton s called back-propagaton (of the dervatves) The gradent of L T (w) s the average (from equaton ()) of the gradents computed for each of the samples: L T w = N N n= w = N N n= w () w (K) (here, the dervatve wth respect to w s read as a column vector of dervatves) Ths average vector can be accumulated (see last assgnment n Algorthm ) as back-propagaton progresses For succnctness, operatons are expressed as matrx-vector computatons n Algorthm In practce, the matrces would be very sparse, and correlatons and explct loops over approprate ndces are used nstead (5) 4

Algorthm Backpropagaton functon L T backprop(t, w = [w (),, w (K) ], l) L T = zeros(sze(w)) for n =,, N do x (0) = x n for k =,, K do Forward propagaton x (k) f (k) (x (k ), w (k) ) Compute and store layer outputs to be used n back-propagaton end for l n = [ ] g = l(yn,x(k) ) end for end functon p for k = K,, 2 do l n [g x(k) ; l w (k) n ] g g x(k) x (k ) end for L T (n ) L T + l n n Intally empty contrbuton of the n-th sample to the loss gradent g s ln Back-propagaton Dervatves are evaluated at w (k) and x (k) Dtto Accumulate the average 4 Stochastc Gradent Descent In prncple, a neural network can be traned by mnmzng the tranng rsk L T (w) defned n equaton () by any of a vast varety of numercal optmzaton methods [5, 2] At one end of the spectrum, methods that make no use of gradent nformaton take too many steps to converge At the other end, methods that use second-order dervatves (Hessan) to determne hgh-qualty steps tend to be too expensve n terms of both space and tme at each teraton, although some researchers advocate these types of methods [4] By far the most wdely used methods employ gradent nformaton, computed by back-propagaton [] Lne search s too expensve, and the step sze s therefore chosen accordng to some heurstc nstead The momentum method [6, 8] starts from an ntal value w 0 chosen at random and terates as follows: v t+ = µ t v t α L T (w t ) w t+ = w t + v t+ The vector v t+ s the step or velocty that s added to the old value w t to compute the new value w t+ The scalar α > 0 s the learnng rate that determnes how fast to move n the drecton opposte to the rsk gradent L T (w), and the tme-dependent scalar µ t [0, ] s the momentum coeffcent Gradent descent s obtaned when µ t = 0 Greater values of µ t encourage steps n a consstent drecton (snce the new velocty v t+ has a greater component n the drecton of the old velocty v t than f no momentum were present), and these steps accelerate descent when the gradent of L T (w) s small, as s the case around shallow mnma The value of µ t s often vared accordng to some schedule lke the one n Fgure 3 The ratonale for the ncreasng values over tme s that momentum s more useful n later stages, n whch the gradent magntude s very small as w t approaches the mnmum The learnng rate α s often fxed, and s a parameter of crtcal mportance [9] A rate that s too large leads to large steps that often overshoot, and a rate that s too small leads to very slow progress In practce, an ntal value of α s chosen by cross-valdaton to be some value much smaller than Convergence can 5

09 085 µ t = mn ( µ max, 2 t 250 ) + µt 08 075 07 065 06 055 05 0 500 000 500 2000 2500 3000 t Fgure 3: A possble schedule [8] for the momentum coeffcent µ t take between hours and weeks for typcal applcatons, and the value of L T s typcally montored through some user nterface When progress starts to saturate, the value of α s decreased (say, dvded by 0) Mn-Batches The gradent of the rsk L T (w) s expensve to compute, and one tends to use as large a learnng rate as possble so as to mnmze the number of steps taken One way to prevent the resultng overshootng would be to do onlne learnng, n whch each step µ t v t α l n (w t ) (there s one such step for each tranng sample) s taken rght away, rather than accumulated nto the step µ t v t α L T (w t ) (no subscrpt n here) In contrast, usng the latter step s called batch learnng Computng l n s much less expensve (by a factor of N) than computng L T In addton and most mportantly for convergence behavor onlne learnng breaks a sngle batch step nto N small steps, after each of whch the value of the rsk s re-evaluated As a result, the onlne steps can follow very curved paths, whereas a sngle batch step can only move n a fxed drecton n parameter space Because of ths greater flexblty, onlne learnng converges faster than batch learnng for the same overall computatonal effort The small onlne steps, however, have hgh varance, because each of them s taken based on mnmal amounts of data One can mprove convergence further by processng mn-batches of tranng data: Accumulate B gradents l n from the data n one mn-batch nto a sngle gradent L T, take the step, and move on to the next mnbatch It turns out that small values of B acheve the best compromse between reducng varance and keepng steps flexble Values of B around a few dozen are common Termnaton When used outsde learnng, gradent descent s typcally stopped when steps make lttle progress, as measured by step sze w t w t and/or decrease n functon value L T (w t ) L T (w t ) When tranng a deep network, on the other hand, descent s often stopped earler to mprove generalzaton Specfcally, one montors the zero-one rsk error of the classfer on a valdaton set, rather than the crossentropy rsk of the soft-max output on the tranng set, and stops when the valdaton-set error bottoms out, even f the tranng-set rsk would contnue to decrease A dfferent way to mprove generalzaton, sometmes used n combnaton wth early termnaton, s dscussed n Secton 5 5 Dropout Snce deep nets have a large number of parameters, they would need mpractcally large tranng sets to avod overfttng f no specal measures are taken durng tranng Early termnaton, descrbed at the end of the 6

prevous secton, s one such measure In general, the best way to avod overfttng n the presence of lmted data would be to buld one network for every possble settng of the parameters, compute the posteror probablty of each settng gven the tranng set, and then aggregate the nets nto a sngle predctor that computes the average output weghted by the posteror probabltes Ths approach, whch s remnscent of buldng a forest of trees, s obvously nfeasble to mplement for nontrval nets One way to approxmate ths scheme n a computatonally effcent way s called the dropout method [7] Gven a deep network to be traned, a dropout network s obtaned by flppng a based con for each node of the orgnal network and droppng that node f the flp turns out heads Droppng the node means that all the weghts and bases for that node are set to zero, so that the node becomes effectvely nactve One then trans the network by usng mn-batches of tranng data, and performs one teraton of tranng on each mn-batch after turnng off neurons ndependently wth probablty p When tranng s done, all the weghts n the network are multpled by p, and ths effectvely averages the outputs of the nets wth weghts that depend on how often a unt partcpated n tranng The value of p s typcally set to /2 Each dropout network can be vewed as a dfferent network, and the dropout method effectvely samples a large number of nets effcently 7

Appendx: The Jacobans for Back-Propagaton If f (k) s a pont functon, that s, f t s R R, the ndvdual entres of the Jacoban matrces (5) are easly found to be (revertng to matrx subscrpts for the weghts) W (k) qj = δ q df (k) x (k ) j and x (k ) j = df (k) W (k) j The Kronecker delta δ q = { f = q 0 otherwse n the frst of the two expressons above reflects the fact that x (k) depends only on the -th actvaton, whch s n turn the nner product of row of W (k) wth x (k ) Because of ths, the dervatve of x (k) wth respect to entry W (k) qj s zero f ths entry s not n that row, that s, when q The expresson df (k) s shorthand for df (k) da a=a (k) the dervatve of the actvaton functon f (k) wth respect to ts only argument a, evaluated for a = a (k) For the ReLU actvaton functon h k = h,, df (k) { for a 0 da = 0 otherwse For the ReLU actvaton functon followed by max-poolng, h k ( ) = π(h( )), on the other hand, the value of the output at ndex s computed from a wndow P () of actvatons, and only one of the actvatons (the one wth the hghest value) n the wndow s relevant to the output 2 Let then p (k) = max q P () (h(a(k) q )) be the value resultng from max-poolng over the wndow P () assocated wth output of layer k Furthermore, let ˆq = arg max q P () (h(a(k) q )) be the ndex of the actvaton where that maxmum s acheved, where for brevty we leave the dependence of ˆq on actvaton ndex and layer k mplct Then, W (k) qj = δ qˆq df (k) ˆq x (k ) j and x (k ) j = df (k) ˆq W (k) ˆqj 2 In case of a te, we attrbute the hghest values n P () to one of the hghest nputs, say, chosen at random 8

References [] C M Bshop Pattern Recognton and Machne Learnng Sprnger, 2006 [2] S Boyd and L Vandeberghe Convex Optmzaton Cambdrge Unversty Press, 2004 [3] J B Hrart-Urruty and C Lemaréchal Convex analyss and mnmzaton algorthms I: Fundamentals, volume 305 Sprnger scence & busness meda, 203 [4] J Martens Learnng recurrent neural networks wth Hessan-free optmzaton In Proceedngs of the 28th Internatonal Conference on Machne Learnng, pages 735 742, 20 [5] J Nocedal and S J Wrght Numercal Optmzaton Sprnger, New York, NY, 999 [6] B T Polyak Some methods of speedng up the convergence of teraton methods USSR Computatonal Mathematcs and Mathematcal Physcs, 4(5): 7, 964 [7] N Srvastava, G Hnton, A Krzhevsky, I Sutskever, and R Salakhutdnov Dropout: a smple way to prevent neural networks from overfttng Journal of Machne Learnng Research, 5:929 958, 204 [8] I Sutskever, J Martens, G Dahl, and G Hnton On the mportance of ntalzaton and momentum n deep learnng In Proceedngs of the 30th Internatonal Conference on Machne Learnng, pages 39 47, 203 [9] D R Wlson and T R Martnez The general neffcency of batch tranng for gradent descent learnng Neural Networks, 6:429 45, 2003 9