SmoothOut: Smoothing Out Sharp Minima to Improve Generalization in Deep Learning

Size: px

Start display at page:

Download "SmoothOut: Smoothing Out Sharp Minima to Improve Generalization in Deep Learning"

Betty Atkinson
5 years ago
Views:

1 PREPRINT SmoothOut: Smoothing Out Shrp Minim to Improve Generliztion in Deep Lerning Wei Wen, Yndn Wng, Feng Yn, Member, IEEE, Cong Xu, Chunpeng Wu, Yirn Chen, Fellow, IEEE nd Hi (Helen) Li, Fellow, IEEE rxiv: v3 [cs.lg] 2 Dec 28 Abstrct In Deep Lerning, Stochstic Grdient Descent (SGD) is usully selected s trining method becuse of its efficiency; however, recently, problem in SGD gins reserch interest: shrp minim in Deep Neurl Networks (DNNs) hve poor generliztion; especilly, lrge-btch SGD tends to converge to shrp minim. It becomes n open question whether escping shrp minim cn improve the generliztion. To nswer this question, we propose SmoothOut frmework to smooth out shrp minim in DNNs nd thereby improve generliztion. In nutshell, SmoothOut perturbs multiple copies of the DNN by noise injection nd verges these copies. Injecting noises to SGD is widely used in the literture, but SmoothOut differs in lots of wys: () de-noising process is pplied before prmeter updting; (2) noise strength is dpted to filter norm; (3) n lterntive interprettion on the dvntge of noise injection, from the perspective of shrpness nd generliztion; (4) usge of uniform noise insted of Gussin noise. We prove tht SmoothOut cn eliminte shrp minim. Trining multiple DNN copies is inefficient, we further propose n unbised stochstic SmoothOut which only introduces the overhed of noise injecting nd de-noising per btch. An dptive vrint of SmoothOut, AdSmoothOut, is lso proposed to improve generliztion. In vriety of experiments, SmoothOut nd AdSmoothOut consistently improve generliztion in both smll-btch nd lrge-btch trining on the top of stte-of-the-rt solutions. Index Terms Deep Lerning, Neurl Networks, Shrp Minim, Generliztion, SGD. I. INTRODUCTION Stochstic Grdient Descent (SGD) is the dominnt optimiztion method used to trin Deep Neurl Networks (DNNs). However, the generliztion of DNNs needs more understnding. Recently, one observtion is tht lrge-btch SGD hs worse generliztion thn smll-btch SGD [][2][3]. The ccurcy difference between smll-btch trining nd lrgebtch trining is the well known generliztion gp [2]. Resons behind the generliztion gp re still under ctive reserch. Hoffer et l. [4] hypothesizes tht the process of SGD is similr to rndom wlk on rndom potentil [5]. This hypothesis ttributes generliztion gp to the limited number of prmeter updtes, nd suggests to trin more itertions. Lerning Rte Scling (LRS) ws lso proposed to mtch wlk sttistics to close the gp. Inspired by this hypothesis, prcticl techniques re proposed [3][6][7][8]. W. Wen is with the Deprtment of Electricl nd Computer Engineering, Duke Univerisity, Durhm, NC, 2778 USA e-mil: (see Y. Wng is with University of Pittsburgh. F. Yn is with University of Nevd Reno. C. Xu is with HP Lbs. C. Wu, H. Li nd Y. Chen re with Duke Univerisity. Mnuscript for review. Another ppeling hypothesis, which rouses recent reserch interest, is tht the generliztion is ttributed to the fltness of minim [9][]; tht is, flt minim hve good generliztion while shrp minim cn worsen it. The hypothesis cn be pplied to both smll-btch nd lrge-btch SGD, but lrge-btch SGD tends to converge to shrper minim, ending up with the generliztion gp. Shrp minim hve bd generliztion due to their high over-fitting to trining dt [9][] nd high sensitivity to noises []. Jstrzębski et l. [2] showed the connection between these two hypotheses: LRS motivted by rndom wlk leds to fltter minim nd helps to improve the generliztion. Our pproch is bsed on the second hypothesis, trgeting on escping shrp minim for better generliztion in both smll-btch nd lrge-btch SGD. Moreover, our pproch cn enhnce techniques inspired by the first hypothesis nd further improve generliztion. Keskr et l. [2] ttempted to escpe shrp minim through dt ugmenttion, conservtive trining nd dversril trining. However, ll trils do not completely remedy the problem [2], leving how to void shrp minim s n open question. We propose SmoothOut to smooth out shrp minim nd guide the convergence of SGD to fltter regions. SmoothOut slightly perturbs DNN function by noise injecting or function reshping, then verges ll perturbed DNNs. Becuse shrp minim re sensitive to perturbtion, slight perturbtion cn result in significnt function increse t ech shrp minimum, which mens the verged vlue will be high. In this wy, shrp minim cn be eliminted. Conversely, smll perturbtion only influences the mrgin of ech flt region nd the flt bottom still ligns well with the originl bottom. Averging ligned bottoms cn mintin the originl minimum. Beyond this intuition, we prove tht SmoothOut under uniform noises cn eliminte shrp minim while mintining flt minim. Note tht we mjorly use uniform noise for study s it is well motivted, but other noise types like Gussin noise cn fit into our new SmoothOut frmework. Moreover, trining over mny perturbed DNNs for verging is computtion intensive. We propose Stochstic SmoothOut, which injects noise per itertion during SGD. We prove tht Stochstic SmoothOut is equivlent to the originl SmoothOut in expecttion. Adptive SmoothOut AdSmoothOut, is lso proposed to further improve generliztion by dpting noise strength to filter norm. Our experiments show tht SmoothOut nd AdSmoothOut cn help to escpe shrp minim nd improve the generliztion. SmoothOut nd AdSmoothOut re esy to be implemented nd our code is t

2 PREPRINT 2 II. RELATED WORK Shrp Minim nd Generliztion. Why deep neurl networks generlize well still needs deeper understnding [3]. As forementioned, one hypothesis is tht SGD finds flt minim which cn generlize well [9][2]. As Hochreiter et l. [9] pointed out, bsed on Minimum Description (Messge) Length [][4] theory, fltter minimum cn be encoded in fewer bits which indictes simpler DNN model for better generliztion. Alterntive explntions re bsed on Byesin lerning [5][6]. Dinh et l. [7] further rgue tht current definitions of shrpness re problemtic nd redefinition is required for explntion. However, Keskr et l. [2] indeed found tht lrge-btch trining sticks to shrp minim nd hs bd generliztion. Different from previous work, we focus on new vrints of SGD to escpe shrp minim. We find tht our method not only cn escpe shrp minim during lrge-btch trining but lso guide smll-btch trining to fltter ones, therefore, improving generliztion in both cses. Chudhri et l. [] proposed Entropy-SGD which mximizes locl entropy to bis SGD to flt minim ( wide vlleys ). Locl entropy ws constructed by building connections between Gibbs distribution nd optimiztion problems. The grdients of locl entropy ws estimted by Lngevin dynmics [8], which is computtion intensive. Compred with Entropy-SGD, our SmoothOut is more efficient since noise injection nd de-noising is the only overhed. This enbles SmoothOut to scle to lrger dtset like ImgeNet [9]. Moreover, SmoothOut consistently improves generliztion in ll experiments, while Entropy-SGD chieved comprble generliztion error. Izmilov et l. [2] proposed Stochstic Weight Averging (SWA), which records the newest prmeter points long the trjectory of SGD nd then simply verges them to get the finl optimum. Compring with SWA, SmoothOut performs stochstic verging over perturbed models; SWA relies on pre-trining to converge ner to minim, while SmoothOut cn trin from scrtch; moreover, it is unnswered if SWA cn improve generliztion in lrge-btch SGD, while SmoothOut cn improve it both experimentlly nd theoreticlly. Noise Injection. Noise injection is commonly used method in SGD [2][22][23][24][25][26] nd Byesin neurl networks [27][28][29], where usully Gussin noises re injected to prmeters or grdients for explortion or distribution pproximtion. Differently, our method is motivted by eliminting shrp minim, which leds to some key differences: () de-noising process is pplied before prmeter updting; (2) noise strength is dpted to filter norm; (3) n lterntive interprettion on the dvntge of noise injection, from the perspective of shrpness nd generliztion; (4) usge of uniform noise insted of Gussin noise. Our experiments will show tht uniform noise is superior to Gussin noise. Moreover, our SmoothOut frmework is gnostic noise types Any noise type cn fit to the frmework nd my chieve the gol. We dopt uniform noise s mjor study becuse it is well motivted s will be shown in Section III. Dropout [3] is populr method to void over-fitting nd include uncertinty [3] by rndomly drop neurons, however, lrgebtch trining with Dropout still hs the generliztion gp s shown experimentlly. The reson is: s Keskr et l. [2] observed, shrp region only expnds in smll dimensionl subspce nd most directions re flt; however, Dropout only perturbs subspce such tht the shrp directions cnnot be frequently perturbed; conversely, our method effectively perturbs the whole spce including shrp direction. We will explin the connections between Dropout nd our method. Lrge-btch SGD. Lrge-btch SGD is loosely relted work becuse SmoothOut is generl SGD pproch. However, s shrp minim in lrge-btch SGD become severer [2] nd ccurcy loss is generlly observed, n ctive line of reserch focuses on overcoming the generliztion gp (ccurcy loss). Hoffer et l. [4] suggest to trin more epochs, however, trining more epochs consumes more time. Some heuristic techniques were proposed to close the gp without prolonging epochs. Those techniques include liner lerning rte scling [3], wrm-up trining [3][8], Lyerwise Adptive Rte Scling [7] nd others [32]. However, without theoreticl support, it is uncler to wht extent those methods cn generlize. For exmple, liner lerning rte scling nd wrm-up trining cnnot generlize to CIFAR- [4] nd other rchitectures on ImgeNet [7]. Compred with those techniques, our SmoothOut is n interpretble solution supported by the shrp minim hypothesis. More importntly, our experiments show tht SmoothOut cn further improve the ccurcy when combined with those stte-of-thert techniques. III. SmoothOut: PRINCIPLES, THEORY AND IMPLEMENTATION We first introduce our SmoothOut method nd its principles in Section III-A. To reduce computtion complexity, Stochstic SmoothOut is proposed in Section III-B; we prove tht Stochstic SmoothOut is n unbised pproximtion of deterministic SmoothOut. Section III-C implements Stochstic SmoothOut in bck-propgtion of DNNs. At lst, n dptive vrint AdSmoothOut, is introduced. A. Principles: Averging Perturbed Models Smooths Out Shrp Minim As [2] studied, shrp minim hve lrge generliztion gps, becuse smll distortion/shift of testing function from trining function cn significntly increse testing loss even though current prmeter is minimum of the trining function. Our optimiztion gol is to encourge convergence to flt minim for more robust models. Our solution is derived from the sensitivity nture of shrp minim. We intentionlly inject noises into the model to smooth out shrp minim. The concept is illustrted in Figure ()(b). We define w s point in the prmeter spce, C(w) s the trining loss function nd C(w; Θ) s perturbtion of C(w). C(w; Θ) is prmeterized by both w nd Θ, where Θ is rndom vector to generte the perturbtion. Insted of minimizing C(w), we propose to minimize { } C(w) = E C(w; Θ) N C(w; θ i ) () N i= Figure in their pper [2] illustrtes this conception.

3 PREPRINT 3 Trining Function !(#) Flt Shrp 5 Bsis Prmeter spce w () Trining loss function !&(#; ( ) )! (#) Mintined Smoothed out 5 Averged Prmeter spce w (b) SmoothOut Dtset Dtset Duplicte!(# + % & )!(# + % ( )!(# + % ) ) 2 N (c) The first version of SmoothOut Dt x t Averging *(+,, # + %, ) (d) The stochstic version of SmoothOut Figure : Illustrtion nd frmework of SmoothOut. () The trining loss function of the bsis model w.r.t. prmeter w. (b) Ech thin curve represents perturbed model; there re totlly 24 perturbtions, but only four re plotted for clener visuliztion; the perturbtion is done by slightly shifting the bsis model in (); the shift distnce is rndomly drwn from uniform distribution; the green curve is new model by verging over ll perturbtions. (c) The first version of proposed SmoothOut frmework in Eq. (). (d) The Stochstic SmoothOut which rndomly perturbs prmeter w t ech btch. to find optiml w for C(w), where θ i is smple of Θ nd N is the number of smples. For simplicity, we ssume C(w) hs one flt minimum w f nd one shrp minimum w s, but the discussion cn be generlized to C(w) with multiple flt nd shrp minim. Our gol is to design n uxiliry function C(w) such tht its minimum within the originl flt region cn pproximte w f, by stisfying the Flt Constrint ( ) rg min C(w) wf ϕ, (2) w D(w f,τ) menwhile the shrp w s is smoothed out, by stisfying the Shrp Constrint ( ) min C(w) mx (C(w)) D(w s,ε) D(w s,ε) ( ) (3) > min C(w), D(w f,τ) where D(w, ς) represents region round w, being constrined s D(w, ς) = {w R m : (w w) i ς, i {...m}}. (4) When ϕ is smll nd τ is lrge, Inequlity (2) ensures tht the uxiliry function C(w) mintins the minimlity of C(w) in the flt region; in the extreme cse of ϕ = nd τ, the minimum of C(w) is exctly wf. Conversely, ner the originl shrp region, Inequlity (3) ensures tht minimlity of C(w) is eliminted when ε is reltively lrge, becuse mx D(ws,ε) (C(w)), the lower bound of C(w), increses rpidly by slightly incresing ε round the shrp minimum; in the extreme cse of ε, the lower bound is the mximum of C(w). In nutshell, good design of C(w) llow smll ϕ, lrge τ nd lrge ε. In this wy, minimiztion process of C(w) will skip w s nd converge to w f. It is infesible to find n optiml C(w) which minimizes ϕ nd mximizes τ nd ε, especilly when C(w) is deep neurl network. However, we find tht, under the Uniform Perturbtion C(w; Θ) = C(w + Θ) where Θ i i.i.d. U(, ) nd i {, 2,..., m}, (5) C(w) cn well perform the purpose. U(, ) is uniform distribution within rnge of [, ]. In this cse, we hve bused C(w; ) s C(w) in the nottion for simplicity. In the Appendix A, we prove tht, under Uniform Perturbtion, pproprite ϕ, τ nd ε cn be found to stisfy Flt Constrint nd Shrp Constrint: Theorem. When C(w) is symmetric in D(w f, τ) τ>, the minimum of ϕ is to stisfy the Flt Constrint when C(w) is generted under the Uniform Perturbtion. Theorem 2. Suppose C(w) is high dimensionl (w R m, m ) nd is symmetric nd strictly monotonic in D(w s, b) b>, then such tht Shrp Constrint is stisfied with ε when C(w) is generted under the Uniform Perturbtion. In theorems, the symmetry is ssumed only ner minim, nd the loss surfce does not hve to be symmetric in the whole spce. By referring to the visuliztion of loss lndscpes of neurl nets in [33], it is resonble to mke this ssumption ner minim. Besides the rigor proof in the Appendix A, SmoothOut cn be explined from the perspective of signl processing: imgining the prmeter spce s time domin nd the function s signls, then verging is low-pss filter which elimintes high-frequency signls (shrp regions) while mintins lowfrequency signls (flt regions). Figure (c) illustrtes the frmework of the proposed SmoothOut in SGD. All models shre the sme prmeter w. Before trining strts, the i-th model is independently perturbed by θ i ; during trining, ll θ i re fixed nd n identicl btch of dt is sent to ll models for trining. Becuse lrge N is required for pproximtion in Eq. (), the computtion complexity nd memory usge will be very high, especilly when C(w) is deep neurl network. In the next section, the Stochstic SmoothOut will be proposed to solve this issue. B. Theory: Stochstic SmoothOut is Unbised To reduce the computtion complexity nd memory usge of SmoothOut in Figure (c), Stochstic SmoothOut is proposed

4 PREPRINT 4. Vl loss Trin loss Vl ccurcy Trin ccurcy α () Shrpness visuliztion SB LB Trin loss SB Vl loss SB Trin loss LB Vl loss LB 6 Trin ccu. SB Vl ccu. SB 5 Trin ccu. LB Vl ccu. LB (b) Sensitivity to noise Figure 2: Nottion: SB : Smll Btch (256); LB : Lrge Btch (5); ccu. : ccurcy. () loss nd ccurcy vs. α, which controls w long the direction from SB minimum (w f ) to LB minimum (w s ); (b) loss nd ccurcy under influence of different strengths of noise. Dtset: CIFAR-. Network: C in [2] implemented by [4]. Optimizer: Adm with. initil lerning rte. in this section s shown in Figure (d). Insted of using multiple perturbed models to lern from identicl dt, only one model is trined. At the t-th btch of trining dt x t, the prmeter w t is first perturbed to w t + θ t nd then x t is fed into the model to clculte the loss function. We cn prove tht, in both frmeworks, the outputs cn pproximte C(w) without bis. Formlly, in Figure (c), the expecttion of the output is { } N E θ...n C(w + θ i ) N (6) i= = E Θ {C(w + Θ)} = C(w). In online lerning systems [34] like Figure (d), the dt x t is independently generted from rndom distribution nd its online loss is obtined by model Q(x t, w); the finl loss function to minimize is the expecttion of online loss under dt distribution, i.e., Accurcy (%) Accurcy % C(w) E X {Q(x, w)}. (7) Therefore, in Figure (d), the expecttion of the output is E {Q(x t, w + θ t )} = E Θ {E X {Q(x, w + Θ)}} = E Θ {C(w + Θ)} = C(w). Consequently, both frmeworks in Figure cn pproximte C(w) = E{ C(w; Θ)} in Eq. (), but Stochstic SmoothOut (8) is much more computtion efficient. The only overhed of Stochstic SmoothOut is noise injection nd denoising s will be shown. In the following sections, without explicit clrifiction, SmoothOut will refer to the stochstic version in Figure (d). The reson why SmoothOut cn eliminte shrp minim is tht C(w s ) is more sensitive to noise thn C(w f ), nd we expect C(ws ) increses fster thn C(w f ) s the noise strength increses from. To verify this, we first trin DNN under smll btch size to get flt minimum w f ; second, w = w f is deployed into the frmework in Figure (d); third, the whole trin/vlidtion dtset is fed to the frmework in btch size of, nd t ech btch, the prmeter is perturbed to w = w f + θ t ; finlly, the losses re verged over ll btches to estimte C(w f ). The sme process is done using lrge btch size for the sme DNN to estimte C(w s ). We scn in rnge to test the sensitivity of C(wf ) nd C(w s ) to perturbtion. Figure 2() visulizes the shrpness of C(w) round w f nd w s, using the technique dopted in [2] which ws originlly proposed in [35]. In Figure 2(), ech point on the loss curve is (w +, C(w + )) where w + = α w s + ( α) w f. The visuliztion is consistent with [2], which concluded tht lrge-btch trining converges to shrp minim. Figure 2(b) nlyzes the sensitivity. For both trining nd vlidtion dtsets, C(ws ) indeed increses fster thn C(w f ) s increses. The ccurcy curves hve similr trend. Sensitivity nlyses of more DNNs nd more dtsets re included in the Appendix B. Therefore, side outcome of this work is tht we cn use s = ( C(w ; ) ) (9) s metric to mesure the shrpness of C(w) t minimum w. A lrger s mens shrper minimum. At lst, under our frmework, we cn view Dropout s n noise under Bernoulli distribution dpting its noise strength to the corresponding weight. Concretely, in Figure (d), θ ti = w i with probbility p nd θ ti = with probbility p, where p is the dropout rtio. Under this view, Dropout cn fit into our frmework, but it cnnot guide the convergence to shrp minim, becuse the strength of noise θ ti = w i is too lrge. C. Implementtion: Bck-propgtion with Perturbtion nd Denoising Algorithm SmoothOut in Bck Propgtion Input : Trining dtset X, totl itertions T, model Q(x, w) with initil prmeter w = w : for t {,..., T } do 2: Rndomly smple btch dt x t from X i.i.d. 3: Perturbtion: w t = w t + θ t where θ ti U(, ) 4: Bck-propgtion: g t = Q(x t,w t ) w t 5: Denoising: w t = w t θ t 6: Updting: w t+ = w t η t g t 7: end for Output: The model Q(x, w) with finl prmeter w = w T

5 PREPRINT 5 w " = w " + θ t x t Q x t,w " Q x t,w " g t w " () Perturbtion (2) Bck-propgtion itertion t (4) Updting (3) Denoising fixing the strength. Mthemticlly, suppose w (i) is vector of prmeters in filter i nd θ (i) is noise vector, then dpted noise ˆθ (i) will be ˆθ (i) = w(i) 2 θ (i) 2 θ (i), (2) where controls the strength of noises. Adptive noise is nother key difference from noise injection in previous work. Our bltion study will show dptive noise is more effective in improving generliztion. w ",- = w " η " g t w " = w " θ t Figure 3: SmoothOut in BP. In Figure (d), the grdient to updte prmeter t itertion t is g t = Q(x t, w + θ t ) w w=wt = w Q(x t, w) w=wt+θ t (w + θ () t) w. Therefore, the prmeter is updted s w t+ = w t η t w Q(x t, w) w=wt+θ t, () where η t is the lerning rte nd the grdient is obtined by bck propgtion when the prmeter vlue is w t + θ t. Thus, SmoothOut cn be implemented s Algorithm s illustrted in Figure 3. This revels pitfll in implementtion tht the noise θ t dded to w t must be denoised before pplying the grdient, which is lso key difference from existing noise injection pproches [2][22][23][24]. As shown in Figure 3, the only overhed of SmoothOut is dding nd subtrcting noises, which is much more efficient thn trining multiple DNNs in Figure (c). Note tht, lthough Algorithm is proposed in the context of vnill SGD, it cn be extended to SGD vrints by simply utilizing the grdient g t for momentum ccumultion, lerning rte dpttion, nd so on. D. Adptive SmoothOut AdSmoothOut Due to the fct tht the weight distributions cross ll lyers vry lot, dding noise with constnt strength to ll weights my over-perturb the lyers with smll weights while underperturb others. The vrying distribution is lso the source of problem in visulizing the shrpness s pointed out in [33]. To overcome this, [33] proposed filter normliztion nd chieved more ccurte visuliztion. Inspired by filter normliztion, in SmoothOut, the noises dded to filter re linerly scled by l 2 norm of the filter. In fully-connected lyers, the noises re scled per neuron, i.e., ll input connections of ech neuron form vector nd noises re divided by l 2 norm of the vector. We cll it Adptive SmoothOut (AdSmoothOut) becuse it dpts the strength of noises to the filters insted of IV. EXPERIMENTS We evlute SmoothOut in MNIST [36], CIFAR- [37], CIFAR- [37] nd ImgeNet [9] dtset. SmoothOut nd AdSmoothOut re evluted in smll-btch ( SB ) SGD nd lrge-btch ( LB ) SGD. (C ɛ, A)-shrpness [2] is utilized to mesure the shrpness of minimum, which is solved using L- BFGS-B lgorithm [38]. In solving (C ɛ, A)-shrpness, the fullspce (i.e., A = I n ) in the bounding box C ɛ (with ɛ = 5 4 ) is explored to find the mximum for mesurement. As L- BFGS-B is n estimtion lgorithm nd my fil to find the exct mximum vlue, vrince in mesurements is observed. We run 5 experiments for ech mesurement, nd use the mximum s the finl shrpness metric. Unlike [2] which verged over 5 runs, ours is more resonble becuse (C ɛ, A)- shrpness is bsed on mesuring the mximum vlue round the box. In trining with SmoothOut, is the only dditionl hyper-prmeter to tune, which controls the strength of noise. is very robust becuse of the width of flt minim. More concretely, is.375 in ll experiments of SmoothOut in Tble I nd Tble II. We believe the vlue of is network rchitecture dependent (i.e., loss function dependent). We cross-vlidte it in smll-btch SGD nd directly use it in lrge-btch SGD without further tuning, nd it generlizes well nd improves ccurcy in both smll-btch SGD nd lrgebtch SGD. A. Convergence to Fltter Minim We first dopt benchmrks by [2] to verify tht SmoothOut cn effectively guide both SB nd LB SGD to fltter minim nd thus improve the generliztion (ccurcy). The comprison is in Tble I. Figure 4 visulizes nd compres the shrpness of bseline (C 3 ) nd SmoothOut. Similr visuliztion results for F nd C cn be found in the Appendix B. Note tht Keskr et l. [2] did not trget on chieving stte-of-thert ccurcy but studying the chrcteristics of minim, nd we simply follow this purpose. Comprison in stte-of-the-rt models will be covered in Section IV-B. In Tble I nd Figure 4, we observed consistency mong shrpness, visuliztion, nd generliztion, tht is, smller (C ɛ, A)-shrpness, then fltter region in the visuliztion nd higher ccurcy. More importntly, the results indicte tht () compring with SB trining, LB trining converges to shrper minim with worse generliztion, but SmoothOut cn guide it to converge to fltter minim nd closes the gp or even improves the ccurcy; (2) the shrp minim problem lso exist in SB trining s shown in Figure 4(),

6 PREPRINT 6 Tble I: Shrpness reduction nd generliztion improvement for DNNs in [2]. DNN Dtset Btch size Bseline SmoothOut Improvement F C C 3 MNIST CIFAR- CIFAR- (C ɛ, A)-shrpness chnge bseline SmoothOut % 98.6%.2% % 98.42%.4% % 8.72% 2.5% % 8.34% 3.4% % 5.7% 3.8% % 48.43% 4.6% () SB bseline SB SmoothOut () Accurcy (%). 2 Noise Bseline trin loss..2.3 AdSmoothOut trin loss Bseline trin ccurcy AdSmoothOut trin ccurcy (b) LB bseline Figure 4: Shrpness of bseline nd SmoothOut in () SB trining nd (b) LB trining of C 3. LB SmoothOut (b) Bseline vl loss..2.3 AdSmoothOut vl loss Bseline vl ccurcy AdSmoothOut vl ccurcy Figure 5: Shrpness visuliztion by filter normliztion [33] using () trining dtset nd (b) vlidtion dtset. The DNN is ResNet44 trined by CIFAR- using bseline [4] nd AdSmoothOut but SmoothOut cn reduce the shrpness nd improve the ccurcy; (3) shrp minim problem is severer in LB trining such tht SmoothOut cn improve more. At lst, we rgue tht the convergence of our method is stble lthough noises re injected; tht is, different runs converge to similr ccurcy under the sme strength of injected noises. More specific, for C in Tble I, ccurcy stndrd devition is ±.33%, ±.2%, ±.24% nd ±.3% in smll-btch bseline, smll-btch SmoothOut, lrge-btch bseline nd lrge-btch SmoothOut, respectively. B. Improving Generliztion on the Top of Stte-of-the-rt Solutions In this section, we evlute our method by stte-of-thert DNNs, including ResNet44 on CIFAR- nd CIFAR-, AlexNet [39] nd ResNet8 [4] on ImgeNet. Tble II summrizes ll the results. As generliztion issue is severer in LB trining, we focus on LB trining in this section. There re lots of proposed techniques to relieve the generliztion issue in LB trining [4][3][32][8][7][6], however, our method is orthogonl nd we simply pply SmoothOut on the top of them to verify if SmoothOut cn be combined with those stte-of-the-rt solutions. We re not ble to duplicte ll of those techniques, but we select Lerning Rte Scling (LRS), Ghost Btch Normliztion (GBN) nd Trining Longer (TL) techniques [4][3] s the representtives. For LRS, [3] used liner LRS (i.e. lerning rte is scled linerly w.r.t. the btch size), while [4] used squre root LRS. The preferble LRS rule is dependent on the dtset nd DNN [4]. In our experiments, liner LRS 2 is preferble for ResNet44 on CIFAR- nd squre root LRS is preferble for the others. For TL, we simply double the trining epochs for ech lerning rte. In the experiments of ResNet44 on CIFAR- nd CIFAR-, we pplied GBN, TL (4 epochs), liner LRS or squre root LRS in the bselines, so tht we cn diversify the setups to evlute our method. In ll setups, SmoothOut improves generliztion on the top of the GBN, TL nd LRS, verifying tht our method is orthogonl to stte-of-the-rt solutions. More importntly, the AdSmoothOut vrint hs the best generliztion in ll experiments on CIFAR- nd CIFAR- 2 Wrm up pre-trining is not dopted in our experiments for net evlution.

7 PREPRINT 7 Tble II: SmoothOut nd AdSmoothOut improve the stte-of-the-rt bselines. DNN Dtset Btch size Epochs LRS Method Accurcy ResNet44 CIFAR squre root ResNet44 CIFAR- 24 AlexNet ImgeNet squre root 4 squre root 4 liner 6 squre root 2 squre root ResNet8 ImgeNet squre root Bseline 9.2% SmoothOut 9.95% AdSmoothOut 92.63% Bseline 67.23% SmoothOut 68.68% AdSmoothOut 7.9% Bseline 68.62% SmoothOut 7.% AdSmoothOut 72.39% Bseline 7.2% SmoothOut 7.67% AdSmoothOut 72.85% Bseline 47.64% AdSmoothOut 52.53% Bseline 54.24% AdSmoothOut 55.5% Bseline 66.75% AdSmoothOut 67.% Tble III: SmoothOut without de-noising tested on CIFAR-. DNN Btch size SmoothOut SmoothOut w/o de-noising C % 36.52% % 46.5% ResNet % 27.57% Noise strength is.375 in ll experiments., showing the necessity of dptive noises. Therefore, we choose AdSmoothOut s the representtive in ImgeNet for fster development. As AdSmoothOut is one type of regulriztions by stochstic model verging, the regulriztions by weight decy nd dropout re not dopted in trining ImgeNet, such tht we cn reduce the number of hyperprmeters. The top- ccurcy of AlexNet in SB trining is 56.5% with the btch size of 256, however, in LB trining with the btch size of 6384, the ccurcy drops to 47.64% if trined by the sme epochs. TL indeed cn improve the generliztion to 54.24%. More importntly, AdSmoothOut improves the ccurcy in both cses, i.e., improving 4.89% when TL is not pplied nd improving.27% on the top of TL. Lst but not the lest, our method lso chieve improvement on ResNet8 on the ImgeNet. At the end, we visulize the shrpness of minim by filter normliztion visuliztion [33], AdSmoothOut indeed converges to fltter region s shown in Figure 5. C. Abltion Study ) The necessity of de-noising: One of our contributions is the de-noising process. We perform n bltion study by removing the de-noising process to test its necessity. We rerun ll CIFAR- SmoothOut experiments in Tble I nd Tble II, but without de-noising. We use the sme noise strength for comprison. The results re summrized in Tble III. Without Tble IV: Accurcy with nd without de-noising tested by ResNet44 on CIFAR- with the sme setting in Tble II. Method Accurcy Bseline 9.2% SmoothOut % SmoothOut w/o de-noising. 9.5% AdSmoothOut % AdSmoothOut w/o de-noising % de-noising, the ccurcy significntly drops. The reson is strightforwrd: strong noises mke originl prmeters nd grdients less ccurte nd deteriorte convergence, but our grdients re exctly the grdients of uxiliry function C(w) nd perturbed prmeters re recovered before pplying grdients. For fir comprison, we further crefully tune the noise strength in SmoothOut nd AdSmoothOut w/o de-noising to get ner optiml ccurcy. More specific, we hve to decrese s SGD is more sensitive to noises when denoising is not pplied. The results re summrized in Tble IV. Without de-noising, ccurcy is lower. Note tht the de-noising process is nturlly generted by our frmework nd theory in Section III; without de-noising, it will not fit into our frmework nd the optimiztion trget will not be the uxiliry function C(w). 2) Gussin noise vs. uniform noise: Another contribution is the generic SmoothOut nd AdSmoothOut frmework, which is gnostic to the type of noises. We mjorly used uniform noise for study s it is well motivted, but ny type of noises cn fit into our frmework. As Gussin noise is brodly used in the literture [2][22][23][24][25][26][29], we perform n bltion study here by replcing uniform noise with Gussin noise, for the purpose of verifying our frmework is

8 PREPRINT 8 Tble V: Comprison between uniform nd Gussin noises tested on CIFAR- with the sme setting in Tble II. Method Accurcy Bseline 9.2% SmoothOut (uniform) % SmoothOut (Gussin) % AdSmoothOut (uniform) % AdSmoothOut (Gussin) % Uniform noise only. 9.5% Gussin noise only % noise gnostic nd nswering how performnce chnges when the noise type lters. The results re in Tble V, where, in injecting Gussin noises, is the stndrd derivtion. Tble V indictes tht both uniform nd Gussin noises improve generliztion in SmoothOut nd AdSmoothOut, verifying they re gnostic to noise types; uniform noise is superior to Gussin noise. A intuitive explntion is tht Gussin distribution gives high probbility to verge over vlues ner the minimum, nd thus hs smller probbility to smooth out shrp minim. However, uniform distribution evenly trets vlues round the minimum, nd cn eliminte the minimum when it is shrp. We do not ggressively conclude tht uniform noise will lwys be superior in ll settings, but leving noise selection s n building block when using our frmework. smller generliztion improvement is observed if only injecting noises into prmeters without using our frmework (s shown by the noise only experiments). V. CONCLUSION In this pper, we propose SmoothOut nd AdSmoothOut frmework to escpe shrp minim during SGD trining of Deep Neurl Networks, for better generliztion. SmoothOut nd AdSmoothOut build n uxiliry optimiztion function without shrp minim, utilizing noise injection. Although noise injection ws brodly used in the literture, we interpret the dvntge of noise injection from new perspective of generliztion nd shrpness. Moreover, our frmework dvnces in multiple wys: () de-noising is pplied fter noise injection; (2) noise strength is dptive to filter norm in AdSmoothOut; (3) uniform noise is mjorly dopted for study nd cn be superior to Gussin noise in some cses. A comprehensive bltion study is conducted to prove the necessity of those three dvnces. In the future, we will extend SmoothOut nd AdSmoothOut to Recurrent Neurl Networks, ttention-bsed models nd very deep Convolutionl Neurl Networks. APPENDIX A PROOF OF THEOREM AND THEOREM 2 Proof of Theorem : Proof. As D(w, ) is defined s box centering t w with size 2, i.e., D(w, ) = {w R m : (w w) i, i {...m}}, (3) then, under Uniform Perturbtion, { } C(w; Θ) C(w) = E = (2) m D(w,) = E {C(w + Θ)} C(w )dw... dw m C(w) = w i (2) m ( C(w ) w i =wi+ C(w ) ) w i =w i dw \i, D(w \i,) where nd (4) (5) w \i [w,, w i, w i+,, w m ] T R m (6) dw \i dw... dw i dw i+... dw m (7) When C(w) is symmetric bout w f in D(w f, τ) such tht, i, cut long w i = (w f ) i + nd cut long w i = (w f ) i get the sme function in the subspce w \i, then C(w f ) = ; tht is, the Flt Constrint stisfies with ϕ =. The optiml ϕ nd τ re determined by the symmetry of the flt region. ϕ my be relxed to lrger vlue when the symmetry is broken; however, within flt region, lrger ϕ my only slightly increse C(w ). Proof of Theorem 2: Proof. Suppose C (s) ε minimum, i.e., is the mximum vlue ner the shrp C (s) ε = mx (C(w)), (8) D(w s,ε ) s C(w) is strictly monotonic in D(w s, b), we hve, ε < < b, min (C(w)) > C(s) D(w s,)\d(w s,ε ε, (9) ) where D(w s, )\D(w s, ε ) is Set Difference, notting domin within D(w s, ) but outside of D(w s, ε ). Then, follow the proof of Theorem, we hve ( ) min C(w) = D(w s,ε) ε<b (2) m ( (2) m ( ( ε = Becuse of lim m (( (2) m C (s) ε ) m ) C (s) ε + ( ε ( ε C(w )dw... dw m D(w s,) ( )) (2ε ) m C (s) ε C(w s) ) m C(w s ). (2) ) m ) C (s) ε + ( ε ) m C(w s )) = C (s) ε, (2)

9 PREPRINT 9. Vl loss Trin loss Vl ccurcy Trin ccurcy 9 SB α LB () Shrpness visuliztion Accurcy (%) Trin loss SB Trin loss LB Trin ccu. SB Trin ccu. LB Vl loss SB Vl loss LB Vl ccu. SB Vl ccu. LB (b) Sensitivity to noise Figure 6: Nottion: SB : Smll Btch (256); LB : Lrge Btch (5); ccu. : ccurcy. () loss nd ccurcy vs. α, which controls w long the direction from SB minimum (w f ) to LB minimum (w s ); (b) loss nd ccurcy under influence of different strengths of noise. Dtset: CIFAR-. Network: C 3. The optimizer is Adm with. initil lerning rte Accurcy % Trin loss SB Trin loss LB Trin ccu. SB Trin ccu. LB Vl loss SB Vl loss LB Vl ccu. SB Vl ccu. LB () Sensitivity to noise (ResNet-44 on Cifr-) Accurcy (%) Trin loss SB Trin loss LB Trin ccu. SB Trin ccu. LB Vl loss SB Vl loss LB Vl ccu. SB Vl ccu. LB (b) Sensitivity to noise (ResNet-44 on Cifr-) Figure 7: nd ccurcy of ResNet-44 under influence of different strengths of noise on () CIFAR- nd (b) CIFAR-. The optimizer is SGD with momentum.9. Nottion: SB : Smll Btch (28); LB : Lrge Btch (248 for CIFAR- nd 24 for CIFAR-); ccu. : ccurcy Accurcy (%) in high dimensionl models (like deep neurl networks), we cn find ε ε (left limit) to stisfy ( ) min C(w) > C (s) ε mx (C(w)). (22) D(w s,ε) D(w s,ε) In the flt region, min D(w f,τ) ( C(w) ) = (2) m < (2) m D(w f,) C(w )dw... dw m ( ) (2) m mx (C(w)) D(w f,) = mx (C(w)) C(f) D(w f,) (23) Assuming C(w s ) C(w f ), s grows, C ε (s) ε increses fst in the shrp region while C (f) increses slowly in the flt region; therefore, such tht C (s) ε According to Inequlity (22)(23)(24), min D(w s,ε) > C (f). (24) ( C(w) ) > mx D(w s,ε) (C(w)) > mx (C(w)) D(w f,) ( ) > min C(w) D(w f,τ) which stisfies the Shrp Constrint. (25) APPENDIX B SENSITIVITY ANALYSES AND SHARPNESS VISUALIZATION We provide more sensitivity nlyses in Figure 6 nd Figure 7 s tested on different DNNs nd dtsets. More shrpness comprison between bseline nd SmoothOut is visulized in Figure 8 nd Figure 9. ACKNOWLEDGMENT This work ws supported in prt by DOE de-sc864, NSF nd Any opinions, findings, conclusions or recommendtions expressed in this mteril re those of the uthors nd do not necessrily reflect the views of DOE, NSF or their contrctors. () SB bseline SB SmoothOut Figure 8: Shrpness of bseline nd SmoothOut in () SB trining nd (b) LB trining of C. (b) REFERENCES [] G. Dimos, S. Sengupt, B. Ctnzro, M. Chrznowski, A. Cotes, E. Elsen, J. Engel, A. Hnnun, nd S. Stheesh, Persistent rnns: Stshing recurrent weights on-chip, in Interntionl Conference on Mchine Lerning, 26, pp [2] N. S. Keskr, D. Mudigere, J. Nocedl, M. Smelynskiy, nd P. T. P. Tng, On lrge-btch trining for deep lerning: Generliztion gp nd shrp minim, in Interntionl Conference on Lerning Representtions, 27. [3] P. Goyl, P. Dollár, R. Girshick, P. Noordhuis, L. Wesolowski, A. Kyrol, A. Tulloch, Y. Ji, nd K. He, Accurte, lrge minibtch sgd: trining imgenet in hour, rxiv preprint rxiv: , 27. [4] E. Hoffer, I. Hubr, nd D. Soudry, Trin longer, generlize better: closing the generliztion gp in lrge btch trining of neurl networks, in Advnces in Neurl Informtion Processing Systems, 27, pp [5] J.-P. Bouchud nd A. Georges, Anomlous diffusion in disordered medi: sttisticl mechnisms, models nd physicl pplictions, Physics reports, vol. 95, no. 4-5, pp , 99. [6] S. L. Smith, P.-J. Kindermns, nd Q. V. Le, Don t decy the lerning rte, increse the btch size, in Interntionl Conference on Lerning Representtions, 28. [7] Y. You, I. Gitmn, nd B. Ginsburg, Scling sgd btch size to 32k for imgenet trining, rxiv preprint rxiv: , 27. () SB bseline SB SmoothOut Figure 9: Shrpness of bseline nd SmoothOut in () SB trining nd (b) LB trining of F. (b) LB bseline LB bseline LB SmoothOut LB SmoothOut

10 PREPRINT [8] T. Akib, S. Suzuki, nd K. Fukud, Extremely lrge minibtch sgd: Trining resnet-5 on imgenet in 5 minutes, rxiv preprint rxiv:7.4325, 27. [9] S. Hochreiter nd J. Schmidhuber, Flt minim, Neurl Computtion, vol. 9, no., pp. 42, 997. [] P. Chudhri, A. Choromnsk, S. Sotto, nd Y. LeCun, Entropy-sgd: Bising grdient descent into wide vlleys, in Interntionl Conference on Lerning Representtions, 27. [] J. Rissnen, Modeling by shortest dt description, Automtic, vol. 4, no. 5, pp , 978. [2] S. Jstrzębski, Z. Kenton, D. Arpit, N. Blls, A. Fischer, Y. Bengio, nd A. Storkey, Finding fltter minim with sgd, 28. [3] C. Zhng, S. Bengio, M. Hrdt, B. Recht, nd O. Vinyls, Understnding deep lerning requires rethinking generliztion, rxiv preprint rxiv:6.353, 26. [4] C. S. Wllce nd D. M. Boulton, An informtion mesure for clssifiction, The Computer Journl, vol., no. 2, pp , 968. [5] D. J. McKy, A prcticl byesin frmework for bckpropgtion networks, Neurl computtion, vol. 4, no. 3, pp , 992. [6] S. L. Smith nd Q. V. Le, A byesin perspective on generliztion nd stochstic grdient descent, in Proceedings of Second workshop on Byesin Deep Lerning (NIPS 27), 27. [7] L. Dinh, R. Pscnu, S. Bengio, nd Y. Bengio, Shrp minim cn generlize for deep nets, in Interntionl Conference on Mchine Lerning, 27, pp [8] M. Welling nd Y. W. Teh, Byesin lerning vi stochstic grdient lngevin dynmics, in Proceedings of the 28th Interntionl Conference on Mchine Lerning (ICML-), 2, pp [9] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, nd L. Fei-Fei, Imgenet: A lrge-scle hierrchicl imge dtbse, in Computer Vision nd Pttern Recognition, 29. CVPR 29. IEEE Conference on. IEEE, 29, pp [2] P. Izmilov, D. Podoprikhin, T. Gripov, D. Vetrov, nd A. G. Wilson, Averging weights leds to wider optim nd better generliztion, rxiv:83.547, 28. [2] A. Neelkntn, L. Vilnis, Q. V. Le, I. Sutskever, L. Kiser, K. Kurch, nd J. Mrtens, Adding grdient noise improves lerning for very deep networks, rxiv:5.687, 25. [22] H. Mobhi, Trining recurrent neurl networks by diffusion, rxiv:6.44, 26. [23] M. Fortunto, M. G. Azr, B. Piot, J. Menick, I. Osbnd, A. Grves, V. Mnih, R. Munos, D. Hssbis, O. Pietquin et l., Noisy networks for explortion, rxiv:76.295, 27. [24] M. Plppert, R. Houthooft, P. Dhriwl, S. Sidor, R. Y. Chen, X. Chen, T. Asfour, P. Abbeel, nd M. Andrychowicz, Prmeter spce noise for explortion, rxiv:76.95, 27. [25] Y. Li nd F. Liu, Whiteout: Gussin dptive noise regulriztion in feedforwrd neurl networks, rxiv:62.49, 26. [26] K. Ho, C.-s. Leung, nd J. Sum, On weight-noise-injection trining, in Interntionl Conference on Neurl Informtion Processing. Springer, 28, pp [27] R. M. Nel, Byesin lerning for neurl networks. Springer Science & Business Medi, 22, vol. 8. [28] G. Hinton nd D. vn Cmp, Keeping neurl networks simple by minimising the description length of weights. 993, in Proceedings of COLT-93, pp [29] D. P. Kingm nd M. Welling, Auto-encoding vritionl byes, rxiv:32.64, 23. [3] N. Srivstv, G. Hinton, A. Krizhevsky, I. Sutskever, nd R. Slkhutdinov, Dropout: simple wy to prevent neurl networks from overfitting, The Journl of Mchine Lerning Reserch, vol. 5, no., pp , 24. [3] Y. Gl nd Z. Ghhrmni, Dropout s byesin pproximtion: Representing model uncertinty in deep lerning, in interntionl conference on mchine lerning, 26, pp [32] X. Ji, S. Song, W. He, Y. Wng, H. Rong, F. Zhou, L. Xie, Z. Guo, Y. Yng, L. Yu et l., Highly sclble deep lerning trining system with mixed-precision: Trining imgenet in four minutes, rxiv:87.25, 28. [33] H. Li, Z. Xu, G. Tylor, nd T. Goldstein, Visulizing the loss lndscpe of neurl nets, rxiv preprint rxiv:72.993, 27. [34] L. Bottou, Online lerning nd stochstic pproximtions, On-line lerning in neurl networks, vol. 7, no. 9, p. 42, 998. [35] I. J. Goodfellow, O. Vinyls, nd A. M. Sxe, Qulittively chrcterizing neurl network optimiztion problems, rxiv: , 24. [36] Y. LeCun, L. Bottou, Y. Bengio, nd P. Hffner, Grdient-bsed lerning pplied to document recognition, Proceedings of the IEEE, vol. 86, no., pp , 998. [37] A. Krizhevsky nd G. Hinton, Lerning multiple lyers of fetures from tiny imges, 29. [38] R. H. Byrd, P. Lu, J. Nocedl, nd C. Zhu, A limited memory lgorithm for bound constrined optimiztion, SIAM Journl on Scientific Computing, vol. 6, no. 5, pp. 9 28, 995. [39] A. Krizhevsky, I. Sutskever, nd G. E. Hinton, Imgenet clssifiction with deep convolutionl neurl networks, in Advnces in Neurl Informtion Processing Systems, 22, pp [4] K. He, X. Zhng, S. Ren, nd J. Sun, Deep residul lerning for imge recognition, in Proceedings of the IEEE Conference on Computer Vision nd Pttern Recognition, 26, pp

Reinforcement learning II

Reinforcement learning II CS 1675 Introduction to Mchine Lerning Lecture 26 Reinforcement lerning II Milos Huskrecht milos@cs.pitt.edu 5329 Sennott Squre Reinforcement lerning Bsics: Input x Lerner Output Reinforcement r Critic