Lecture 16: Backpropogation Algorithm Neural Networks with smooth activation functions

CO-511: Learg Theory prg 2017 Lecturer: Ro Lv Lecture 16: Bacpropogato Algorthm Dsclamer: These otes have ot bee subected to the usual scruty reserved for formal publcatos. They may be dstrbuted outsde ths class oly wth the permsso of the Istructor. o far we ve dscussed covex learg problems. Covex learg problem are of partcular terest maly because they come wth strog theoretcal guaratees. For example, we ca apply GD algorthm to obta desrable learg rates. As t turs out, eve though o-covex problems form formdable challeges theory: They ofte ted to solve may terestg problems practce. I ths lecture we wll dscuss the tas of trag eural etwors usg tochastc Gradet Descet Algorthm. Eve though, we caot guaratee ths algorthm wll coverge to optmum, ofte state-of-the-art results are obtaed by ths algorthm ad t has become a bechmar algorthm for ML. 16.1 Neural Networs wth smooth actvato fuctos We recall that gve a graph (V, E) ad a actvato fucto σ we defed N (V,E),σ to be the class of all eural etwors mplemetable by the archtecture of (V, E) ad actvato fucto σ (ee lectures 5 ad 6). Gve a fxed archtecture a target fucto f ω,b N (V,E),σ s parametrzed by a set of weghts ω : E R ad bas b : V R. The emprcal loss (0-1) s gve by L 0,1 (ω, b) = m =1 l 0,1 (f (0,1) ω,b (x() ), y ) Where we add the superscrpt (0, 1) to ote that we are cosderg a target fucto the class N (V,E),σσsg. Of course, the aforemetoed problem s o-dfferetable ( fact o cotous), therefore we caot apply GD le method. Therefore we wll do two alteratos to the archtectures cosdered so far. Frst stead of σ σsg that we cosdered so far we wll cosder a dfferet actvato fucto. Namely, σ(a) = 1 1 + e a. That meas that each euro, ow returs as output v (t) = σ( ω(t), v(t 1) (x) + b (t) ) whch s a smooth fucto ts parameter. I tur the fucto f ω,b becomes smooth ts parameter (sce ts a composto of addto of smooth fuctos). Remar: Note that we care about smoothess terms of ω ad b!!! Whle f ω,b s a fucto of x: I trag, we cosder the emprcal loss as a fucto of the parameters ad we wat to optmze over these. Of course, ow the target fucto does ot retur 0 or 1 but a real umber, therefore we also replace the 0 1 wth a surrogate covex loss fucto. For cocretess we let l(a, y) = (a y) 2. We ow obta the dfferetable emprcal problem m L (ω, b) = l(f ω,b (x () ), y ) =1 16-1

16-2 Lecture 16: Bacpropogato Algorthm To see that these alterato do ot cause ay loss expressve power or geeralzato we prove the followg clam Clam 16.1. Let (V, E) be a fxed feed-forward graph, the for every sample : 1. 2. For every (ω, b ) f L (ω, b) f L (0,1) (ω, b) m l 0,1 (sg(f ω,b (x () )), y ) L (ω, b ) =1 The frst clam shows that we ca acheve a soluto that s compettve wth the loss of the optmal eural etwor wth 0 1 actvato fucto. The secod statemet tells us that the 0 1 soluto of the optmzer of L wll also have small 0 1 loss. I other words, by mmzg the dfferetable problem, we acheve a soluto wth small emprcal 0 1 loss. Proof. For the frst clam, ote that lm a σ(a) = 1 ad lm a σ(a) = 0, hece hece hece, the frst statemet hold. lm f c ω,c b = f 0,1 c ω,b, lm L (ω, b) L (0,1) c (ω, b) As to the secod statemet, ths follows from the fact that l s a surrogate loss fucto. Thus we tured the o-smooth problem to a dfferetable problem. Ths meas that we ca ow try to apply a gradet descet method, smlar to GD as we used covex problems. There are two ssues to over come 1. Though the loss fucto mght be covex, the ERM problem as a whole, gve ts depedece o the paratmeter s o covex. We have oly show that GD coverges whe the ERM problem s covex the parameters. 2. To perform GD we stll eed to compute the gradet f ω,b, where the depedece betwee the parameters may be hghly volved. The frst problem turs out to be a real ssue ad deed there s o guaratee that GD wll coverge to a global optmum whe the problem s essetally o-covex. I fact, eve covergece to a local mmum s ot guarateed, though oe ca show that GD wll coverge to a crtcal pot (more accurately to a pot where f ω,b ɛ (uder certa smoothess assumptos). The problem s geerally solved by re-teratg the algorhtm from dfferet talzato pots: wth the hope that oe of the staces wll deed coverge to a suffcetly optmal pot. However, all hardess results we dscussed so far apply: Therefore for ay method, f the etwor s expressve eough to represet, for example, tersecto of halfspaces the for some staces the method must fal. The secod pot s actually solvable ad we wll ext see how oe ca compute the gradet of the loss: Ths s ow as the Bacpropagato algorthm, whch has become the worhorse of Mache Learg the past few years.

Lecture 16: Bacpropogato Algorthm 16-3 16.1.1 A Few Remars o NNs practce Before presetg the Bacpropagato algorthm, t s worth dscussg some smplfcatos we have cosdered here over what s ofte used practce: the actvato fucto We are restrctg our atteto to a sgmodal actvato fucto. These has bee used the past. The geeral tuto beg, that they are a smoothg of the 0 1 actvato fucto. I realty, trag wth sgmodal fucto ted to get stuc: whe the weghts are very large the the dervatve starts to behave roughly le the 0 1 fucto whch mea they vash. Oe chage that was suggested s to use the relu actvato fucto σ relu = max(0, a) Ule the sgmodal fucto, ts dervatve does t vash wheever the put s postve. I terms of expressve power, they ca express sgmodal le fucto usg σ relu (a + 1) σ relu (a) o the overall expressvty of the etwor does t chage (as log as we allow twce as may euros at each layer, whch s the same order of euros) Regularzato For geeralzato we rely here o the geeralzato boud of O(E log E ). I practce the umber of free parameters (weghts ad bas) ted to be extremely larger the the umber of examples. Therefore certa regularzato s ofte employed o the weght (e.g. l 2, l 1 regularzato). There have also bee other heurstcs for regulerzg eural etwors such as dropout: Where roughly, durg trag oe zero out some weghts durg the update step. As we saw past lecture GD comes wth ts ow geeralzato guaratees. Geeralzato bouds to GD for o-covex optmzato has bee recetly obtaed [?], but these are ot ecessarly for the learg rates used practce. 16.1.2 The Bacpropagato Algorthm We ext dscuss the Bacpropogato algorthm that computes ω,b lear tme. To smplfy ad mae otatos easer, stead of carryg a bas term: let us assume that each layer V (t) cotas a sgle euro v (t) 0 that always outputs a costat 1. thus the output of a euro s gve by σ( ω, v (t 1) ) ad we supress the bas b as a addtoal weght ω,0. We ext wsh to compute the dervatve f ω. Now suppose euro v (t) computes: Where u (t) v (t) The usg a smple cha rule we obta that (x) = σ(u (t) (x)) (x) = ω, v (t 1) (x). ω, = u(t) = ω, v (t 1) (x) Thus to compute the partal dervatve wth respect to a sgle weght, we see that t s eough to compute.

16-4 Lecture 16: Bacpropogato Algorthm o we focus computg f of some varable z the we have by cha rule:. Now aga suppose f s a fucto of u (t) 1,..., u(t) m, whch are tur fucto m z = =1 u(t) z (16.1) Now f z = u (t 1) choce of actvato fucto s the output of some euro a prevous layer: The calculato of s easy for our u (t) = ω, σ(u (t 1) ) u(t) = ω, σ (u (t 1) ). Usg Eq. 16.1 for f = u (t) wll gve us also ω. we ca recursvely calculate all partal dervatve u(t) for t < t, whch tur The ave approach to calculate the gradet s that we calculate ductvely all dervatves of the form for t < t, the usg Eq. 16.1 wth f = u (t+1) we calculate all dervatves u(t+1) l. Ths calculato calls for each fxed umber of tmes proportoal to E umber of edges, therefore the overall calculato tme s gve by O( V E ). The bac propogato algorhtm calculates the dervatve through dyamcal programmg ad reduces the complexty to O( V + E ): 16.1.3 Bacproporgato We ext cosder a approach to calculate the partal dervatve that taes tme O( V + E ). Algorthm 1 Bacpropogato Iput: Graph G(V, E) ad parameters ω : E R. ET T = depthg,.e v (T ) s the output euro. ET m (T ) = 1 for t = T 1... 1 % tart from top layer ad move toward bottom layer do for = 1,..., V (t) % Go over all euros at layer t do euro v (t) receve messages m (t+1) ad passes a message m (t) (v (t) ), sums them up: = V (t+1) =1 m (t+1) (v (t) ) (v (t 1) ) to each euro at a lower level: ed for ed for Clam 16.2. At each ode v (t) m (t) (v (t 1) ) = the value s exactly ω. v (t)

Lecture 16: Bacpropogato Algorthm 16-5 Proof. We prove the statemet by ducto. The message that receves the output euro s gve by = ω ω = 1. Next for each euro we have by ducto: = V (t+1) =1 ω u (t+1) u(t+1) (16.2) Whch by cha rule gves the desred result The Bacpropagato, each euro does umber of computato that s proportoal to ts degree, overall the umber of calculato s proportoal to twce the umber of edges whch gves overall umber of calculatos O( V + E ).