1 Convex Optimization

Size: px

Start display at page:

Download "1 Convex Optimization"

Oliver Stephens
5 years ago
Views:

1 Convex Optmzaton We wll consder convex optmzaton problems. Namely, mnmzaton problems where the objectve s convex (we assume no constrants for now). Such problems often arse n machne learnng. For example, the SVM optmzaton problem s convex. Recall that f(w) s convex f for all w, w 2 : f(w 2 ) f(w ) + f(w ) (w 2 w ) () An alternatve condton for convexty s that w, w 2 t holds that: ( f(w ) f(w 2 )) (w w 2 ) 0 (2). Stochastc Gradent Descent he dervaton here follows [], Chapter 3. Consder mnmzng the functon f(w), whch s gven as a sum: f(w) = m m f (w) (3) where f (x) are convex functons. Now consder the SGD algorthm defned as follows. We use W for the th value, snce t s a random varable. Set W = 0 Sample V t such that E [V t W t ] = f(w t ). Update W t+ = W t ηv t. In partcular, V t can be obtaned by samplng an ndex j unformly n {,..., m} and returnng V t = f j (w t ) (see sldes for why ts expected value s ndeed f(w t )) Now let s analyze the qualty of the averaged vector W = t= W t. We wll show the followng result. heorem.. Denote by w the mnmzer of f(w) and assume w 2 B. Also assume V t 2 G for all. hen: = E [ f( W ) f(w ) ] BG (4) Proof. f( W ) = f ( ) W t f(w t ) (5) t= where we have used the Jensen nequalty, whch says that for a convex functon, the average of ts values s greater than ts value on the average. t=

2 We would lke to compare ths to the optmum, so: f( W ) f(w ) [f(w t ) f(w )] (6) Usng convexty we have: [f(w t ) f(w )] f(w t )(W t w ) (7) Now, n SGD, the W,..., W n are random varables, because of the randomness of the gradent estmate. Denote these by W,..., W. hus, the above dfference s a random varable, and we would lke to say that t s small. Here we wll show ths n the expected sense. Denote by V,..., V n the stochastc gradent estmates, as sampled durng the algorthm. Note that each V t s sampled based on W t such that: E [V t W t = w t ] = f(w t ) (8) Smlarly: And: t= E [V t W t W t = w t ] = f(w t )w t (9) E [V t W t ] = E [ f(w t )W t ] (0) We can now take the expected value of Eq. (6) to get: E [ f( W ) f(w ) ] E [ f(w t )(W t w )] () and usng Eq. (0) ths s equvalent to: E [ f( W ) f(w ) ] E [V t (W t w )] (2) We now use the SGD update form to wrte: W t+ w 2 2 = W ηv t w 2 2 = W t w η 2 V t 2 2 2η (W w ) V t Rearrangng we have: V t (W t w ) = W t w 2 2 W t+ w 2 2 2η From Eq. (3) we get: E [ f( W ) f(w ) ] = t + 0.5η V t 2 2 (3) [ Wt w 2 2 W t+ w 2 2 E + η ] 2η 2 V t 2 2 ] η + 2η w 2 2 2η E [ W + w 2 2 2η w η 2 2 t 2 E [ V t 2 ] B 2 2 2η + ηg2 2 t E [ V t 2 ] 2

3 where we used the fact the seres s telescopng, and for the last nequalty we dropped a negatve term. Settng η = we get: ( Namely, after teratons we have O 2 Deep Learnng B G E [ f( W ) f(w ) ] BG (4) ) error. In deep learnng we are nterested n functons that correspond to sequences of lnear transformaton followed by a non-lnearty. Formally, assume x s out nput, and denote z 0 = x. Now defne the calculaton recursvely: z t+ = h(w t+ z t ) (5) Here h : R R s a non-lnear functon. Some examples are: h(z) = RELU(z) = max [0, z] h(z) = SIGM OID(z) = + e z h(z) = ANH(z) = ex e x e x + e x he output of the neural net depends on the task we want to perform. Let us focus for smplcty on bnary classfcaton. In whch case the output s a scalar, obtaned as the transformaton: z L = h(w L z L ) (6) Note that often ths last transformaton s taken to be lnear. For ease of dervaton we assume t s the same non-lnear transformaton as we had n the prevous layers. he resultng classfer s then taken to be: f(x; W ) = sgn (z L ) (7) where W stands for all the parameters of the model. Note that t would also make sense to add a bas term here so that classfcaton s sgn (z L b) (of course ths wll be absolutely requred f h has non-negatve outputs). 2. ranng Deep Learnng by SGD and Backpropagaton Assume we have tranng data n the form of pars x, y for y {+, }. hen we can follow the standard ERM approach of mnmzng some approxmaton 3

4 of the classfcaton error. As we dscussed earler, two natural loss functon are the hnge loss or the logstc loss. l hnge (y, z L ) = max [ yz L, 0] l logstc (y, z L ) = log ( + e yz ) L Note that the logstc loss wll go to zero when y, z have the same sgn and z. Introduce some notaton: Denote v t = W t+ z t. Namely, these are the values of the neurons n layer t before the actvaton functon. Let w t, denote the th row of the matrx W t. weghts of neuron n the layer t. So, these are the nput he loss we want to mnmze s then (we just use l for the loss. It can be hnge, logstc or somethng else): f(w ) = = l(y, z L (x, W )) (8) Note we could have ncluded a regularzaton term, and ths s ndeed often used. he smplest and most effectve way of mnmzng the above s SGD. o mplement t, we only need to calculate the gradent l(x, y, W ). here are many software tools that can do ths dfferentaton automatcally. Here we derve the gradent, to better understand ts structure. hs gradent calculaton, whch s fundumental to any deep learnng optmzaton, s called backpropagaton for reasons that wll soon be clear. We frst recall the chan rule. Start wth a smple case where: f(x) = y(z(x)) (9) hen: f y = y z z x Smlarly n the multvarate case: (20) f(x) = y(z (x),..., z m (x)) (2) hen: f x = y z z x (22) Now say we want the gradent wrt W t,,j or equvalently the j th weght n the vector w t,. We recall that w t, appears n the objectve only n z t,. hus we can use the chan rule to wrte: = l(y, z L(x, W )) (23) W t,,j W t,,j 4

5 Now recall: So we have: Puttng thngs together we have: z t, = h(w t, z t ) = h(v t, ) (24) W t,,j = h (v t, )z t,,j (25) W t,,j = l(y, z L(x, W )) h (v t, )z t,j (26) Note that our result depends on the dervatve of the loss wrt z t. We turn to show that ths can be calculated recursvely (from the top layer to the nput). = r z t+,r (27) z t+,r Agan usng Eq. (24) we get: So together wth Eq. (27) we have: o smplfy thngs denote: r z t+,r = h (v t+,r )W t+,r, (28) z t+,r h (v t+,r )W t+,r, (29) δ t,r = l(y, z L(x, W )) z t,r (30) And denote δ t the vector of all values for gven t. hen the backpropagaton calculaton for calculatng the gradents at weght W amounts to the followng steps: Use W to run the network forward on x and calculate all v t values (.e., the outputs of the lnear calculaton at each step). Recursvely calculate δ t usng: δ t, = r h (v t+,r )δ t+,r W t+,r, (3) In vector form we would have: δ t = ( h (v t+ ) δ t+) Wt+ (32) he base of the recurson s smply (note t s a scalar because there s only one z L ): δ L = l (y, z L (x, W )) (33) 5

6 Calculate the desred gradents va: l W t,,j = δ t, h (v t, )z t,j (34) Or n vector form: l W t = (δ t h (v t ))z t (35) You can now see why the algorthm s called backpropagaton. After the forward pass, t starts from the last layer, calculates the gradent of the loss wrt z L and then begns to propagate the errors backwards, so that each weght matrx W t gets a sgnal as to how t should change to decrease the error. References [] E. Hazan et al. Introducton to onlne convex optmzaton. Foundatons and rends n Optmzaton, 2(3-4):57 325,

For now, let us focus on a specific model of neurons. These are simplified from reality but can achieve remarkable results.

For now, let us focus on a specific model of neurons. These are simplified from reality but can achieve remarkable results. Neural Networks : Dervaton compled by Alvn Wan from Professor Jtendra Malk s lecture Ths type of computaton s called deep learnng and s the most popular method for many problems, such as computer vson