1 Gradient descent for convex functions: univariate case

Size: px

Start display at page:

Download "1 Gradient descent for convex functions: univariate case"

Sandra Potter
6 years ago
Views:

1 prnceton unv. F 13 cos 51: Advanced Algorthm Desgn Lecture 19: Gong wth the slope: offlne, onlne, and randomly Lecturer: Sanjeev Arora Scrbe: hs lecture s about gradent descent, a popular method for contnuous optmzaton (especally nonlnear optmzaton). We start by recallng that allowng nonlnear constrants n optmzaton leads to NPhard problems n general. For nstance the followng sngle constrant can be used to force all varables to be 0/1. x (1 x ) = 0. Notce, ths constrant s nonconvex. We saw n earler lectures that the Ellpsod method can solve convex optmzaton problems effcently under farly general condtons. But t s slow n practce. Gradent descent s a popular alternatve because t s smple and t gves some knd of meanngful result for both convex and nonconvex optmzaton. It tres to mprove the functon value by movng n a drecton related to the gradent (.e., the frst dervatve). For convex optmzaton t gves the global optmum under farly general condtons. For nonconvex optmzaton t arrves at a local optmum. Fgure 1: optmum For nonconvex functons, a local optmum may be dfferent from the global We wll frst study unconstraned gradent descent where we are smply optmzng a functon f( ). Recall that the functon s convex f f(λx + (1 λ)y) λf(x) + (1 λ)f(y) for all x, y and λ [0, 1]. 1 Gradent descent for convex functons: unvarate case For warm up let s descrbe gradent descent n the unvarate case, whch wll be famlar from freshman calculus. Recall the aylor expanson of a functon all of whose dervatves exst at x: 1

2 f(x + η) = f(x) + ηf (x) + η f (x) + η3 3! f (x). (1) he functon s convex f f (x) 0 for all x. hs means that f (x) s an ncreasng functon of x. he mnmum s attaned when f (x) = 0 snce f (x) keeps ncreasng to the left and rght of that. hus the global mnmum s unque. he functon s concave f f (x) 0 for all x; such functons have a unque maxmum. Examples of convex functons: ax + b for any a, b R; exp(ax) for any a R; x α for x 0, α 1 or α 0. Another nterestng example s the negatve entropy: x log x for x 0. Examples of concave functons: ax + b for any a, b R; x α for α [0, 1] and x 0; log x for x 0. Fgure : Concave and Convex Functon o mnmze a convex functon by gradent descent we start at some x 0 and at step update x to x +1 = x + ηf (x) for some small η < 0. In other words, move n the drecton where f decreases. If we gnore terms that nvolve η 3 or hgher, then f(x +1 ) = f(x ) + ηf (x ) + η f (x ). and the best value for η (whch gves the most reducton n one step) s η = f (x)/f (x), whch gves f(x +1 ) = f(x ) (f (x )) f (x ). hus the algorthm makes progress so long as f (x ) > 0. Convex functons that satsfy f (x) > 0 for all x are called strongly convex. he above calculaton s the man dea n Newton s method, whch you may have seen n calculus. Provng convergence requres further assumptons. Convex multvarate functons A convex functon on R n, f t s dfferentable, satsfes the basc followng basc nequalty, whch says that the functon les above the tangent plane at any pont. f(x + z) f(x) + f(x) z x, y. ()

3 3 Here f(x) s the vector of frst order dervatves where the th coordnate s f/ x and called the gradent. Sometmes we restate t equvalently as f(x) (y x) f(y) f(x) x, z (3) 3.1 Basc propertes and examples 69 f(y) f(x)+ f(x) (y x) (x, f(x)) Fgure 3. If f s convex and dfferentable, then f(x)+ f(x) (y x) f(y) for all x, y dom f. Fgure 3: A dfferentable convex functon les above the tangent plane f(x)+ f(x) (y x) s gven by { 0 x C Ĩ C(x) = If hgher dervatves also exst, the x multvarate C. aylor expanson for an n-varate functon f s he convex functon ĨC s called the ndcator functon of the set C. f(x + y) = f(x) + f(x) y + y f(x)y +. (4) We can play several notatonal trcks wth the ndcator functon ĨC. For example the problem of mnmzng a functon f (defned on all of R n, say) on the set C s the same as mnmzng the functon f + ĨC over all of Rn. Indeed, the functon f + ĨC s (by our conventon) f restrcted to the set C. Here f(x) denotes the n n matrx whose, j entry s f/ x x j and t s called the Hessan. It can be checked that f s convex f the Hessan s postve semdefnte; ths In a smlar way we can extend a concave functon by defnng t to be outsde ts doman Frst-order condtons Suppose f s dfferentable (.e., ts gradent f exsts at each pont n dom f, whch s open). hen f s convex f and only f dom f s convex and f(y) f(x)+ f(x) (y x) (3.) holds for all x, y dom f. hs nequalty s llustrated n fgure 3.. he affne functon of y gven by f(x)+ f(x) (y x) s, of course, the frst-order aylor approxmaton of f near x. he nequalty (3.) states that for a convex functon, the frst-order aylor approxmaton Fgure s n fact 4: a global he underestmator Hessan of the functon. Conversely, f the frst-order aylor approxmaton of a functon s always a global underestmator of the functon, then the functon s convex. he nequalty (3.) shows that from local nformaton about a convex functon (.e., ts value and dervatve at a pont) we can derve global nformaton (.e., a global underestmator of t). hs s perhaps the most mportant property of convex functons, and explans some of the remarkable propertes of convex functons and convex optmzaton problems. As one smple example, the nequalty (3.) shows that f f(x) = 0, then for all y dom f, f(y) f(x),.e., x s a global mnmzer of the functon f. means y fy 0 for all y. Example 1 he followng are some examples of convex functons. Norms. Every l p norm s convex on R n. he reason s that a norm satsfes trangle nequalty: x + y x + y x, y. f(x) = log(e x 1 + e x + + e xn ) s convex on R n. hs fact s used n practce as an analytc approxmaton of the max functon snce max {x 1,..., x n } f(x)+ max {x 1,..., x n } + log n.

4 4 urns out ths fact s at the root of the multplcatve weght update method; the algorthm for approxmately solvng LPs that we saw n Lecture 10 can be seen as dong a gradent descent on ths functon, where the x s are the slacks of the lnear constrants. (For a lnear constrant a z b the slack s a z b.) f(x) = x Ax where A s postve semdefnte. Its Hessan s A. Some mportant examples of concave functons are: geometrc mean ( n =1 x ) 1/n and logdetermnant (defned for X R n as log det(x) where X s nterpreted as an n n matrx). Many famous nequaltes n mathematcs (such as Cauchy-Schwartz) are derved usng convex functons. Example (Lnear equatons wth PSD constrant matrx) In lnear algebra you learnt that the method of choce to solve systems of equatons Ax = b s Gaussan elmnaton. In many practcal settngs ts O(n 3 ) runnng tme may be too hgh. Instead one does gradent descent on the functon 1 x Ax b x, whose local optmum satsfes Ax = b. If A s postve semdefnte ths functon s also convex snce the Hessan s A. Actually n real lfe these are optmzed usng more advanced methods such as conjugate gradent. Also, f A s dagonal domnant, a stronger constrant than PSD, then Spelman and eng (003) have shown how to solve ths problem n tme that s near lnear n the number of nonzero entres. Example 3 (Least squares) In some settngs we are gven a set of ponts a 1, a,..., a m R n and some data values b 1, b,..., b m taken at these ponts by some functon of nterest. We suspect that the unknown functon s a lne, except the data values have a lttle error n them. One standard technque s to fnd a least squares ft: a lne that mnmzes the sum of squares of the dstance to the dataponts to the lne. he objectve functon s mn Ax b where A R m n s the matrx whose rows are the a s. (We saw n an earler lecture that the soluton s also the frst sngular vector.) hs objectve s just x A Ax (Ax) b+b b, whch s convex. In the unvarate case, gradent descent has a choce of only two drectons to move n: rght or left. In n dmensons, t can move n any drecton n R n. he most drect analog of the unvarate method s to move dametrcally opposte from the gradent. he most drect analogue of our unvarate analyss would be to assume a lowerbound of y fy for all y (n other words, a lowerbound on the egenvalues of f). hs wll be explored n the homework. In the rest of lecture we wll only assume (). 3 Gradent Descent for Constraned Optmzaton As studed n prevous lectures, constraned optmzaton conssts of solvng the followng where K s a convex set and f( ) s a convex functon. mn f(x) s.t. x K.

5 5 Example 4 (Spam classfcaton va SVMs) hs example wll run through the entre lecture. Support Vector Machne s the name n machne learnng for a lnear classfer; we saw these before n Lecture 5. Suppose we wsh to tran the classfer to classfy emals as spam/nonspam. Each emal s represented usng a vector that gves the frequences of varous words n t ( bag of words model). Say X 1, X,..., X N are the emals, and for each there s a correspondng bt y { 1, 1} where y = 1 means X s spam. SVMs use a lnear classfer to separate spam from nonspam. If spam were perfectly dentfable by a lnear classfer, there would be a functon W x such that W X 1 f X s spam, and W X 1 f not. In other words, 1 y W X 0 (5) Of course, n practce a lnear classfer makes errors, so we have to allow for the possblty that (5) s volated by some X s. hus a more robust verson of ths problem s mn Loss(1 W (y X )) + λ W, (6) where Loss( ) s a functon that penalzes unsatsfed constrants accordng to the amount by whch they are unsatsfed, and λ > 0 s a scalng constant. (Note that W s the vector of varables.) he most obvous loss functon would be to count the number of unsatsfed constrants but that s nonconvex. For ths lecture we focus on convex loss functons; the smplest s the hnge loss: Loss(t) = max {0, t}. hen the functon n (6) s convex. If x K s the current pont and we use the gradent to step to x+ x then n general ths new pont wll not be n K. hus one needs to do a projecton. Defnton 1 he projecton of a pont y on K s x K that mnmzes y x. (It s also possble to use other norms than l to defne projectons.) A projecton oracle for the convex body a black box that, for every pont y, returns ts projecton on K. Often convex sets used n applcatons are smple to project to. Example 5 If K = unt sphere, then the projecton of y s y/ y. Here s a smple algorthm for solvng the constraned optmzaton problem. he algorthm only needs to access f va a gradent oracle and K va a projecton oracle. Defnton (Gradent Oracle) A gradent oracle for a functon f s a black box that, for every pont z, returns f(z) the gradent valuated at pont z. (Notce, ths s a lnear functon of the form gx where g s the vector of partal dervatves evaluated at z.) he same value of η wll be used throughout. Gradent Descent for Constraned Optmzaton Let η == D G. Repeat for = 0 to y (+1) x () + η f(x () ) x (+1) Projecton of y (+1) on K. At the end output z = 1 x().

6 6 Let us analyse ths algorthm as follows. Let x be the pont where the optmum s attaned. Let G denote an upperbound on f(x) for any x K, and let D = max x,y K x y be the so-called dameter of K. o ensure that the output z satsfes f(z) f(x ) + ɛ we wll use = 4D G. ɛ Snce x () s a projecton of y () on K we have x (+1) x y (+1) x = x () x η f(x () ) = x () x + η (f)(x () ) + η f(x () ) (x () x ) Reorganzng and usng defnton of G we obtan: f(x () ) (x x () ) 1 η ( x () x x (+1) x ) + η G Usng (3), we can lowerbound the left hand sde by f(x () ) f(x ). We conclude that f(x () ) f(x ) 1 η ( x () x x (+1) x ) + η G. (7) Now sum the prevous nequalty over = 1,,..., and use the telescopng cancellatons to obtan 1 =1 Fnally, by convexty f( 1 x() satsfes (f(x () ) f(x )) 1 η ( x (0) x x ( ) x ) + η G. x() ) 1 f(x() ) so we conclude that the pont z = f(z) f(z ) D η + η G. Now set η = D G DG to get an upperbound on the rght hand sde of. Snce = 4D G ɛ we see that f(z) f(x ) + ɛ. 4 Onlne Gradent Descent Znkevch notced that the analyss of gradent descent apples to a much more generalzed scenaro. here s a convex set K gven va a projecton oracle. For = 1,,..., we are presented at step a convex functon f. At step we have to put forth our guess soluton x () K but the catch s that we do not know the functons that wll be presented n future. So our onlne decsons have to be made such that f x s the pont w that mnmzes f (w) (.e. the pont that we would have chosen n hndsght after all the functons were revealed) then the followng quantty (called regret) should stay small: f (x () ) f (x ).

7 7 he above gradent descent algorthm can be adapted for ths problem by replacng f(x () ) by f (x () ). hs algorthm s called Onlne Gradent Descent. he earler analyss works essentally unchanged, once we realze that the left hand sde of (7) has the regret for tral. Summng over gves the total regret on the left sde, and the rght hand sde s analysed and upperbounded as before. hus we have shown: heorem 1 (Znkevch 003) If D s the dameter of K and G s an upperbound on the norm of the gradent of any of the presented functons, and η s set to then the regret per step after steps s at most DG. D G Example 6 (Spam classfcaton aganst adaptve adversares) We return to the spam classfcaton problem of Example 4, wth the new twst that ths classfer changes over tme, as spammers learn to evade the current classfer. hus there s no fxed dstrbuton of spam emals and thus t s frutless to tran the classfer at one go. It s better to have t mprove and adapt tself as new examples come up. At step t the optmum classfer f t may not be known and s presented usng a gradent oracle. hs functon just corresponds to the term n (6) correspondng to the latest emal that was classfed as spam/nonspam. Znkevch s theorem mples that our sequence of classfers has low regret. 5 Stochastc Gradent Descent Stochastc gradent descent s a varant of the algorthm n Secton 3 that works wth convex functons presented usng an even weaker noton: an expected gradent oracle. Gven a pont z, ths oracle returns a lnear functon gx + f that s drawn from a probablty dstrbuton D z such that the expectaton E g,f Dz [gx + f] s exactly the gradent of f at z. Example 7 (Spam classfcaton usng SGD) Returnng to the spam classfcaton problem of Example 4, we see that the functon n (6) s a sum of many smlar terms. If we randomly pck a sngle term and compute just ts gradent (whch s very quck to do!) then by lnearty of expectatons, the expectaton of ths gradent s just the true gradent. hus we see that the expected gradent oracle may be a much faster computaton than the gradent oracle (a mllon tmes faster f the number of emal examples s a mllon!). In fact ths settng s not atypcal; often the convex functon of nterest s a sum of many smlar terms. Stochastc gradent descent uses Onlne Gradent Descent (OGD). Let g x be the gradent at step. hen we use ths functon whch s a lnear functon and hence convex as f n the th step of OGD. Let z = 1 =1 x(). Let x be the pont n K where f attans ts mnmum value. heorem E[f(z)] f(x ) + DG, where D s the dameter as before and G s an upperbound of the norm of any gradent vector ever output by the oracle.

8 8 Proof: f(z) f(x ) 1 1 = 1 = 1 (f(x () ) f(x )) by convexty of f f(x () ) (x () x ) usng () E[g (x () x )] E[f (x ()) f (x )] = 1 E[ (f (x () ) f (x )] Snce expected gradent s the true gradent Defn. of f and the theorem now follows snce the expresson n the E[ ] s just the regret, whch s always upperbounded by the quantty gven n Znkevch s theorem, so the same upperbound holds also for the expectaton. 6 Hnt of more advanced deas We covered some of these n the Frday specal secton. Gradent descent algorthms come n dozens of flavors. (he Boyd-Vandenberghe book s a good resource. and Nesterov s lecture notes are terser but stll have a lot of ntuton.) Surprsngly, just gong along the gradent (more precsely, dametrcally opposte drecton from gradent) s not always the best strategy. Steepest descent drecton s defned by quantfyng the best decrease n the objectve functon obtanable va a step of unt length. he catch s that dfferent norms can be used to defne unt length. For example, f dstance s measured usng l 1 norm, then the best reducton happens by pckng the largest coordnate of the gradent vector and reducng the correspondng coordnate n x (coordnate descent). he classcal Newton method s a subcase where dstance s measured usng the ellpsodal norm defned usng the Hessan. Gradent descent deas underle recent advances n algorthms for problems lke Spelman- eng style solver for Laplacan systems, near-lnear tme approxmaton algorthms for maxmum flow n undrected graphs, and Madry s faster algorthm for maxmum weght matchng. Bblography 1. Convex Optmzaton, S. Boyd and L. Vandenberghe. Cambrdge Unversty Press. (pdf avalable onlne.). Introductory Lectures on Convex Optmzaton: A Basc Course. Y. Nesterov. Sprnger Lecture notes on onlne optmzaton. Elad Hazan. 4. Lecture notes on onlne optmzaton. S. Bubeck.

1 Convex Optimization

1 Convex Optimization Convex Optmzaton We wll consder convex optmzaton problems. Namely, mnmzaton problems where the objectve s convex (we assume no constrants for now). Such problems often arse n machne learnng. For example,