1 Duality revisited. AM 221: Advanced Optimization Spring 2016

Size: px

Start display at page:

Download "1 Duality revisited. AM 221: Advanced Optimization Spring 2016"

Barbara Heath
5 years ago
Views:

1 AM 22: Advaced Optimizatio Sprig 206 Prof. Yaro Siger Sectio 7 Wedesday, Mar. 9th Duality revisited I this sectio, we will give a slightly differet perspective o duality. optimizatio program: f(x) x R s.t x K Cosider a costraied where K is the feasible set of the program. The it is clear that this problem is equivalet to: where g is a fuctio such that: g(x) = x R g(x) { f(x) if x K + otherwise Ideed, whe x / K, g(x) = + ad the imum of g will ot be attaied at this poit. Whe x K, the f ad g coicide ad have the same imum. So buy ecodig the feasible set K ito the fuctio g, we trasformed a costraied program ito a ucostraied oe. More specifically, cosider the covex optimizatio program: f(x) x R s.t g(x) a h(x) = b where f, g ad h are covex. The followig discussio ca easily be adapted to a arbitrary umber of costraits. Let us deote by K the (covex) feasible set of this program, the the fuctio: g(x) = f(x) + λ( g(x) a ) + µ ( h(x) b ) satisfies the coditio of Equatio (). Ideed, cosider the case where g(x) > a, the by takig λ (λ arbitrarily large), we see that g(x) = +. Similarly, if h(x) b, the either h(x) > b, i which case we obtai g(x) = + by takig µ, or h(x) < b ad we obtai the same coclusio by takig µ. Let us deote by the fuctio: = f(x) + λ ( g(x) a ) + µ ( h(x) b ) The we have just show that the primal problem (2) is equivalet to: x R () (2) (3)

2 But you should remember from class, that we defied the Lagrgage dual fuctio as: g(λ, µ) = x R ad the dual problem as the followig imizatio of the Lagrage dual fuctio: g(λ, µ) = x R Observe the ice symmetry with Equatio 3 where the oly differece is the order of the ad operatios: the primal ad dual problems are both optimizig the Lagragia of the problem i differet orders. The weak duality theorem ca be re-iterpreted i this cotext as the followig iequality: x R x R This iequality is true without ay assumptio o L (hece o f, g ad h). The strog duality theorem states that the iequality is i fact a equality: = x R x R I other words, the order i which the parameters are optimized over does ot matter (swappig the ad operators is allowed). So Slater s coditio see i class ca be iterpreted as a sufficiet coditio uder which the swappig of ad operators is valid. Remark. The Lagrage dual fuctio g beig (poitwise) a imum of affie fuctios is a cocave fuctio. Hece the dual problem of a covex imizatio problem is a cocave imizatio problem. Remark. The iterpretatio of the strog duality theorem as swappig a ad operator is reiscet of Vo Neuma s i theorem that we used to defie the value of two-players, zero-sum games. I fact, the strog duality theorem uder Slater s coditio ca be see as a geeralizatio of Vo Neuma s i theorem. 2 Takig duals Let us cosider the followig covex optimizatio program: x R log(b i a i x) s.t a i x b i 0, i d The motivatio is to fid a feasible poit x of the set of liear iequalities a i x b i (defiig a polytope) which is far from the boudary of the polytope. Note that as a i x gets close to b i, the value of the objective fuctio coverges to +. The goal of this sectio is to derive the dual of the above problem. Simplifyig the primal is a good thig to do before computig the dual. Here, we ca itroduce extra variables y i = b i a i x, such 2

3 that the origial problem ca be rewritte: x R, y R d log(y i ) s.t y i 0, i d y i = b i a i x, i d the advatage is that the argumets of the logs are ow simpler. We the rewrite the problem i form by itroductio a dual variable λ i for each iequality costrait, ad a dual variable µ i for each equality costrait: x R, y R d d The dual problem is obtaied by swappig the ad the : d x R, y R d We ow eed to uderstad the Lagrage dual fuctio: g(λ, µ) = x R, y R d First thig to ote is that if d µ ia i 0, the the imum is (we ca always make x j arbitrarily large or small for coordiates j such that d µ ia ij 0). Whe d µ ia i = 0, the the fuctio o loger depeds o x ad we are left with optimizig over y R d. Sigular poits are give by: = µ i λ i, i d y i I summary, we have: { d g(λ, µ) = log(µ i λ i ) µ b + if d µ ia i = 0 otherwise Ad the dual problem ca be rewritte as: d s.t log(µ i λ i ) µ b + µ i a i = 0 Fially, we observe that the objective fuctio is decreasig i λ i, so we should always take λ i as small as possible, i.e. equal to zero. The dual thus simplifies: µ R d s.t log µ i µ b + µ i a i = 0 3

4 3 Stochastic gradiet descet The stochastic gradiet descet algorithm is a variat of the gradiet descet algorithm where the gradiet of the fuctio f(x t ) at iteratio t is replaced by a radom variable g t such that: E[g t ] = f(x t ) ad E[ g t 2 ] G 2 I other words, g t is equal to the true gradiet i expectatio ad has bouded expected orm. The algorithm is described i Algorithm. Algorithm Stochastic gradiet descet Require: x : for t = to T do 2: x t+ x t η t g t 3: ed for 4: retur x T = T T t= x t Note that as opposed to stadard gradiet descet, the gradiet is ow stochastic ad the algorithm is o loger strictly descedig (the value of the curret solutio is ot ecessarily decreasig). This explais the last lie of the algorithm where istead of returig the solutio at the lest step, we retur the average of all the solutios see so far to smooth out the oise. Theorem. With step size η t = D G t, the solutio x T retured by Algorithm is such that: 3GD E[f( x T )] f(x) x R 2 T Proof. The proof I did t fiish i sectio was take from the excellet book Olie Covex Optimizatio by Elad Haza whose draft is available at What is iterestig is that the proof is a direct applicatio of the theorem we saw i class give regret bouds for gradiet descet i olie covex optimizatio. Applicatio to machie learig. At first glace, stochastic gradiet descet might ot seem very useful: how to fid a g t equal to the true gradiet i expectatio? is it much simpler tha computig the true gradiet? It turs out that optimizatio problems cog from machie learig are very ameable to stochastic gradiet descet. A stadard problem i machie learig is the followig: give a data set {(x i, y i ), i }, the goal is to fid a model f w parametrized by w which fits the data well. The fit of the model at data poit i is captured by a loss fuctio l(f w (x i ), y i ). The optimizatio problem thus takes the followig form: w l(f w (x i ), y i ) The true gradiet of the objective fuctio is give by: w l(f w (x i ), y i ) (4) 4

5 ad requires iteratig over the data poits, which is prohibitive for very large datasets. Istead, cosider g w computed by the followig algorithm: Algorithm 2 Stochastic gradiet Require: w : i pick uiformly at radom betwee ad 2: retur g w = w l(f w (x i ), y i ) g w is ow a radom vector, each i is chose with probability, hece the expected value of g w is give by: E[g w ] = w l(f w (x i ), y i ) which is exactly the gradiet computed i Equatio 4. So g w ca be used as a stochastic gradiet i the stochastic gradiet algorithm. Note that ow, each iteratio of the gradiet descet oly requires accessig oe data poit of the dataset, which iduces huge performace gais! 5

Linear Support Vector Machines

Linear Support Vector Machines Liear Support Vector Machies David S. Roseberg The Support Vector Machie For a liear support vector machie (SVM), we use the hypothesis space of affie fuctios F = { f(x) = w T x + b w R d, b R } ad evaluate