Motivation. Lecture 2 Topics from Optimization and Duality. network utility maximization (NUM) problem:

CDS270 Maryam Fazel Lecture 2 Topics from Optimization and Duality Motivation network utility maximization (NUM) problem: consider a network with S sources (users), each sending one flow at rate x s, through path L(s), with utility PSfrag replacements U s (x s ). there are L links, each one with capacity c l, shared by S(l) sources. motivation: Network Utility Maximization (NUM) problem application: Internet congestion control basic unconstrained minimization x 1 x 3 c 1 c 2 c 3 x 2 c 4 x 4 duality theory Lagrange dual of NUM, decentralization maximize S s=1 U s(x s ) subject to s S(l) x s c l, l = 1,..., L, optimization variable: x R S (domain of U s is x s 0) 2 1 2 2

when U s (x s ) are concave functions, NUM is a convex optimization problem TCP/AQM congestion control seminal paper [Kelly,Maulloo,Tan 98] important foundation of data networks: structure (APP/TCP/IP/MAC/PHY) layered different parts of NUM relate to different layers e.g., varying R, c would involve IP and PHY layers PSfrag replacements more later... a large class of congestion control mechanisms can be interpreted as distribured algorithms that solve NUM and its dual x s λ l AQM: RED, DropTail,... NUM useful as framework to understand network equilibrium and dynamics also a central problem in economics TCP: Reno, Vegas,... x s : source rate, updated by TCP (Transmission Control Protocol) λ l : link congestion measure, or price, updated by AQM (Active Queue Management) e.g., TCP Reno uses packet loss as congestion measure, TCP Vegas uses queueing delay 2 3 2 4

Topics we cover Unconstrained minimization basic unconstrained minimization, gradient descent duality theory Lagrange dual problem and its properties strong duality and Slater s condition interpretations of duality KKT optimality conditions theorems of alternatives Lagrange dual of NUM, decentralization minimize f(x) f : R n R, convex, differentiable minimizing sequence: x (k), k f(x (k) ) f optimality condition: f(x ) = 0 set of nonlinear equations; usually no analytical solution refs: Boyd & Vandenberghe, Convex Optimization, 2004. www.stanford.edu/ boyd/cvxbook. Chapters 5, and 9.1-9.3. Bertsekas, Nonlinear Programming, 1999. Chapter 1. more generally, if 2 f(x) mi, m > 0, then f(x) f 1 2m f(x) 2... yields stopping criterion (if you know m) 2 5 2 6

Descent methods given starting point x dom f repeat 1. Compute a search direction v 2. Line search. Choose step size t > 0 3. Update. x := x + tv until stopping criterion is satisfied descent method: f(x (k+1) ) < f(x (k) ) since f convex, v must be a descent direction: f(x (k) ) T v (k) < 0 choose v (descent direction) v (k) = f(x (k) ) (gradient) v (k) = H (k) f(x (k) ), H (k) = H (k)t 0 v (k) = 2 f(x (k) ) 1 f(x (k) ) (Newton s) choose t (step size) exact line search: t = argmin s>0 f(x + sv) backtracking line search (picks step size that reduces f enough ) Sfrag replacements f(x) = c 2 c 1 f(x) = c 1 v (k) constant step size x x (k) f(x (k) ) (main idea extends to nondifferentiable f using subgradients) 2 7 2 8

Gradient descent method Example given starting point x dom f repeat 1. Compute search direction v = f(x) 2. Line search. Choose step size t 3. Update. x := x + tv until stopping criterion is satisfied converges with exact or backtracking line search. a simple (but conservative) convergence condition for t: suppose f has bounded curvature 2 f(x) MI; there exists z line segment [x, y] s.t. f(x + tv) f(x) = f(x) T (tv) + 1 2 (tv)t ( 2 f(z))(tv) t f 2 + 1 2 t2 M f 2, then t + 1 2 t2 M < 0, or t < 2/M, guarantees descent. 2 9 minimize 1 2 (x2 1 + Mx 2 2) where M > 0, optimal point is x = 0. use exact line search; start at x (0) = (M, 1) 15 10 5 0 5 10 15 20 15 10 5 0 5 10 15 20 (condition number κ = max{m, 1/M} determines convergence rate) 2 10

notes: convergence can be very slow, depends critically on condition number Lagrangian standard form optimization problem Newton s method (and variations) have much better convergence properties minimize subject to f 0 (x) f i (x) 0, i = 1,..., m however, in networking problems, gradient descent proves very useful, since main concern here is decentralization... (details later) optimal value p, domain D called primal problem (in context of duality) (for now) we don t assume convexity how about constrained optimization problems? next topic: Lagrangian duality Lagrangian L : R n+m R L(x, λ) = f 0 (x) + λ 1 f 1 (x) + + λ m f m (x) λ i called Lagrange multipliers or dual variables objective is augmented with weighted sum of constraint functions 2 11 2 12

Lagrange dual function (Lagrange) dual function g : R m R { } g(λ) = inf x = inf x L(x, λ) (f 0(x) + λ 1 f 1 (x) + + λ m f m (x)) minimum of augmented cost as function of weights can be for some λ g is concave (even if f i not convex!) example: LP minimize subject to L(x, λ) = c T x + c T x a T i x b i 0, i = 1,..., m m λ i (a T i x b i ) i=1 = b T λ + (A T λ + c) T x { b hence g(λ) = T λ if A T λ + c = 0 otherwise 2 13 Lower bound property if λ 0 and x is primal feasible, then g(λ) f 0 (x) proof: if f i (x) 0 and λ i 0, f 0 (x) f 0 (x) + i ( inf z = g(λ) f 0 (z) + i λ i f i (x) λ i f i (z) f 0 (x) g(λ) is called the duality gap of (primal feasible) x and λ 0 minimize over primal feasible x to get, for any λ 0, g(λ) p λ R m is dual feasible if λ 0 and g(λ) > dual feasible points yield lower bounds on optimal value! 2 14 )

Lagrange dual problem Strong duality & constraint qualifications let s find best lower bound on p : maximize g(λ) subject to λ 0 called (Lagrange) dual problem (associated with primal problem) always a convex problem, even if primal isn t! optimal value denoted d we always have d p (called weak duality) p d is optimal duality gap for convex problems, we (usually) have strong duality: d = p (strong duality does not hold, in general, for nonconvex problems) when strong duality holds, dual optimal λ serves as certificate of optimality for primal optimal point x many conditions or constraint qualifications guarantee strong duality for convex problems Slater s condition: if primal problem is strictly feasible (and convex), i.e., there exists x relint D with then we have p = d f i (x) < 0, i = 1,..., m 2 15 2 16

Dual of linear program Dual of quadratic program (primal) LP minimize subject to c T x Ax b primal QP minimize x T P x subject to Ax b we assume P 0 for simplicity n variables, m inequality constraints dual of LP is (after making implicit equality constraints explicit) maximize b T λ subject to A T λ + c = 0 λ 0 dual of LP is also an LP (in std LP format) m variables, n equality constraints, m nonnegativity contraints for LP we have strong duality except in one (pathological) case: primal and dual both infeasible (p = +, d = ) 2 17 Lagrangian is L(x, λ) = x T P x + λ T (Ax b) x L(x, λ) = 0 yields x = (1/2)P 1 A T λ, hence dual function is g(λ) = (1/4)λ T AP 1 A T λ b T λ concave quadratic function all λ 0 are dual feasible dual of QP is maximize (1/4)λ T AP 1 A T λ b T λ subject to λ 0... another QP 2 18

Min-max & saddle-point interpretation can express primal and dual problems in a more symmetric form: sup λ 0 ( f 0 (x) + ) m λ i f i (x) = i=1 so p = inf x sup λ 0 L(x, λ). { f0 (x) f i (x) 0, + otherwise, also by definition, d = sup λ 0 inf x L(x, λ). weak duality can be expressed as sup inf λ 0 x L(x, λ) inf x sup L(x, λ) λ 0 strong duality when equality holds. means can switch the order of minimization over x and maximization over λ 0. if x and λ are primal and dual optimal and strong duality holds, they form a saddle-point for the Lagrangian (converse also true). Economic (price) interpretation minimize f 0 (x) subject to f i (x) 0, i = 1,..., m. f 0 (x) is cost of operating firm at operating condition x; constraints give resource limits. suppose: can violate f i (x) 0 by paying additional cost of λ i (in dollars per unit violation), i.e., incur cost λ i f i (x) if f i (x) > 0 can sell unused portion of ith constraint at same price, i.e., gain profit λ i f i (x) if f i (x) < 0 total cost to firm to operate at x at constraint prices λ i : m L(x, λ) = f 0 (x) + λ i f i (x) i=1 2 19 2 20

interpretations: dual function: g(λ) = inf x L(x, λ) is optimal cost to firm at constraint prices λ i weak duality: cost can be lowered if firm allowed to pay for violated constraints (and get paid for non-tight ones) duality gap: advantage to firm under this scenario strong duality: λ give prices for which firm has no advantage in being allowed to violate constraints... dual optimal λ problem called shadow prices for original Geometric interpretation of duality consider set A = { (u, t) R m+1 x f i (x) u i, f 0 (x) t } A is convex if f i are for λ 0, g(λ) = inf { [ λ 1 ] T [ u t ] [ u t ] } A PSfrag replacements t A t + λ T u = g(λ) g(λ)» λ 1 u 2 21 2 22

(Idea of) proof of Slater s theorem problem convex, strictly feasible = strong duality PSfrag replacements t Slater s condition: there exists (u, t) A with u 0; implies that all supporting hyperplanes at (0, p ) are non-vertical (µ 0 > 0) p A» 1 λ u (0, p ) A supporting hyperplane at (0, p ): (u, t) A = µ 0 (t p ) + µ T u 0 µ 0 0, µ 0, (µ, µ 0 ) 0 strong duality supp. hyperpl. with µ 0 > 0: for λ = µ/µ 0, we have p t + λ T u (t, u) A p g(λ ) 2 23 2 24

Sensitivity analysis via duality define p (u) as the optimal value of PSfrag replacements minimize subject to p (u) 0 f 0 (x) f i (x) u i, i = 1,..., m 0 u epi p p (0) λ T u λ gives lower bound on p (u): p (u) p m i=1 λ i u i if λ i large: u i < 0 greatly increases p if λ i small: u i > 0 does not decrease p too much if p (u) is differentiable, λ i = p (0) u i λ i is sensitivity of p w.r.t. ith constraint 2 25 Complementary slackness suppose x, λ are primal, dual feasible with zero duality gap (hence, they are primal, dual optimal) f 0 (x ) = g(λ ) ( = inf x f 0 (x ) + f 0 (x) + ) m λ i f i (x) i=1 m λ i f i (x ) i=1 hence we have m i=1 λ i f i(x ) = 0, and so λ i f i (x ) = 0, i = 1,..., m called complementary slackness condition ith constraint inactive at optimum = λ i = 0 λ i > 0 at optimum = ith constraint active at optimum 2 26

suppose KKT optimality conditions f i are differentiable x, λ are (primal, dual) optimal, with zero duality gap so if x, λ are (primal, dual) optimal, with zero duality gap, they satisfy f i (x ) 0 λ i 0 λ i f i(x ) = 0 f 0 (x ) + i λ i f i(x ) = 0 the Karush-Kuhn-Tucker (KKT) optimality conditions by complementary slackness we have f 0 (x ) + i λ i f i (x ) = inf x ( f 0 (x) + i λ i f i (x) ) conversely, if the problem is convex and x, λ satisfy KKT, then they are (primal, dual) optimal i.e., x minimizes L(x, λ ) therefore f 0 (x ) + i λ i f i (x ) = 0 2 27 2 28

Generalized inequalities dual problem minimize subject to f 0 (x) f i (x) Ki 0, i = 1,..., L maximize g(λ 1,..., λ L ) subject to λ i K i 0, i = 1,..., L Ki are generalized inequalities on R m i f i : R n R m i are K i -convex Lagrangian L : R n R m 1 R m L R, L(x, λ 1,..., λ L ) = f 0 (x) + λ T 1 f 1 (x) + + λ T Lf L (x) dual function g(λ 1,..., λ L ) = inf x ( f0 (x) + λ T 1 f 1 (x) + + λ T Lf L (x) ) weak duality: d p always strong duality: d = p usually Slater condition: if primal is strictly feasible, i.e., x relint D : f i (x) Ki 0, i = 1,..., L then d = p λ i dual feasible if λ i K i 0, g(λ 1,..., λ L ) > lower bound property: if x primal feasible and (λ 1,..., λ L ) is dual feasible, then g(λ 1,..., λ L ) f 0 (x) (hence, g(λ 1,..., λ L ) p ) 2 29 2 30

Example: semidefinite programming minimize c T x subject to F 0 + x 1 F 1 + + x n F n 0 Lagrangian (multiplier Z = Z T R m m ) L(x, Z) = c T x + Tr Z(F 0 + x 1 F 1 + + x n F n ) dual function g(z) = ( inf c T x + Tr Z(F 0 + x 1 F 1 + + x n F n ) ) x = { Tr F0 Z if Tr F i Z + c i = 0, i = 1,..., n otherwise dual problem maximize subject to Tr F 0 Z Tr F i Z + c i = 0, i = 1,..., n Z = Z T 0 strong duality holds if there exists x with F 0 + x 1 F 1 + + x n F n 0 2 31 Theorem of alternatives apply Lagrange duality theory to the feasibility problem: f 1,..., f m convex with dom f i = R n exactly one of the following is true: 1. there exists x with f i (x) < 0, i = 1,..., m 2. there exists λ 0 with λ 0, g(λ) = inf x (λ 1f 1 (x) + + λ m f m (x)) 0 called alternatives use in practice: λ that satisfies 2nd condition proves f i (x) < 0 is infeasible example: f i (x) = a T i x b i 1. there exists x with Ax b 2. there exists λ 0, λ 0, b T λ 0, A T λ = 0 2 32

proof. from convex duality: primal problem minimize subject to (variables x, t) dual problem t f i (x) t, i = 1,..., m maximize g(λ) subject to λ 0 1 T λ = 1 Slater s condition is satisfied, hence p = d 1st alternative: p < 0 2nd alternative: p 0 2 33 Dual of NUM back to the Network Utility Maximization problem we saw before, primal: maximize Us (x s ) subject to Rx c, where R is the routing matrix. Lagrangian: L(x, λ) = s = s D(λ) = max L(x, λ) = x s U s (x s ) + ( λ l c l ) R ls x s l s [ ( ) ] U s (x s ) R ls λ l x s + c l λ l l l max L s (x s, λ s ) + x s l c l λ l where λ s = l R lsλ l is end-to-end path price for source s. 2 34

dual: minimize subject to s max x s 0 (U s (x s ) λ s x s ) + l c lλ l λ 0, dual-based distributed algorithm one way to solve iteratively: use (a variation on) gradient method for dual additivity of total utility and flow constraints lead to decomposition into individual source terms: a form of dual decomposition this decomposition is called horizontal : users in network each source x s can keep its utility private across each source only needs to know λ s to solve subproblem max xs 0(U s (x s ) λ s x s ) optimal prices λ serve as a coordinating signal that align individual (selfish) optimality with social optimality source algorithm: single-variable optimization x s(λ s ) = argmax [ U s (x s ) ( l R ls λ l (t))x s ], s. if U s are strictly concave, twice continously differentiable, x (λ s ) = U 1 (λ s ) called demand function in microeconomics more later... 2 35 2 36

link algorithm: if U s strictly concave, twice continously differentiable, D λ l (λ(k)) = c l s R ls x s (λ(k)) use gradient projection method: [ ( λ l (k+1) = λ l (k) α(k) c l s R ls x s (λ(k)))] +, l α(k) is step size, certain choices guarantee λ(k) λ and x (λ(k)) x. interpretation: balancing supply and demand through pricing link update is decentralized, each link using only local information a model for source-link dynamics 2 37