1/34 Dual Decomposition http://bicmr.pku.edu.cn/~wenzw/opt-2017-fall.html Acknowledgement: this slides is based on Prof. Lieven Vandenberghes lecture notes
Outline 2/34 1 Conjugate function 2 introduction: dual methods 3 gradient and subgradient of conjugate 4 dual decomposition 5 network utility maximization 6 network flow optimization
Conjugate function 3/34 the conjugate of a function f is f (y) = sup (y T x f (x)) x dom f f is closed and convex even if f is not Fenchel s s inequality f (x) + f (y) x T y x, y (extends inequality x T x/2 + y T y/2 x T y to non-quadratic convex f )
The second conjugate 4/34 f (x) = sup (x T y f (y)) y dom f f (x) is closed and convex from Fenchel s inequality x T y f (y) f (x) for all y and x): f f (x) x equivalently, epi f epi f (for any f ) if f is closed and convex, then f (x) = f (x) x equivalently, epi f = epi f (if f is closed convex); proof on next page
Conjugates and subgradients 5/34 if f is closed and convex, then y f (x) x f (y) x T y = f (x) + f (y) proof: if y f (x), then f (y) = sup u (y T u f (u)) = y T x f (x) f (v) = sup(v T u f (u) u v T x f (x) = x T (v y) f (x) + y T x = f (y) + x T (v y) (1) for all v; therefore, x is a subgradient of f at y (x f (y)) reverse implication x f (y) y f (x) follows from f = f
6/34 Moreau decomposition ( prox f (x) = argmin f (u) + 1 ) u 2 u x 2 2 x = prox f (x) + prox f (x) x follows from properties of conjugates and subgradients: u = prox f (x) x u f (u) u f (x u) x u = prox f (x) generalizes decomposition by orthogonal projection on subspaces: x = P L(x) + P L (x) if L is a subspace, L its orthogonal complement (this is Moreau decomposition with f = I L, f = I L )
Outline 7/34 1 Conjugate function 2 introduction: dual methods 3 gradient and subgradient of conjugate 4 dual decomposition 5 network utility maximization 6 network flow optimization
Duality and conjugates 8/34 primal problem (A R m n, f and g convex) min f (x) + g(ax) Lagrangian (after introducing new variable y = Ax) dual function inf x (f (x) + z T Ax) + inf y f (x) + g(y) + z T (Ax y) (g(y) z T y) = f ( A T z) g (z) dual problem max f ( A T z) g (z)
Examples 9/34 equality constraints: g is indicator for {b} min f (x) max b T z f ( A T z) s.t. Ax = b linear inequality constraints: g is indicator for {y y b} min f (x) max b T z f ( A T z) s.t. Ax b s.t. z 0 norm regularization: g(y) = y b min f (x) + Ax b max b T z f ( A T z) s.t. z 1
10/34 Dual methods apply first-order method to dual problem max f ( A T z) g (z) reasons why dual problem may be easier for first-order method: dual problem is unconstrained or has simple constraints dual objective is differentiable or has a simple nondifferentiable term decomposition: exploit separable structure
Outline 11/34 1 Conjugate function 2 introduction: dual methods 3 gradient and subgradient of conjugate 4 dual decomposition 5 network utility maximization 6 network flow optimization
(Sub-)gradients of conjugate function assume f : R n R is closed and convex with conjugate subgradient f (y) = sup(y T x f (x)) x f is subdifferentiable on (at least) int dom f maximizers in the definition of f (y) are subgradients at y y f (x) y T x f (x) = f (y) x f (y) gradient: for strictly convex f, maximizer in definition is unique if it exists f (y) = argmax(y T x f (x)) (if maximum is attained) x 12/34
13/34 Conjugate of strongly convex function assume f is closed and strongly convex, with parameter µ > 0 f is defined for all y (i.e., dom f = R n ) f is differentiable everywhere, with gradient f (y) = argmax(y T x f (x)) x f is Lipschitz continuous with constant 1/µ f (y) f (y ) 2 1 µ y y 2 y, y
14/34 proof: if f is strongly convex and closed y T x f (x) has a unique maximizer x for every y x maximizes y T x f (x) if and only if y f (x); y f (x) x f (y) = { f (y)} hence f (y) = argmax x (y T x f (x)) from convexity of f (x) (µ/2)x T x : (y y ) T (x x ) µ x x 2 2 if y f (x), y f (x ) this is co-coercivity of f (which implies Lipschitz continuity) (y y ) T ( f (y) f (y )) µ f (y) f (y ) 2 2
Outline 15/34 1 Conjugate function 2 introduction: dual methods 3 gradient and subgradient of conjugate 4 dual decomposition 5 network utility maximization 6 network flow optimization
16/34 Equality constraints P: min f (x) D: min f ( A T z) + b T z s.t. Ax = b dual gradient ascent (assuming dom f = R n ): ˆx = argmin(f (x) + z T Ax), z + = z + t(aˆx b) x ˆx is a subgradient of f at A T z ( i.e., ˆx f ( A T z)) b Aˆx is a subgradient of f ( A T z) + b T z at z It is of interest if calculation of ˆx is inexpensive (for example, f is separable)
Dual decomposition 17/34 convex problem with separable objective min f 1 (x 1 ) + f 2 (x 2 ) s.t. A 1 x 1 + A 2 x 2 b constraint is complicating or coupling constraint dual problem max f1 ( AT 1 z) f 2 ( AT 2 z) bt z s.t. z 0 can be solved by (sub-)gradient projection if z 0 is the only constraint
18/34 Dual subgradient projection subproblems: to calculate fj ( AT j z) and a (sub-) gradient for it, min x j f j (x j ) + z T A j x j optimal value is f j ( AT j z); minimizer ˆx j is in f j ( AT j z) dual subgradient projection method ˆx j = argmin x j (f j (x j ) + z T A j x j ), j = 1, 2 z + = (z + t(a 1ˆx 1 + A 2ˆx 2 b)) + minimization problems over x 1, x 2 are independent z-update is projected subgradient step (u + = max{u, 0} elementwise)
Quadratic programming example 19/34 min s.t. r j=1 (xt j P jx j + q T j x j) B j x j d j, j = 1,..., r p j=1 A jx j b r = 10, variables x j R 100, 10 coupling constraints (A j R 10 100 ) P j 0; implies dual function has Lipschitz continuous gradient subproblems: each iteration requires solving 10 decoupled QPs min xj s.t. xj TP jx j + (q j + A T j z)t x j B j x j d j
20/34 gradient projection and fast gradient projection fixed step size (equal in the two methods) plot shows dual objective gap
Outline 21/34 1 Conjugate function 2 introduction: dual methods 3 gradient and subgradient of conjugate 4 dual decomposition 5 network utility maximization 6 network flow optimization
Network utility maximization 22/34 network flows n flows, with fixed routes, in a network with m links variable x j 0 denotes the rate of flow j flow utility is U j : R R, concave, increasing capacity constraints traffic y i on link i is sum of flows passing through it y = Rx, where R is the routing matrix { 1 flow j passes over link i R ij = 0 otherwise link capacity constraint: y c
Dual network utility maximization problem 23/34 max n j=1 U j(x j ) s.t. Rx c a convex problem; dual decomposition gives decentralized method dual problem min c T z + n j=1 ( U j) (rj Tz) s.t. z 0 z i is price (per unit flow) for using link i r T j z is the sum of prices along routej (r j is jth column of R)
(Sub-)gradients of dual function 24/34 dual objective subgradient f (x) = c T z + = c T z + n ( U j ) (rj T z) j=1 n j=1 sup x j (U j (x j ) (r T j z)x j ) c Rˆx f (z) where ˆx j = argmax x j (U j (x j ) (r T j z)x j ) if U j is strictly concave, this is a gradient r T j z is the sum of link prices along route j c Rˆx is vector of link capacity margins for flow ˆx
Dual decomposition algorithm given initial link price vector z 0 ( e.g., z = 1), repeat: 1 sum link prices along each route: calculate λ j = r T j z for j = 1,..., n 2 optimize flows (separately) using flow prices ˆx j = argmax(u j (x j ) λ j x j ), j = 1,..., n x j 3 calculate link capacity margins s = c Rˆx 4 update link prices using projected (sub-)gradient step with step t z := (z ts) + decentralized: to find λ j, ˆx source j only needs to know the prices on its route to update s i, z i, link i only needs to know the flows that pass through it 25/34
Outline 26/34 1 Conjugate function 2 introduction: dual methods 3 gradient and subgradient of conjugate 4 dual decomposition 5 network utility maximization 6 network flow optimization
Single commodity network flow 27/34 network connected, directed graph with n links/arcs, m nodes node-arc incidence matrix A R m n is 1 arc j enters node i A ij = 1 arc j leaves node i 0 otherwise flow vector and external sources variable x j denotes flow (traffic) on arc j b i is external demand (or supply) of flow at node i (satisfies 1 T b = 0) flow conservation: Ax = b
Network flow optimization problem 28/34 min s.t. φ(x) = n j=1 φ j(x j ) Ax = b φ is a separable sum of convex functions dual decomposition yields decentralized solution method dual problem ( a j is jth column of A) n max b T z φ j ( a T j z) j=1 dual variable z i can be interpreted as potential at node i y j = a T j z is the potential difference across arc j (potential at start node minus potential at end node)
29/34 (Sub-)gradients of dual function negative dual objective f (z) = b T z + subgradient n φ j ( a T j z) j=1 b Aˆx f (z) where ˆx j = argmin(φ j (x j ) + (a T j z)x j ) this is a gradient if the functions φ j are strictly convex if φ j is differentiable, φ j (ˆx j) = a T j z
30/34 Dual decomposition network flow algorithm given initial potential vector z, repeat 1 determine link flows from potential differences y = A T z ˆx j = argmin(φ j (x j ) y j x j ), j = 1,..., n x j 2 compute flow residual at each node: s := b Aˆx 3 update node potentials using (sub-)gradient step with step size t decentralized z := z ts flow is calculated from potential difference across arc node potential is updated from its own flow surplus
Electrical network interpretation 31/34 network flow optimality conditions (with differentiable φ j ) Ax = b, y + A T z = 0, y j = φ j(x j ), j = 1,..., n network with node incidence matrix A, nonlinear resistors in branches Kirchhoff current law (KCL): Ax = b x j is the current flow in branch j; b i is external current extracted at node i Kirchhoff voltage law (KVL): y + A T z = 0 z j is node potential; y j = a T j z is jth branch voltage current-voltage characterics: y j = φ j (x j) for example, φ j (x j ) = R j xj 2/2 for linear resistor R j current and potentials in circuit are optimal flows and dual variables
Example: minimum queueing delay 32/34 flow cost function and conjugate (c j > 0 are link capacities): φ j (x j ) = (with dom φ j = [0, c j )) x { j, φ ( cj y j (y j ) = j 1) 2 y j > 1/c j c j x j 0 y j 1/c j φ j is differentiable except at x j = 0 φ j (0) = (, 0], φ j(x j ) = φ j is differentiable c j (c j x j ) 2 (0 < x j < c j ) { φ j 0 yj 1/c (y j ) = j c j c j /y j y j > 1/c j
flow cost function and conjugate (c j = 1) 33/34 derivatives
34/34 References S. Boyd, course notes for EE364b. Lectures and notes on decomposition M. Chiang, S.H. Low, A.R. Calderbank, J.C. Doyle, Layering as optimization decomposition: A mathematical theory of network architectures, Proceedings IEEE (2007) D.P. Bertsekas and J.N. Tsitsiklis, Parallel and Distributed Computation: Numerical Methods (1989) D.P. Bertsekas, Network Optimization. Continuous and Discrete Models (1998) L.S. Lasdon, Optimization Theory for Large Systems (1970)