Dual Methods Lecturer: Ryan Tibshirani Conve Optimization 10-725/36-725 1
Last time: proimal Newton method Consider the problem min g() + h() where g, h are conve, g is twice differentiable, and h is simple. Proimal Newton method: let (0) R n, and repeat: v (k) = argmin v g( (k 1) ) T v + 1 2 vt 2 g( (k 1) )v + h( (k 1) + v) (k) = (k 1) + t k v (k), k = 1, 2, 3,... Step sizes are typically chosen by backtracking Iterations here are typically very epensive (computing d (k) is typically a formidable task) But typically very few iterations are needed until convergence: under appropriate conditions, get local quadratic convergence 2
Reminder: conjugate functions Recall that given f : R n R, the function f (y) = ma y T f() is called its conjugate Conjugates appear frequently in dual programs, since f (y) = min f() y T If f is closed and conve, then f = f. Also, f (y) y f() argmin z f(z) y T z If f is strictly conve, then f (y) = argmin z f(z) y T z 3
Proof details We will show that f (y) y f(), assuming that f is conve and closed Proof of = : Suppose y f(). Then M y, the set of maimizers of y T z f(z) over z. But f (y) = ma z y T z f(z) and f (y) = cl(conv( z My {z})). Thus f (y). Proof of = : From what we showed above, if f (y), then y f (), but f = f. Clearly y f() argmin z f(z) y T z Lastly if f is strictly conve, then we know that f(z) y T z has a unique minimizer over z, and this must be f (y) 4
Outline Today: Dual (sub)gradient methods Dual decomposition Augmented Lagrangians A peak at ADMM 5
Dual (sub)gradient methods Even if we can t derive dual (conjugate) in closed form, we can still use dual-based subgradient or gradient methods Eample: consider the problem min f() subject to A = b Its dual problem is ma u f ( A T u) b T u where f is conjugate of f. Defining g(u) = f ( A T u) b T u, note that g(u) = A f ( A T u) b 6
Therefore, using what we know about conjugates g(u) = A b where argmin z f(z) + u T Az The dual subgradient method (for maimizing the dual objective) starts with an initial dual guess u (0), and repeats for k = 1, 2, 3,... (k) argmin f() + (u (k 1) ) T A u (k) = u (k 1) + t k (A (k) b) Step sizes t k, k = 1, 2, 3,..., are chosen in standard ways 7
Recall that if f is strictly conve, then f is differentiable, and so this becomes dual gradient ascent, which repeats for k = 1, 2, 3,... (k) = argmin f() + (u (k 1) ) T A u (k) = u (k 1) + t k (A (k) b) (Difference is that each (k) is unique, here.) Again, step sizes t k, k = 1, 2, 3,... are chosen in standard ways Also, proimal gradients and acceleration can be applied as they would usually 8
Lipschitz gradients and strong conveity Assume that f is a closed and conve function. Then f is strongly conve with parameter d f Lipschitz with parameter 1/d Proof of = : Recall, if g strongly conve with minimizer, then g(y) g() + d 2 y 2, for all y Hence defining u = f (u), v = f (v), f( v ) u T v f( u ) u T u + d 2 u v 2 2 f( u ) v T u f( v ) v T v + d 2 u v 2 2 Adding these together, using Cauchy-Schwartz, and rearranging shows that u v 2 u v 2 /d 9
10 Convergence guarantees The following results hold from combining the last fact with what we already know about gradient descent: If f is strongly conve with parameter d, then dual gradient ascent with fied step sizes t k = d, k = 1, 2, 3,..., converges at the rate O(1/ɛ) If f is strongly conve with parameter d, and f is Lipschitz with parameter L, then dual gradient ascent with step sizes t k = 2/(1/d + 1/L), k = 1, 2, 3,..., converges at the rate O(log(1/ɛ)) Note that these results describe convergence of the dual objective to its optimal value
11 Consider min Dual decomposition B f i ( i ) subject to A = b i=1 Here = ( 1,... B ) R n divides into B blocks of variables, with each i R n i. We can also partition A accordingly A = [A 1... A B ], where A i R m n i Simple but powerful observation, in calculation of (sub)gradient, is that the minimization decomposes into B separate problems: + i + argmin B f i ( i ) + u T A i=1 argmin f i ( i ) + u T A i i, i = 1,... B i
12 Dual decomposition algorithm: repeat for k = 1, 2, 3,... (k) i argmin f i ( i ) + (u (k 1) ) T A i i, i = 1,... B i u (k) = u (k 1) + t k ( B i=1 A i (k) i ) b Can think of these steps as: Broadcast: send u to each of the B processors, each optimizes in parallel to find i Gather: collect A i i from each processor, update the global dual variable u 1 u 2 u 3 u
13 Dual decomposition with inequality constraints Consider min B f i ( i ) subject to i=1 B A i i b i=1 Dual decomposition (projected subgradient method): repeat for k = 1, 2, 3,... argmin f i ( i ) + (u (k 1) ) T A i i, i = 1,... B i ( ( B ) ) u (k) = u (k 1) + t k A i (k) i b (k) i i=1 where u + denotes the positive part of u, i.e., (u + ) i = ma{0, u i }, i = 1,..., m +
14 Price coordination interpretation (Vandenberghe): Have B units in a system, each unit chooses its own decision variable i (how to allocate its goods) Constraints are limits on shared resources (rows of A), each component of dual variable u j is price of resource j Dual update: u + j = (u j ts j ) +, j = 1,... m where s = b B i=1 A i i are slacks Increase price uj if resource j is over-utilized, s j < 0 Decrease price uj if resource j is under-utilized, s j > 0 Never let prices get negative
15 Augmented Lagrangian method also known as: method of multipliers Disadvantage of dual ascent: require strong conditions to ensure convergence. Improved by augmented Lagrangian method, also called method of multipliers. We transform the primal problem: min f() + ρ 2 A b 2 2 subject to A = b where ρ > 0 is a parameter. Clearly equivalent to original problem, and objective is strongly conve when A has full column rank. Use dual gradient ascent: repeat for k = 1, 2, 3,... (k) = argmin u (k) = u (k 1) + ρ(a (k) b) f() + (u (k 1) ) T A + ρ 2 A b 2 2
16 Notice step size choice t k = ρ, k = 1, 2, 3,... in dual algorithm. Why? Since (k) minimizes f() + (u (k 1) ) T A + ρ 2 A b 2 2 over, we have ( ) 0 f( (k) ) + A T u (k 1) + ρ(a (k) b) = f( (k) ) + A T u (k) This is the stationarity condition for the original primal problem; can show under mild conditions that A (k) b approaches zero (i.e., primal iterates approach feasibility), hence in the limit KKT conditions are satisfied and (k), u (k) approach optimality Advantage: much better convergence properties. Disadvantage: lose decomposability! (Separability is compromised by augmented Lagrangian...)
17 Alternating direction method of multipliers Alternating direction method of multipliers or ADMM: the best of both worlds, i.e., we get strong convergence properties, along with decomposability. Consider min,z f() + g(z) subject to A + Bz = c As before, we augment the objective min f() + g(z) + ρ 2 A + Bz c 2 2 subject to A + Bz = c for a parameter ρ > 0. We define augmented Lagrangian L ρ (, z, u) = f() + g(z) + u T (A + Bz c) + ρ 2 A + Bz c 2 2
18 ADMM repeats the steps, for k = 1, 2, 3,... (k) = argmin z (k) = argmin z L ρ (, z (k 1), u (k 1) ) L ρ ( (k), z, u (k 1) ) u (k) = u (k 1) + ρ(a (k) + Bz (k) c) Note that the usual method of multipliers would have replaced the first two steps by a joint minimization ( (k), z (k) ) = argmin,z L ρ (, z, u (k 1) )
19 Convergence guarantees Under modest assumptions on f, g (these do not require A, B to be full rank), the ADMM iterates satisfy, for any ρ > 0: Residual convergence: r (k) = A (k) Bz (k) c 0 as k, i.e., primal iterates approach feasibility Objective convergence: f( (k) ) + g(z (k) ) f + g, where f + g is the optimal primal objective value Dual convergence: u (k) u, where u is a dual solution For details, see Boyd et al. (2010). Note that we do not generically get primal convergence, but this is true under more assumptions Convergence rate: not known in general, theory is currently being developed, e.g., in Hong and Luo (2012), Deng and Yin (2012), Iutzeler et al. (2014), Nishihara et al. (2015). Roughly, it behaves like a first-order method (or a bit faster)
20 ADMM in scaled form It is often easier to epress the ADMM algorithm in a scaled form, where we replace the dual variable u by a scaled variable w = u/ρ. In this parametrization, the ADMM steps are: (k) = argmin z (k) = argmin z f() + ρ 2 A + Bz(k 1) c + w (k 1) 2 2 g(z) + ρ 2 A(k) + Bz c + w (k 1) 2 2 w (k) = w (k 1) + A (k) + Bz (k) c Note that here the kth iterate w (k) is just given by a running sum of residuals: w (k) = w (0) + k ( A (i) + Bz (i) c ) i=1
21 Eample: alternating projections Consider finding a point in intersection of conve sets C, D R n : min I C () + I D () To get this into ADMM form, we epress it as min,z I C () + I D (z) subject to z = 0 Each ADMM cycle involves two projections: (k) ( = argmin P C z (k 1) w (k 1)) z (k) ( = argmin P D (k) + w (k 1)) z w (k) = w (k 1) + (k) z (k) Like the classical alternating projections method, but more efficient
References S. Boyd and N. Parikh and E. Chu and B. Peleato and J. Eckstein (2010), Distributed optimization and statistical learning via the alternating direction method of multipliers W. Deng and W. Yin (2012), On the global and linear convergence of the generalized alternating direction method of multipliers M. Hong and Z. Luo (2012), On the linear convergence of the alternating direction method of multipliers F. Iutzeler and P. Bianchi and Ph. Ciblat and W. Hachem, (2014), Linear convergence rate for distributed optimization with the alternating direction method of multipliers R. Nishihara and L. Lessard and B. Recht and A. Packard and M. Jordan (2015), A general analysis of the convergence of ADMM L. Vandenberghe, Lecture Notes for EE 236C, UCLA, Spring 2011-2012 22