Dual Ascent Ryan Tibshirani Conve Optimization 10-725
Last time: coordinate descent Consider the problem min f() where f() = g() + n i=1 h i( i ), with g conve and differentiable and each h i conve. Coordinate descent: let (0) R n, and repeat (k) i = argmin i f ( (k) 1,..., (k) i 1, i, (k 1) i+1,..., (k 1) ) n, i = 1,..., n for k = 1, 2, 3,... Very simple and easy to implement Careful implementations can achieve state-of-the-art Scalable, e.g., don t need to keep full data in memory 2
Reminder: conjugate functions Recall that given f : R n R, the function f (y) = ma y T f() is called its conjugate Conjugates appear frequently in dual programs, since f (y) = min f() y T If f is closed and conve, then f = f. Also, f (y) y f() argmin z f(z) y T z If f is strictly conve, then f (y) = argmin z f(z) y T z 3
Outline Today: Dual ascent Dual decomposition Augmented Lagrangians A peak at ADMM 4
Dual ascent Even if we can t derive dual (conjugate) in closed form, we can still use dual-based gradient or subgradient methods Consider the problem min f() subject to A = b Its dual problem is ma u f ( A T u) b T u where f is conjugate of f. Defining g(u) = f ( A T u) b T u, note that g(u) = A f ( A T u) b 5
Therefore, using what we know about conjugates g(u) = A b where argmin z f(z) + u T Az The dual subgradient method (for maimizing the dual objective) starts with an initial dual guess u (0), and repeats for k = 1, 2, 3,... (k) argmin f() + (u (k 1) ) T A u (k) = u (k 1) + t k (A (k) b) Step sizes t k, k = 1, 2, 3,..., are chosen in standard ways 6
Recall that if f is strictly conve, then f is differentiable, and so this becomes dual gradient ascent, which repeats for k = 1, 2, 3,... (k) = argmin f() + (u (k 1) ) T A u (k) = u (k 1) + t k (A (k) b) (Difference is that each (k) is unique, here.) Again, step sizes t k, k = 1, 2, 3,... are chosen in standard ways Lastly, proimal gradients and acceleration can be applied as they would usually 7
Smoothness of f and f Assume that f is a closed and conve function. Then f is strongly conve with parameter m f Lipschitz with parameter 1/m Proof of = : Recall, if g strongly conve with minimizer, then g(y) g() + m 2 y 2, for all y Hence defining u = f (u), v = f (v), f( v ) u T v f( u ) u T u + m 2 u v 2 2 f( u ) v T u f( v ) v T v + m 2 u v 2 2 Adding these together, using Cauchy-Schwartz, rearranging shows that u v 2 u v 2 /m 8
Proof of = : for simplicity, call g = f and L = 1/m. As g is Lipschitz with constant L, so is g (z) = g(z) g() T z, hence g (z) g (y) + g (y) T (z y) + L 2 z y 2 2 Minimizing each side over z, and rearranging, gives 1 2L g() g(y) 2 2 g(y) g() + g() T ( y) Echanging roles of, y, and adding together, gives 1 L g() g(y) 2 2 ( g() g(y)) T ( y) Let u = f(), v = g(y); then g (u), y g (v), and the above reads ( y) T (u v) u v 2 2 /L, implying the result 9
10 Convergence guarantees The following results hold from combining the last fact with what we already know about gradient descent: If f is strongly conve with parameter m, then dual gradient ascent with constant step sizes t k = m converges at sublinear rate O(1/ɛ) If f is strongly conve with parameter m and f is Lipschitz with parameter L, then dual gradient ascent with step sizes t k = 2/(1/m + 1/L) converges at linear rate O(log(1/ɛ)) Note that these results describe convergence of the dual objective to its optimal value
11 Consider min Dual decomposition B f i ( i ) subject to A = b i=1 Here = ( 1,... B ) R n divides into B blocks of variables, with each i R n i. We can also partition A accordingly A = [A 1... A B ], where A i R m n i Simple but powerful observation, in calculation of (sub)gradient, is that the minimization decomposes into B separate problems: + i + argmin B f i ( i ) + u T A i=1 argmin f i ( i ) + u T A i i, i = 1,... B i
12 Dual decomposition algorithm: repeat for k = 1, 2, 3,... (k) i argmin f i ( i ) + (u (k 1) ) T A i i, i = 1,... B i u (k) = u (k 1) + t k ( B i=1 A i (k) i ) b Can think of these steps as: Broadcast: send u to each of the B processors, each optimizes in parallel to find i Gather: collect A i i from each processor, update the global dual variable u 1 u 2 u 3 u
13 Dual decomposition with inequality constraints Consider min B f i ( i ) subject to i=1 B A i i b i=1 Dual decomposition, i.e., projected subgradient method: argmin f i ( i ) + (u (k 1) ) T A i i, i = 1,... B i ( ( B ) ) u (k) = u (k 1) + t k A i (k) i b (k) i i=1 where u + denotes the positive part of u, i.e., (u + ) i = ma{0, u i }, i = 1,..., m +
14 Price coordination interpretation (Vandenberghe): Have B units in a system, each unit chooses its own decision variable i (how to allocate its goods) Constraints are limits on shared resources (rows of A), each component of dual variable u j is price of resource j Dual update: u + j = (u j ts j ) +, j = 1,... m where s = b B i=1 A i i are slacks Increase price uj if resource j is over-utilized, s j < 0 Decrease price uj if resource j is under-utilized, s j > 0 Never let prices get negative
15 Augmented Lagrangian method also known as: method of multipliers Disadvantage of dual ascent: require strong conditions to ensure convergence. Improved by augmented Lagrangian method, also called method of multipliers. We transform the primal problem: min f() + ρ 2 A b 2 2 subject to A = b where ρ > 0 is a parameter. Clearly equivalent to original problem, and objective is strongly conve when A has full column rank. Use dual gradient ascent: (k) = argmin u (k) = u (k 1) + ρ(a (k) b) f() + (u (k 1) ) T A + ρ 2 A b 2 2
16 Notice step size choice t k = ρ in dual algorithm. Why? Since (k) minimizes f() + (u (k 1) ) T A + ρ 2 A b 2 2 over, we have ( ) 0 f( (k) ) + A T u (k 1) + ρ(a (k) b) = f( (k) ) + A T u (k) This is the stationarity condition for original primal problem; under mild conditions A (k) b 0 as k (primal iterates become feasible), so KKT conditions are satisfied in the limit and (k), u (k) converge to solutions Advantage: much better convergence properties Disadvantage: lose decomposability! (Separability is ruined by augmented Lagrangian...)
17 Alternating direction method of multipliers Alternating direction method of multipliers or ADMM: try for best of both worlds. Consider the problem min,z f() + g(z) subject to A + Bz = c As before, we augment the objective min f() + g(z) + ρ 2 A + Bz c 2 2 subject to A + Bz = c for a parameter ρ > 0. We define augmented Lagrangian L ρ (, z, u) = f() + g(z) + u T (A + Bz c) + ρ 2 A + Bz c 2 2
18 ADMM repeats the steps, for k = 1, 2, 3,... (k) = argmin z (k) = argmin z L ρ (, z (k 1), u (k 1) ) L ρ ( (k), z, u (k 1) ) u (k) = u (k 1) + ρ(a (k) + Bz (k) c) Note that the usual method of multipliers would have replaced the first two steps by a joint minimization ( (k), z (k) ) = argmin,z L ρ (, z, u (k 1) )
19 Convergence guarantees Under modest assumptions on f, g (these do not require A, B to be full rank), the ADMM iterates satisfy, for any ρ > 0: Residual convergence: r (k) = A (k) Bz (k) c 0 as k, i.e., primal iterates approach feasibility Objective convergence: f( (k) ) + g(z (k) ) f + g, where f + g is the optimal primal objective value Dual convergence: u (k) u, where u is a dual solution For details, see Boyd et al. (2010). Note that we do not generically get primal convergence, but this is true under more assumptions Convergence rate: roughly, ADMM behaves like first-order method. Theory still being developed, see, e.g., in Hong and Luo (2012), Deng and Yin (2012), Iutzeler et al. (2014), Nishihara et al. (2015)
20 Scaled form ADMM Scaled form: denote w = u/ρ, so augmented Lagrangian becomes L ρ (, z, w) = f() + g(z) + ρ 2 A B + c + w 2 2 ρ 2 w 2 2 and ADMM updates become (k) = argmin z (k) = argmin z f() + ρ 2 A + Bz(k 1) c + w (k 1) 2 2 g(z) + ρ 2 A(k) + Bz c + w (k 1) 2 2 w (k) = w (k 1) + A (k) + Bz (k) c Note that here kth iterate w (k) is just a running sum of residuals: w (k) = w (0) + k ( A (i) + Bz (i) c ) i=1
21 Eample: alternating projections Consider finding a point in intersection of conve sets C, D R n : min I C () + I D () To get this into ADMM form, we epress it as min,z I C () + I D (z) subject to z = 0 Each ADMM cycle involves two projections: (k) ( = argmin P C z (k 1) w (k 1)) z (k) ( = argmin P D (k) + w (k 1)) z w (k) = w (k 1) + (k) z (k)
22 Compare classic alternating projections algorithm (von Neumann): (k) ( = argmin P C z (k 1) ) z (k) ( = argmin P D (k) ) z Difference is ADMM utilizes a dual variable w to offset projections. When (say) C is a linear subspace, ADMM algorithm becomes (k) ( = argmin P C z (k 1) ) z (k) ( = argmin P D (k) + w (k 1)) z w (k) = w (k 1) + (k) z (k) Initialized at z (0) = y, this is equivalent to Dykstra s algorithm for finding the closest point in C D to y
References S. Boyd and N. Parikh and E. Chu and B. Peleato and J. Eckstein (2010), Distributed optimization and statistical learning via the alternating direction method of multipliers W. Deng and W. Yin (2012), On the global and linear convergence of the generalized alternating direction method of multipliers M. Hong and Z. Luo (2012), On the linear convergence of the alternating direction method of multipliers F. Iutzeler and P. Bianchi and Ph. Ciblat and W. Hachem, (2014), Linear convergence rate for distributed optimization with the alternating direction method of multipliers R. Nishihara and L. Lessard and B. Recht and A. Packard and M. Jordan (2015), A general analysis of the convergence of ADMM L. Vandenberghe, Lecture Notes for EE 236C, UCLA, Spring 2011-2012 23