Alternating Direction Method of Multipliers Ryan Tibshirani Convex Optimization 10-725
Consider the problem Last time: dual ascent min x f(x) subject to Ax = b where f is strictly convex and closed. Denote Lagrangian L(x, u) = f(x) + u T (Ax b) Dual gradient ascent repeats, for k = 1, 2, 3,... x (k) = argmin x L(x, u (k 1) ) u (k) = u (k 1) + t k (Ax (k) b) Good: x update decomposes when f does. Bad: require stringent assumptions (strong convexity of f) to ensure convergence 2
Augmented Lagrangian method (also called method of multipliers) considers the modified problem, for a parameter ρ > 0, uses modified Lagrangian min x f(x) + ρ 2 Ax b 2 2 subject to Ax = b L ρ (x, u) = f(x) + u T (Ax b) + ρ 2 Ax b 2 2 and repeats, for k = 1, 2, 3,... x (k) = argmin x L ρ (x, u (k 1) ) u (k) = u (k 1) + ρ(ax (k) b) Good: better convergence properties. Bad: lose decomposability 3
Alternating direction method of multipliers Alternating direction method of multipliers or ADMM: combines the best of both methods. Consider a problem of the form: min x,z f(x) + g(z) subject to Ax + Bz = c We define augmented Lagrangian, for a parameter ρ > 0, L ρ (x, z, u) = f(x) + g(z) + u T (Ax + Bz c) + ρ 2 Ax + Bz c 2 2 We repeat, for k = 1, 2, 3,... x (k) = argmin x z (k) = argmin z L ρ (x, z (k 1), u (k 1) ) L ρ (x (k), z, u (k 1) ) u (k) = u (k 1) + ρ(ax (k) + Bz (k) c) 4
Convergence guarantees Under modest assumptions on f, g (these do not require A, B to be full rank), the ADMM iterates satisfy, for any ρ > 0: Residual convergence: r (k) = Ax (k) + Bz (k) c 0 as k, i.e., primal iterates approach feasibility Objective convergence: f(x (k) ) + g(z (k) ) f + g, where f + g is the optimal primal objective value Dual convergence: u (k) u, where u is a dual solution For details, see Boyd et al. (2010). Note that we do not generically get primal convergence, but this is true under more assumptions Convergence rate: roughly, ADMM behaves like first-order method. Theory still being developed, see, e.g., in Hong and Luo (2012), Deng and Yin (2012), Iutzeler et al. (2014), Nishihara et al. (2015) 5
Scaled form ADMM Scaled form: denote w = u/ρ, so augmented Lagrangian becomes L ρ (x, z, w) = f(x) + g(z) + ρ 2 Ax + Bz c + w 2 2 ρ 2 w 2 2 and ADMM updates become x (k) = argmin x z (k) = argmin z f(x) + ρ 2 Ax + Bz(k 1) c + w (k 1) 2 2 g(z) + ρ 2 Ax(k) + Bz c + w (k 1) 2 2 w (k) = w (k 1) + Ax (k) + Bz (k) c Note that here kth iterate w (k) is just a running sum of residuals: w (k) = w (0) + k ( Ax (i) + Bz (i) c ) i=1 6
Outline Today: Examples, practicalities Consensus ADMM Special decompositions 7
Connection to proximal operators Consider min f(x) + g(x) min x x,z f(x) + g(z) subject to x = z ADMM steps (equivalent to Douglas-Rachford, here): x (k) = prox f,1/ρ (z (k 1) w (k 1) ) z (k) = prox g,1/ρ (x (k) + w (k 1) ) w (k) = w (k 1) + x (k) z (k) where prox f,1/ρ is the proximal operator for f at parameter 1/ρ, and similarly for prox g,1/ρ In general, the update for block of variables reduces to prox update whenever the corresponding linear transformation is the identity 8
Example: lasso regression Given y R n, X R n p, recall the lasso problem: We can rewrite this as: 1 min β 2 y Xβ 2 2 + λ β 1 min β,α 1 2 y Xβ 2 2 + λ α 1 subject to β α = 0 ADMM steps: β (k) = (X T X + ρi) 1( X T y + ρ(α (k 1) w (k 1) ) ) α (k) = S λ/ρ (β (k) + w (k 1) ) w (k) = w (k 1) + β (k) α (k) 9
10 Notes: The matrix X T X + ρi is always invertible, regardless of X If we compute a factorization (say Cholesky) in O(p 3 ) flops, then each β update takes O(p 2 ) flops The α update applies the soft-thresolding operator S t, which recall is defined as x j t x > t [S t (x)] j = 0 t x t, j = 1,... p x j + t x < t ADMM steps are almost like repeated soft-thresholding of ridge regression coefficients
11 Comparison of various algorithms for lasso regression: 100 random instances with n = 200, p = 50 Suboptimality fk fstar 1e 10 1e 07 1e 04 1e 01 Coordinate desc Proximal grad Accel prox ADMM (rho=50) ADMM (rho=100) ADMM (rho=200) 0 10 20 30 40 50 60 Iteration k
12 Practicalities In practice, ADMM usually obtains a relatively accurate solution in a handful of iterations, but it requires a large number of iterations for a highly accurate solution (like a first-order method) Choice of ρ can greatly influence practical convergence of ADMM: ρ too large not enough emphasis on minimizing f + g ρ too small not enough emphasis on feasibility Boyd et al. (2010) give a strategy for varying ρ; it can work well in practice, but does not have convergence guarantees Like deriving duals, transforming a problem into one that ADMM can handle is sometimes a bit subtle, since different forms can lead to different algorithms
13 Example: group lasso regression Given y R n, X R n p, recall the group lasso problem: Rewrite as: min β,α ADMM steps: 1 min β 2 y Xβ 2 2 + λ 1 2 y Xβ 2 2 + λ G c g β g 2 g=1 G c g α g 2 subject to β α = 0 g=1 β (k) = (X T X + ρi) 1( X T y + ρ(α (k 1) w (k 1) ) ) α g (k) ( = R cgλ/ρ β (k) g + w g (k 1) ), g = 1,... G w (k) = w (k 1) + β (k) α (k)
14 Notes: The matrix X T X + ρi is always invertible, regardless of X If we compute a factorization (say Cholesky) in O(p 3 ) flops, then each β update takes O(p 2 ) flops The α update applies the group soft-thresolding operator R t, which recall is defined as ( R t (x) = 1 t ) x x 2 + Similar ADMM steps follow for a sum of arbitrary norms of as regularizer, provided we know prox operator of each norm ADMM algorithm can be rederived when groups have overlap (hard problem to optimize in general!). See Boyd et al. (2010)
15 Example: sparse subspace estimation Given S S p (typically S 0 is a covariance matrix), consider the sparse subspace estimation problem (Vu et al., 2013): max Y tr(sy ) λ Y 1 subject to Y F k where F k is the Fantope of order k, namely F k = {Y S p : 0 Y I, tr(y ) = k} Note that when λ = 0, the above problem is equivalent to ordinary principal component analysis (PCA) This above is an SDP and in principle solveable with interior point methods, though these can be complicated to implement and quite slow for large problem sizes
16 Rewrite as: min Y,Z tr(sy ) + I F k (Y ) + λ Z 1 subject to Y = Z ADMM steps are: Y (k) = P Fk (Z (k 1) W (k 1) + S/ρ) Z (k) = S λ/ρ (Y (k) + W (k 1) ) W (k) = W (k 1) + Y (k) Z (k) Here P Fk is Fantope projection operator, computed by clipping the eigendecomposition A = UΣU T, Σ = diag(σ 1,..., σ p ): P Fk (A) = UΣ θ U T, Σ θ = diag(σ 1 (θ),..., σ p (θ)) where each σ i (θ) = min{max{σ i θ, 0}, 1}, and p i=1 σ i(θ) = k
17 Example: sparse + low rank decomposition Given M R n m, consider the sparse plus low rank decomposition problem (Candes et al., 2009): min L,S subject to L tr + λ S 1 L + S = M ADMM steps: L (k) = S tr 1/ρ (M S(k 1) + W (k 1) ) S (k) = S l 1 λ/ρ (M L(k) + W (k 1) ) W (k) = W (k 1) + M L (k) S (k) where, to distinguish them, we use Sλ/ρ tr for matrix soft-thresolding and S l 1 λ/ρ for elementwise soft-thresolding
18 Example from Candes et al. (2009): (a) Original frames (b) Low-rank ˆL (c) Sparse Ŝ
19 Consensus ADMM Consider a problem of the form: min x B f i (x) i=1 The consensus ADMM approach begins by reparametrizing: min x 1,...x B,x B f i (x i ) subject to x i = x, i = 1,... B i=1 This yields the decomposable ADMM steps: x (k) i = argmin x i B x (k) = 1 B w (k) i i=1 = w (k 1) i f i (x i ) + ρ 2 x i x (k 1) + w (k 1) i 2 2, i = 1,... B ( x (k) i ) + w (k 1) i + x (k) i x (k), i = 1,... B
20 Write x = 1 B B i=1 x i and similarly for other variables. Not hard to see that w (k) = 0 for all iterations k 1 Hence ADMM steps can be simplified, by taking x (k) = x (k) : x (k) i w (k) i = argmin x i = w (k 1) i f i (x i ) + ρ 2 x i x (k 1) + w (k 1) i 2 2, i = 1,... B + x (k) i x (k), i = 1,... B To reiterate, the x i, i = 1,... B updates here are done in parallel Intuition: Try to minimize each f i (x i ), use (squared) l 2 regularization to pull each x i towards the average x If a variable x i is bigger than the average, then w i is increased So the regularization in the next step pulls x i even closer
21 General consensus ADMM Consider a problem of the form: min x B f i (a T i x + b i ) + g(x) i=1 For consensus ADMM, we again reparametrize: min x 1,...x B,x B f i (a T i x i + b i ) + g(x) subject to x i = x, i = 1,... B i=1 This yields the decomposable ADMM updates: x (k) i = argmin x i x (k) = argmin x w (k) i = w (k 1) i f i (a T i x i + b i ) + ρ 2 x i x (k 1) + w (k 1) i 2 2, Bρ 2 x x(k) w (k 1) 2 2 + g(x) + x (k) i x (k), i = 1,... B i = 1,... B
22 Notes: It is no longer true that w (k) = 0 at a general iteration k, so ADMM steps do not simplify as before To reiterate, the x i, i = 1,... B updates are done in parallel Each x i update can be thought of as a loss minimization on part of the data, with l 2 regularization The x update is a proximal operation in regularizer g The w update drives the individual variables into consensus A different initial reparametrization will give rise to a different ADMM algorithm See Boyd et al. (2010), Parikh and Boyd (2013) for more details on consensus ADMM, strategies for splitting up into subproblems, and implementation tips
23 Special decompositions ADMM can exhibit much faster convergence than usual, when we parametrize subproblems in a special way ADMM updates relate closely to block coordinate descent, in which we optimize a criterion in an alternating fashion across blocks of variables With this in mind, get fastest convergence when minimizing over blocks of variables leads to updates in nearly orthogonal directions Suggests we should design ADMM form (auxiliary constraints) so that primal updates de-correlate as best as possible This is done in, e.g., Ramdas and Tibshirani (2014), Wytock et al. (2014), Barbero and Sra (2014)
Example: 2d fused lasso Given an image Y Rd d, equivalently written as y Rn, recall the 2d fused lasso or 2d total variation denoising problem: min Θ X 1 ky Θk2F + λ Θi,j Θi+1,j + Θi,j Θi,j+1 2 i,j min θ 1 ky θk22 + λkdθk1 2 Here D Rm n is a 2d difference operator giving the appropriate differences (across horizontally and vertically adjacent positions) 164 Parallel and Distributed Algorithms Figure 5.2: Variables are black dots; the partitions P and Q are in orange and cyan. 24
25 First way to rewrite: min θ,z 1 2 y θ 2 2 + λ z 1 subject to θ = Dz Leads to ADMM steps: Notes: θ (k) = (I + ρd T D) 1( y + ρd T (z (k 1) + w (k 1) ) ) z (k) = S λ/ρ (Dθ (k) w (k 1) ) w (k) = w (k 1) + z (k 1) Dθ (k) The θ update solves linear system in I + ρl, with L = D T D the graph Laplacian matrix of the 2d grid, so this can be done efficiently, in roughly O(n) operations The z update applies soft thresholding operator S t Hence one entire ADMM cycle uses roughly O(n) operations
26 Second way to rewrite: min H,V subject to H = V 1 2 Y H 2 F + λ i,j Leads to ADMM steps: H (k),j V (k) i, ( (k 1) Y + ρ(v = FL 1d,j W (k 1) λ/(1+ρ) 1 + ρ ( = FL 1d λ/ρ H (k) i, + W (k 1) i, W (k) = W (k 1) + H (k) V (k) Notes: ( ) H i,j H i+1,j + V i,j V i,j+1,j ) ), i = 1,..., d ), j = 1,..., d Both H, V updates solve (sequence of) 1d fused lassos, where we write FL 1d 1 τ (a) = argmin x 2 a x 2 2 + τ d 1 i=1 x i x i+1
Critical: each 1d fused lasso solution can be computed exactly in O(d) operations with specialized algorithms (e.g., Johnson, 2013; Davies and Kovac, 2001) Distributed Hence one entire ADMM cycleparallel again usesand O(n) operations Algori 27
28 Comparison of 2d fused lasso algorithms: an image of dimension 300 200 (so n = 60, 000) 3 4 5 6 7 Data Exact solution
29 Two ADMM algorithms, (say) standard and specialized ADMM: f(k) fstar 1e+01 1e+02 1e+03 1e+04 1e+05 Standard Specialized 0 20 40 60 80 100 k
30 ADMM iterates visualized after k = 10, 30, 50, 100 iterations: 3 4 5 6 7 Standard ADMM Specialized ADMM 10 iterations 10 iterations
30 ADMM iterates visualized after k = 10, 30, 50, 100 iterations: 3 4 5 6 7 Standard ADMM Specialized ADMM 30 iterations 30 iterations
30 ADMM iterates visualized after k = 10, 30, 50, 100 iterations: 3 4 5 6 7 Standard ADMM Specialized ADMM 50 iterations 50 iterations
30 ADMM iterates visualized after k = 10, 30, 50, 100 iterations: 3 4 5 6 7 Standard ADMM Specialized ADMM 100 iterations 100 iterations
31 References A. Barbero and S. Sra (2014), Modular proximal optimization for multidimensional total-variation regularization S. Boyd and N. Parikh and E. Chu and B. Peleato and J. Eckstein (2010), Distributed optimization and statistical learning via the alternating direction method of multipliers E. Candes and X. Li and Y. Ma and J. Wright (2009), Robust principal component analysis? N. Parikh and S. Boyd (2013), Proximal algorithms A. Ramdas and R. Tibshirani (2014), Fast and flexible ADMM algorithms for trend filtering V. Vu and J. Cho and J. Lei and K. Rohe (2013), Fantope projection and selection: a near-optimal convex relaxation of sparse PCA M. Wytock and S. Sra. and Z. Kolter (2014), Fast Newton methods for the group fused lasso