Introduction to Alternating Direction Method of Multipliers

Introduction to Alternating Direction Method of Multipliers Yale Chang Machine Learning Group Meeting September 29, 2016 Yale Chang (Machine Learning Group Meeting) Introduction to Alternating Direction Method of Multipliers September 29, 2016 1 / 17

Formulation min f () + g(z) s.t. A + Bz = c (1),z where R n, z R m, A R p n, B R p m, c R p f : R n R {+ } and g : R m R {+ } are closed, proper, and conve, or equivalently, the epigraph of f, g are closed nonempty conve sets (epigraphf = {(, t) R n R f () t}) f, g can be non-differentiable and can take value + (e.g. indicator function I() = 0 if C and I() = + otherwise) We also assume the strong duality condition holds (Slater s condition) Yale Chang (Machine Learning Group Meeting) Introduction to Alternating Direction Method of Multipliers September 29, 2016 2 / 17

ADMM Iterations: Unscaled Form The Augmented Lagrangian of problem (1) is L ρ (, z, y) = f () + g(z) + y T (A + Bz c) + ρ 2 A + Bz c 2 2 (2) The ADMM iterations are The algorithm state is (z k, y k ) k+1 = arg minl ρ (, z k, y k ) (3) z k+1 = arg minl ρ ( k+1, z, y k ) (4) z y k+1 = y k + ρ(a k+1 + Bz k+1 c) (5) Yale Chang (Machine Learning Group Meeting) Introduction to Alternating Direction Method of Multipliers September 29, 2016 3 / 17

ADMM Iterations: Scaled Form Let r = A + Bz c, u = y/ρ, the augmented Lagrangian can be written as L ρ (, z, y) = f () + g(z) + y T r + ρ/2 r 2 2 The ADMM iterations become = f () + g(z) + ρ/2 r + y/ρ 2 2 1/(2ρ) y 2 2 = f () + g(z) + ρ/2 r + u 2 2 ρ/2 u 2 2 k+1 = arg min f () + ρ/2 A + Bz k c + u k 2 2 (6) z k+1 = arg min g(z) + ρ/2 A k+1 + Bz c + u k 2 2 z (7) u k+1 = u k + A k+1 + Bz k+1 c (8) In practice, we prefer the scaled form because it s shorter and easier to work with. Yale Chang (Machine Learning Group Meeting) Introduction to Alternating Direction Method of Multipliers September 29, 2016 4 / 17

Stopping Criteria The following inequality always holds f ( k ) + g(z k ) p (y k ) T r k + ( k ) T s k (9) where p = inf{f () + g(z) A + Bz = c}, is the optimal value of when f () + g(z) achieves the optimal objective p r k = A k + Bz k c is called primal residual, s k = ρa T B(z k z k 1 ) is called dual residual If r k 2 ɛ primal and s k 2 ɛ dual both hold, the iterations can stop There are some suggestions on how to set ɛ primal and ɛ dual in practice, you can refer to Boyd s tutorial for details We will skip the proof of convergence for now Yale Chang (Machine Learning Group Meeting) Introduction to Alternating Direction Method of Multipliers September 29, 2016 5 / 17

General Patterns: Proimal Operator How to solve each individual step in the iterations? The -update step can be epressed as + = arg min f () + (ρ/2) A v 2 2 (10) where v = Bz + c u is a known constant in this step. Case 1: Proimal Operator: assume A = I, which appears frequently in eamples, we have + = arg min f () + (ρ/2) v 2 2 = pro f,ρ (v) where pro f,ρ is called the proimal operator of f with penalty ρ, when the function f is simple enough, the proimal operator can be evaluated analytically Yale Chang (Machine Learning Group Meeting) Introduction to Alternating Direction Method of Multipliers September 29, 2016 6 / 17

Proimal Operator: Eamples Indicator Function: f () = 0 if C and f () = + otherwise, where C is a closed nonempty conve set, then the -update is + = pro f,ρ (v) = Π C (v) where Π C denotes the projection onto set C (in the Euclidean norm). 1) Hyper-rectangle C = { l u} (Reall the unit ball of l ) l k, v k l k = (Π C (v)) k = v k, l k v k u k u k, v k u k 2) Affine set C = { R n A = b}(a R m n ) = Π C (v) = v A (Av b) (A is pseudoinverse of A) 3) Positive semidefinite cone C = S n + = Π C (V ) = n i=1 (λ i) + u i ui T, where V = n i=1 λ iu i ui T More eamples can be found from Proimal Algorithms by Parikh and Boyd. Yale Chang (Machine Learning Group Meeting) Introduction to Alternating Direction Method of Multipliers September 29, 2016 7 / 17

General Patterns: Quadratic Objective Suppose f () = 1 2 T P + q T + r (P S n +), the -update becomes + = arg min = arg min 1 2 T P + q T + r + ρ 2 A v 2 2 1 2 T (P + ρa T A) (ρa T v q) T = (P + ρa T A) 1 (ρa T v q) Here we assume F = P + ρa T A is invertible. Computing + requires solving for the linear system F = g (g = ρa T v q). Eploiting Sparsity: when F is sparse, the computation cost can be effectively reduced Caching: when multiple linear systems with the same coefficient matri F need to be solved, the factorization of F can be computed once and saved for later use Yale Chang (Machine Learning Group Meeting) Introduction to Alternating Direction Method of Multipliers September 29, 2016 8 / 17

General Patterns: Smooth Objective Terms When f is smooth, general iterative methods can be used to carry out the -minimization step, eamples include gradient descent, conjugate gradient method and L-BFGS etc. The presence of the quadratic penalty term tends to improve the convergence of the iterative algorithms. There are a few techniques to speed up the convergence of ADMM iterations Early Termination: early termination in the - or z-updates can result in more ADMM iterations, but lower cost per iteration, giving an overall improvement in efficiency Warm Start: initialize the iterative method using the solution obtained from the previous iteration Yale Chang (Machine Learning Group Meeting) Introduction to Alternating Direction Method of Multipliers September 29, 2016 9 / 17

General Patterns: Decomposition Block Separability: suppose R n can be partitionaed into N subvectors and that f is separable w.r.t. this partition, i.e. f () = f 1 ( 1 ) + + f N ( N ) where i R n i and N i=1 n i = n. If the quadratic term A 2 2 is also separable w.r.t. the partition, then the augmented Lagrangian L ρ is separable. Then the -update can be carried out in parallel One eample is f () = λ 1 (λ > 0) and A = I, in this case the i -update is + i = arg min λ i + ρ 2 ( i v i ) 2. Using subdifferential calculus, the solution is + i = S λ/ρ (v i ), where the soft thresholding operator S is defined as a κ a > κ S κ (a) = 0 a κ a + κ a < κ Yale Chang (Machine Learning Group Meeting) Introduction to Alternating Direction Method of Multipliers September 29, 2016 10 / 17

Apply ADMM to Constrained Conve Optimization The generic constrained conve optimization problem is min f () s.t. C (11) where R n, f and C are conve. Its ADMM form can be written as min f () + g(z) s.t. z = 0 (12),z where g is the indicator function of C. The augmented Lagrangian is L ρ (, z, u) = f () + g(z) + ρ/2 z + u 2 2, so the scaled form of ADMM is k+1 = arg min f () + ρ/2 z k + u k 2 2 = pro f,ρ (z k u k ) z k+1 = Π C ( k+1 + u k ) u k+1 = u k + k+1 z k+1 Note here f need not be smooth. Yale Chang (Machine Learning Group Meeting) Introduction to Alternating Direction Method of Multipliers September 29, 2016 11 / 17

Eample: Conve Feasibility Find a point in the intersection of two closed nonempty conve sets C and D. In ADMM, the problem can be written as min f () + g(z) s.t. z = 0,z where f is the indicator function of C and g is the indicator function of D. The scaled form of ADMM is k+1 = Π C (z k u k ) z k+1 = Π D ( k+1 + u k ) u k+1 = u k + k+1 z k+1 Yale Chang (Machine Learning Group Meeting) Introduction to Alternating Direction Method of Multipliers September 29, 2016 12 / 17

Eample: Linear and Quadratic Programming 1 min 2 T P + q T s.t.a = b, 0 where R n, P S+ n, when P = 0, this reduces to the linear programming. The ADMM form is min f () + g(z) s.t. z = 0,z where f () = 1 2 T P + q T, domf = { A = b} and g is the indicator function of the nonnegative orthant R n +. The scaled form of ADMM has iterations k+1 = arg min f () + ρ/2 z k + u k 2 2 :A=b z k+1 = Π R n + ( k+1 + u k ) = ( k+1 + u k ) + u k+1 = u k + k+1 z k+1 Yale Chang (Machine Learning Group Meeting) Introduction to Alternating Direction Method of Multipliers September 29, 2016 13 / 17

Eample: l 1 -Norm Problems A lot of machine learning algorithms involve minimizing a loss function with a regularization term or side constraints. Both the loss function and regularization term can be nonsmooth. We use l 1 -norm problems to show how to apply ADMM to solve such problems. min l() + λ 1 (13) where l is any conve loss function. The ADMM form is min l() + g(z) s.t. z = 0,z where g(z) = λ z 1, the ADMM iterations are k+1 = arg minl() + ρ/2 z k + u k = pro l,ρ z k + u k 2 2 z k+1 = S λ/ρ ( k+1 + u k ) u k+1 = u k + k+1 z k+1 Yale Chang (Machine Learning Group Meeting) Introduction to Alternating Direction Method of Multipliers September 29, 2016 14 / 17

Eample: Sparse Inverse Covariance Selection Given a dataset consisting of samples from a zero mean Gaussian distribution in R n : a i N (0, Σ), i = 1,, N, (Σ 1 ) ij is zero if and only if the i-th and j-th components of the random variable are conditionally independent given the other variables Let X = Σ 1 and the objective be the negative log-likelihood of data with l 1 regularization on X The objective can be written as min Tr(SX ) log X + λ X 1, where X S n +, 1 is defined X elementwise, the domain of log X is S n ++, the set of symmetric positive definite n n matri. The ADMM steps are X k+1 = arg min X S n ++ Tr(SX ) log X + ρ/2 X Z K + U K 2 F Z k+1 = arg min λ Z 1 + ρ/2 X k+1 Z + U k 2 F Z U k+1 = U k + X k+1 Z k+1 Yale Chang (Machine Learning Group Meeting) Introduction to Alternating Direction Method of Multipliers September 29, 2016 15 / 17

Eample: Sparse Inverse Covariance Selection (continued) The Z-minimization step can be solved by elementwise soft thresholding Z k+1 ij = S λ/ρ (X k+1 ij + U k ij ) The X -minimization step also has an analytical solution using the first-order optimality condition S X 1 + ρ(x Z k + U k ) = 0. We will construct a matri X satisfying ρx X 1 = ρ(z k U k ) S and X 0. Let the RHS be factorized as QΛQ T (Λ = diag(λ 1,, λ n ), Q T = QQ T = I ), then we have ρ X X 1 = Λ, where X = Q T XQ. We can construct X by finding numbers X ii that satisfy ρ X ii 1/ X ii = λ i, which leads to X ii = (λ i + λ 2 i + 4ρ)/(2ρ) Yale Chang (Machine Learning Group Meeting) Introduction to Alternating Direction Method of Multipliers September 29, 2016 16 / 17

What Have We Learned? If f, g are smooth, the updates can always be solved using general iterative methods If f, g are so simple that an closed-form proimal operator evaluation is feasible, the updates can be carried out using analytical formulas. More eamples of proimal operator evaluations are available at the paper Proimal Algorithms written by Parikh and Boyd Techniques like early termination and warm start can be utilized to speed up convergence We have been focusing on non-distributed versions of various problems, their distributed versions need to be studied for parallelization Yale Chang (Machine Learning Group Meeting) Introduction to Alternating Direction Method of Multipliers September 29, 2016 17 / 17