Introduction to Alternating Direction Method of Multipliers

Similar documents
Dual Methods. Lecturer: Ryan Tibshirani Convex Optimization /36-725

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Dual Ascent. Ryan Tibshirani Convex Optimization

Lecture 23: November 19

The Alternating Direction Method of Multipliers

10-725/36-725: Convex Optimization Spring Lecture 21: April 6

Uses of duality. Geoff Gordon & Ryan Tibshirani Optimization /

Frank-Wolfe Method. Ryan Tibshirani Convex Optimization

Dual methods and ADMM. Barnabas Poczos & Ryan Tibshirani Convex Optimization /36-725

Duality Uses and Correspondences. Ryan Tibshirani Convex Optimization

In applications, we encounter many constrained optimization problems. Examples Basis pursuit: exact sparse recovery problem

10725/36725 Optimization Homework 4

Sparse Gaussian conditional random fields

On the interior of the simplex, we have the Hessian of d(x), Hd(x) is diagonal with ith. µd(w) + w T c. minimize. subject to w T 1 = 1,

Convex Analysis Notes. Lecturer: Adrian Lewis, Cornell ORIE Scribe: Kevin Kircher, Cornell MAE

EE 367 / CS 448I Computational Imaging and Display Notes: Image Deconvolution (lecture 6)

2 Regularized Image Reconstruction for Compressive Imaging and Beyond

1 Sparsity and l 1 relaxation

Coordinate Update Algorithm Short Course Subgradients and Subgradient Methods

Lecture 8. Strong Duality Results. September 22, 2008

Constrained Optimization and Lagrangian Duality

Divide-and-combine Strategies in Statistical Modeling for Massive Data

Distributed Optimization and Statistics via Alternating Direction Method of Multipliers

Proximal Methods for Optimization with Spasity-inducing Norms

ICS-E4030 Kernel Methods in Machine Learning

Sparse Optimization Lecture: Basic Sparse Optimization Models

10-725/ Optimization Midterm Exam

Math 273a: Optimization Subgradients of convex functions

9. Interpretations, Lifting, SOS and Moments

CS-E4830 Kernel Methods in Machine Learning

An efficient ADMM algorithm for high dimensional precision matrix estimation via penalized quadratic loss

Lecture 23: Conditional Gradient Method

LECTURE 25: REVIEW/EPILOGUE LECTURE OUTLINE

Algorithms for Nonsmooth Optimization

Convex Optimization and l 1 -minimization

Math 273a: Optimization Lagrange Duality

Alternating Direction Method of Multipliers. Ryan Tibshirani Convex Optimization

Lecture 16: October 22

Subgradient Method. Ryan Tibshirani Convex Optimization

arxiv: v1 [math.oc] 16 Nov 2015

Preconditioning via Diagonal Scaling

Lecture 24 November 27

Constrained optimization

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique

Proximal methods. S. Villa. October 7, 2014

Distributed Optimization via Alternating Direction Method of Multipliers

Primal-dual Subgradient Method for Convex Problems with Functional Constraints

12. Interior-point methods

Accelerated primal-dual methods for linearly constrained convex problems

Computational Optimization. Constrained Optimization Part 2

Big Data Analytics: Optimization and Randomization

ACCELERATED FIRST-ORDER PRIMAL-DUAL PROXIMAL METHODS FOR LINEARLY CONSTRAINED COMPOSITE CONVEX PROGRAMMING

Lagrange duality. The Lagrangian. We consider an optimization program of the form

CSCI : Optimization and Control of Networks. Review on Convex Optimization

Primal/Dual Decomposition Methods

10-725/36-725: Convex Optimization Prerequisite Topics

Conditional Gradient (Frank-Wolfe) Method

LOCAL LINEAR CONVERGENCE OF ADMM Daniel Boley

Applications of Linear Programming

Additional Homework Problems

A Brief Review on Convex Optimization

Convexity II: Optimization Basics

You should be able to...

Duality revisited. Javier Peña Convex Optimization /36-725

Fantope Regularization in Metric Learning

Lecture 25: November 27

5. Duality. Lagrangian

Lecture 7: Weak Duality

EE364b Convex Optimization II May 30 June 2, Final exam

Math 273a: Optimization Convex Conjugacy

Coordinate descent. Geoff Gordon & Ryan Tibshirani Optimization /

Convex Optimization and Modeling

Introduction to Machine Learning Lecture 7. Mehryar Mohri Courant Institute and Google Research

Lecture: Duality of LP, SOCP and SDP

Support Vector Machines: Maximum Margin Classifiers

Convex Optimization. Dani Yogatama. School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA. February 12, 2014

Convex Optimization M2

Sparse Covariance Selection using Semidefinite Programming

Statistical Machine Learning from Data

Distributed Convex Optimization

Homework 5. Convex Optimization /36-725

Lecture 3: Lagrangian duality and algorithms for the Lagrangian dual problem

Lecture Note 5: Semidefinite Programming for Stability Analysis

EE 367 / CS 448I Computational Imaging and Display Notes: Noise, Denoising, and Image Reconstruction with Noise (lecture 10)

CSC 576: Gradient Descent Algorithms

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

On Optimal Frame Conditioners

Dual Proximal Gradient Method

arxiv: v1 [math.oc] 23 May 2017

LECTURE 13 LECTURE OUTLINE

Extended Monotropic Programming and Duality 1

Selected Topics in Optimization. Some slides borrowed from

Lecture 14: Optimality Conditions for Conic Problems

Optimization for Learning and Big Data

Lecture 5 : Projections

SEMI-SMOOTH SECOND-ORDER TYPE METHODS FOR COMPOSITE CONVEX PROGRAMS

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

Convex Optimization & Lagrange Duality

Suppose that the approximate solutions of Eq. (1) satisfy the condition (3). Then (1) if η = 0 in the algorithm Trust Region, then lim inf.

Convex Optimization. Newton s method. ENSAE: Optimisation 1/44

Transcription:

Introduction to Alternating Direction Method of Multipliers Yale Chang Machine Learning Group Meeting September 29, 2016 Yale Chang (Machine Learning Group Meeting) Introduction to Alternating Direction Method of Multipliers September 29, 2016 1 / 17

Formulation min f () + g(z) s.t. A + Bz = c (1),z where R n, z R m, A R p n, B R p m, c R p f : R n R {+ } and g : R m R {+ } are closed, proper, and conve, or equivalently, the epigraph of f, g are closed nonempty conve sets (epigraphf = {(, t) R n R f () t}) f, g can be non-differentiable and can take value + (e.g. indicator function I() = 0 if C and I() = + otherwise) We also assume the strong duality condition holds (Slater s condition) Yale Chang (Machine Learning Group Meeting) Introduction to Alternating Direction Method of Multipliers September 29, 2016 2 / 17

ADMM Iterations: Unscaled Form The Augmented Lagrangian of problem (1) is L ρ (, z, y) = f () + g(z) + y T (A + Bz c) + ρ 2 A + Bz c 2 2 (2) The ADMM iterations are The algorithm state is (z k, y k ) k+1 = arg minl ρ (, z k, y k ) (3) z k+1 = arg minl ρ ( k+1, z, y k ) (4) z y k+1 = y k + ρ(a k+1 + Bz k+1 c) (5) Yale Chang (Machine Learning Group Meeting) Introduction to Alternating Direction Method of Multipliers September 29, 2016 3 / 17

ADMM Iterations: Scaled Form Let r = A + Bz c, u = y/ρ, the augmented Lagrangian can be written as L ρ (, z, y) = f () + g(z) + y T r + ρ/2 r 2 2 The ADMM iterations become = f () + g(z) + ρ/2 r + y/ρ 2 2 1/(2ρ) y 2 2 = f () + g(z) + ρ/2 r + u 2 2 ρ/2 u 2 2 k+1 = arg min f () + ρ/2 A + Bz k c + u k 2 2 (6) z k+1 = arg min g(z) + ρ/2 A k+1 + Bz c + u k 2 2 z (7) u k+1 = u k + A k+1 + Bz k+1 c (8) In practice, we prefer the scaled form because it s shorter and easier to work with. Yale Chang (Machine Learning Group Meeting) Introduction to Alternating Direction Method of Multipliers September 29, 2016 4 / 17

Stopping Criteria The following inequality always holds f ( k ) + g(z k ) p (y k ) T r k + ( k ) T s k (9) where p = inf{f () + g(z) A + Bz = c}, is the optimal value of when f () + g(z) achieves the optimal objective p r k = A k + Bz k c is called primal residual, s k = ρa T B(z k z k 1 ) is called dual residual If r k 2 ɛ primal and s k 2 ɛ dual both hold, the iterations can stop There are some suggestions on how to set ɛ primal and ɛ dual in practice, you can refer to Boyd s tutorial for details We will skip the proof of convergence for now Yale Chang (Machine Learning Group Meeting) Introduction to Alternating Direction Method of Multipliers September 29, 2016 5 / 17

General Patterns: Proimal Operator How to solve each individual step in the iterations? The -update step can be epressed as + = arg min f () + (ρ/2) A v 2 2 (10) where v = Bz + c u is a known constant in this step. Case 1: Proimal Operator: assume A = I, which appears frequently in eamples, we have + = arg min f () + (ρ/2) v 2 2 = pro f,ρ (v) where pro f,ρ is called the proimal operator of f with penalty ρ, when the function f is simple enough, the proimal operator can be evaluated analytically Yale Chang (Machine Learning Group Meeting) Introduction to Alternating Direction Method of Multipliers September 29, 2016 6 / 17

Proimal Operator: Eamples Indicator Function: f () = 0 if C and f () = + otherwise, where C is a closed nonempty conve set, then the -update is + = pro f,ρ (v) = Π C (v) where Π C denotes the projection onto set C (in the Euclidean norm). 1) Hyper-rectangle C = { l u} (Reall the unit ball of l ) l k, v k l k = (Π C (v)) k = v k, l k v k u k u k, v k u k 2) Affine set C = { R n A = b}(a R m n ) = Π C (v) = v A (Av b) (A is pseudoinverse of A) 3) Positive semidefinite cone C = S n + = Π C (V ) = n i=1 (λ i) + u i ui T, where V = n i=1 λ iu i ui T More eamples can be found from Proimal Algorithms by Parikh and Boyd. Yale Chang (Machine Learning Group Meeting) Introduction to Alternating Direction Method of Multipliers September 29, 2016 7 / 17

General Patterns: Quadratic Objective Suppose f () = 1 2 T P + q T + r (P S n +), the -update becomes + = arg min = arg min 1 2 T P + q T + r + ρ 2 A v 2 2 1 2 T (P + ρa T A) (ρa T v q) T = (P + ρa T A) 1 (ρa T v q) Here we assume F = P + ρa T A is invertible. Computing + requires solving for the linear system F = g (g = ρa T v q). Eploiting Sparsity: when F is sparse, the computation cost can be effectively reduced Caching: when multiple linear systems with the same coefficient matri F need to be solved, the factorization of F can be computed once and saved for later use Yale Chang (Machine Learning Group Meeting) Introduction to Alternating Direction Method of Multipliers September 29, 2016 8 / 17

General Patterns: Smooth Objective Terms When f is smooth, general iterative methods can be used to carry out the -minimization step, eamples include gradient descent, conjugate gradient method and L-BFGS etc. The presence of the quadratic penalty term tends to improve the convergence of the iterative algorithms. There are a few techniques to speed up the convergence of ADMM iterations Early Termination: early termination in the - or z-updates can result in more ADMM iterations, but lower cost per iteration, giving an overall improvement in efficiency Warm Start: initialize the iterative method using the solution obtained from the previous iteration Yale Chang (Machine Learning Group Meeting) Introduction to Alternating Direction Method of Multipliers September 29, 2016 9 / 17

General Patterns: Decomposition Block Separability: suppose R n can be partitionaed into N subvectors and that f is separable w.r.t. this partition, i.e. f () = f 1 ( 1 ) + + f N ( N ) where i R n i and N i=1 n i = n. If the quadratic term A 2 2 is also separable w.r.t. the partition, then the augmented Lagrangian L ρ is separable. Then the -update can be carried out in parallel One eample is f () = λ 1 (λ > 0) and A = I, in this case the i -update is + i = arg min λ i + ρ 2 ( i v i ) 2. Using subdifferential calculus, the solution is + i = S λ/ρ (v i ), where the soft thresholding operator S is defined as a κ a > κ S κ (a) = 0 a κ a + κ a < κ Yale Chang (Machine Learning Group Meeting) Introduction to Alternating Direction Method of Multipliers September 29, 2016 10 / 17

Apply ADMM to Constrained Conve Optimization The generic constrained conve optimization problem is min f () s.t. C (11) where R n, f and C are conve. Its ADMM form can be written as min f () + g(z) s.t. z = 0 (12),z where g is the indicator function of C. The augmented Lagrangian is L ρ (, z, u) = f () + g(z) + ρ/2 z + u 2 2, so the scaled form of ADMM is k+1 = arg min f () + ρ/2 z k + u k 2 2 = pro f,ρ (z k u k ) z k+1 = Π C ( k+1 + u k ) u k+1 = u k + k+1 z k+1 Note here f need not be smooth. Yale Chang (Machine Learning Group Meeting) Introduction to Alternating Direction Method of Multipliers September 29, 2016 11 / 17

Eample: Conve Feasibility Find a point in the intersection of two closed nonempty conve sets C and D. In ADMM, the problem can be written as min f () + g(z) s.t. z = 0,z where f is the indicator function of C and g is the indicator function of D. The scaled form of ADMM is k+1 = Π C (z k u k ) z k+1 = Π D ( k+1 + u k ) u k+1 = u k + k+1 z k+1 Yale Chang (Machine Learning Group Meeting) Introduction to Alternating Direction Method of Multipliers September 29, 2016 12 / 17

Eample: Linear and Quadratic Programming 1 min 2 T P + q T s.t.a = b, 0 where R n, P S+ n, when P = 0, this reduces to the linear programming. The ADMM form is min f () + g(z) s.t. z = 0,z where f () = 1 2 T P + q T, domf = { A = b} and g is the indicator function of the nonnegative orthant R n +. The scaled form of ADMM has iterations k+1 = arg min f () + ρ/2 z k + u k 2 2 :A=b z k+1 = Π R n + ( k+1 + u k ) = ( k+1 + u k ) + u k+1 = u k + k+1 z k+1 Yale Chang (Machine Learning Group Meeting) Introduction to Alternating Direction Method of Multipliers September 29, 2016 13 / 17

Eample: l 1 -Norm Problems A lot of machine learning algorithms involve minimizing a loss function with a regularization term or side constraints. Both the loss function and regularization term can be nonsmooth. We use l 1 -norm problems to show how to apply ADMM to solve such problems. min l() + λ 1 (13) where l is any conve loss function. The ADMM form is min l() + g(z) s.t. z = 0,z where g(z) = λ z 1, the ADMM iterations are k+1 = arg minl() + ρ/2 z k + u k = pro l,ρ z k + u k 2 2 z k+1 = S λ/ρ ( k+1 + u k ) u k+1 = u k + k+1 z k+1 Yale Chang (Machine Learning Group Meeting) Introduction to Alternating Direction Method of Multipliers September 29, 2016 14 / 17

Eample: Sparse Inverse Covariance Selection Given a dataset consisting of samples from a zero mean Gaussian distribution in R n : a i N (0, Σ), i = 1,, N, (Σ 1 ) ij is zero if and only if the i-th and j-th components of the random variable are conditionally independent given the other variables Let X = Σ 1 and the objective be the negative log-likelihood of data with l 1 regularization on X The objective can be written as min Tr(SX ) log X + λ X 1, where X S n +, 1 is defined X elementwise, the domain of log X is S n ++, the set of symmetric positive definite n n matri. The ADMM steps are X k+1 = arg min X S n ++ Tr(SX ) log X + ρ/2 X Z K + U K 2 F Z k+1 = arg min λ Z 1 + ρ/2 X k+1 Z + U k 2 F Z U k+1 = U k + X k+1 Z k+1 Yale Chang (Machine Learning Group Meeting) Introduction to Alternating Direction Method of Multipliers September 29, 2016 15 / 17

Eample: Sparse Inverse Covariance Selection (continued) The Z-minimization step can be solved by elementwise soft thresholding Z k+1 ij = S λ/ρ (X k+1 ij + U k ij ) The X -minimization step also has an analytical solution using the first-order optimality condition S X 1 + ρ(x Z k + U k ) = 0. We will construct a matri X satisfying ρx X 1 = ρ(z k U k ) S and X 0. Let the RHS be factorized as QΛQ T (Λ = diag(λ 1,, λ n ), Q T = QQ T = I ), then we have ρ X X 1 = Λ, where X = Q T XQ. We can construct X by finding numbers X ii that satisfy ρ X ii 1/ X ii = λ i, which leads to X ii = (λ i + λ 2 i + 4ρ)/(2ρ) Yale Chang (Machine Learning Group Meeting) Introduction to Alternating Direction Method of Multipliers September 29, 2016 16 / 17

What Have We Learned? If f, g are smooth, the updates can always be solved using general iterative methods If f, g are so simple that an closed-form proimal operator evaluation is feasible, the updates can be carried out using analytical formulas. More eamples of proimal operator evaluations are available at the paper Proimal Algorithms written by Parikh and Boyd Techniques like early termination and warm start can be utilized to speed up convergence We have been focusing on non-distributed versions of various problems, their distributed versions need to be studied for parallelization Yale Chang (Machine Learning Group Meeting) Introduction to Alternating Direction Method of Multipliers September 29, 2016 17 / 17