Alternating Direction Method of Multipliers. Ryan Tibshirani Convex Optimization

Similar documents
Dual methods and ADMM. Barnabas Poczos & Ryan Tibshirani Convex Optimization /36-725

Dual Methods. Lecturer: Ryan Tibshirani Convex Optimization /36-725

Dual Ascent. Ryan Tibshirani Convex Optimization

Uses of duality. Geoff Gordon & Ryan Tibshirani Optimization /

Frank-Wolfe Method. Ryan Tibshirani Convex Optimization

Distributed Optimization via Alternating Direction Method of Multipliers

The Alternating Direction Method of Multipliers

10-725/36-725: Convex Optimization Spring Lecture 21: April 6

Preconditioning via Diagonal Scaling

Coordinate descent. Geoff Gordon & Ryan Tibshirani Optimization /

EE 367 / CS 448I Computational Imaging and Display Notes: Image Deconvolution (lecture 6)

Distributed Optimization and Statistics via Alternating Direction Method of Multipliers

Proximal Gradient Descent and Acceleration. Ryan Tibshirani Convex Optimization /36-725

Sparse Gaussian conditional random fields

2 Regularized Image Reconstruction for Compressive Imaging and Beyond

Numerical Linear Algebra Primer. Ryan Tibshirani Convex Optimization

Introduction to Alternating Direction Method of Multipliers

Divide-and-combine Strategies in Statistical Modeling for Massive Data

Proximal Newton Method. Zico Kolter (notes by Ryan Tibshirani) Convex Optimization

Recent Developments of Alternating Direction Method of Multipliers with Multi-Block Variables

Proximal Newton Method. Ryan Tibshirani Convex Optimization /36-725

Convex Optimization Algorithms for Machine Learning in 10 Slides

Shiqian Ma, MAT-258A: Numerical Optimization 1. Chapter 9. Alternating Direction Method of Multipliers

Math 273a: Optimization Overview of First-Order Optimization Algorithms

Conditional Gradient (Frank-Wolfe) Method

Numerical Linear Algebra Primer. Ryan Tibshirani Convex Optimization /36-725

Lecture 25: November 27

Gradient Descent. Ryan Tibshirani Convex Optimization /36-725

Lecture 8: February 9

Newton s Method. Ryan Tibshirani Convex Optimization /36-725

Lecture 9: September 28

Convex Optimization. (EE227A: UC Berkeley) Lecture 15. Suvrit Sra. (Gradient methods III) 12 March, 2013

Contraction Methods for Convex Optimization and monotone variational inequalities No.12

Contraction Methods for Convex Optimization and Monotone Variational Inequalities No.16

Sparse Optimization Lecture: Dual Methods, Part I

Dual and primal-dual methods

Stochastic Gradient Descent. Ryan Tibshirani Convex Optimization

ECS289: Scalable Machine Learning

Big Data Analytics: Optimization and Randomization

Proximal Methods for Optimization with Spasity-inducing Norms

Coordinate Descent and Ascent Methods

Does Alternating Direction Method of Multipliers Converge for Nonconvex Problems?

Gradient descent. Barnabas Poczos & Ryan Tibshirani Convex Optimization /36-725

Sparse and Regularized Optimization

Duality in Linear Programs. Lecturer: Ryan Tibshirani Convex Optimization /36-725

An efficient ADMM algorithm for high dimensional precision matrix estimation via penalized quadratic loss

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique

arxiv: v1 [math.oc] 23 May 2017

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

STA141C: Big Data & High Performance Statistical Computing

A Unified Approach to Proximal Algorithms using Bregman Distance

Frist order optimization methods for sparse inverse covariance selection

Optimization methods

Duality. Geoff Gordon & Ryan Tibshirani Optimization /

Optimization methods

ARock: an algorithmic framework for asynchronous parallel coordinate updates

Differential network analysis from cross-platform gene expression data: Supplementary Information

Newton s Method. Javier Peña Convex Optimization /36-725

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Lecture 1: September 25

Case Study: Generalized Lasso Problems. Ryan Tibshirani Convex Optimization /36-725

10725/36725 Optimization Homework 4

Splitting methods for decomposing separable convex programs

Distributed Optimization: Analysis and Synthesis via Circuits

10-725/36-725: Convex Optimization Prerequisite Topics

Proximal splitting methods on convex problems with a quadratic term: Relax!

Lecture 23: November 19

r=1 r=1 argmin Q Jt (20) After computing the descent direction d Jt 2 dt H t d + P (x + d) d i = 0, i / J

Coordinate Update Algorithm Short Course Proximal Operators and Algorithms

Homework 5 ADMM, Primal-dual interior point Dual Theory, Dual ascent

Linear programming II

DLM: Decentralized Linearized Alternating Direction Method of Multipliers

Homework 5. Convex Optimization /36-725

A direct formulation for sparse PCA using semidefinite programming

STAT 200C: High-dimensional Statistics

Coordinate Update Algorithm Short Course Operator Splitting

Primal-dual coordinate descent

Nonconvex? NP! (No Problem!) Ryan Tibshirani Convex Optimization /36-725

Contraction Methods for Convex Optimization and Monotone Variational Inequalities No.11

ACCELERATED FIRST-ORDER PRIMAL-DUAL PROXIMAL METHODS FOR LINEARLY CONSTRAINED COMPOSITE CONVEX PROGRAMMING

Tighten after Relax: Minimax-Optimal Sparse PCA in Polynomial Time

Duality Uses and Correspondences. Ryan Tibshirani Convex Optimization

Sparse Optimization Lecture: Basic Sparse Optimization Models

Optimization in Big Data

The Direct Extension of ADMM for Multi-block Convex Minimization Problems is Not Necessarily Convergent

Optimization for Learning and Big Data

Decentralized Quadratically Approximated Alternating Direction Method of Multipliers

Nonlinear Optimization for Optimal Control

Algorithms for constrained local optimization

Sparse regression. Optimization-Based Data Analysis. Carlos Fernandez-Granda

Accelerated primal-dual methods for linearly constrained convex problems

Primal-dual coordinate descent A Coordinate Descent Primal-Dual Algorithm with Large Step Size and Possibly Non-Separable Functions

Subgradient Method. Ryan Tibshirani Convex Optimization

An ADMM algorithm for optimal sensor and actuator selection

Distributed Smooth and Strongly Convex Optimization with Inexact Dual Methods

15-780: LinearProgramming

Fantope Regularization in Metric Learning

Constrained Optimization and Lagrangian Duality

You should be able to...

CPSC 540: Machine Learning

Transcription:

Alternating Direction Method of Multipliers Ryan Tibshirani Convex Optimization 10-725

Consider the problem Last time: dual ascent min x f(x) subject to Ax = b where f is strictly convex and closed. Denote Lagrangian L(x, u) = f(x) + u T (Ax b) Dual gradient ascent repeats, for k = 1, 2, 3,... x (k) = argmin x L(x, u (k 1) ) u (k) = u (k 1) + t k (Ax (k) b) Good: x update decomposes when f does. Bad: require stringent assumptions (strong convexity of f) to ensure convergence 2

Augmented Lagrangian method (also called method of multipliers) considers the modified problem, for a parameter ρ > 0, uses modified Lagrangian min x f(x) + ρ 2 Ax b 2 2 subject to Ax = b L ρ (x, u) = f(x) + u T (Ax b) + ρ 2 Ax b 2 2 and repeats, for k = 1, 2, 3,... x (k) = argmin x L ρ (x, u (k 1) ) u (k) = u (k 1) + ρ(ax (k) b) Good: better convergence properties. Bad: lose decomposability 3

Alternating direction method of multipliers Alternating direction method of multipliers or ADMM: combines the best of both methods. Consider a problem of the form: min x,z f(x) + g(z) subject to Ax + Bz = c We define augmented Lagrangian, for a parameter ρ > 0, L ρ (x, z, u) = f(x) + g(z) + u T (Ax + Bz c) + ρ 2 Ax + Bz c 2 2 We repeat, for k = 1, 2, 3,... x (k) = argmin x z (k) = argmin z L ρ (x, z (k 1), u (k 1) ) L ρ (x (k), z, u (k 1) ) u (k) = u (k 1) + ρ(ax (k) + Bz (k) c) 4

Convergence guarantees Under modest assumptions on f, g (these do not require A, B to be full rank), the ADMM iterates satisfy, for any ρ > 0: Residual convergence: r (k) = Ax (k) + Bz (k) c 0 as k, i.e., primal iterates approach feasibility Objective convergence: f(x (k) ) + g(z (k) ) f + g, where f + g is the optimal primal objective value Dual convergence: u (k) u, where u is a dual solution For details, see Boyd et al. (2010). Note that we do not generically get primal convergence, but this is true under more assumptions Convergence rate: roughly, ADMM behaves like first-order method. Theory still being developed, see, e.g., in Hong and Luo (2012), Deng and Yin (2012), Iutzeler et al. (2014), Nishihara et al. (2015) 5

Scaled form ADMM Scaled form: denote w = u/ρ, so augmented Lagrangian becomes L ρ (x, z, w) = f(x) + g(z) + ρ 2 Ax + Bz c + w 2 2 ρ 2 w 2 2 and ADMM updates become x (k) = argmin x z (k) = argmin z f(x) + ρ 2 Ax + Bz(k 1) c + w (k 1) 2 2 g(z) + ρ 2 Ax(k) + Bz c + w (k 1) 2 2 w (k) = w (k 1) + Ax (k) + Bz (k) c Note that here kth iterate w (k) is just a running sum of residuals: w (k) = w (0) + k ( Ax (i) + Bz (i) c ) i=1 6

Outline Today: Examples, practicalities Consensus ADMM Special decompositions 7

Connection to proximal operators Consider min f(x) + g(x) min x x,z f(x) + g(z) subject to x = z ADMM steps (equivalent to Douglas-Rachford, here): x (k) = prox f,1/ρ (z (k 1) w (k 1) ) z (k) = prox g,1/ρ (x (k) + w (k 1) ) w (k) = w (k 1) + x (k) z (k) where prox f,1/ρ is the proximal operator for f at parameter 1/ρ, and similarly for prox g,1/ρ In general, the update for block of variables reduces to prox update whenever the corresponding linear transformation is the identity 8

Example: lasso regression Given y R n, X R n p, recall the lasso problem: We can rewrite this as: 1 min β 2 y Xβ 2 2 + λ β 1 min β,α 1 2 y Xβ 2 2 + λ α 1 subject to β α = 0 ADMM steps: β (k) = (X T X + ρi) 1( X T y + ρ(α (k 1) w (k 1) ) ) α (k) = S λ/ρ (β (k) + w (k 1) ) w (k) = w (k 1) + β (k) α (k) 9

10 Notes: The matrix X T X + ρi is always invertible, regardless of X If we compute a factorization (say Cholesky) in O(p 3 ) flops, then each β update takes O(p 2 ) flops The α update applies the soft-thresolding operator S t, which recall is defined as x j t x > t [S t (x)] j = 0 t x t, j = 1,... p x j + t x < t ADMM steps are almost like repeated soft-thresholding of ridge regression coefficients

11 Comparison of various algorithms for lasso regression: 100 random instances with n = 200, p = 50 Suboptimality fk fstar 1e 10 1e 07 1e 04 1e 01 Coordinate desc Proximal grad Accel prox ADMM (rho=50) ADMM (rho=100) ADMM (rho=200) 0 10 20 30 40 50 60 Iteration k

12 Practicalities In practice, ADMM usually obtains a relatively accurate solution in a handful of iterations, but it requires a large number of iterations for a highly accurate solution (like a first-order method) Choice of ρ can greatly influence practical convergence of ADMM: ρ too large not enough emphasis on minimizing f + g ρ too small not enough emphasis on feasibility Boyd et al. (2010) give a strategy for varying ρ; it can work well in practice, but does not have convergence guarantees Like deriving duals, transforming a problem into one that ADMM can handle is sometimes a bit subtle, since different forms can lead to different algorithms

13 Example: group lasso regression Given y R n, X R n p, recall the group lasso problem: Rewrite as: min β,α ADMM steps: 1 min β 2 y Xβ 2 2 + λ 1 2 y Xβ 2 2 + λ G c g β g 2 g=1 G c g α g 2 subject to β α = 0 g=1 β (k) = (X T X + ρi) 1( X T y + ρ(α (k 1) w (k 1) ) ) α g (k) ( = R cgλ/ρ β (k) g + w g (k 1) ), g = 1,... G w (k) = w (k 1) + β (k) α (k)

14 Notes: The matrix X T X + ρi is always invertible, regardless of X If we compute a factorization (say Cholesky) in O(p 3 ) flops, then each β update takes O(p 2 ) flops The α update applies the group soft-thresolding operator R t, which recall is defined as ( R t (x) = 1 t ) x x 2 + Similar ADMM steps follow for a sum of arbitrary norms of as regularizer, provided we know prox operator of each norm ADMM algorithm can be rederived when groups have overlap (hard problem to optimize in general!). See Boyd et al. (2010)

15 Example: sparse subspace estimation Given S S p (typically S 0 is a covariance matrix), consider the sparse subspace estimation problem (Vu et al., 2013): max Y tr(sy ) λ Y 1 subject to Y F k where F k is the Fantope of order k, namely F k = {Y S p : 0 Y I, tr(y ) = k} Note that when λ = 0, the above problem is equivalent to ordinary principal component analysis (PCA) This above is an SDP and in principle solveable with interior point methods, though these can be complicated to implement and quite slow for large problem sizes

16 Rewrite as: min Y,Z tr(sy ) + I F k (Y ) + λ Z 1 subject to Y = Z ADMM steps are: Y (k) = P Fk (Z (k 1) W (k 1) + S/ρ) Z (k) = S λ/ρ (Y (k) + W (k 1) ) W (k) = W (k 1) + Y (k) Z (k) Here P Fk is Fantope projection operator, computed by clipping the eigendecomposition A = UΣU T, Σ = diag(σ 1,..., σ p ): P Fk (A) = UΣ θ U T, Σ θ = diag(σ 1 (θ),..., σ p (θ)) where each σ i (θ) = min{max{σ i θ, 0}, 1}, and p i=1 σ i(θ) = k

17 Example: sparse + low rank decomposition Given M R n m, consider the sparse plus low rank decomposition problem (Candes et al., 2009): min L,S subject to L tr + λ S 1 L + S = M ADMM steps: L (k) = S tr 1/ρ (M S(k 1) + W (k 1) ) S (k) = S l 1 λ/ρ (M L(k) + W (k 1) ) W (k) = W (k 1) + M L (k) S (k) where, to distinguish them, we use Sλ/ρ tr for matrix soft-thresolding and S l 1 λ/ρ for elementwise soft-thresolding

18 Example from Candes et al. (2009): (a) Original frames (b) Low-rank ˆL (c) Sparse Ŝ

19 Consensus ADMM Consider a problem of the form: min x B f i (x) i=1 The consensus ADMM approach begins by reparametrizing: min x 1,...x B,x B f i (x i ) subject to x i = x, i = 1,... B i=1 This yields the decomposable ADMM steps: x (k) i = argmin x i B x (k) = 1 B w (k) i i=1 = w (k 1) i f i (x i ) + ρ 2 x i x (k 1) + w (k 1) i 2 2, i = 1,... B ( x (k) i ) + w (k 1) i + x (k) i x (k), i = 1,... B

20 Write x = 1 B B i=1 x i and similarly for other variables. Not hard to see that w (k) = 0 for all iterations k 1 Hence ADMM steps can be simplified, by taking x (k) = x (k) : x (k) i w (k) i = argmin x i = w (k 1) i f i (x i ) + ρ 2 x i x (k 1) + w (k 1) i 2 2, i = 1,... B + x (k) i x (k), i = 1,... B To reiterate, the x i, i = 1,... B updates here are done in parallel Intuition: Try to minimize each f i (x i ), use (squared) l 2 regularization to pull each x i towards the average x If a variable x i is bigger than the average, then w i is increased So the regularization in the next step pulls x i even closer

21 General consensus ADMM Consider a problem of the form: min x B f i (a T i x + b i ) + g(x) i=1 For consensus ADMM, we again reparametrize: min x 1,...x B,x B f i (a T i x i + b i ) + g(x) subject to x i = x, i = 1,... B i=1 This yields the decomposable ADMM updates: x (k) i = argmin x i x (k) = argmin x w (k) i = w (k 1) i f i (a T i x i + b i ) + ρ 2 x i x (k 1) + w (k 1) i 2 2, Bρ 2 x x(k) w (k 1) 2 2 + g(x) + x (k) i x (k), i = 1,... B i = 1,... B

22 Notes: It is no longer true that w (k) = 0 at a general iteration k, so ADMM steps do not simplify as before To reiterate, the x i, i = 1,... B updates are done in parallel Each x i update can be thought of as a loss minimization on part of the data, with l 2 regularization The x update is a proximal operation in regularizer g The w update drives the individual variables into consensus A different initial reparametrization will give rise to a different ADMM algorithm See Boyd et al. (2010), Parikh and Boyd (2013) for more details on consensus ADMM, strategies for splitting up into subproblems, and implementation tips

23 Special decompositions ADMM can exhibit much faster convergence than usual, when we parametrize subproblems in a special way ADMM updates relate closely to block coordinate descent, in which we optimize a criterion in an alternating fashion across blocks of variables With this in mind, get fastest convergence when minimizing over blocks of variables leads to updates in nearly orthogonal directions Suggests we should design ADMM form (auxiliary constraints) so that primal updates de-correlate as best as possible This is done in, e.g., Ramdas and Tibshirani (2014), Wytock et al. (2014), Barbero and Sra (2014)

Example: 2d fused lasso Given an image Y Rd d, equivalently written as y Rn, recall the 2d fused lasso or 2d total variation denoising problem: min Θ X 1 ky Θk2F + λ Θi,j Θi+1,j + Θi,j Θi,j+1 2 i,j min θ 1 ky θk22 + λkdθk1 2 Here D Rm n is a 2d difference operator giving the appropriate differences (across horizontally and vertically adjacent positions) 164 Parallel and Distributed Algorithms Figure 5.2: Variables are black dots; the partitions P and Q are in orange and cyan. 24

25 First way to rewrite: min θ,z 1 2 y θ 2 2 + λ z 1 subject to θ = Dz Leads to ADMM steps: Notes: θ (k) = (I + ρd T D) 1( y + ρd T (z (k 1) + w (k 1) ) ) z (k) = S λ/ρ (Dθ (k) w (k 1) ) w (k) = w (k 1) + z (k 1) Dθ (k) The θ update solves linear system in I + ρl, with L = D T D the graph Laplacian matrix of the 2d grid, so this can be done efficiently, in roughly O(n) operations The z update applies soft thresholding operator S t Hence one entire ADMM cycle uses roughly O(n) operations

26 Second way to rewrite: min H,V subject to H = V 1 2 Y H 2 F + λ i,j Leads to ADMM steps: H (k),j V (k) i, ( (k 1) Y + ρ(v = FL 1d,j W (k 1) λ/(1+ρ) 1 + ρ ( = FL 1d λ/ρ H (k) i, + W (k 1) i, W (k) = W (k 1) + H (k) V (k) Notes: ( ) H i,j H i+1,j + V i,j V i,j+1,j ) ), i = 1,..., d ), j = 1,..., d Both H, V updates solve (sequence of) 1d fused lassos, where we write FL 1d 1 τ (a) = argmin x 2 a x 2 2 + τ d 1 i=1 x i x i+1

Critical: each 1d fused lasso solution can be computed exactly in O(d) operations with specialized algorithms (e.g., Johnson, 2013; Davies and Kovac, 2001) Distributed Hence one entire ADMM cycleparallel again usesand O(n) operations Algori 27

28 Comparison of 2d fused lasso algorithms: an image of dimension 300 200 (so n = 60, 000) 3 4 5 6 7 Data Exact solution

29 Two ADMM algorithms, (say) standard and specialized ADMM: f(k) fstar 1e+01 1e+02 1e+03 1e+04 1e+05 Standard Specialized 0 20 40 60 80 100 k

30 ADMM iterates visualized after k = 10, 30, 50, 100 iterations: 3 4 5 6 7 Standard ADMM Specialized ADMM 10 iterations 10 iterations

30 ADMM iterates visualized after k = 10, 30, 50, 100 iterations: 3 4 5 6 7 Standard ADMM Specialized ADMM 30 iterations 30 iterations

30 ADMM iterates visualized after k = 10, 30, 50, 100 iterations: 3 4 5 6 7 Standard ADMM Specialized ADMM 50 iterations 50 iterations

30 ADMM iterates visualized after k = 10, 30, 50, 100 iterations: 3 4 5 6 7 Standard ADMM Specialized ADMM 100 iterations 100 iterations

31 References A. Barbero and S. Sra (2014), Modular proximal optimization for multidimensional total-variation regularization S. Boyd and N. Parikh and E. Chu and B. Peleato and J. Eckstein (2010), Distributed optimization and statistical learning via the alternating direction method of multipliers E. Candes and X. Li and Y. Ma and J. Wright (2009), Robust principal component analysis? N. Parikh and S. Boyd (2013), Proximal algorithms A. Ramdas and R. Tibshirani (2014), Fast and flexible ADMM algorithms for trend filtering V. Vu and J. Cho and J. Lei and K. Rohe (2013), Fantope projection and selection: a near-optimal convex relaxation of sparse PCA M. Wytock and S. Sra. and Z. Kolter (2014), Fast Newton methods for the group fused lasso