Distributed Optimization via Alternating Direction Method of Multipliers

Similar documents
Distributed Optimization and Statistics via Alternating Direction Method of Multipliers

The Alternating Direction Method of Multipliers

Dual methods and ADMM. Barnabas Poczos & Ryan Tibshirani Convex Optimization /36-725

Recent Developments of Alternating Direction Method of Multipliers with Multi-Block Variables

The Direct Extension of ADMM for Multi-block Convex Minimization Problems is Not Necessarily Convergent

Uses of duality. Geoff Gordon & Ryan Tibshirani Optimization /

Splitting methods for decomposing separable convex programs

Dual Ascent. Ryan Tibshirani Convex Optimization

Alternating Direction Method of Multipliers. Ryan Tibshirani Convex Optimization

Dual Methods. Lecturer: Ryan Tibshirani Convex Optimization /36-725

Optimization for Learning and Big Data

arxiv: v1 [math.oc] 23 May 2017

A Simple Heuristic Based on Alternating Direction Method of Multipliers for Solving Mixed-Integer Nonlinear Optimization

Shiqian Ma, MAT-258A: Numerical Optimization 1. Chapter 9. Alternating Direction Method of Multipliers

Math 273a: Optimization Overview of First-Order Optimization Algorithms

Accelerated primal-dual methods for linearly constrained convex problems

Coordinate Update Algorithm Short Course Operator Splitting

Contraction Methods for Convex Optimization and monotone variational inequalities No.12

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Contraction Methods for Convex Optimization and Monotone Variational Inequalities No.11

Distributed Convex Optimization

Sparse and Regularized Optimization

A Unified Approach to Proximal Algorithms using Bregman Distance

2 Regularized Image Reconstruction for Compressive Imaging and Beyond

Frank-Wolfe Method. Ryan Tibshirani Convex Optimization

Divide-and-combine Strategies in Statistical Modeling for Massive Data

Introduction to Alternating Direction Method of Multipliers

Dual Proximal Gradient Method

A SIMPLE PARALLEL ALGORITHM WITH AN O(1/T ) CONVERGENCE RATE FOR GENERAL CONVEX PROGRAMS

Dealing with Constraints via Random Permutation

Dual and primal-dual methods

Support Vector Machines and Kernel Methods

Primal/Dual Decomposition Methods

On convergence rate of the Douglas-Rachford operator splitting method

LOCAL LINEAR CONVERGENCE OF ADMM Daniel Boley

Motivation. Lecture 2 Topics from Optimization and Duality. network utility maximization (NUM) problem:

EE 367 / CS 448I Computational Imaging and Display Notes: Image Deconvolution (lecture 6)

The Direct Extension of ADMM for Multi-block Convex Minimization Problems is Not Necessarily Convergent

Contraction Methods for Convex Optimization and Monotone Variational Inequalities No.16

Distributed Optimization: Analysis and Synthesis via Circuits

Probabilistic Graphical Models

Preconditioning via Diagonal Scaling

Coordinate Descent and Ascent Methods

arxiv: v1 [math.oc] 27 Jan 2013

STA141C: Big Data & High Performance Statistical Computing

An ADMM algorithm for optimal sensor and actuator selection

Douglas-Rachford Splitting: Complexity Estimates and Accelerated Variants

Does Alternating Direction Method of Multipliers Converge for Nonconvex Problems?

Signal Processing and Networks Optimization Part VI: Duality

Sparse Gaussian conditional random fields

Bregman Alternating Direction Method of Multipliers

Dual Averaging and Proximal Gradient Descent for Online Alternating Direction Multiplier Method

Machine Learning for NLP

Beyond Heuristics: Applying Alternating Direction Method of Multipliers in Nonconvex Territory

Support Vector Machines

4.1 Online Convex Optimization

Optimal Linearized Alternating Direction Method of Multipliers for Convex Programming 1

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

ECS289: Scalable Machine Learning

An interior-point stochastic approximation method and an L1-regularized delta rule

Support Vector Machines for Classification and Regression

Inexact Alternating Direction Method of Multipliers for Separable Convex Optimization

Primal-dual Subgradient Method for Convex Problems with Functional Constraints

9. Dual decomposition and dual algorithms

MLCC 2017 Regularization Networks I: Linear Models

Constrained Optimization and Lagrangian Duality

First-order methods for structured nonsmooth optimization

Sparse Optimization Lecture: Dual Methods, Part I

Tight Rates and Equivalence Results of Operator Splitting Schemes

Linearized Alternating Direction Method of Multipliers via Positive-Indefinite Proximal Regularization for Convex Programming.

Final exam.

Machine Learning And Applications: Supervised Learning-SVM

Frist order optimization methods for sparse inverse covariance selection

Machine Learning Support Vector Machines. Prof. Matteo Matteucci

A Customized ADMM for Rank-Constrained Optimization Problems with Approximate Formulations

Proximal ADMM with larger step size for two-block separable convex programming and its application to the correlation matrices calibrating problems

Coordinate descent methods

Operator Splitting for Parallel and Distributed Optimization

Convex Optimization & Lagrange Duality

Distributed Smooth and Strongly Convex Optimization with Inexact Dual Methods

Machine Learning. Regression-Based Classification & Gaussian Discriminant Analysis. Manfred Huber

Warm up: risk prediction with logistic regression

Splitting Techniques in the Face of Huge Problem Sizes: Block-Coordinate and Block-Iterative Approaches

Optimization in Big Data

CS , Fall 2011 Assignment 2 Solutions

Jeff Howbert Introduction to Machine Learning Winter

Optimization for Machine Learning

ALADIN An Algorithm for Distributed Non-Convex Optimization and Control

Support Vector Machines. CAP 5610: Machine Learning Instructor: Guo-Jun QI

Support Vector Machines: Maximum Margin Classifiers

Alternative Decompositions for Distributed Maximization of Network Utility: Framework and Applications

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Relaxed linearized algorithms for faster X-ray CT image reconstruction

Machine Learning. Support Vector Machines. Manfred Huber

Interior-Point and Augmented Lagrangian Algorithms for Optimization and Control

arxiv: v3 [math.oc] 17 Dec 2017 Received: date / Accepted: date

A Bregman alternating direction method of multipliers for sparse probabilistic Boolean network problem

ICS-E4030 Kernel Methods in Machine Learning

Kernel Machines. Pradeep Ravikumar Co-instructor: Manuela Veloso. Machine Learning

ADMM and Accelerated ADMM as Continuous Dynamical Systems

Transcription:

Distributed Optimization via Alternating Direction Method of Multipliers Stephen Boyd, Neal Parikh, Eric Chu, Borja Peleato Stanford University ITMANET, Stanford, January 2011

Outline precursors dual decomposition method of multipliers alternating direction method of multipliers applications/examples conclusions/big picture ITMANET, Stanford, January 2011 1

Dual problem convex equality constrained optimization problem minimize f(x) subject to Ax = b Lagrangian: L(x,y) = f(x)+y T (Ax b) dual function: g(y) = inf x L(x,y) dual problem: maximize g(y) recover x = argmin x L(x,y ) ITMANET, Stanford, January 2011 2

Dual ascent gradient method for dual problem: y k+1 = y k +α k g(y k ) g(y k ) = A x b, where x = argmin x L(x,y k ) dual ascent method is x k+1 := argmin x L(x,y k ) // x-minimization y k+1 := y k +α k (Ax k+1 b) // dual update works, with lots of strong assumptions ITMANET, Stanford, January 2011 3

suppose f is separable: Dual decomposition f(x) = f 1 (x 1 )+ +f N (x N ), x = (x 1,...,x N ) then L is separable in x: L(x,y) = L 1 (x 1,y)+ +L N (x N,y) y T b, L i (x i,y) = f i (x i )+y T A i x i x-minimization in dual ascent splits into N separate minimizations x k+1 i := argmin x i L i (x i,y k ) which can be carried out in parallel ITMANET, Stanford, January 2011 4

Dual decomposition dual decomposition (Everett, Dantzig, Wolfe, Benders 1960 65) x k+1 i := argmin xi L i (x i,y k ), i = 1,...,N y k+1 := y k +α k ( N i=1 A ix k+1 i b) scatter y k ; update x i in parallel; gather A i x k+1 i solve a large problem by iteratively solving subproblems (in parallel) dual variable update provides coordination works, with lots of assumptions; often slow ITMANET, Stanford, January 2011 5

a method to robustify dual ascent Method of multipliers use augmented Lagrangian (Hestenes, Powell 1969), ρ > 0 L ρ (x,y) = f(x)+y T (Ax b)+(ρ/2) Ax b 2 2 method of multipliers (Hestenes, Powell; analysis in Bertsekas 1982) x k+1 := argmin x L ρ (x,y k ) y k+1 := y k +ρ(ax k+1 b) (note specific dual update step length ρ) ITMANET, Stanford, January 2011 6

Method of multipliers good news: converges under much more relaxed conditions (f can be nondifferentiable, take on value +,... ) bad news: quadratic penalty destroys splitting of the x-update, so can t do decomposition ITMANET, Stanford, January 2011 7

Alternating direction method of multipliers a method with good robustness of method of multipliers which can support decomposition robust dual decomposition or decomposable method of multipliers proposed by Gabay, Mercier, Glowinski, Marrocco in 1976 ITMANET, Stanford, January 2011 8

Alternating direction method of multipliers ADMM problem form (with f, g convex) minimize f(x) + g(z) subject to Ax+Bz = c two sets of variables, with separable objective L ρ (x,z,y) = f(x)+g(z)+y T (Ax+Bz c)+(ρ/2) Ax+Bz c 2 2 ADMM: x k+1 := argmin x L ρ (x,z k,y k ) z k+1 := argmin z L ρ (x k+1,z,y k ) // x-minimization // z-minimization y k+1 := y k +ρ(ax k+1 +Bz k+1 c) // dual update ITMANET, Stanford, January 2011 9

Alternating direction method of multipliers if we minimized over x and z jointly, reduces to method of multipliers instead, we do one pass of a Gauss-Seidel method we get splitting since we minimize over x with z fixed, and vice versa ITMANET, Stanford, January 2011 10

Convergence assume (very little!) f, g convex, closed, proper L 0 has a saddle point then ADMM converges: iterates approach feasibility: Ax k +Bz k c 0 objective approaches optimal value: f(x k )+g(z k ) p ITMANET, Stanford, January 2011 11

Related algorithms operator splitting methods (Douglas, Peaceman, Rachford, Lions, Mercier,... 1950s, 1979) proximal point algorithm (Rockafellar 1976) Dykstra s alternating projections algorithm (1983) Spingarn s method of partial inverses (1985) Rockafellar-Wets progressive hedging (1991) proximal methods (Rockafellar, many others, 1976 present) Bregman iterative methods (2008 present) most of these are special cases of the proximal point algorithm ITMANET, Stanford, January 2011 12

Consensus optimization want to solve problem with N objective terms minimize N i=1 f i(x) e.g., f i is the loss function for ith block of training data ADMM form: N minimize i=1 f i(x i ) subject to x i z = 0 x i are local variables z is the global variable x i z = 0 are consistency or consensus constraints ITMANET, Stanford, January 2011 13

Consensus optimization via ADMM L ρ (x,z,y) = N i=1( fi (x i )+yi T(x ) i z)+(ρ/2) x i z 2 2 ADMM: x k+1 i := argmin x i N z k+1 := 1 N i=1 ( fi (x i )+yi kt (x i z k )+(ρ/2) x i z k 2 ) 2 ( x k+1 i +(1/ρ)y k i y k+1 i := y k i +ρ(x k+1 i z k+1 ) ) ITMANET, Stanford, January 2011 14

using N i=1 yk i x k+1 i Consensus optimization via ADMM = 0, algorithm simplifies to := argmin x i y k+1 i := y k i +ρ(x k+1 i x k+1 ) where x k = (1/N) N i=1 xk i in each iteration ( fi (x i )+yi kt (x i x k )+(ρ/2) x i x k 2 ) 2 gather x k i and average to get xk scatter the average x k to processors update yi k locally (in each processor, in parallel) update x i locally ITMANET, Stanford, January 2011 15

Statistical interpretation f i is negative log-likelihood for parameter x given ith data block x k+1 i is MAP estimate under prior N(x k +(1/ρ)y k i,ρi) prior mean is previous iteration s consensus shifted by price of processor i disagreeing with previous consensus processors only need to support a Gaussian MAP method type or number of data in each block not relevant consensus protocol yields global maximum-likelihood estimate ITMANET, Stanford, January 2011 16

Consensus classification data (examples) (a i,b i ), i = 1,...,N, a i R n, b i { 1,+1} linear classifier sign(a T w+v), with weight w, offset v margin for ith example is b i (a T i w +v); want margin to be positive loss for ith example is l(b i (a T i w +v)) l is loss function (hinge, logistic, probit, exponential,... ) choose w, v to minimize 1 N N i=1 l(b i(a T i w +v))+r(w) r(w) is regularization term (l 2, l 1,... ) split data and use ADMM consensus to solve ITMANET, Stanford, January 2011 17

Consensus SVM example hinge loss l(u) = (1 u) + with l 2 regularization baby problem with n = 2, N = 400 to illustrate examples split into 20 groups, in worst possible way: each group contains only positive or negative examples ITMANET, Stanford, January 2011 18

Iteration 1 10 8 6 4 2 0 2 4 6 8 10 3 2 1 0 1 2 3 ITMANET, Stanford, January 2011 19

Iteration 5 10 8 6 4 2 0 2 4 6 8 10 3 2 1 0 1 2 3 ITMANET, Stanford, January 2011 19

Iteration 40 10 8 6 4 2 0 2 4 6 8 10 3 2 1 0 1 2 3 ITMANET, Stanford, January 2011 19

Reference Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers (Boyd, Parikh, Chu, Peleato, Eckstein) available at Boyd web site ITMANET, Stanford, January 2011 20