Math 273a: Optimization Overview of First-Order Optimization Algorithms

Similar documents
Operator Splitting for Parallel and Distributed Optimization

Coordinate Update Algorithm Short Course Operator Splitting

Splitting methods for decomposing separable convex programs

Tight Rates and Equivalence Results of Operator Splitting Schemes

Coordinate Update Algorithm Short Course Proximal Operators and Algorithms

Math 273a: Optimization Lagrange Duality

Proximal splitting methods on convex problems with a quadratic term: Relax!

Sparse Optimization Lecture: Dual Methods, Part I

A Primal-dual Three-operator Splitting Scheme

Recent Developments of Alternating Direction Method of Multipliers with Multi-Block Variables

Asynchronous Algorithms for Conic Programs, including Optimal, Infeasible, and Unbounded Ones

Dual and primal-dual methods

ARock: an algorithmic framework for asynchronous parallel coordinate updates

Uses of duality. Geoff Gordon & Ryan Tibshirani Optimization /

Math 273a: Optimization Subgradients of convex functions

Distributed Optimization via Alternating Direction Method of Multipliers

M. Marques Alves Marina Geremia. November 30, 2017

Shiqian Ma, MAT-258A: Numerical Optimization 1. Chapter 9. Alternating Direction Method of Multipliers

Optimization methods

Convergence of Fixed-Point Iterations

The Alternating Direction Method of Multipliers

ALADIN An Algorithm for Distributed Non-Convex Optimization and Control

Accelerated primal-dual methods for linearly constrained convex problems

A Tutorial on Primal-Dual Algorithm

A General Framework for a Class of Primal-Dual Algorithms for TV Minimization

Optimization for Learning and Big Data

Proximal Methods for Optimization with Spasity-inducing Norms

EE 367 / CS 448I Computational Imaging and Display Notes: Image Deconvolution (lecture 6)

Alternating Direction Method of Multipliers. Ryan Tibshirani Convex Optimization

About Split Proximal Algorithms for the Q-Lasso

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique

Optimization methods

Math 273a: Optimization Convex Conjugacy

This can be 2 lectures! still need: Examples: non-convex problems applications for matrix factorization

Coordinate Update Algorithm Short Course The Package TMAC

Sparse Regularization via Convex Analysis

Convex Optimization. (EE227A: UC Berkeley) Lecture 15. Suvrit Sra. (Gradient methods III) 12 March, 2013

Math 273a: Optimization Subgradient Methods

Algorithm-Hardware Co-Optimization of Memristor-Based Framework for Solving SOCP and Homogeneous QCQP Problems

Coordinate Descent and Ascent Methods

Motivation. Lecture 2 Topics from Optimization and Duality. network utility maximization (NUM) problem:

Primal-dual algorithms for the sum of two and three functions 1

arxiv: v4 [math.oc] 29 Jan 2018

Dual methods and ADMM. Barnabas Poczos & Ryan Tibshirani Convex Optimization /36-725

Sparse Optimization Lecture: Basic Sparse Optimization Models

Bias-free Sparse Regression with Guaranteed Consistency

Convex Optimization. Dani Yogatama. School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA. February 12, 2014

Nonconvex ADMM: Convergence and Applications

ACCELERATED FIRST-ORDER PRIMAL-DUAL PROXIMAL METHODS FOR LINEARLY CONSTRAINED COMPOSITE CONVEX PROGRAMMING

Duality in Linear Programs. Lecturer: Ryan Tibshirani Convex Optimization /36-725

ARock: an Algorithmic Framework for Asynchronous Parallel Coordinate Updates

Primal-dual coordinate descent A Coordinate Descent Primal-Dual Algorithm with Large Step Size and Possibly Non-Separable Functions

Does Alternating Direction Method of Multipliers Converge for Nonconvex Problems?

First-order methods for structured nonsmooth optimization

Gauge optimization and duality

Contraction Methods for Convex Optimization and monotone variational inequalities No.12

Splitting Techniques in the Face of Huge Problem Sizes: Block-Coordinate and Block-Iterative Approaches

2 Regularized Image Reconstruction for Compressive Imaging and Beyond

Distributed Consensus Optimization

Adaptive Primal Dual Optimization for Image Processing and Learning

Asynchronous Parallel Computing in Signal Processing and Machine Learning

A Unified Approach to Proximal Algorithms using Bregman Distance

On the interior of the simplex, we have the Hessian of d(x), Hd(x) is diagonal with ith. µd(w) + w T c. minimize. subject to w T 1 = 1,

Lecture 24 November 27

A GENERAL FRAMEWORK FOR A CLASS OF FIRST ORDER PRIMAL-DUAL ALGORITHMS FOR CONVEX OPTIMIZATION IN IMAGING SCIENCE

ARock: an Algorithmic Framework for Async-Parallel Coordinate Updates

Optimality, Duality, Complementarity for Constrained Optimization

Fast proximal gradient methods

Beyond Heuristics: Applying Alternating Direction Method of Multipliers in Nonconvex Territory

Primal-dual Subgradient Method for Convex Problems with Functional Constraints

Variational Image Restoration

The Direct Extension of ADMM for Multi-block Convex Minimization Problems is Not Necessarily Convergent

4TE3/6TE3. Algorithms for. Continuous Optimization

Accelerated Dual Gradient-Based Methods for Total Variation Image Denoising/Deblurring Problems (and other Inverse Problems)

6. Proximal gradient method

Introduction to Alternating Direction Method of Multipliers

Coordinate Update Algorithm Short Course Subgradients and Subgradient Methods

ICS-E4030 Kernel Methods in Machine Learning

A New Primal Dual Algorithm for Minimizing the Sum of Three Functions with a Linear Operator

Convex Optimization M2

12. Interior-point methods

A GENERAL FRAMEWORK FOR A CLASS OF FIRST ORDER PRIMAL-DUAL ALGORITHMS FOR TV MINIMIZATION

Proximal Gradient Descent and Acceleration. Ryan Tibshirani Convex Optimization /36-725

Relaxed linearized algorithms for faster X-ray CT image reconstruction

LOCAL LINEAR CONVERGENCE OF ADMM Daniel Boley

Convergence analysis for a primal-dual monotone + skew splitting algorithm with applications to total variation minimization

A primal-dual fixed point algorithm for multi-block convex minimization *

Support Vector Machines

9. Dual decomposition and dual algorithms

On the order of the operators in the Douglas Rachford algorithm

MLCC 2018 Variable Selection and Sparsity. Lorenzo Rosasco UNIGE-MIT-IIT

Coordinate descent. Geoff Gordon & Ryan Tibshirani Optimization /

I P IANO : I NERTIAL P ROXIMAL A LGORITHM FOR N ON -C ONVEX O PTIMIZATION

3. Duality: What is duality? Why does it matter? Sensitivity through duality.

10-725/36-725: Convex Optimization Spring Lecture 21: April 6

Sparse Optimization Lecture: Dual Certificate in l 1 Minimization

Sparse and Regularized Optimization

Alternative Decompositions for Distributed Maximization of Network Utility: Framework and Applications

On convergence rate of the Douglas-Rachford operator splitting method

Frank-Wolfe Method. Ryan Tibshirani Convex Optimization

Transcription:

Math 273a: Optimization Overview of First-Order Optimization Algorithms Wotao Yin Department of Mathematics, UCLA online discussions on piazza.com 1 / 9

Typical flow of numerical optimization Optimization Problem Standard forms: LS, LP, SOCP... Algorithm Prob-specific improvements Solution 2 / 9

Linear programming (LP) General form minimize x c T x + c 0 subject to a T i x b i, i = 1,..., m ā T j x = b j, j = 1,..., m Every LP can be turned into the standard form minimize x c T x subject to Ax = b, x 0. no analytic formula for solutions reliable algorithms and software packages (e.g., CPlex, Gurobi) computation time proportional to n 2 m if m n, less with structured data a mature technology (unless A is huge) 3 / 9

Example: basis pursuit Goal: recover a sparse solution in [l, u] to Ax = b Model: l 1-minimization problem minimize x x 1 subject to Ax = b l x u 4 / 9

0 20 40 60 80 100 120 140 160 180 200 1 1 0.5 0.5 0 0 0.5 0.5 1 1 1.5 1.5 2 2 0 20 40 60 80 100 120 140 160 180 200 Original signal Recovered signal 5 / 9

Original model: minimize x x 1 subject to Ax = b l x u LP formulation: minimize x +.x x + + x subject to A(x + x ) = [A A] l x + x = [I I] ( x + x ) 0 ( x + ( x + x x ) ) u = b 6 / 9

Another optimization flow Optimization Problem simple parts (coordinates / operators) cooridnate / splitting algorithm Solution 7 / 9

Coordinate descent and operator splitting algorithms provide us with a simple yet powerful approach to derive algorithms, which are applicable to largely many problems (because they have friendly structures) easy to implement (often just several lines of code) applied on median/large scale problems, having (nearly) state-of-the-art performance 1 highly scalable (can go stochastic, parallel, distributed, asynchronous) convergence and complexity guarantees 1 They may not be the best choice for problems of small/median scales. 8 / 9

Monotone operator splitting pipeline 1. recognize the simple parts in your problem 2. reformulate as a monotone inclusion: e.g., 0 (A 1 + A 2)x (A i contains the simple parts in your problem) 3. apply an operator-splitting scheme: e.g., 0 (A 1 + A 2)x x = T A1 T }{{ A2 (x) } T 4. run the iteration z k+1 = z k + λ(tz k z k ), λ (0, 1] 9 / 9

Operator splitting: The big-three 10 / 9

The big three two-operator splitting schemes 0 Ax + Bx forward-backward (Mercier 79) for (maximally monotone) + (cocoercive) Douglas-Rachford (Lion-Mercier 79) for (maximally monotone) + (maximally monotone) forward-backward-forward (Tseng 00) for (maximally monotone) + (Lipschitz & monotone) all the schemes are built from forward operators and backward operators 11 / 9

Forward-backward splitting require: A monotone, B single-valued monotone forward-backward splitting (FBS) operator (Lion-Mercier 79) T FBS := (I + γa) 1 (I γb) where γ > 0 reduces to forward operator if A = 0, and backward operator if B = 0 equivalent conditions: 0 Ax + Bx x γbx x + γax (I γb)x (I + γa)x (I + γa) 1 (I γb)x = x x = T FBS (x) 12 / 9

FBS iteration: x k+1 = T FBS (x k ) converges if B is β-cocoercive and γ < 2β (come later) convergence rates are known typical usage: minimize smooth + proximable minimize smooth + proximable linear + proximable/constraints decentralized consensus minimization 13 / 9

Douglas-Rachford splitting require: A, B both monotone, possibly set-valued define reflective resolvent R γt := 2(I + γt) 1 I Douglas-Rachford splitting (DRS) operator (Lion-Mercier 79) T DRS := 1 2 I + 1 2 R γa R γb note: switching A and B gives a different DRS operator introduce z (I + γa)x. Then, x = J γa (z) and equivalent conditions: 0 Ax + Bx z = TPRS(z), λ x = J γb z (not finished) applications: alternating projection, ADMM, distributed optimization 14 / 9

Peaceman-Rachford splitting (PRS) operator: T PRS := R γa R γb (relaxed) Peaceman-Rachford splitting (PRS) operator, λ (0, 1]: T 1/2 PRS recovers T DRS T λ PRS := (1 λ)i + λr γa R γb T λ PRS, λ (0, 1), requires a weaker condition to converge than T PRS but the latter tends to be faster when it converges 15 / 9

Operator splitting: Direct applications 16 / 9

Constrained minimization C is a convex set. f is a proper close convex function. minimize x f (x) subject to x C equivalent condition: 0 N C (x) + f (x) (N C (x) = ι C (x), the subdifferential of C s indicator function) 17 / 9

if f is Lipschitz differentiable, then apply forward-backward splitting x k+1 = proj C (I γ f )x k recovers the projected gradient method [picture] 18 / 9

Constrained minimization, cont. if f is non-differentiable, then apply Douglas-Rachford splitting (DRS) z k+1 = ( 1 2 I + 1 2 (2prox γf I ) (2proj C I ) ) z k (recover x k = proj C z k ) dual DRS approach: introduce x y = 0 and apply ADMM to minimize x,y f (x) + ι C (y) subject to x y = 0. (indicator function ι C (y) = 0, if y C, and otherwise.) equivalence: the ADMM iteration = the DRS iteration 19 / 9

Regularization least squares minimize x r(x) + 1 Kx b 2 } 2 {{} f (x) K: linear operator; b: input data (observation) r: enforces a structure on x. examples: l 2 2, l 1, sorted l 1, l 2, TV, nuclear norm,... equivalent condition: 0 r(x) + f (x) x = (I + γ r) 1 }{{} =prox γr (I γ f )x forward-backward splitting iteration: x k+1 = prox γr (I γ f )x k 20 / 9

Example: LASSO (basis pursuit denoising) Tibshirani 96: find a sparse vector x such that Lx b by solving minimize x λ x 1 + 1 Lx b 2 }{{}} 2 {{} r(x) f (x) 2 simple parts: proximable function: r(x) := λ x 1 arg min x {r(x) + 1 2γ x y 2 } = sign(y) max{0, y λγ} smooth function: f (x) := 1 2 Lx b 2 and f (x) = L T (Lx b) apply forward-backward splitting to 0 Ax + Bx: x k+1 = (I + γa) 1 (I γb)x k }{{} Tx k 21 / 9

realization: (I γb)x k = x k γ f (x k ) (I + γa) 1 (y) = sign(y) max{0, y λγ} therefore, x k+1 = (I + γa) 1 (I γb)x k is realized as y k = x k γl T (Lx k b) x k+1 = sign(y k ) max{0, y k λγ} which recovers the Iterative Soft-Thresholding Algorithm (ISTA) if matrix L is huge, we can randomly sample rows, columns, or blocks of L at each iteration, or distribute the storage of L and run (asynchronous) parallel algorithms 22 / 9

Multi-function minimization f 1,..., f m : H (, ] are proper closed convex functions. product-space trick: minimize f 1(x) + + f N (x) x introduce copies x (i) H of x; let x = (x (1),..., x (N) ) H N define the set C = {x : x (1) = = x (N) } equivalent problem in H N : minimize x ι C (x) + N f i(x (i) ) i=1 then apply a two-operator splitting scheme 23 / 9

if all f i are smooth, apply forward-backward splitting and evaluate f i if all f i are proximable, apply Douglas-Rachford splitting and evaluate prox fi in parallel (alternatively, one can apply Eckstein-Svaiter projective splitting) if some f i are smooth and the others are proximable, we cannot mix them. We need to apply a three-operator splitting (e.g, FBS and DRS) 24 / 9

Alternating projection: project, project C 1 and C 2 are closed convex sets. Let dist(x, C) = min{ x y : y C}. equivalent problem: minimize 1 2 dist2 (x, C 1) + 1 2 dist2 (x, C 2). apply Peaceman-Rachford splitting: z k+1 = proj C1 proj C2 (z k ). C 1 z 1 C 2 z 0 25 / 9

Alternating projection: project, project C 1 and C 2 are closed convex sets. Let dist(x, C) = min{ x y : y C}. equivalent problem: minimize 1 2 dist2 (x, C 1) + 1 2 dist2 (x, C 2). apply Peaceman-Rachford splitting: z k+1 = proj C1 proj C2 (z k ). C 1 z 2 z 3 z 1 C 2 z 0 26 / 9

Operator splitting: Dual applications 27 / 9

Lagrange duality original problem: minimize x f (x) subject to Ax = b. relaxations: Lagrangian: L(x; w) := f (x) + w T (Ax b) augmented Lagrangian: L(x; w, γ) := f (x) + w T (Ax b) + γ Ax b 2 2 (negative) dual function, which is convex (even if f is not): (negative) dual problem: d(w) = min x L(x; w) minimize w d(w) 28 / 9

Duality: why? convex (also some nonconvex) problems have two perspectives: the primal problem and the dual problem duality brings us: an alternative or relaxed problem provides lower bounds certificates of optimality or infeasibility economic interpretations duality + operator splitting: decouples the linear constraints that couple the variables gives rise to parallel and distributed algorithms 29 / 9

Forward operator on dual function suppose that d(w) is differentiable (equivalently, f (x) is strictly convex) dual forward operator based on d: w + = (I γ d)w the operator can be evaluated through minimizing the Lagrangian: { x + = arg min x L(x ; w) w + = w γ(b Ax + ) dual forward operator can be evaluated without involving d at all property: b Ax + = d(w) 30 / 9

Dual backward operator dual backward operator based on d: w + = (I + γ d) 1 w = arg min w { d(w ) + 1 2γ w w 2} w + can be computed through minimizing the augmented Lagrangian: { x + arg min x L(x ; w, γ) w + = w γ(b Ax + ) dual backward operator can also be evaluated without involving d at all property: b Ax + d(w + ) 31 / 9

Dual algorithms (no splitting yet) original problem: minimize x f (x) subject to Ax = b. if f is strongly convex (thus d is Lipschitz differentiable), apply dual gradient iteration: { x k+1 = arg min x L(x ; w k ) w k+1 = w γ(b Ax k+1 ) if f is convex (d may not be differentiable), apply dual PPA: { x k+1 arg min x L(x ; w k, γ) w k+1 = w γ(b Ax k+1 ) this iteration has a more complicated subproblem but is also more stable neither algorithms explicitly involves the dual function d 32 / 9

Monotropic program definition: minimize x 1,...,x m f 1(x 1) + + f m(x m) subject to A 1x 1 + + A mx m = b. x 1,..., x m are separable in the objective but coupled in the constraints the dual problem has the form where minimize w d 1(w) + + d m(w) d i(w) := min x i { fi(x i) + w T( A ix i 1 m b)}. each d i only involves x i, f i, and A i; they are connected by dual variable w 33 / 9

Examples of monotropic programs linear programs min{f (x) : Ax C} min x,y{f (x) + ι C (y) : Ax y = 0} consensus problem min{f 1(x 1) + + f n(x n) : Ax = 0}, where Ax = 0 x 1 = = x m the structure of A enables distributed computing exchange problems 34 / 9

Dual (Lagrangian) decomposition minimize x 1,...,x m f 1(x 1) + + f m(x m) subject to A 1x 1 + + A mx m = b. the variables x 1,..., x m are decoupled in the Lagrangian L(x 1,..., x n; w) = m { fi(x i) + w T (A ix i 1 m b)} i=1 (but not so in the augmented Lagrangian since it includes the term γ 2 A1x1 + + Amxm b 2 ) 35 / 9

let A = [A 1 A m] and x = [x 1;... ; x m] the dual gradient iteration x k+1 = arg min x L(x ; w k ) w k+1 = w γ(b Ax k+1 ) the first step reduces to m independent subproblems x k+1 i = arg min f i(x i ) + w kt (A ix i 1 x i m b), i = 1,..., m which can be solved in parallel this decomposition requires strongly convex f 1,..., f m (equivalently, Lipschitz differentiable d 1,..., d m) (dual PPA doesn t have this requirement, but the first step doesn t decouple either) 36 / 9

Dual forward-backward splitting original problem: minimize x f 1(x 1) + f 2(x 2) subject to A 1x 1 + A 2x 2 = b. require: strongly convex f 1 (thus Lipschitz d 1), convex f 2 FBS iteration: z k+1 = prox γd2 (I γ d 1)z k FBS expressed in terms of original problem s components: x k+1 1 = arg min x f 1 1(x 1) + w kt A 1x 1 x k+1 2 arg min x f 2 2(x 2) + w kt A 2x 2 + γ 2 A1xk+1 1 + A 2x 2 b 2 w k+1 = w γ(b A 1x k+1 1 A 2x k+1 2 ) we have recovered Tseng s Alternating Minimization Algorithm 37 / 9

Dual Douglas-Rachford splitting original problem: minimize x,y f 1(x 1) + f 2(x 2) subject to A 1x 1 + A 2x 2 = b. f 1, f 2 are convex, no strong-convexity requirement DRS iteration: z k+1 = ( 1 2 I + 1 2 (2prox γd 2 I )(2prox γd1 I ) ) z k DRS expressed in terms of original problem s components: x k+1 1 arg min x f 1 1(x 1) + w kt A 1x 1 + γ 2 A1x 1 + A 2x2 k b 2 x k+1 2 arg min x f 2 2(x 2) + w kt A 2x 2 + γ 2 A1xk+1 1 + A 2x 2 b 2 w k+1 = w k γ(b A 1x k+1 1 A 2x k+1 2 ) recover the Alternating Direction Method of Multipliers (ADMM) 38 / 9

Dual operator splitting for monotropic programming the problem has a separable objective and coupling linear constraints each iteration: separate f i subproblems + multiplier update Lagrangian x i-subproblems require strongly convex f i and can be solved in parallel augmented Lagrangian x i-subproblems do not require strong-convex f i but are solved in sequence (they can also be solved in parallel with more advance techniques) 39 / 9

Operator splitting: Primal-dual applications 40 / 9

proximable linear composition problem: minimize proximable linear + proximable + smooth equivalent condition: minimize x decouple r 1 from L: introduce r 1(Lx) + r 2(x) + f (x) 0 (L T r 1 L + r 2 + f )x dual variable y r 1(Lx) Lx r 1 (y) equivalent condition: [ ] [ ] [ ] [ ] 0 L T x r2(x) f (x) 0 + + L 0 y r1 (y) 0 41 / 9

equivalent condition (copied from last slide): [ ] [ ] [ ] [ ] 0 L T x r2(x) f (x) 0 + + L 0 y r1 (y) 0 }{{}}{{} Az Bz [ ] x primal-dual variable: z =. y apply forward-backward splitting to 0 Az + Bz: z k+1 = J γa F γb z k z k+1 = (I + γa) 1 (I γb)z k { x k+1 + γl T y k+1 + γ r2(x k+1 ) = x k γ f (x k ) solve y k+1 γlx k+1 + γ r 1 (y k+1 ) = y k issue: both x k+1 and y k+1 appear in both equations! 42 / 9

solution: introduce the metric [ ] I γl T U = 0 γl I apply forward-backward splitting to 0 U 1 Az + U 1 Bz: z k+1 = J γu 1 A (I γu 1 B)z k z k+1 = (I + γu 1 A) 1 (I γu 1 B)z k solve Uz k+1 + γãzk+1 = Uz k γbz k { x k+1 γl T y k+1 + γl T y k+1 + γ r2(x k+1 ) = x k γl T y k γ f (x k ) y k+1 γl x k+1 γlx k+1 + γ r1 (y k+1 ) = y k γl x k (like Gaussian elimination, y k+1 is cancelled from the first equation) 43 / 9

strategy: obtain x k+1 from the first equation; plug in x k+1 as a constant into the second equation and then obtain y k+1 final iteration: x k+1 = prox γr2 (x k γl T y k γ f (x k )) y k+1 = prox γr (y k + γl(2x k+1 x k )) 1 nice properties: apply L and f explicitly solve proximal-point subproblems of r 1 and r 2 convergence follows from standard forward-backward splitting 44 / 9

Example: total variation (TV) deblurring Rudin-Osher-Fatemi 92: minimize u 1 2 Ku b 2 + λ Du 1 subject to 0 u 255 4 simple parts: smooth + proximable linear + proximable smooth function: f (u) = 1 Ku b 2 2 linear operator: D proximable function: r 1 = λ 1 proximable function: r 2 = ι [0,255] (equivalent to the constraints 0 u 255) 45 / 9

(The steps below are sophisticated but routine. We skip the details.) equivalent condition: [ ] [ ] [ ] [ ] r2 D u f 0 u 0 + D r1 w 0 0 w (where w is the auxiliary (dual) variable) forward-backward splitting algorithm under a special metric: ( u k+1 = proj [0,255] n u k γd w k γ f (u k ) ) ( w k+1 = prox γl w k + γd(2u k+1 u k ) ) every step is simple to implement parallel, distributed, and stochastic algorithms are also available 46 / 9

Three-operator splitting 47 / 9

A three-operator splitting scheme motivation: all existing monotone splitting schemes reduce to one of the big three : forward-backward (1970s) Douglas-Rachford (1970s) forward-backward-forward (2000) benefits of a multi-operator splitting scheme save extra variables, potential savings in memory and cpu time fewer tricks, increased flexibility improve theoretical understanding to operator splitting challenge: the fixed-point of the operator T encodes a solution convergence of the iteration z k+1 = z k + λ(tz k z k ) 48 / 9

A three-operator splitting scheme require: A, B maximally monotone, C cocoercive Davis and Yin 15: T DY := I J γb + J γa (2J γb I γc J γb ) (evaluating T DY z will evaluate J γa, J γb, and C only once each) reduces to BFS if A = 0, FBS if B = 0, and to DRS if C = 0 equivalent conditions: 0 Ax + Bx + Cx z = T DY (z), x = J γb z 49 / 9

let C be β-cocoercive choose γ (0, 2β) z k+1 = T DY z k can implemented as: 1. xb k = J γb (z k ) 2. xa k = J γa (2xB k z k γcxb) k 3. z k+1 = z k + (xa k xb) k J γa, J γb, and γc are evaluated once at each iteration the intermediate variables: xa k x and xb k x 50 / 9

Three-operator natural applications Nonnegative matrix completion: y = A(X ) + w, w N (0, σ) minimize X R d m y A(X) 2 F + µ X + ι +(X) Smooth optimization with linear and box constraints minimize x H f (x) subject to: Lx = b; 0 x b. Covers kernelized SVMs and quadratic programs 51 / 9

Three-set split feasibility problem find x H such that x C 1 C 2 and Lx C 3, applications: nonnegative semi-definite programs and conic programs through homogeneous self-dual embedding. reformulation: minimize x ι C1 (x) + ι C2 (x) + 1 dist(lx, C3)2 2 52 / 9

Dual Davis-Yin splitting original problem, m 3: minimize x f 1(x 1) + + f m(x m) subject to A 1x 1 + + A mx m = b. require: strongly convex f 1,..., f m 2, convex f m 1 and f m Davis-Yin 15 iteration: z k+1 = T DY z k for (d 1 + + d m 2) + d m 1 + d m express in terms of (augmented) Lagrangian: x k+1 i = arg min x f i 1(x i ) + w kt A ix i, i = 1,... m 2, in parallel x k+1 m 1 arg min x j,j=m 1 fj(x j ) + w kt A jx j + γ 2 Ajx j + i m 1 Aixk+1 i b 2 xm k+1 arg min x m f m(x m) + w kt A mx m + γ 2 Amx m + m 1 i=1 Aixk+1 i b 2 w k+1 = w k γ ( b ) m i=1 Aixk+1 i m = 2 recovers ADMM, m = 3 gives simplest 3-block ADMM to date 53 / 9

Steps to build an operator-splitting algorithm Problem Simple Parts Standard Operators Algorithm 54 / 9

Summary monotone operator splitting is a powerful computational tool the first three-operator splitting scheme is introduced not covered: additive / block-separable structures enable parallel, distributed, and stochastic algorithms convergence analysis objective error: f k f point error: z k z 2 accelerated rates by averaging or extrapolation coordinate descent algorithms 55 / 9