Math 273a: Optimization Overview of First-Order Optimization Algorithms

Math 273a: Optimization Overview of First-Order Optimization Algorithms Wotao Yin Department of Mathematics, UCLA online discussions on piazza.com 1 / 9

Typical flow of numerical optimization Optimization Problem Standard forms: LS, LP, SOCP... Algorithm Prob-specific improvements Solution 2 / 9

Linear programming (LP) General form minimize x c T x + c 0 subject to a T i x b i, i = 1,..., m ā T j x = b j, j = 1,..., m Every LP can be turned into the standard form minimize x c T x subject to Ax = b, x 0. no analytic formula for solutions reliable algorithms and software packages (e.g., CPlex, Gurobi) computation time proportional to n 2 m if m n, less with structured data a mature technology (unless A is huge) 3 / 9

Example: basis pursuit Goal: recover a sparse solution in [l, u] to Ax = b Model: l 1-minimization problem minimize x x 1 subject to Ax = b l x u 4 / 9

0 20 40 60 80 100 120 140 160 180 200 1 1 0.5 0.5 0 0 0.5 0.5 1 1 1.5 1.5 2 2 0 20 40 60 80 100 120 140 160 180 200 Original signal Recovered signal 5 / 9

Original model: minimize x x 1 subject to Ax = b l x u LP formulation: minimize x +.x x + + x subject to A(x + x ) = [A A] l x + x = [I I] ( x + x ) 0 ( x + ( x + x x ) ) u = b 6 / 9

Another optimization flow Optimization Problem simple parts (coordinates / operators) cooridnate / splitting algorithm Solution 7 / 9

Coordinate descent and operator splitting algorithms provide us with a simple yet powerful approach to derive algorithms, which are applicable to largely many problems (because they have friendly structures) easy to implement (often just several lines of code) applied on median/large scale problems, having (nearly) state-of-the-art performance 1 highly scalable (can go stochastic, parallel, distributed, asynchronous) convergence and complexity guarantees 1 They may not be the best choice for problems of small/median scales. 8 / 9

Monotone operator splitting pipeline 1. recognize the simple parts in your problem 2. reformulate as a monotone inclusion: e.g., 0 (A 1 + A 2)x (A i contains the simple parts in your problem) 3. apply an operator-splitting scheme: e.g., 0 (A 1 + A 2)x x = T A1 T }{{ A2 (x) } T 4. run the iteration z k+1 = z k + λ(tz k z k ), λ (0, 1] 9 / 9

Operator splitting: The big-three 10 / 9

The big three two-operator splitting schemes 0 Ax + Bx forward-backward (Mercier 79) for (maximally monotone) + (cocoercive) Douglas-Rachford (Lion-Mercier 79) for (maximally monotone) + (maximally monotone) forward-backward-forward (Tseng 00) for (maximally monotone) + (Lipschitz & monotone) all the schemes are built from forward operators and backward operators 11 / 9

Forward-backward splitting require: A monotone, B single-valued monotone forward-backward splitting (FBS) operator (Lion-Mercier 79) T FBS := (I + γa) 1 (I γb) where γ > 0 reduces to forward operator if A = 0, and backward operator if B = 0 equivalent conditions: 0 Ax + Bx x γbx x + γax (I γb)x (I + γa)x (I + γa) 1 (I γb)x = x x = T FBS (x) 12 / 9

FBS iteration: x k+1 = T FBS (x k ) converges if B is β-cocoercive and γ < 2β (come later) convergence rates are known typical usage: minimize smooth + proximable minimize smooth + proximable linear + proximable/constraints decentralized consensus minimization 13 / 9

Douglas-Rachford splitting require: A, B both monotone, possibly set-valued define reflective resolvent R γt := 2(I + γt) 1 I Douglas-Rachford splitting (DRS) operator (Lion-Mercier 79) T DRS := 1 2 I + 1 2 R γa R γb note: switching A and B gives a different DRS operator introduce z (I + γa)x. Then, x = J γa (z) and equivalent conditions: 0 Ax + Bx z = TPRS(z), λ x = J γb z (not finished) applications: alternating projection, ADMM, distributed optimization 14 / 9

Peaceman-Rachford splitting (PRS) operator: T PRS := R γa R γb (relaxed) Peaceman-Rachford splitting (PRS) operator, λ (0, 1]: T 1/2 PRS recovers T DRS T λ PRS := (1 λ)i + λr γa R γb T λ PRS, λ (0, 1), requires a weaker condition to converge than T PRS but the latter tends to be faster when it converges 15 / 9

Operator splitting: Direct applications 16 / 9

Constrained minimization C is a convex set. f is a proper close convex function. minimize x f (x) subject to x C equivalent condition: 0 N C (x) + f (x) (N C (x) = ι C (x), the subdifferential of C s indicator function) 17 / 9

if f is Lipschitz differentiable, then apply forward-backward splitting x k+1 = proj C (I γ f )x k recovers the projected gradient method [picture] 18 / 9

Constrained minimization, cont. if f is non-differentiable, then apply Douglas-Rachford splitting (DRS) z k+1 = ( 1 2 I + 1 2 (2prox γf I ) (2proj C I ) ) z k (recover x k = proj C z k ) dual DRS approach: introduce x y = 0 and apply ADMM to minimize x,y f (x) + ι C (y) subject to x y = 0. (indicator function ι C (y) = 0, if y C, and otherwise.) equivalence: the ADMM iteration = the DRS iteration 19 / 9

Regularization least squares minimize x r(x) + 1 Kx b 2 } 2 {{} f (x) K: linear operator; b: input data (observation) r: enforces a structure on x. examples: l 2 2, l 1, sorted l 1, l 2, TV, nuclear norm,... equivalent condition: 0 r(x) + f (x) x = (I + γ r) 1 }{{} =prox γr (I γ f )x forward-backward splitting iteration: x k+1 = prox γr (I γ f )x k 20 / 9

Example: LASSO (basis pursuit denoising) Tibshirani 96: find a sparse vector x such that Lx b by solving minimize x λ x 1 + 1 Lx b 2 }{{}} 2 {{} r(x) f (x) 2 simple parts: proximable function: r(x) := λ x 1 arg min x {r(x) + 1 2γ x y 2 } = sign(y) max{0, y λγ} smooth function: f (x) := 1 2 Lx b 2 and f (x) = L T (Lx b) apply forward-backward splitting to 0 Ax + Bx: x k+1 = (I + γa) 1 (I γb)x k }{{} Tx k 21 / 9

realization: (I γb)x k = x k γ f (x k ) (I + γa) 1 (y) = sign(y) max{0, y λγ} therefore, x k+1 = (I + γa) 1 (I γb)x k is realized as y k = x k γl T (Lx k b) x k+1 = sign(y k ) max{0, y k λγ} which recovers the Iterative Soft-Thresholding Algorithm (ISTA) if matrix L is huge, we can randomly sample rows, columns, or blocks of L at each iteration, or distribute the storage of L and run (asynchronous) parallel algorithms 22 / 9

Multi-function minimization f 1,..., f m : H (, ] are proper closed convex functions. product-space trick: minimize f 1(x) + + f N (x) x introduce copies x (i) H of x; let x = (x (1),..., x (N) ) H N define the set C = {x : x (1) = = x (N) } equivalent problem in H N : minimize x ι C (x) + N f i(x (i) ) i=1 then apply a two-operator splitting scheme 23 / 9

if all f i are smooth, apply forward-backward splitting and evaluate f i if all f i are proximable, apply Douglas-Rachford splitting and evaluate prox fi in parallel (alternatively, one can apply Eckstein-Svaiter projective splitting) if some f i are smooth and the others are proximable, we cannot mix them. We need to apply a three-operator splitting (e.g, FBS and DRS) 24 / 9

Alternating projection: project, project C 1 and C 2 are closed convex sets. Let dist(x, C) = min{ x y : y C}. equivalent problem: minimize 1 2 dist2 (x, C 1) + 1 2 dist2 (x, C 2). apply Peaceman-Rachford splitting: z k+1 = proj C1 proj C2 (z k ). C 1 z 1 C 2 z 0 25 / 9

Operator splitting: Dual applications 27 / 9

Lagrange duality original problem: minimize x f (x) subject to Ax = b. relaxations: Lagrangian: L(x; w) := f (x) + w T (Ax b) augmented Lagrangian: L(x; w, γ) := f (x) + w T (Ax b) + γ Ax b 2 2 (negative) dual function, which is convex (even if f is not): (negative) dual problem: d(w) = min x L(x; w) minimize w d(w) 28 / 9

Duality: why? convex (also some nonconvex) problems have two perspectives: the primal problem and the dual problem duality brings us: an alternative or relaxed problem provides lower bounds certificates of optimality or infeasibility economic interpretations duality + operator splitting: decouples the linear constraints that couple the variables gives rise to parallel and distributed algorithms 29 / 9

Forward operator on dual function suppose that d(w) is differentiable (equivalently, f (x) is strictly convex) dual forward operator based on d: w + = (I γ d)w the operator can be evaluated through minimizing the Lagrangian: { x + = arg min x L(x ; w) w + = w γ(b Ax + ) dual forward operator can be evaluated without involving d at all property: b Ax + = d(w) 30 / 9

Dual backward operator dual backward operator based on d: w + = (I + γ d) 1 w = arg min w { d(w ) + 1 2γ w w 2} w + can be computed through minimizing the augmented Lagrangian: { x + arg min x L(x ; w, γ) w + = w γ(b Ax + ) dual backward operator can also be evaluated without involving d at all property: b Ax + d(w + ) 31 / 9

Dual algorithms (no splitting yet) original problem: minimize x f (x) subject to Ax = b. if f is strongly convex (thus d is Lipschitz differentiable), apply dual gradient iteration: { x k+1 = arg min x L(x ; w k ) w k+1 = w γ(b Ax k+1 ) if f is convex (d may not be differentiable), apply dual PPA: { x k+1 arg min x L(x ; w k, γ) w k+1 = w γ(b Ax k+1 ) this iteration has a more complicated subproblem but is also more stable neither algorithms explicitly involves the dual function d 32 / 9

Monotropic program definition: minimize x 1,...,x m f 1(x 1) + + f m(x m) subject to A 1x 1 + + A mx m = b. x 1,..., x m are separable in the objective but coupled in the constraints the dual problem has the form where minimize w d 1(w) + + d m(w) d i(w) := min x i { fi(x i) + w T( A ix i 1 m b)}. each d i only involves x i, f i, and A i; they are connected by dual variable w 33 / 9

Examples of monotropic programs linear programs min{f (x) : Ax C} min x,y{f (x) + ι C (y) : Ax y = 0} consensus problem min{f 1(x 1) + + f n(x n) : Ax = 0}, where Ax = 0 x 1 = = x m the structure of A enables distributed computing exchange problems 34 / 9

Dual (Lagrangian) decomposition minimize x 1,...,x m f 1(x 1) + + f m(x m) subject to A 1x 1 + + A mx m = b. the variables x 1,..., x m are decoupled in the Lagrangian L(x 1,..., x n; w) = m { fi(x i) + w T (A ix i 1 m b)} i=1 (but not so in the augmented Lagrangian since it includes the term γ 2 A1x1 + + Amxm b 2 ) 35 / 9

let A = [A 1 A m] and x = [x 1;... ; x m] the dual gradient iteration x k+1 = arg min x L(x ; w k ) w k+1 = w γ(b Ax k+1 ) the first step reduces to m independent subproblems x k+1 i = arg min f i(x i ) + w kt (A ix i 1 x i m b), i = 1,..., m which can be solved in parallel this decomposition requires strongly convex f 1,..., f m (equivalently, Lipschitz differentiable d 1,..., d m) (dual PPA doesn t have this requirement, but the first step doesn t decouple either) 36 / 9

Dual forward-backward splitting original problem: minimize x f 1(x 1) + f 2(x 2) subject to A 1x 1 + A 2x 2 = b. require: strongly convex f 1 (thus Lipschitz d 1), convex f 2 FBS iteration: z k+1 = prox γd2 (I γ d 1)z k FBS expressed in terms of original problem s components: x k+1 1 = arg min x f 1 1(x 1) + w kt A 1x 1 x k+1 2 arg min x f 2 2(x 2) + w kt A 2x 2 + γ 2 A1xk+1 1 + A 2x 2 b 2 w k+1 = w γ(b A 1x k+1 1 A 2x k+1 2 ) we have recovered Tseng s Alternating Minimization Algorithm 37 / 9

Dual Douglas-Rachford splitting original problem: minimize x,y f 1(x 1) + f 2(x 2) subject to A 1x 1 + A 2x 2 = b. f 1, f 2 are convex, no strong-convexity requirement DRS iteration: z k+1 = ( 1 2 I + 1 2 (2prox γd 2 I )(2prox γd1 I ) ) z k DRS expressed in terms of original problem s components: x k+1 1 arg min x f 1 1(x 1) + w kt A 1x 1 + γ 2 A1x 1 + A 2x2 k b 2 x k+1 2 arg min x f 2 2(x 2) + w kt A 2x 2 + γ 2 A1xk+1 1 + A 2x 2 b 2 w k+1 = w k γ(b A 1x k+1 1 A 2x k+1 2 ) recover the Alternating Direction Method of Multipliers (ADMM) 38 / 9

Dual operator splitting for monotropic programming the problem has a separable objective and coupling linear constraints each iteration: separate f i subproblems + multiplier update Lagrangian x i-subproblems require strongly convex f i and can be solved in parallel augmented Lagrangian x i-subproblems do not require strong-convex f i but are solved in sequence (they can also be solved in parallel with more advance techniques) 39 / 9

Operator splitting: Primal-dual applications 40 / 9

proximable linear composition problem: minimize proximable linear + proximable + smooth equivalent condition: minimize x decouple r 1 from L: introduce r 1(Lx) + r 2(x) + f (x) 0 (L T r 1 L + r 2 + f )x dual variable y r 1(Lx) Lx r 1 (y) equivalent condition: [ ] [ ] [ ] [ ] 0 L T x r2(x) f (x) 0 + + L 0 y r1 (y) 0 41 / 9

equivalent condition (copied from last slide): [ ] [ ] [ ] [ ] 0 L T x r2(x) f (x) 0 + + L 0 y r1 (y) 0 }{{}}{{} Az Bz [ ] x primal-dual variable: z =. y apply forward-backward splitting to 0 Az + Bz: z k+1 = J γa F γb z k z k+1 = (I + γa) 1 (I γb)z k { x k+1 + γl T y k+1 + γ r2(x k+1 ) = x k γ f (x k ) solve y k+1 γlx k+1 + γ r 1 (y k+1 ) = y k issue: both x k+1 and y k+1 appear in both equations! 42 / 9

solution: introduce the metric [ ] I γl T U = 0 γl I apply forward-backward splitting to 0 U 1 Az + U 1 Bz: z k+1 = J γu 1 A (I γu 1 B)z k z k+1 = (I + γu 1 A) 1 (I γu 1 B)z k solve Uz k+1 + γãzk+1 = Uz k γbz k { x k+1 γl T y k+1 + γl T y k+1 + γ r2(x k+1 ) = x k γl T y k γ f (x k ) y k+1 γl x k+1 γlx k+1 + γ r1 (y k+1 ) = y k γl x k (like Gaussian elimination, y k+1 is cancelled from the first equation) 43 / 9

strategy: obtain x k+1 from the first equation; plug in x k+1 as a constant into the second equation and then obtain y k+1 final iteration: x k+1 = prox γr2 (x k γl T y k γ f (x k )) y k+1 = prox γr (y k + γl(2x k+1 x k )) 1 nice properties: apply L and f explicitly solve proximal-point subproblems of r 1 and r 2 convergence follows from standard forward-backward splitting 44 / 9

Example: total variation (TV) deblurring Rudin-Osher-Fatemi 92: minimize u 1 2 Ku b 2 + λ Du 1 subject to 0 u 255 4 simple parts: smooth + proximable linear + proximable smooth function: f (u) = 1 Ku b 2 2 linear operator: D proximable function: r 1 = λ 1 proximable function: r 2 = ι [0,255] (equivalent to the constraints 0 u 255) 45 / 9

(The steps below are sophisticated but routine. We skip the details.) equivalent condition: [ ] [ ] [ ] [ ] r2 D u f 0 u 0 + D r1 w 0 0 w (where w is the auxiliary (dual) variable) forward-backward splitting algorithm under a special metric: ( u k+1 = proj [0,255] n u k γd w k γ f (u k ) ) ( w k+1 = prox γl w k + γd(2u k+1 u k ) ) every step is simple to implement parallel, distributed, and stochastic algorithms are also available 46 / 9

Three-operator splitting 47 / 9

A three-operator splitting scheme motivation: all existing monotone splitting schemes reduce to one of the big three : forward-backward (1970s) Douglas-Rachford (1970s) forward-backward-forward (2000) benefits of a multi-operator splitting scheme save extra variables, potential savings in memory and cpu time fewer tricks, increased flexibility improve theoretical understanding to operator splitting challenge: the fixed-point of the operator T encodes a solution convergence of the iteration z k+1 = z k + λ(tz k z k ) 48 / 9

A three-operator splitting scheme require: A, B maximally monotone, C cocoercive Davis and Yin 15: T DY := I J γb + J γa (2J γb I γc J γb ) (evaluating T DY z will evaluate J γa, J γb, and C only once each) reduces to BFS if A = 0, FBS if B = 0, and to DRS if C = 0 equivalent conditions: 0 Ax + Bx + Cx z = T DY (z), x = J γb z 49 / 9

let C be β-cocoercive choose γ (0, 2β) z k+1 = T DY z k can implemented as: 1. xb k = J γb (z k ) 2. xa k = J γa (2xB k z k γcxb) k 3. z k+1 = z k + (xa k xb) k J γa, J γb, and γc are evaluated once at each iteration the intermediate variables: xa k x and xb k x 50 / 9

Three-operator natural applications Nonnegative matrix completion: y = A(X ) + w, w N (0, σ) minimize X R d m y A(X) 2 F + µ X + ι +(X) Smooth optimization with linear and box constraints minimize x H f (x) subject to: Lx = b; 0 x b. Covers kernelized SVMs and quadratic programs 51 / 9

Three-set split feasibility problem find x H such that x C 1 C 2 and Lx C 3, applications: nonnegative semi-definite programs and conic programs through homogeneous self-dual embedding. reformulation: minimize x ι C1 (x) + ι C2 (x) + 1 dist(lx, C3)2 2 52 / 9

Dual Davis-Yin splitting original problem, m 3: minimize x f 1(x 1) + + f m(x m) subject to A 1x 1 + + A mx m = b. require: strongly convex f 1,..., f m 2, convex f m 1 and f m Davis-Yin 15 iteration: z k+1 = T DY z k for (d 1 + + d m 2) + d m 1 + d m express in terms of (augmented) Lagrangian: x k+1 i = arg min x f i 1(x i ) + w kt A ix i, i = 1,... m 2, in parallel x k+1 m 1 arg min x j,j=m 1 fj(x j ) + w kt A jx j + γ 2 Ajx j + i m 1 Aixk+1 i b 2 xm k+1 arg min x m f m(x m) + w kt A mx m + γ 2 Amx m + m 1 i=1 Aixk+1 i b 2 w k+1 = w k γ ( b ) m i=1 Aixk+1 i m = 2 recovers ADMM, m = 3 gives simplest 3-block ADMM to date 53 / 9

Steps to build an operator-splitting algorithm Problem Simple Parts Standard Operators Algorithm 54 / 9

Summary monotone operator splitting is a powerful computational tool the first three-operator splitting scheme is introduced not covered: additive / block-separable structures enable parallel, distributed, and stochastic algorithms convergence analysis objective error: f k f point error: z k z 2 accelerated rates by averaging or extrapolation coordinate descent algorithms 55 / 9