Operator Splitting for Parallel and Distributed Optimization

Size: px

Start display at page:

Download "Operator Splitting for Parallel and Distributed Optimization"

Christine Johnston
5 years ago
Views:

1 Operator Splitting for Parallel and Distributed Optimization Wotao Yin (UCLA Math) Shanghai Tech, SSDS 15 June 23, 2015 URL: alturl.com/2z7tv 1 / 60

2 What is splitting? Sun-Tzu: (400 BC) Caesar: divide-n-conquer ( BC) splitting in computing: break a problem separate parts solve the separate parts sub-solutions combine the sub-solutions in a controlled fashion 2 / 60

3 Monotone operator-splitting pipeline 1. recognize the simple parts in your problem 2. build an equivalent monotone-inclusion problem: 0 Ax (the simple parts are separately placed in A) 3. apply an operator-splitting scheme: 0 Ax = z = Tz (the simple parts become sub-operators of T) 4. run the iteration z k+1 = z k + λ(tz k z k ), λ (0, 1] (known as the Krasnosel skiĭ Mann (KM) iteration) 3 / 60

4 Example: LASSO (basis pursuit denoising) Tibshirani 96: 2 simple parts: smooth + simple minimize 1 2 Ax b 2 + λ x 1 smooth function: f (x) = 1 Ax b 2 2 simple nonsmooth function: r(x) = λ x 1 equivalent condition: 0 f (x) + r(x) forward-backward splitting algorithm: x k+1 = prox γr (x k γ f (x k )) }{{} Tx k also known as the Iterative Soft-Thresholding Algorithm (ISTA) 4 / 60

5 Example: total variation (TV) deblurring Rudin-Osher-Fatemi 92: minimize u 1 2 Ku b 2 + λ Du 1 subject to 0 u simple parts: smooth + simple linear + simple smooth function: f (u) = 1 Ku b 2 2 linear operator: D simple nonsmooth function: r 1 = λ 1 simple indicator function: r 2 = ι [0,255] 5 / 60

6 (We only show the results, skipping the details) equivalent condition: [ ] [ ] [ ] [ ] r2 D u f 0 u 0 + D r1 w 0 0 w (where w is the auxiliary (dual) variable) forward-backward splitting algorithm under a special metric: u k+1 = proj [0,255] n (u k γd w k γ f (u k )) w k+1 = prox γl (w k + γd(2u k+1 u k )) every step is simple to implement 6 / 60

7 Simple parts 7 / 60

8 Simple parts linear maps (e.g., matrices, finite differences, orthogonal transforms) differentiable functions (non-differentiable) functions that have simple proximal maps sets that are easy to project to Abstraction: we look for monotone maps which have certain simple operators 8 / 60

9 Monotone map A : H H is monotone if Ax Ay, x y 0, x, y H extend to set-valued A : H 2 H : p q, x y 0, x, y H, p Ax, q Ay examples: a positive semi-definite linear map a skew-symmetric linear map: Ax, x = 0, x H f : subdifferential of a proper closed convex function f 9 / 60

10 Steps to build an operator-splitting algorithm Problem Monotone Maps Standard Operators Algorithm 10 / 60

11 Forward operator require: A is monotone and either Lipschitz or cocoercive 1 definition: fwd γa := (I γa) examples: forward Euler for ẏ + g(t, y) = 0: y t+1 = y t h g(t, y t ) = (I h g(t, ))y t gradient descent for min f (x): x k+1 = x k γ f (x k ) = (I γ f )x k equivalent conditions: 0 = Ax x = (I γa)x 1 Ax Ay, x y β Ax Ay 2, x, y H. If A is β-cocoercive, then A is 1/β-Lipschitz. The reverse is generally untrue. 11 / 60

12 Backward operator require: A is monotone definition: J γa := (I + γa) 1 equivalent conditions: 0 Ax x x + γax x = J γa x even if A is set-valued, J γa is single-valued examples: regularized matrix inversion backward Euler proximal map, including projection as a special case 12 / 60

13 Proximal map require: a proper closed convex function f definition: the minimizer must satisfy: prox γf (y) = arg min x f (x) + 1 x y 2 2γ 0 γ f (x ) + (x y) x = (I + γ f ) 1 (y) = J γ f (y) therefore, prox γf J γ f (I + γ f ) 1 13 / 60

14 Reflective backward operator require: A is monotone definition: R γa := 2J γa I reflects x through J γa x by adding J γa x x examples: mirror or reflective projection: refl C = 2proj C I reflective proximal map: for closed proper convex function f R γ f = 2J γ f I = 2prox γf I 14 / 60

15 Operator splitting 15 / 60

16 Steps to build an operator-splitting algorithm Problem Monotone Maps Standard Operators Algorithm 16 / 60

17 The big three two-operator splitting schemes 0 Ax + Bx forward-backward (Mercier 79) for (maximally monotone) + (cocoercive) Douglas-Rachford (Lion-Mercier 79) for (maximally monotone) + (maximally monotone) forward-backward-forward (Tseng 00) for (maximally monotone) + (Lipschitz & monotone) all the schemes are built from forward operators and backward operators 17 / 60

18 Forward-backward splitting require: A maximally monotone, B cocoercive (thus single-valued) forward-backward splitting (FBS) operator (Lion-Mercier 79) T FBS := J γa (I γb) reduces to forward operator if A = 0, and backward operator if B = 0 equivalent conditions: 0 Ax + Bx x = T FBS (x) backward-forward splitting (BFS) operator T BFS := (I γb) J γa, then 0 Ax + Bx z = T BFS (z), x = J γa z 18 / 60

19 Regularization least squares minimize x r(x) + 1 Kx b 2 } 2 {{} f (x) K: linear operator b: input data (observation) r: enforces a structure on x. examples: l 2 2, l 1, sorted l 1, l 2, TV, nuclear norm,... equivalent condition: 0 r(x) + f (x) forward-backward splitting iteration: x k+1 = prox γr (I γ f )x k 19 / 60

20 Constrained minimization C is a convex set. f is a proper close convex function. minimize x f (x) subject to x C equivalent condition: 0 N C (x) + f (x) if f is Lipschitz differentiable, then apply forward-backward splitting x k+1 = proj C (I γ f )x k recovers the projected gradient method 20 / 60

21 Douglas-Rachford splitting require: A, B both monotone Douglas-Rachford splitting (DRS) operator (Lion-Mercier 79) T DRS := 1 2 I R γa R γb note: switching A and B gives a different DRS operator (relaxed) Peaceman-Rachford splitting (PRS) operator, λ (0, 1]: T λ PRS := (1 λ)i + λr γa R γb also, T PRS = T 1 PRS equivalent conditions: 0 Ax + Bx z = T λ PRS(z), x = J γb z 21 / 60

22 Constrained minimization, cont. if f is non-differentiable, then apply Douglas-Rachford splitting (DRS) (where x k = proj C z k ) z k+1 = ( 1 2 I (2prox γf I ) (2proj C I ) ) z k dual approach: introduce x y = 0 and apply ADMM to minimize x,y f (x) + ι C (y) subject to x y = 0. (indicator function ι C (y) = 0, if y C, and otherwise.) equivalence: the ADMM iteration = the DRS iteration 22 / 60

23 Multi-function minimization f 1,..., f m : H (, ] are proper closed convex functions. product-space trick: minimize f 1(x) + + f N (x) x introduce copies x (i) H of x; let x = (x (1),..., x (N) ) H N let C = {x : x (1) = = x (N) } equivalent problem in H N : minimize x ι C (x) + N f i(x (i) ) i=1 then apply a two-operator splitting scheme 23 / 60

24 Operator splitting: Dual application 24 / 60

25 Duality convex (and some nonconvex) optimization problems have two perspectives: the primal problem and the dual problem duality brings us: alternative or relaxed problems, lower bounds certificates for optimality or infeasibility economic interpretations duality + operator splitting: decouples objective functions and constraints components gives rise to parallel and distributed algorithms 25 / 60

26 Lagrange duality original problem: minimize x f (x) subject to Ax = b. relaxations: Lagrangian: L(x; w) := f (x) + w T (Ax b) augmented Lagrangian: L(x; w, γ) := f (x) + w T (Ax b) + γ Ax b 2 2 dual function, which is convex (even if f is not): d(w) = min L(x; w) x dual problem: minimize w d(w) 26 / 60

27 Monotropic program definition: minimize x 1,...,x m f 1(x 1) + + f m(x m) subject to A 1x 1 + A mx m = b. where f i(x i) may include ι Ci (x) for constraint x i C i x 1,..., x m are separable in the objective and coupled in the constraints dual problem has the form minimize w d 1(w) + + d m(w) where d i(w) := min xi f i(x i) + w T (A ix i 1 m b) 27 / 60

28 Examples of monotropic programs linear programs min{f (x) : Ax C} min x,y{f (x) + ι C (y) : Ax y = 0} consensus problem min{f 1(x 1) + + f n(x n) : Ax = 0}, where Ax = 0 x 1 = = x m the structure of A enables distributed computing exchange problem 28 / 60

29 Dual forward-backward splitting original problem: minimize x f 1(x) + f 2(y) subject to A 1x 1 + A 2x 2 = b. require: strongly convex f 1 (thus Lipschitz d 1) FBS iteration: z k+1 = prox γd2 (I γ d 1 )z k express in terms of (augmented) Lagrangian: x k+1 1 = arg min f 1(x 1) + w kt A 1x 1 x 1 x k+1 2 arg min f 2(x 2) + w kt A 2x 2 + γ x 2 2 A1xk A 2x 2 b 2 w k+1 = w γ(b A 1x k+1 1 A 2x k+1 2 ) we have recovered Tseng s Alternating Minimization Algorithm 29 / 60

30 Dual Douglas-Rachford splitting original problem: minimize x no strong-convexity requirement f 1(x) + f 2(y) subject to A 1x 1 + A 2x 2 = b. DRS iteration: z k+1 = ( 1 2 I (2prox γd 2 I )(2prox γd1 I ) ) z k express in terms of augmented Lagrangian: x k+1 1 arg min f 1(x 1) + w kt A 1x 1 + γ 2 A 1x 1 + A 2x2 k b 2 x k+1 x 1 2 arg min x 2 f 2(x 2) + w kt A 2x 2 + γ k+1 A1x1 + A 2x 2 b 2 2 w k+1 = w k γ(b A 1x k+1 1 A 2x k+1 2 ) recover the Alternating Direction Method of Multipliers (ADMM) 30 / 60

31 Operator splitting: Primal-dual application 31 / 60

32 Nonsmooth linear composition problem: minimize nonsmooth linear + nonsmooth + smooth equivalent condition: minimize x decouple r 1 from L: introduce r 1(Lx) + r 2(x) + f (x) 0 (L T r 1 L + r 2 + f )x dual variable y r 1(Lx) Lx r 1 (y) equivalent condition: [ ] [ ] [ ] [ ] 0 L T x r2(x) f (x) L 0 y r1 (y) 0 32 / 60

33 equivalent condition (copied from last slide): [ ] [ ] [ ] [ ] 0 L T x r2(x) f (x) L 0 y r1 (y) 0 }{{}}{{} Az Bz [ ] x primal-dual variable: z =. y apply forward-backward splitting to 0 Az + Bz: z k+1 = J γa F γb z k z k+1 = (I + γa) 1 (I γb)z k { x k+1 + γl T y k+1 + γ r2(x k+1 ) = x k γ f (x k ) solve y k+1 γlx k+1 + γ r 1 (y k+1 ) = y k issue: both x k+1 and y k+1 appear in both equations! 33 / 60

34 solution: introduce the metric [ ] I γl T U = 0 γl I apply forward-backward splitting to 0 U 1 Az + U 1 Bz: z k+1 = J γu 1 A (I γu 1 B)z k z k+1 = (I + γu 1 A) 1 (I γu 1 B)z k solve Uz k+1 + γãz k+1 = Uz k γbz k { x k+1 γl T y k+1 + γl T y k+1 + γ r2(x k+1 ) = x k γl T y k γ f (x k ) y k+1 γl x k+1 γlx k+1 + γ r1 (y k+1 ) = y k γl x k (like Gaussian elimination, y k+1 is cancelled from the first equation) 34 / 60

35 strategy: obtain x k+1 from the first equation; plug in x k+1 as a constant into the second equation and then obtain y k+1 final iteration: x k+1 = prox γr2 (x k γl T y k γ f (x k )) y k+1 = prox γr (y k + γl(2x k+1 x k )) 1 nice properties: apply L and f explicitly solve proximal-point subproblems of r 1 and r 2 convergence follows from standard forward-backward splitting 35 / 60

36 Example: total variation (TV) deblurring Rudin-Osher-Fatemi 92: minimize u 1 2 Ku b 2 + λ Du 1 subject to 0 u simple parts: smooth + simple linear + simple smooth function: f (u) = 1 Ku b 2 2 linear operator: D simple nonsmooth function: r 1 = λ 1 simple indicator function: r 2 = ι [0,255] 36 / 60

37 (We only show the results, skipping the details) equivalent condition: [ ] [ ] [ ] [ ] r2 D u f 0 u 0 + D r1 w 0 0 w (where w is the auxiliary (dual) variable) forward-backward splitting algorithm under a special metric: u k+1 = proj [0,255] n (u k γd w k γ f (u k )) w k+1 = prox γl (w k + γd(2u k+1 u k )) every step is simple to implement 37 / 60

38 Operator splitting: Parallel and distributed applications 38 / 60

39 Huge matrix A background: you wish to distribute a huge matrix A in your problem three schemes to distribute A: A (1). A 1 A N A (M) A 1,1 A 1,N A M,1 A M,N scheme 1 scheme 2 scheme 3 39 / 60

40 choose a scheme based on the structures of functions/operators example: convex and smooth f ; convex r (possibly nonsmooth) minimize x r(x) + f (Ax) choose scheme 1 for separable f, that is, f (y) = m i=1 fi(yi) apply forward-backward splitting x k+1 = prox γr (x k γa T f (Ax k )) ( M = prox γr x k γ A T (i) f i (A (i) x k ) ) i=1 we can distribute/parallelize A T (i) fi(a (i)x k ) 40 / 60

41 choose scheme 2 for separable r prox γr1 (y 1) prox γr (y) =. apply forward-backward splitting: prox γrn (y N ) x k+1 = prox γr (x k γa T f (Ax k )) cache g = f (Ax k ) x k+1 1 prox γr1 (x1 k γa T 1 g). =. prox γrn (xn k γa T N g) broadcast g = f ( ) M j=1 Ajxk+1 j x k+1 N we distribute the computing of A jx k+1 j = A jprox γrj (x k j γa T j g) this is also known as parallel coordinate descent 41 / 60

42 choose scheme 3 for separable f and r (we skip the details) Remarks: parallel computing is enabled by separability structure try not to introduce extra variables for convergence speed parallelism can be further accelerated by asynchronous parallelism 42 / 60

43 assume scheme 1: A = Duality and structure trade A (1). A (M) primal problem: convex nonsmooth r 1,..., r m ; strongly-convex f : m minimize ri(a w i=1 (i)w) + f (w) structure: nonsmooth linear + strongly-convex let f (y) = sup x y, x f (x) denote the convex conjugate of f dual problem: minimize y f ( m i=1 AT (m) yi ) + m i=1 r i (y i) dual structure: nonsmooth + smooth linear 43 / 60

44 Example: support vector machine (SVM) given N sample-label pairs (x i, y i) where x i R d and y i {1, 1} primal problem: dual problem: apply scheme 1 minimize w,b maximize α R N m i=1 ι R + ( yi (x T (i) w b) 1) w 2 quadratic(α; X) + nonsmooth(α) 44 / 60

45 Operator splitting: Decentralized applications 45 / 60

46 Decentralized computing n agents in a connected network G = (V, E) with bi-directional links E each agent i has a private function f i problem: find a consensus solution x to minimize x R p f (x) := n f i(x i) subject to x i = x j, i, j. i=1 challenges: no center, only between-neighbor communication benefits: fault tolerance, no long-dist communication, privacy 46 / 60

47 Decentralized ADMM Decentralized consensus optimization problem: minimize fi(xi) x i,i V i V subject to x i = x j, (i, j) E ADMM reformulation: minimize x i,i V, y ij,(i,j) E subject to i V fi(xi) x i = y ij, x j = y ij, (i, j) E ADMM (Douglas-Rachford splitting) alternates between two steps each agent: update x i while related y ij are fixed each pair of agents (i, j): update y ij and dual var while x i, x j are fixed 47 / 60

48 Primal-dual splitting problem: minimize x i,i V i V ( ri(x i) + f i(x i) ) subject to W x = x. where r i are convex and f i are convex and smooth the mixing matrix W R E E : w ij 0 only if i = j or agents i, j are neighbors symmetric W = W T, doubly stochastic W 1 = 1. thus I W 0 consensus: x i = x j, (i, j) E W x = x where x stacks all x T i equivalent problem: minimize x i,i V r(x) + f (x) = i V ( ri(x i) + f i(x i) ) subject to W x = x. 48 / 60

49 let V T V = 1 2 (I W ) equivalent problem (KKT conditions): [ ] [ ] [ ] r V T x f (x) 0 + V 0 q 0 applying forward-backward splitting with a special metric, skipping details, eliminating q, we obtain x k+1+1/2 = W x k+1 + x k+1/2 1 2 (W + I )xk α [ f (x k+1 ) f (x k ) ] x k+2 = arg min r(x) + 1 2α x xk+1+1/2 2 F this recovers the PG-EXTRA decentralized algorithm 49 / 60

50 Operator-splitting analysis 50 / 60

51 How to analyze splitting algorithms? problem: find x such that 0 (A + B)x and 0 (A + B + C)x iteration: z k+1 = T(z k ) require: fixed point z of T encodes a solution x z k+1 z 2 z k z 2 C z k+1 z k 2 sufficiency: T is α-averaged, α (0, 1) 51 / 60

52 Averaged operator weaker than contractive operators; strong than nonexpansive operators T is α-averaged, α (0, 1), if for any z, ˉz H Tz Tˉz 2 z ˉz 2 1 α α (I T)z (I T)ˉz 2. assume z k+1 = Tz k and ˉz = Tˉz, then consequences: z k+1 z k 0 z k+1 ˉz 2 z k ˉz 2 1 α α zk+1 z k 2, boundedness of {z k }, subsequence z k j z weakly (by demiclossedness and monotonicity) z k z weakly and z = Tz z k+1 z k 2 = o(1/k) (Davis-Y 15) 52 / 60

53 Examples A is monotone J γa := (I + γa) 1 is (1/2)-averaged 2 R γa = 2J γa I is nonexpansive T is nonexpansive T 1 = (1 α)i + αt is α-averaged, α (0, 1) A is β-cocoercive F γa := I γa is (1 γ 2β )-averaged Baillon-Haddad: if f is convex, f is 1 γ -Lipschitz I γ f is (1 )-averaged β 2β 2 also known as firmly nonexpansive 53 / 60

54 Composition properties T 1, T 2 are nonexpansive T 1 T 2 is nonexpansive T 1, T 2 are averaged T 1 T 2 is averaged 54 / 60

55 Consequences assume A, B are monotone B is β-cocoercive, γ (0, 2β) FBS: J γa F γb is averaged PRS: (2J γa I ) (2J γb I ) is nonexpansive DRS: 1 I + 1 (2J 2 2 γa I ) (2J γb I ) is (1/2)-averaged 55 / 60

56 A three-operator splitting scheme require: A, B maximally monotone, C cocoercive Davis and Yin 15: T DYS := I J γb + J γa (2J γb I γc J γb ) (evaluating T 3z will evaluate J γa, J γb, and C only once each) reduces to BFS if A = 0, FBS if B = 0, and to DRS if C = 0 equivalent conditions: 0 Ax + Bx + Cx z = T 3(z), x = J γb z 56 / 60

57 Abstraction: KM (Krasnosel skiĭ Mann) iteration require: T is nonexpansive (1-Lipschitz) choose λ > 0 so that T λ = (1 λ) + λt is an averaged operator KM iteration: z k+1 = T λ z k special cases: J γa, (I γa), T FBS, T BFS, T DRS, TPRS, λ and T DYS (the step-size of any cocoercive map therein must be bounded) convergence: if FixT, then z k z FixT. divergence: if FixT =, then (z k ) k 0 goes unbounded. 57 / 60

58 Open question Find an operator-splitting scheme for 0 (T T m)x, m 4. require: no use of auxiliary variable convergence is guaranteed under monotonic T i s 58 / 60

59 Summary monotone operator splitting is a set of powerful and elegant tools for many problems in signal processing, machine learning, computer vision, etc. they give rise to parallel, distributed, and decentralized algorithms under the hood: fixed-point and nonexpansive-operator theory not covered: the convergence rates of objective error: f k f point error: z k z 2 accelerated rates by averaging and extrapolation 59 / 60

60 Thank you! References: H. Bauschke and P. Combettes. Convex analysis and monotone operator theory in Hilbert spaces. Springer, N. Komodakis and J.-C. Pasquet. Playing with duality: an overview of recent primal-dual approaches for solving large-scale optimization problems. IEEE Signal Processing Magazine, D. Davis and Y, Convergence rate analysis of several splitting schemes, UCLA CAM 14-51, D. Davis and Y, A three-operator splitting method and its acceleration, UCLA CAM 15-13, / 60

Math 273a: Optimization Overview of First-Order Optimization Algorithms

Math 273a: Optimization Overview of First-Order Optimization Algorithms Wotao Yin Department of Mathematics, UCLA online discussions on piazza.com 1 / 9 Typical flow of numerical optimization Optimization