Dual Proximal Gradient Method http://bicmr.pku.edu.cn/~wenzw/opt-2016-fall.html Acknowledgement: this slides is based on Prof. Lieven Vandenberghes lecture notes
Outline 2/19 1 proximal gradient method applied to the dual 2 Examples 3 alternating minimization method
Dual methods 3/19 subgradient method : slow, step size selection difficult gradient method : requires differentiable dual cost function often dual cost is not differentiable, or has nontrivial domain dual can be smoothed by adding small strongly convex term to primal augmented Lagrangian method equivalent to gradient ascent on a smoothed dual problem however smoothing destroys separable structrue proximal gradient method(this lecture): dual cost split in two terms one term is differentiable with Lipschitz continuous gradient other term has an inexpensive prox-operator
Composite structure in the dual 4/19 min f (x) + g(ax) max f ( A T z) g (z) dual has the right structure for the proximal gradient method if prox-operator of g (or g ) is cheap (closed form or simple algorithm) f is strongly convex ( f (x) (µ/2)x T x is convex ) implies f ( A T z) has Lipschitz continuous gradient ( L = A 2 2 /µ ): A f ( A T u) A f ( A T v) 2 A 2 2 µ u v 2 because f is Lipschitz continuous with constant 1/µ
Dual proximal gradient update 5/19 z + = prox tg (z + ta f ( A T z)) equivalent expression in terms of f : z + = prox tg (z + taˆx) where ˆx = argmin(f (x) + z T Ax) x if f is separable, calculation of ˆx decomposes into independent problems step size t constant or from backtracking line search can use accelerated proximal gradient methods
Alternating minimization interpretation 6/19 Moreau decomposition gives alternate expression for z-update z + = z + t(aˆx ŷ) where ˆx = argmin(f (x) + z T Ax) x ŷ = prox t 1 g (z/t + Aˆx) = argmin(g(y) + z T (Aˆx y) + t y 2 Aˆx y 2 2) in each iteration, an alternating minimization of: Lagrangian f (x) + g(y) + z T (Ax y) over x augmented Lagrangian f (x) + g(y) + z T (Ax y) + t 2 Ax y 2 2 over y
Outline 7/19 1 proximal gradient method applied to the dual 2 Examples 3 alternating minimization method
Regularized norm approximation 8/19 min f (x) + Ax b (with f strongly convex) a special case of Page 4 with g(y) = y b { g b T z z 1 (x) = prox tg (z) = P C (z tb) + otherwise C is unit norm ball for dual norm dual gradient projection update ˆx = argmin x (f (x) + z T Ax) z + = P C (z + t(aˆx b))
Example 9/19 p min f (x) + B i x 2 i=1 (with f strongly convex) a special case of Page 4 with g(y 1,..., y p ) = p i=1 y i 2 and A = [ B T 1 B T 2 B T p ] T dual gradient projection update ˆx = argmin(f (x) + ( x p B T i z i ) T x) i=1 z + i = P Ci (z i + tb iˆx), i = 1,..., p C i is unit Euclidean norm ball in R m i, if B i R m i n
10 0 gradient 10/19 numerical example f (x) = 1 2 Cx d 2 2 with random generated C R 2000 1000, B i R 10 1000, p = 500 10-1 FISTA f dual objective f 10-2 10-3 10-4 10-5 10-6 0 100 200 300 400 500 600 700 800 900 k
11/19 Minimization over intersection of convex sets min s.t. f (x) x C 1 C m f strongly convex; e.g., f (x) = x a 2 2 for projecting a on intersection sets C i are closed, convex, and easy to project onto this is a special case of Page 4 with g a sum of indicators g(y 1,..., y m ) = I C1 (y 1 ) + + I Cm (y m ), A = [ I I ] T dual proximal gradient update ˆx = argmin x (f (x) + (z 1 + + z m ) T x) z + i = z i + tˆx tp Ci (z i /t + ˆx), i = 1,..., m
Decomposition of separable problems 12/19 n m min f j (x j ) + g i (A i1 x 1 + + A in x n ) j=1 i=1 each f i is strongly convex; g i has inexpensive prox-operator dual proximal gradient update z + i ˆx j = argmin x j (f j (x j ) + = prox tg i (z i + t m z T i A ij x j ), i=1 n A ij ˆx j ), j=1 j = 1,..., n i = 1,..., m
Outline 13/19 1 proximal gradient method applied to the dual 2 Examples 3 alternating minimization method
Primal problem with separable structure 14/19 composite problem with separable f min f 1 (x 1 ) + f 2 (x 2 ) + g(a 1 x 1 + A 2 x 2 ) we assume f 1 strongly convex, but not necessarily f 2 dual problem max f 1 ( A T 1 z) f 2 ( A T 2 z) g (z) first term is differentiable with Lipschitz continuous gradient prox-operator h(z) = f 2 ( AT 2 z) + g (z) was discussed
Dual proximal gradient method 15/19 z + = prox th (z + ta 1 f 1 ( A T 1 z)) equivalent form using f 1 : z + = prox th (z + ta 1 ˆx 1 ) where ˆx 1 = argmin x 1 (f 1 (x 1 ) + z T A 1 x 1 ) prox-operator of h(z) = f 2 ( AT 2 z) + g (z) is given by prox th (w) = w + t(a 2 ˆx 2 ŷ) where ˆx 2, ŷ minimize an augmented Lagrangian ( ˆx 2, ŷ) = argmin(f 2 (x 2 ) + g(y) + t x 2,y 2 A 2x 2 y + w/t 2 2)
Proof: prox th (w) = w + t(a 2 ˆx 2 ŷ) h(z) = f 2 ( AT 2 z) g (z) and h (y) = sup y T z f2 ( A T 2 z) g (z) z = sup y T z f2 (w) + g (z), s.t. w = A T 2 z z,w = inf v sup y T z f2 (w) g (z) + v T (w + A T 2 z) z,w = inf v f 2(v) + g(a 2 v + y) Moreau decomposition: w = prox th (w) + tprox t 1 h (w/t) min t 1 h (y) + 1 2 y w/t 2 2 min y,v f 2 (v) + g(a 2 v + y) + t 2 y w/t 2 2 min u,v f 2 (v) + g(u) + t 2 u A 2v w/t 2 2 using u = A 2 v + y prox th (w) = w ty = w t(u A 2 v) 16/19
Alternating minimization method 17/19 starting at some initial z, repeat the following iteration 1 minimize the Lagrangian over x 1 : ˆx 1 = argmin x 1 (f 1 (x 1 ) + z T A 1 x 1 ) 2 minimize the augmented Lagrangian over ˆx 2, ŷ: ( ˆx 2, ŷ) = argmin(f 2 (x 2 ) + g(y) + t x 2,y 2 A 1 ˆx 1 + A 2 x 2 y + z/t 2 2) 3 update dual variable: z + = z + t(a 1 ˆx 1 + A 2 ˆx 2 ŷ)
Comparison with augmented Lagrangian method 18/19 augmented Lagrangian method (for problem on page 14) 1 compute minimizer ˆx 1, ˆx 2, ŷ of the augmented Lagrangian f 1 (x 1 ) + f 2 (x 2 ) + g(y) + t 2 A 1x 1 + A 2 x 2 y + z/t 2 2 2 update dual variable: z + = z + t(a 1 ˆx 1 + A 2 ˆx 2 ŷ) differences with alternating minimization more general: AL method does not require strong convexity of f 1 quadratic penalty in step 1 destroys separability
References 19/19 alternating minimization method P. Tseng, Applications of a splitting algorithm to decomposition in convex programming and variational inequalities, SIAM J. Control and Optimization (1991) P. Tseng, Further applications of a splitting algorithm to decomposition in variational inequalities and convex programming, Mathematical Programming (1990)