6. Proximal gradient method

Size: px

Start display at page:

Download "6. Proximal gradient method"

Estella Thompson
5 years ago
Views:

1 L. Vandenberghe EE236C (Spring ) 6. Proximal gradient method motivation proximal mapping proximal gradient method with fixed step size proximal gradient method with line search 6-1

2 Proximal mapping the proximal mapping (prox-operator) of a convex function h is defined as examples prox h (x) = argmin u h(x) = 0: prox h (x) = x ( h(u)+ 1 ) 2 u x 2 2 h(x) = I C (x) (indicator function of C): prox h is projection on C prox h (x) = argmin u C u x 2 2 = P C (x) h(x) = x 1 : prox h is the soft-threshold (shrinkage) operation prox h (x) i = x i 1 x i 1 0 x i 1 x i +1 x i 1 Proximal gradient method 6-2

3 Proximal gradient method unconstrained optimization with objective split in two components minimize f(x) = g(x)+h(x) g convex, differentiable, domg = R n h convex with inexpensive prox-operator (many examples in lecture 9) proximal gradient algorithm x (k) = prox tk h ( ) x (k 1) t k g(x (k 1) ) t k > 0 is step size, constant or determined by line search Proximal gradient method 6-3

4 Interpretation x + = prox th (x t g(x)) from definition of proximal mapping: x + = argmin u = argmin u ( h(u)+ 1 ) 2t u x+t g(x) 2 2 ( h(u)+g(x)+ g(x) T (u x)+ 1 ) 2t u x 2 2 x + minimizes h(u) plus a simple quadratic local model of g(u) around x Proximal gradient method 6-4

5 Examples minimize g(x) + h(x) gradient method: special case with h(x) = 0 x + = x t g(x) gradient projection method: special case with h(x) = I C (x) x x + = P C (x t g(x)) C x + x t g(x) Proximal gradient method 6-5

6 soft-thresholding: special case with h(x) = x 1 x + = prox th (x t g(x)) where prox th (u) i = u i t u i t 0 t u i t u i +t u i t t prox th (u) i t u i Proximal gradient method 6-6

7 Outline introduction proximal mapping proximal gradient method with fixed step size proximal gradient method with line search

8 Proximal mapping if h is convex and closed (has a closed epigraph), then prox h (x) = argmin u exists and is unique for all x ( h(u)+ 1 ) 2 u x 2 2 will be studied in more detail in lecture 9 from optimality conditions of minimization in the definition: u = prox h (x) x u h(u) h(z) h(u)+(x u) T (z u) z Proximal gradient method 6-7

9 Projection on closed convex set proximal mapping of indicator function I C is Euclidean projection on C prox IC (x) = argmin u C u x 2 2 = P C (x) subgradient characterization x N C (u) u = P C (x) (x u) T (z u) 0 z C C P C (x) we will see that proximal mappings have many properties of projections Proximal gradient method 6-8

10 if u = prox h (x), v = prox h (y), then Nonexpansiveness (u v) T (x y) u v 2 2 prox h is firmly nonexpansive, or co-coercive with constant 1 follows from characterization of page 6-7 and monotonicity (page 4-10) x u h(u), y v h(v) = (x u y +v) T (u v) 0 implies (from Cauchy-Schwarz inequality) prox h (x) prox h (y) 2 x y 2 prox h is nonexpansive, or Lipschitz continuous with constant 1 Proximal gradient method 6-9

11 Outline introduction proximal mapping proximal gradient method with fixed step size proximal gradient method with line search

12 Convergence of proximal gradient method to minimize g +h, choose x (0) and repeat x (k) = prox tk h ( ) x (k 1) t g(x (k 1) ), k 1 assumptions g convex with domg = R n ; g Lipschitz continuous with constant L: g(x) g(y) 2 L x y 2 x,y h is closed and convex (so that prox th is well defined) optimal value f is finite and attained at x (not necessarily unique) convergence result: 1/k rate convergence with fixed step size t k = 1/L Proximal gradient method 6-10

13 Gradient map G t (x) = 1 t (x prox th(x t g(x))) G t (x) is the negative step in the proximal gradient update x + = prox th (x t g(x)) = x tg t (x) G t (x) is not a gradient or subgradient of f = g +h from subgradient definition of prox-operator (page 6-7), G t (x) g(x)+ h(x tg t (x)) G t (x) = 0 if and only if x minimizes f(x) = g(x)+h(x) Proximal gradient method 6-11

14 Consequences of Lipschitz assumption recall upper bound (p.1-12) for convex g with Lipschitz continuous gradient g(y) g(x) g(x) T (y x)+ L 2 y x 2 2 x,y substitute y = x tg t (x): g(x tg t (x)) g(x) t g(x) T G t (x)+ t2 L 2 G t(x) 2 2 if 0 < t 1/L, then g(x tg t (x)) g(x) t g(x) T G t (x)+ t 2 G t(x) 2 2 (1) Proximal gradient method 6-12

15 A global inequality if the inequality (1) holds, then for all z, f(x tg t (x)) f(z)+g t (x) T (x z) t 2 G t(x) 2 2 (2) proof: (define v = G t (x) g(x)) f(x tg t (x)) g(x) t g(x) T G t (x)+ t 2 G t(x) 2 2+h(x tg t (x)) g(z)+ g(x) T (x z) t g(x) T G t (x)+ t 2 G t(x) 2 2 +h(z)+v T (x z tg t (x)) = g(z)+h(z)+g t (x) T (x z) t 2 G t(x) 2 2 line 2 follows from convexity of g and h, and v h(x tg t (x)) Proximal gradient method 6-13

16 Progress in one iteration x + = x tg t (x) inequality (2) with z = x shows the algorithm is a descent method: f(x + ) f(x) t 2 G t(x) 2 2 inequality (2) with z = x : f(x + ) f G t (x) T (x x ) t 2 G t(x) 2 2 = 1 ( ) x x 2 2 2t x x tg t (x) 2 2 = 1 2t ( x x 2 2 x + x 2 ) 2 (3) (hence, x + x 2 x x 2, i.e., distance to optimal set decreases) Proximal gradient method 6-14

17 Analysis for fixed step size add inequalities (3) for x = x (i 1), x + = x (i), t = t i = 1/L k (f(x (i) ) f ) 1 2t i=1 = 1 2t k i=1 ( ) x (i 1) x 2 2 x (i) x 2 2 ( ) x (0) x 2 2 x (k) x t x(0) x 2 2 since f(x (i) ) is nonincreasing, f(x (k) ) f 1 k k i=1 (f(x (i) ) f ) 1 2kt x(0) x 2 2 conclusion: reaches f(x (k) ) f ǫ after O(1/ǫ) iterations Proximal gradient method 6-15

18 Quadratic program with box constraints minimize (1/2)x T Ax+b T x subject to 0 x (f(x (k) ) f )/ f n = 3000; fixed step size t = 1/λ max (A) k Proximal gradient method 6-16

19 1-norm regularized least-squares minimize 1 2 Ax b 2 2+ x (f(x (k) ) f )/f k randomly generated A R ; step t k = 1/L with L = λ max (A T A) Proximal gradient method 6-17

20 Outline introduction proximal mapping proximal gradient method with fixed step size proximal gradient method with line search

21 Line search the analysis for fixed step size (page 6-12) starts with the inequality g(x tg t (x)) g(x) t g(x) T G t (x)+ t 2 G t(x) 2 2 (1) this inequality is known to hold for 0 < t 1/L if L is not known, we can satisfy (1) by a backtracking line search: start at some t := ˆt > 0 and backtrack (t := βt) until (1) holds step size t selected by the line search satisfies t t min = min{ˆt,β/l} requires one evaluation of g and prox th per line search iteration several other types of line search work Proximal gradient method 6-18

22 example: line search for projected gradient method x + = P C (x t g(x)) = x tg t (x) x C P C (x β 2ˆt g(x)) x βˆt g(x) P C (x βˆt g(x)) x ˆt g(x) P C (x ˆt g(x)) backtrack until x tg t (x) satisfies sufficient decrease inequality (1) Proximal gradient method 6-19

23 Analysis with line search from p. 6-14, if (1) holds in iteration i, then f(x (i) ) < f(x (i 1) ) and f(x (i) ) f 1 ( ) x (i 1) x 2 2t 2 x (i) x 2 2 i 1 ( ) x (i 1) x 2 2t 2 x (i) x 2 2 min adding inequalities for i = 1 to i = k gives k i=1 (f(x (i) ) f ) 1 2t min x (0) x 2 2 since f(x (i) ) is nonincreasing, obtain similar 1/k bound as for fixed t i : f(x (k) ) f 1 2kt min x (0) x 2 2 Proximal gradient method 6-20

24 References convergence analysis of proximal gradient method A. Beck and M. Teboulle, A fast iterative shrinkage-thresholding algorithm for linear inverse problems, SIAM Journal on Imaging Sciences (2009) A. Beck and M. Teboulle, Gradient-based algorithms with applications to signal recovery, in: Y. Eldar and D. Palomar (Eds.), Convex Optimization in Signal Processing and Communications (2009) Proximal gradient method 6-21

6. Proximal gradient method

L. Vandenberghe EE236C (Spring 2016) 6. Proximal gradient method motivation proximal mapping proximal gradient method with fixed step size proximal gradient method with line search 6-1 Proximal mapping