6. Proximal gradient method

Size: px

Start display at page:

Download "6. Proximal gradient method"

Carmel Riley
6 years ago
Views:

1 L. Vandenberghe EE236C (Spring 2016) 6. Proximal gradient method motivation proximal mapping proximal gradient method with fixed step size proximal gradient method with line search 6-1

2 Proximal mapping the proximal mapping (or prox-operator) of a convex function h is defined as prox h (x) = argmin u ( h(u) + 1 ) 2 u x 2 2 Examples h(x) = 0: prox h (x) = x h(x) is indicator function of closed convex set C: prox h is projection on C prox h (x) = argmin u C u x 2 2 = P C (x) h(x) = x 1 : prox h is the soft-threshold (shrinkage) operation prox h (x) i = x i 1 x i 1 0 x i 1 x i + 1 x i 1 Proximal gradient method 6-2

3 Proximal gradient method unconstrained optimization with objective split in two components minimize f(x) = g(x) + h(x) g convex, differentiable, dom g = R n h convex with inexpensive prox-operator (many examples in lecture 8) Proximal gradient algorithm x (k) = prox tk h ( ) x (k 1) t k g(x (k 1) ) t k > 0 is step size, constant or determined by line search can start at infeasible x (0) (however x (k) dom f = dom h for k 1) Proximal gradient method 6-3

4 Interpretation x + = prox th (x t g(x)) from definition of proximal mapping: x + = argmin u = argmin u ( h(u) + 1 ) 2t u x + t g(x) 2 2 ( h(u) + g(x) + g(x) T (u x) + 1 ) 2t u x 2 2 x + minimizes h(u) plus a simple quadratic local model of g(u) around x Proximal gradient method 6-4

5 Examples minimize g(x) + h(x) Gradient method: special case with h(x) = 0 x + = x t g(x) Gradient projection method: special case with h(x) = δ C (x) (indicator of C) x x + = P C (x t g(x)) C x + x t g(x) Proximal gradient method 6-5

6 Examples Soft-thresholding: special case with h(x) = x 1 x + = prox th (x t g(x)) where (prox th (u)) i (prox th (u)) i = u i t u i t 0 t u i t u i + t u i t t t u i Proximal gradient method 6-6

7 Outline introduction proximal mapping proximal gradient method with fixed step size proximal gradient method with line search

8 Proximal mapping if h is convex and closed (has a closed epigraph), then exists and is unique for all x prox h (x) = argmin u ( h(u) + 1 ) 2 u x 2 2 will be studied in more detail in lecture 8 from optimality conditions of minimization in the definition: u = prox h (x) x u h(u) h(z) h(u) + (x u) T (z u) z Proximal gradient method 6-7

9 Projection on closed convex set proximal mapping of indicator function δ C is Euclidean projection on C prox δc (x) = argmin u C u x 2 2 = P C (x) x u = P C (x) (x u) T (z u) 0 z C C u we will see that proximal mappings have many properties of projections Proximal gradient method 6-8

10 Firm nonexpansiveness proximal mappings are firmly nonexpansive (co-coercive with constant 1): (prox h (x) prox h (y)) T (x y) prox h (x) prox h (y) 2 2 follows from page 6-7: if u = prox h (x), v = prox h (y), then x u h(u), y v h(v) combining this with monotonicity of subdifferential (page 4-9) gives (x u y + v) T (u v) 0 a weaker property is nonexpansiveness (Lipschitz continuity with constant 1): prox h (x) prox h (y) 2 x y 2 follows from firm nonexpansiveness and Cauchy-Schwarz inequality Proximal gradient method 6-9

11 Outline introduction proximal mapping proximal gradient method with fixed step size proximal gradient method with line search

12 Assumptions minimize f(x) = g(x) + h(x) h is closed and convex (so that prox th is well defined) g is differentiable with dom g = R n there exist constants m 0 and L > 0 such that the functions g(x) m 2 xt x, L 2 xt x g(x) are convex the optimal value f is finite and attained at x (not necessarily unique) Proximal gradient method 6-10

13 Implications of assumptions on g Lower bound convexity of the the function g(x) (m/2)x T x implies (page 1-18): g(y) g(x) + g(x) T (y x) + m 2 y x 2 2 x, y (1) if m = 0, this means g is convex; if m > 0, strongly convex (lecture 1) Upper bound convexity of the function (L/2)x T x g(x) implies (page 1-12): g(y) g(x) + g(x) T (y x) + L 2 y x 2 2 x, y (2) this is equivalent to Lipschitz continuity and co-coercivity of gradient (lecture 1) Proximal gradient method 6-11

14 Gradient map G t (x) = 1 t (x prox th(x t g(x))) G t (x) is the negative step in the proximal gradient update x + = prox th (x t g(x)) = x tg t (x) G t (x) is not a gradient or subgradient of f = g + h from subgradient definition of prox-operator (page 6-7), G t (x) g(x) + h (x tg t (x)) G t (x) = 0 if and only if x minimizes f(x) = g(x) + h(x) Proximal gradient method 6-12

15 Consequences of quadratic bounds on g substitute y = x tg t (x) in the bounds (1) and (2): for all t, mt 2 2 G t(x) 2 2 g (x tg t (x)) g(x) + t g(x) T G t (x) Lt2 2 G t(x) 2 2 if 0 < t 1/L, then the upper bound implies g(x tg t (x)) g(x) t g(x) T G t (x) + t 2 G t(x) 2 2 (3) if the inequality (3) is satisfied and tg t (x) 0, then mt 1 if the inequality (3) is satisfied, then for all z, f(x tg t (x)) f(z) + G t (x) T (x z) t 2 G t(x) 2 2 m 2 x z 2 2 (4) (proof on next page) Proximal gradient method 6-13

16 Proof of (4): f(x tg t (x)) g(x) t g(x) T G t (x) + t 2 G t(x) h(x tg t (x)) g(z) g(x) T (z x) m 2 z x 2 2 t g(x) T G t (x) + t 2 G t(x) h(z) (G t (x) g(x)) T (z x + tg t (x)) = g(z) + h(z) + G t (x) T (x z) t 2 G t(x) 2 2 m 2 x z 2 2 in the first step we add h(x tg t (x)) to both sides of the inequality (3) in the next step we use the lower bound on g(z) from (2) and G t (x) g(x) h(x tg t (x)) (see page 6-12) Proximal gradient method 6-14

17 Progress in one iteration for a step size t that satisfies the inequality (3), define x + = x tg t (x) inequality (4) with z = x shows the algorithm is a descent method: f(x + ) f(x) t 2 G t(x) 2 2 inequality (4) with z = x shows that f(x + ) f G t (x) T (x x ) t 2 G t(x) 2 2 m 2 x x 2 2 = 1 ( ) x x 2 2 2t x x tg t (x) 2 2 m 2 x x 2 2 = 1 ( (1 mt) x x 2 2 x + x 2 ) 2 2t 1 ( x x 2 2 x + x 2 ) 2 2t (5) (6) Proximal gradient method 6-15

18 Analysis for fixed step size add inequalities (6) for x = x (i 1), x + = x (i), t = t i = 1/L k (f(x (i) ) f ) 1 2t i=1 = 1 2t k i=1 ( ) x (i 1) x 2 2 x (i) x 2 2 ( ) x (0) x 2 2 x (k) x t x(0) x 2 2 since f(x (i) ) is nonincreasing, f(x (k) ) f 1 k k i=1 (f(x (i) ) f ) 1 2kt x(0) x 2 2 Proximal gradient method 6-16

19 Distance to optimal set from (5) and f(x + ) f, the distance to the optimal set does not increase: x + x 2 2 (1 mt) x x 2 2 x x 2 2 for fixed step size t k = 1/L x (k) x 2 2 c k x (0) x 2 2, c = 1 m L i.e., linear convergence if g is strongly convex (m > 0) Proximal gradient method 6-17

20 Example: quadratic program with box constraints minimize (1/2)x T Ax + b T x subject to 0 x (f(x (k) ) f )/f n = 3000; fixed step size t = 1/λ max (A) k Proximal gradient method 6-18

21 Example: 1-norm regularized least-squares minimize 1 2 Ax b x (f(x (k) ) f )/f k randomly generated A R ; step t k = 1/L with L = λ max (A T A) Proximal gradient method 6-19

22 Outline introduction proximal mapping proximal gradient method with fixed step size proximal gradient method with line search

23 Line search the analysis for fixed step size (page 6-13) starts with the inequality g(x tg t (x)) g(x) t g(x) T G t (x) + t 2 G t(x) 2 2 (3) this inequality is known to hold for 0 < t 1/L if L is not known, we can satisfy (3) by a backtracking line search: start at some t := ˆt > 0 and backtrack (t := βt) until (3) holds step size t selected by the line search satisfies t t min = min{ˆt, β/l} requires one evaluation of g and prox th per line search iteration several other types of line search work Proximal gradient method 6-20

24 Example line search for gradient projection method x + = P C (x t g(x)) = x tg t (x) C x x βˆt g(x) P C (x βˆt g(x)) x ˆt g(x) P C (x ˆt g(x)) backtrack until P C (x t g(x)) satisfies sufficient decrease inequality (3) Proximal gradient method 6-21

25 Analysis with line search from page 6-15, if (3) holds in iteration i, then f(x (i) ) < f(x (i 1) ) and f(x (i) ) f 1 ( ) x (i 1) x 2 2 x (i) x 2 2 2t i 1 ( ) x (i 1) x 2 2 x (i) x 2 2 2t min adding inequalities for i = 1 to i = k gives k i=1 (f(x (i) ) f ) 1 2t min x (0) x 2 2 since f(x (i) ) is nonincreasing, we obtain similar 1/k bound as for fixed t i : f(x (k) ) f 1 2kt min x (0) x 2 2 Proximal gradient method 6-22

26 Distance to optimal set from page 6-15, if (3) holds in iteration i, then x (i) x 2 2 (1 mt i ) x (i 1) x 2 2 (1 mt min ) x (i 1) x 2 2 = c x (i 1) x 2 2 x (k) x 2 2 c k x (0) x 2 2 with c = 1 mt min = max{1 βm L, 1 mˆt} hence linear convergence if m > 0 Proximal gradient method 6-23

27 Summary Proximal gradient method minimizes sums of differentiable and non-differentiable convex functions f(x) = g(x) + h(x) useful when nondifferentiable term h is simple (has inexpensive prox-operator) convergence properties are similar to standard gradient method (h(x) = 0) less general but faster than subgradient method Proximal gradient method 6-24

28 References Convergence analysis A. Beck and M. Teboulle, A fast iterative shrinkage-thresholding algorithm for linear inverse problems, SIAM Journal on Imaging Sciences (2009). A. Beck and M. Teboulle, Gradient-based algorithms with applications to signal recovery, in: Y. Eldar and D. Palomar (Eds.), Convex Optimization in Signal Processing and Communications (2009). Y. Nesterov, Introductory Lectures on Convex Optimization. A Basic Course (2004), B. T. Polyak, Introduction to Optimization (1987), Proximal gradient method 6-25

6. Proximal gradient method

L. Vandenberghe EE236C (Spring 2013-14) 6. Proximal gradient method motivation proximal mapping proximal gradient method with fixed step size proximal gradient method with line search 6-1 Proximal mapping