The proximal mapping - PDF Free Download

The proximal mapping http://bicmr.pku.edu.cn/~wenzw/opt-2016-fall.html Acknowledgement: this slides is based on Prof. Lieven Vandenberghes lecture notes

Outline 2/37 1 closed function 2 Conjugate function 3 Proximal Mapping

Closed set 3/37 a set C is closed if it contains its boundary: x k C, x k x = x C operations that preserve closedness the intersection of (finitely or infinitely many) closed sets is closed the union of a finite number of closed sets is closed inverse under linear mapping: {x Ax C} is closed if C is closed

4/37 Image under linear mapping the image of a closed set under a linear mapping is not necessarily closed example(c is closed, AC = {Ax x C} is open): C = {(x 1, x 2 ) R 2 + x 1 x 2 1}, A = [ 1 0 ], AC = R ++ sufficient condition: AC is closed if C is closed and convex and C does not have a recession direction in the nullspace of A. i.e., Ay = 0, ˆx C, ˆx + αy C α 0 = y = 0 in particular, this holds for any A if C is bounded

5/37 Closed function definition: a function is closed if its epigraph is a closed set examples f (x) = log(1 x 2 ) with dom f = {x x < 1} f (x) = x log x with dom f = R + and f (0) = 0 indicator function of a closed set C: f (x) = 0 if x C = dom f not closed f (x) = x log x with dom f = R ++ or dom f = R + and f (0) = 1 indicator function of a set C if C is not closed

Properties sublevel sets: f is closed if and only if all its sublevel sets are closed minimum: if f is closed with bounded sublevel sets then it has a minimizer Weierstrass Suppose that the set D E (a finite dimensional vector space over R n ) is nonempty and closed, and that all sublevel sets of the continuous function f : D R are bounded. Then f has a global minimizer. common operations on convex functions that preserve closedness sum: f + g is closed if f and g are closed (and dom f dom g ) composition with affine mapping: f (Ax + b) is closed if f is closed supremum: sup α f α (x) is closed if each function f α is closed 6/37

Outline 7/37 1 closed function 2 Conjugate function 3 Proximal Mapping

Conjugate function 8/37 the conjugate of a function f is f (y) = sup (y T x f (x)) x dom f f is closed and convex even if f is not Fenchel s s inequality f (x) + f (y) x T y x, y (extends inequality x T x/2 + y T y/2 x T y to non-quadratic convex f )

Quadratic function 9/37 f (x) = 1 2 xt Ax + b T x + c strictly convex case (A 0) f (y) = 1 2 (y b)t A 1 (y b) c general convex case (A 0) f (y) = 1 2 (y b)t A (y b) c, dom f = range(a) + b

Negative entropy and negative logarithm 10/37 negative entropy f (x) = negative logarithm n x i log x i f (y) = n i=1 i=1 e y i 1 n f (x) = log x i i=1 n f (y) = log( y i ) n i=1 matrix logarithm f (x) = log det X (dom f = S n ++) f (Y) = log det( Y) n

Indicator function and norm 11/37 indicator of convex set C: conjugate is support function of C { 0, x C f (x) = +, x C f (y) = sup y T x x C norm: conjugate is indicator of unit dual norm ball { f (x) = x f 0, y 1 (y) = +, y > 1 (see next page)

12/37 proof: recall the definition of dual norm: y = sup x T y x 1 to evaluate f (y) = sup x (y T x x ) we distinguish two cases if y 1, then (by definition of dual norm) y T x x x and equality holds if x = 0; therefore sup x (y T x x ) = 0 if y > 1, there exists an x with x 1, x T y > 1; then f (y) y T (tx) tx = t(y T x x ) and r.h.s. goes to infinity if t

The second conjugate 13/37 f (x) = sup (x T y f (y)) y dom f f (x) is closed and convex from Fenchel s inequality x T y f (y) f (x) for all y and x): f f (x) x equivalently, epi f epi f (for any f ) if f is closed and convex, then f (x) = f (x) x equivalently, epi f = epi f (if f is closed convex); proof on next page

proof (f = f if f is closed and convex): by contradiction suppose (x, f (x)) epi f ; then there is a strict separating hyperplane: [ a b ] T [ z x s f (x) ] c 0 (z, s) epi f for some a, b, c with b 0 (b > 0 gives a contradiction as s ) if b < 0, define y = a/( b) and maximize l.h.s. over (z, s) epi f : f (y) y T x + f (x) c/( b) < 0 this contradicts Fenchel s inequality if b = 0, choose ŷ dom f and add small multiple of (ŷ, 1) to (a, b): [ ] T [ ] a + ɛŷ z x ɛ s f c + ɛ(f (ŷ) x T ŷ + f (x)) < 0 (x) now apply the argument for b < 0 14/37

Conjugates and subgradients 15/37 if f is closed and convex, then y f (x) x f (y) x T y = f (x) + f (y) proof: if y f (x), then f (y) = sup u (y T u f (u)) = y T x f (x) f (v) = sup(v T u f (u) u v T x f (x) = x T (v y) f (x) + y T x = f (y) + x T (v y) (1) for all v; therefore, x is a subgradient of f at y (x f (y)) reverse implication x f (y) y f (x) follows from f = f

Some calculus rules 16/37 separable sum f (x 1, x 2 ) = g(x 1 ) + h(x 2 ) f (y 1, y 2 ) = g (y 1 ) + h (y 2 ) scalar multiplication: (for α > 0) f (x) = αg(x) f (y) = αg (y/α) addition to affine function f (x) = g(x) + a T x + b f (y) = g (y a) b infimal convolution f (x) = inf (g(u) + h(v)) f u+v=x (y) = g (y) + h (y)

Outline 17/37 1 closed function 2 Conjugate function 3 Proximal Mapping

Proximal mapping 18/37 ( prox f (x) = argmin f (u) + 1 ) u 2 u x 2 2 if f is closed and convex then prox f (x) exists and is unique for all x existence: f (u) + (1/2) u x 2 2 is closed with bounded sublevel sets uniqueness: f (u) + (1/2) u x 2 2 is strictly (in fact, strongly) convex subgradient characterization u = prox f (x) x u f (u) we are interested in functions f for which prox tf is inexpensive

Examples 19/37 quadratic function (A 0) f (x) = 1 2 xt Ax + b T x + c, prox t f (x) = (I + ta) 1 (x tb) Euclidean norm: f (x) = x 2 f (x) = { (1 t/ x 2 )x x 2 t 0 otherwise logarithmic barrier f (x) = n i=1 log x i, prox t f (x) i = x i + xi 2 + 4t, i = 1,..., n 2

Some simple calculus rules 20/37 separable sum f ([ ]) ([ ]) x x = g(x) + h(y), prox y f = y [ ] proxg (x) prox h (y) scaling and translation of argument : with λ 0 f (x) = g(λx + a), prox f (x) = 1 λ (prox λ 2 g (λx + a) a) scaling and translation of argument : with λ > 0 f (x) = λg(x/λ), prox f = λprox λ 1 g (x/λ)

Addition to linear or quadratic function 21/37 linear function f (x) = g(x) + a T x, prox f = prox g (x a) quadratic function: with u > 0 f (x) = g(x) + u 2 x a 2 2, prox f (x) = prox θg (θx + (1 θ)a), where θ = 1/(1 + u)

22/37 Moreau decomposition x = prox f (x) + prox f (x) x follows from properties of conjugates and subgradients: u = prox f (x) x u f (u) u f (x u) x u = prox f (x) generalizes decomposition by orthogonal projection on subspaces: x = P L (x) + P L (x) if L is a subspace, L its orthogonal complement (this is Moreau decomposition with f = I L, f = I L )

Extended Moreau decomposition 23/37 for λ > 0 x = prox λf (x) + λprox λ 1 f (x/λ) x proof: apply Moreau decomposition to λf x = prox λf (x) + prox (λf ) (x) = prox λf (x) + λprox λ 1 f (x/λ) second line uses (λf ) (y) = λf (y/λ)

Composition with affine mapping 24/37 for general A, the prox-operator of f (x) = g(ax + b) does not follow easily from the prox-operator of g however if AA T = (1/α)I, we have prox f (x) = (I αa T A)x + αa T (prox α 1 g (Ax + b) b) example: f (x 1,..., x m ) = g(x 1 + x 2 + + x m ) prox f (x 1,..., x m ) i = x i 1 m m x j prox m mg ( x j ) j=1 j=1

25/37 proof: u = prox f (x) is the solution of the optimization problem with variables u, y min u,y g(y) + 1 2 u x 2 2 s.t. Au + b = y eliminate u using the expression optimal y is minimizer of u = x + A T (AA T ) 1 (y b Ax) = (I αa T A)x + αa T (y b) g(y) + α2 2 AT (y b Ax) 2 2 = g(y) + α 2 y b Ax 2 2 solution is y = prox α 1 g(ax + b)

Projection on affine sets 26/37 hyperplane: C = {x a T x = b} (with a 0) P C (x) = x + b at x a 2 a 2 affine set: C = {x Ax = b} (with A R p n and rank(a) = p) P C (x) = x + A T (AA T ) 1 (b Ax) inexpensive if p n, or AA T = I,...

Projection on simple polyhedral sets 27/37 halfspace: C = {x a T x b}(with a 0) P C (x) = { x + b a T xa a 2 2 x rectangle: C = [l, u] = {l x u} if a T x > b if a T x b P C (x) i = l i x i u i x i l i l i x i u i x i u i nonnegative orthant: C = R n + P C (x) = x + (x + is componentwise maximum of 0 and x )

28/37 probability simplex: C = {x 1 T x = 1, x 0} P C (x) = (x λ1) + where λ is the solution of the equation 1 T (x λ1) + = n max{0, x k λ} = 1 i=1 probability simplex: C = {x a T x = b, l x u} P c (x) = P [l,u] (x λa) where λ is the solution of a T P [l,u] (x λa) = b

Projection on norm balls 29/37 Euclidean ball: C = {x x 2 1} P C (x) = 1-norm ball: C = {x x 1 1} { 1 x 2 x if x 2 > 1 x if x 2 1 x k λ, x k > λ P c (x) k = 0, λ x k λ x k + λ, x k < λ λ = 0 if x 1 1; otherwise λ is the solution of the equation n max{ x k λ, 0} = 1 k=1

30/37 Projection on simple cones second order cone C = {(x, t) R n 1 x 2 t} P C (x, t) = (x, t) if x 2 t, P C (x, t) = (0, 0) if x 2 t and P C (x, t) = t + x [ 2 x ] 2 x 2 x 2 if t < x 2 < t, x 0 positive semidefinite cone C = S n + P C (X) = n max{0, λ i }q i q T i if X = n i=1 λ iq i q T i is the eigenvalue decomposition of X i=1

Support function 31/37 conjugate of support function of closed convex set is indicator function f (x) = S C (x) = sup y C x T y, f (y) = I C (y) prox-operator of support function: apply Moreau decomposition prox t f = x tprox t 1 f (x/t) = x tp C (x/t) example: f (x) is sum of largest r components of x f (x) = x [1] + + x [r] = S C (x), C = {y 0 y 1, 1 T y = r} prox-operator of f is easily evaluated via projection on C

32/37 Norms conjugate of norm is indicator function of dual norm ball: f (x) = c, f (x) = I B (y) (B = {y y 1}) prox-operator of norm: apply Moreau decomposition prox t f = x tprox t 1 f (x/t) = x tp B (x/t) = x P tb (x) useful formula for prox t when projection on tb = {x x t} is cheap examples: 1, 2

Distance to a point 33/37 distance (in general norm) f (x) = x a prox-operator: from page 20, with g(x) = x prox t f = a + prox t g (x a) = a + x a tp B ( x a ) t = x P tb (x a) B is the unit ball for the dual norm

34/37 Euclidean distance to a set Euclidean distance (to a closed convex set C) d(x) = inf y C x y 2 prox-operator of distance prox td (x) = θp C (x) + (1 θ)x, θ = { t/d(x) prox-operator of squared distance: f (x) = d(x) 2 /2 prox t f = 1 1 + t x + t 1 + t P C(x) d(x) t 1 otherwise

35/37 proof (expression for prox td (x)) if u = prox td (x) / C, then from the definition and subgradient for d x u = t d(u) (u P C(u)) implies P C (u) = P C (x), d(x) t, u is convex combination of x, P C (x) if u C minimizes d(u) + (1/(2t)) u x 2 2, then u = P C(x)

36/37 proof (expression for prox tf (x) when f (x) = d(x) 2 /2) ( 1 prox t f (x) = arg min u 2 d(u)2 + 1 ) 2t u x 2 2 ( 1 = arg min inf u 2 u v 2 2 + 1 2t 2) u x 2 v C optimal u as a function of v is optimal v minimizes 1 2 u = t t + 1 v + 1 t + 1 x t t + 1 v + 1 t + 1 x v 2 + 1 t 2 2t t + 1 v + 1 t + 1 x x Over C, i.e., v = P C (x) 2 2 = 1 2(1 + t) v x 2 2

Refrences 37/37 P. L. Combettes and V. R. Wajs, Signal recovery by proximal forward-backward splitting, Multiscale Modeling and Simulation (2005). P. L. Combettes and J.-Ch. Pesquet, Proximal splitting methods in signal processsing, Fixed-Point Algorithms for Inverse Problems in Science and Engineering (2011). N. Parikh and S. Boyd, Proximal algorithms (2013).