Coordinate Update Algorithm Short Course Proximal Operators and Algorithms

Coordinate Update Algorithm Short Course Proximal Operators and Algorithms Instructor: Wotao Yin (UCLA Math) Summer 2016 1 / 36

Why proximal? Newton s method: for C 2 -smooth, unconstrained problems allow modest size Gradient method: for C 1 -smooth, unconstrained problems give large size and parallel implementations Proximal method: for smooth and non-smooth, constrained and unconstrained exploit problem structures gives large size and parallel implementations 2 / 36

Newton s algorithm uses low-level (explicit) operation: x k+1 x k λh 1 (x k ) f(x k ) Gradient algorithm uses low-level (explicit) operation: x k+1 x k λ f(x k ) Proximal-point algorithm uses high-level (implicit) operation: x k+1 prox λf (x k ) well-known algorithms are special cases Proximal operator prox λf is an optimization problem either standalone or used as a subproblem simple for structured f, but there are many 3 / 36

Notation and Assumptions f : R n R { } is a closed, proper, convex function (why? to ensure prox λf is well-defined and unique) proper refers to domf closed refers to epif is a closed set can extend R n to general Hilbert space H can take. This saves x domf in many settings. an operator maps R n to R n, also called a mapping or a map indicator function for set C: ι C(x) := { 0, if x C, otherwise. arg min x f(x) is the set of minimizers of f; if unique, is the minimizer 4 / 36

Definition Definition The proximal operator prox f : R n R n of a function f is defined by: ( prox f (v) = arg min f(x) + 1 ) x R n 2 x v 2 The (scaled) proximal operator prox λf : R n R n is defined by: ( prox λf (v) = arg min f(x) + 1 ) x R n 2λ x v 2 Strong convex problems with unique minimizers, so = is used Moreau envelope: ( f(v) := min f(x) + 1 ) x R n 2λ x v 2 is a differentiable function. f(v) = 1 λ (v x ) 5 / 36

Proximal of indicator function is projection Consider a nonempty closed convex set C prox ιc (x) = arg min (ι C(y) + 1 ) y 2 y x 2 1 = arg min y x 2 y C 2 =: proj C (x) 6 / 36

7 / 36 1D illustration f(x) + 1 x 2 v 2 is above f(x) unless v arg min f, prox λf (v) moves away from v f ( prox λf (v) ) < f(v)

Proximal parameter Tuning λ in ( prox λf (v) = arg min f(x) + 1 ) x R n 2λ x v 2 as λ : prox λf (v) proj arg min f(x) (v) as λ 0, prox λf (v) proj domf (v) where { } 1 proj domf (v) = arg min x R n 2 v x 2 : f(x) is finite (prox λf (v) v) is generally nonlinear in λ, so λ is a nonlinear step size 8 / 36

prox λf is soft projection The path {prox λf (v) : λ > 0} domf prox λf (v) is between and proj domf (v) and proj arg min f(x) (v) The paths by different v may overlap or join If v arg min f, then prox arg min f(x) (v) = v is a fixed point 9 / 36

Examples

Examples linear function: let a R n, b R and proximal of linear function: f(x) := a T x + b. ( prox λf (v) := arg min (a T x + b) + 1 ) x R n 2λ x v 2 has first-order optimality conditions: a + 1 λ (prox λf (v) v) = 0 prox λf (v) = v λa application: proximal of the linear approximation of f let f (1) (x) = f(x 0 ) + f(x 0 ), x x 0 then, prox λf (1)(x 0 ) = x 0 λ f(x 0 ) is a gradient step with size λ 10 / 36

Examples quadratic function let A be a symmetric positive semi-definite matrix, b R n, and proximal of quadratic function: f(x) := 1 2 xt Ax b T x + c. ( prox λf (v) := arg min f(x) + 1 ) x R n 2λ x v 2 has first order optimality conditions: (Av b) + 1 λ (v v) = 0 v = (λa + I) 1 (λb + v) v = (λa + I) 1 (λb + λav + v λav) v = v + (A + 1 λ I) 1 (b Av) It recovers an iterative refinement method for Ax = b. 11 / 36

application: proximal of the quadratic approximation of f let f (2) (x) = f(x 0 ) + f(x 0 ), x x 0 + 1 (x 2 x0 ) T 2 f(x 0 )(x x 0 ) =: 1 2 xt Ax b T x + c where A = 2 f(x 0 ) b = ( 2 f(x 0 )) T x 0 f(x 0 ) by letting v = x 0, we get prox λf (2)(x 0 ) = x 0 ( 2 f(x 0 ) + 1 λ I) 1 f(x 0 ) recovers the modified-hessian Newton or Levenberg-Marquardt method 12 / 36

Examples l 1-norm: let f(x) = x 1. proximal of l 1-norm: Then ( prox λf (v) := arg min x 1 + 1 ) x R n 2λ x v 2 The subgradient optimality condition 0 f(v ) + 1 λ (v v) v v λ f(v ) Recall f(x) = x 1 x n Hence, it reduces to component-wise subproblems: v i v i v i 13 / 36

proximal of l 1-norm (cont). Three cases: v i > 0, then v i v i = λ, so v i = v i λ v i < 0, then v i v i = λ, so v i = v i + λ v i = 0, then v i = v i v i [ λ, λ]. Rewriting these conditions in terms of v, if v i > λ, then vi = v i λ if v i < λ, then vi = v i + λ if v i [ λ, λ], then vi = 0. prox λf is the shrinkage (or element-wise soft-thresholding) operator shrink(v, λ) i = max( v i λ, 0) vi v i In Matlab: max(abs(v)-lambda,0).*sign(v) 14 / 36

Examples proximal of l 2-norm: let f(x) = x 2, then x prox λf (x) = max( x 2 λ, 0) x 2 = x proj B2 (0,λ)x, special convention: 0/0 = 0 if x = 0. general pattern: proximal of l p-norm: suppose p 1 + q 1 = 1, p, q [1, ]. Then, prox λf (x) = x proj x, Bq(0,λ) Useful for getting the proximals of l -norm and l 2,1-norm 15 / 36

Examples unitary-invariant matrix norms are vector norms on the singular values Frobenius norm: l 2 of singular values nuclear norm: l 1 of singular values l 2-operator norm: l of singular values note: the spectral norm returns the square root of the max eigenvalue of A T A, may not equal the max singular value for asymmetric matrices notation: let be a unitary-invariant matrix norm let be the corresponding vector norm (for the singular values) 16 / 36

proximals of unitary-invariant matrix norms: computation steps: X = prox (A) := arg min X + 1 2λ X A 2 F 1. SVD: A Udiag(σ)V T 2. proximal: σ arg min s s + 1 s σ 2 2λ 3. return: X Udiag(σ )V T 17 / 36

Proximable functions definition: a function f : R n R is proximable if prox γf can be computed in O(n) or O(npolylog(n)) time examples: norms: l 1, l 2, l 2,1, l,... separable functions/constraints: x 0, l x u standard simplex: {x R n : 1 T x = 1, x 0}... In general, f, g both proximable f + g proximable. But, there are exceptions. If f + g is proximable, we can simplify operator splitting 18 / 36

f + g proximable functions Let denote operator composition: for example ( (prox f prox g )(x) := prox f proxg (x) ) rule 1: if f : R R is convex and f (0) = 0, then the scalar function f + is proximable: prox f+ = prox f prox Application: the elastic net regularizer 1 2 x 2 2 + α x 1 19 / 36

rule 2: g is a 1-homogeneous function if g(αx) = αg(x), α 0. Examples: l 1, l, ι 0, ι 0 If g is a 1-homogeneous function, then 2 + g is proximable: prox 2 +g = prox 2 prox g 20 / 36

rule 3: 1D discrete total variation is TV(x) := n 1 xi+1 xi i=1 f is component prox-monotonic if, x R n and i, j {1,..., n}, x i < x j x i = x j ( prox f (x) ) i ( prox f (x) ) j ( prox f (x) ) i = ( prox f (x) ) j Examples: l 1, l 2, l, ι l, ι u, ι [l,u]. If f is component prox-monotonic, then f + TV is proximable: prox f+tv = prox f prox TV Application: Fused LASSO regularizer α x 1 + TV(x) 21 / 36

Properties

Separable Sum Proximal Proposition For a separable function f(x, y) = φ(x) + ψ(y), prox λf (v, w) = (prox λφ (v), prox λψ (w)) We have observed this with the proximal of x 1 := n i=1 xi Can be used to derive x 2,1 := p i=1 x (i) (as in nonoverlap group LASSO) 22 / 36

Proximal fixed-point Theorem (fixed-point = minimizer) Let λ > 0. Point x R n is a minimizer of f if, and only if, prox λf (x ) = x. Proof. : Let x arg min f(x). Then for any x R n, f(x) + 1 2λ x x 2 f(x ) + 1 2λ x x 2. Thus, x = arg min f(x) + 1 2λ x x 2, so x = prox λf (x ). : Let x = prox λf (x ), then by the subgradient optimality condition: 0 f(x ) + 1 λ (x x ) = f(x ) Thus, 0 f(x ), and x arg min f(x). 23 / 36

Proximal operator and resolvent Definition For a monotone operator T, (I + λt ) 1 is the (well-defined) resolvent of T. Proposition prox λf = (I + λ f) 1. Informal proof. x (I + λ f) 1 (v) v (I + λ f)(x) v x + λ f(x) 0 x v + λ f(x) 0 1 (x v) + f(x) λ ( x = arg min f(x) + 1 ) x R d 2λ x v 2 24 / 36

Proximal-Point Algorithm

Proximal-point algorithm (PPA) iteration: x k+1 prox λf (x k ) seldom used to minimize f because prox λf is as difficult recovers the method of multipliers or augmented Lagrangian method (later lecture) has iterating convergence properties can relax λ to take values in an interval R ++ 25 / 36

Proximal is firmly nonexpansive definition: a map T is nonexpansive if T (x) T (y) x y, x, y definition: a map T is firmly nonexpansive if T (x) T (y) 2 x y 2 (x T (x)) (y T (y)) 2, x, y A key property in the development and analysis of first-order algorithms! 26 / 36

Proposition For proper closed convex f and λ > 0, prox γf is firmly nonexpansive. Proof. Take arbitrary x, y, and let x := prox γf (x) and y := prox γf (y). By the subgradient optimality conditions: (x x ) λ f(x ) (y y ) λ f(y ). Since f is monotone, i.e., f(x ) f(y ), x y 0 element-wise, we have (x x ) (y y ), x y 0, which is equivalent to x y 2 x y 2 (x x ) (y y ) 2. 27 / 36

PPA convergence properties (without proofs) Assume: convex f, x exists. Since prox γf is firmly nonexpansive x k x (weakly in inf-dim H) above is still true subject to summable error in computing prox γf (x k ) fixed-point residual rate prox λf (x k ) x k 2 = o(1/k 2 ) objective rate f(x k ) f(x ) = o(1/k) 28 / 36

Assume: strongly convex f: C > 0 p q, x y C x y, x, y and p f(x), q f(y) Then, prox λf is a contraction: prox λf (x) prox λf (y) 2 and (assuming x exists) thus x k+1 x 2 Therefore, x k x linearly. 1 1 + 2λC x y 2. 1 (1 + 2λC) k xk x. 29 / 36

PPA Interpretations (destination) subgradient-descent interpretation: x k+1 = prox λf (x k ) x k+1 = (I + λ f) 1 (x k ) x k (I + λ f)x k+1 x k x k+1 + λ f(x k+1 ) x k+1 = x k λ f(x k+1 ) where f(x k+1 ) f(x k+1 ) interpretation: descent using the negative destination subgradient compare: origin subgradient is not necessarily a descent direction 30 / 36

dual interpretation: Let y k+1 = f(x k+1 ) f(x k+1 ). Substituting formula of x k+1, we get y k+1 f(x k λy k+1 ). Computing prox λf (x k ) is equivalent to solving for a subgradient at the descent destination. Related to the Moreau decomposition (in a later lecture). 31 / 36

approx-gradient interpretation: Assume that f is twice differentiable, then as λ 0, prox λf (x) = (I + f) 1 (x) = x λ f(x) + o(λ) 32 / 36

disappearing Tikhonov-regularization interpretation: ( x k+1 prox λf (x k ) = arg min f(x) + 1 ) x R n 2λ x xk 2 2 The second term is regularization: x k+1 should be close to x k Regularization goes away as x k converges 33 / 36

Bregman iterative regularization: x k+1 arg min x 1 λ Dp r (x; x k ) + f(x) given a proper closed convex function r and a subgradient p(x k ) r(x k ), D p r (x; x k ) := r(x) r(x k ) p, x x k PPA is the special case corresponding to setting r = 1 2 2 2 34 / 36

Summary Proximal operator is easy to understand Is a standard tool for nonsmooth/constrained optimization Gives a fixed-point optimality condition PPA is more stable than gradient descent Sits at a high level of abstraction In closed form for many functions 35 / 36

Not covered Usage in operator splitting Proximals of dual functions, compute by minimizing augmented Lagrangian Proximals of nonconvex functions 36 / 36