Proximal Operator and Proximal Algorithms (Lecture notes of UCLA 285J Fall 2016)

Size: px

Start display at page:

Download "Proximal Operator and Proximal Algorithms (Lecture notes of UCLA 285J Fall 2016)"

Arabella Phelps
6 years ago
Views:

1 Proximal Operator and Proximal Algorithms (Lecture notes of UCLA 285J Fall 206) Instructor: Wotao Yin April 29, 207 Given a function f, the proximal operator maps an input point x to the minimizer of f restricted to a small proximity to x. The proximal operator has a simple definition yet has long been a powerful tool in optimization, giving rise to a variety of proximal algorithms such as the proximal-point algorithm, the prox-gradient algorithm, as well as many algorithms involving linearization and/or splitting. The proximal operator of a convex function involves the subgradients of the function, which does not need to be convex. Hence, proximal algorithms handle both differentiable and nondifferentiable functions. In comparison, Newton s algorithm requires a C 2 function, and gradient algorithms need C functions. Along with convex duality, proximal algorithms can solve problems with linear constraints. In fact, the Method of Multipliers, or the Augmented Lagrangian Method, is a special case of the point-point algorithm. What is unique about the proximal operator is its implicity. It computes the subgradient of a function at the output point. The subgradient and output point actually determine each other. (In comparison, in Newton s algorithm and gradient algorithms, the Hessian and gradient are evaluated at the input point and determine the output point.) Hence, computing the proximal operator may invert a matrix or evaluate a so-called resolvent. While this can be a disadvantage for some functions, a large number of extremely useful functions such as l, l 2, l 8 norms have closed-form proximal operators! Because the proximal operator is implicit, it is very stable. For example, the proximal-point algorithm converges for any positive step size. The implicitness also make proximal algorithms popular choices for certain nonconvex problems with structures. Proximal algorithms are often used for optimization problems with structures such as large sum, block separability, linear constraints, as well as linear transforms. Coordinate descent and operator splitting techniques often decompose a problem into simple subproblems that are easily solved by proximal algorithms. Therefore, the proximal operator give rise to parallel and distributed algorithms. The proximal operator also has interesting mathematical proper-

2 (lecture notes of ucla 285j fall 206) 2 ties. It is a generalization to projection and has the soft projection interpretation. As the projection to complementary linear subspaces produces an orthogonal decomposition for a point, the proximal operators of a convex function and its convex conjugate yield the Moreau decomposition of a point. Furthermore, the proximal operator provides an optimality condition: a function is minimized at a point if, and only if, the proximal operator of the function evaluated at the point returns the same point. It also a common part the fixed-point optimality conditions for more complicated optimization problems. For convex functions, the proximal operator enjoys the important firmly nonexpansive property, which plays a central role in the monotone operator theory and the operator splitting method. The property leads to sequence convergence and lets us assembly simple operators into an algorithm that solves difficult problems. Consequently, proximal operators are frequently used to handle simple or structured functions in operator-splitting algorithms. Notation To ensure the proximal operator is well defined and gives the unique output, we consider functions f : R n Ñ R Y t8u that is proper, not everywhere 8 closed 2, and convex. Except for examples, the results of this chapter generalizes from R n to a possibly infinite dimensional Hilbert space. Including 8 in the range of f will let us save x P dom f. The set of minimizers of f is denoted as argmin f tx P R n : f pxq min y f pyqu. 2 Its epigraph is a closed set. Equivalently, all its level sets are closed sets. Definition Definition. Given a proper closed convex function f : R n Ñ R Y t8u, the proximal operator, which is scaled by λ 0, is a mapping from R n Ñ R defined by prox λ f pxq : argmin ypr n f pyq 2λ }y x}2. () Lemma. For any λ and x, prox λ f pxq exists and is unique, if f is proper closed convex. Since f is proper closed convex, it is lower bounded by an affine function, therefore Fpyq f pyq 2λ }y x} 2 is coercive, i.e., lim }y}ñ 8 Fpyq 8, then we can take a minimizing sequence such that lim mñ 8 Fpy m q inf ypr n Fpyq, since Fpyq is coercive, so this sequence is bounded,

3 (lecture notes of ucla 285j fall 206) 3 therefore it has a subsequence that converges to a cluster point, say y mk Ñ y. Since Fpyq is closed, so Fpy q lim Fpy m kñ 8 k q inf Fpyq ypr n Therefore y is a minimizer of Fpyq, it is unique since Fpyq is strongly convex. We use λ f as the proximal subscript, but we prefer separating f and λ in the minimization problem in(). In fact, the definition does not change if, instead of (), we let prox λ f pxq : argmin λ f pyq }y x}2. 2 However, the separation yields the Moreau envelope or Moreau-Yoshida approximation: ˆf pxq : min ypr n f pyq }y x}2. 2λ The function ˆf approximate f but is everywhere differentiable, even if f is not so. It is easy to show ˆf pxq λ x prox λ f pxq. (2) Exercise: Prove (2). (add existence and uniqueness.) ([Add an D illustration of f pxq, f pyq 2λ }y x} 2, and ˆf pxq.]) Soft projection Let C be a nonempty closed convex set. Recall the indicator function $ & 0, if x P C δ C pxq : % 8, otherwise. Let λ 0. By definition, prox λδc pxq argmin y δ C pyq }y x}2 2λ argmint}y x} : y P Cu y proj C pxq. The proximal operator of C s indicator function is just the projection onto C. The scaling parameter λ does not make any difference. The indicator function has the special property argmin δ C domδ C C. In general, a proper convex function f satisfies argmin f dom f. Suppose that argmin f is nonempty. For a given x P R n,

4 (lecture notes of ucla 285j fall 206) 4 as λ Ò 8, prox λ f pxq approaches proj argmin f pxq. as λ Ó 0, prox λ f pxq approaches proj dom f pxq. (Add a 2D illustration.) Examples Linear function Given a P R n, b P R and prox λ f pxq : argmin ypr n f pxq : xa, xy b. The proximal operator of this linear function is pa T y bq 2λ }y x}2. The first-order optimality condition is obtained by differentiating the minimization objective in y, yielding a λ pprox λ f pxq xq 0 ðñ prox λ f pxq x λa. (3) The proximal map subtracts λ units along the negative normal direction from the input point. As an application, let f pq be the linear (st order) approximation of a differentiable function f at the point x, namely, Following (3), we obtain f pq pyq : f pxq x f pxq, y xy. prox λ f pqpxq x λ f pxq, which is the gradient descent step with the size λ. Quadratic function Can we recover Newton s algorithm with the proximal operator of the quadratic approximation? We will see it almost does! Let A P R n n be a symmetric positive semi-definite matrix and b P R n be a vector. Consider the quadratic function f pxq : xx, Axy xb, xy. 2

5 (lecture notes of ucla 285j fall 206) 5 The proximal of f is prox λ f pxq : argmin ypr n 2 xy, Ayy xb, yy 2λ }y x}2. By differentiating the minimization objective in y, we obtain the first order optimality condition: Therefore, pay bq λ py xq 0 ô y pλa Iq pλb vq ô y pλa Iq pλb λax x λaxq ô y x pa λ Iq pb Axq prox λ f pxq x pa λ Iq pb Axq. (4) Consider the least-squares problem minimize x }Bx b}2 2 which B P R m n and c P R m. By letting A B T B and b B T c, we recover from (4) the iterative refinement algorithm: x k Ð x k pa λ Iq pb Ax k q. As another application, let us take a C 2 function f, and define its quadratic (2nd order) approximation at a point x: f p2q pyq : f pxq x f pxq, y xy 2 xpy xq, 2 f pxqpy xqy. With A 2 f pxq and b p 2 f pxqq T x f pxq, we simplify f p2q as f p2q pyq xx, Axy xb, xy c, 2 where c collects all y-independent terms, which do not affect the evaluation of proximal operator. Following from (4), we get prox λ f p2qpxq x p 2 f pxq λ Iq f pxq. The iteration x k Ð prox λ f p2qpx k q recovers the modified-hessian Newton algorithm, which is also known as the Levenberg-Marquardt method.

6 (lecture notes of ucla 285j fall 206) 6 l -norm Let f pxq }x}. The point y prox λ f pxq must minimize }y} 2λ }y x} 2. Therefore, it satisfies 0 P B}y } λ py xq ôx y P λ B}y }. (5) Recall that, since } } is separable, its subdifferential simplifies to B}y } B y B y n. Therefore, the condition (5) reduces to the component-wise conditions x i y i P λb y i, i,..., n, which means the graph of px i y i q intersects that of λb y i. From the plot, we observe that $ '& x i λ, if x i λ y i x i λ, if x i λ '% 0, otherwise. Since y i preserves the sign of x i and reduces its absolute by λ; if x i λ, then y i 0, we also have y i signpx i q maxt0, x i λu. Putting this for all i,..., n together yields: prox λ} } pxq y signpxq maxt0, x λu, applied component wise. Therefore, the l -proximal operator earns the name shrinkage and soft-thresholding. In Matlab, the computation can be written in one line as y = sign(x).*max(0,abs(x)-lambda); l 2 norm Let f pxq }x} 2,

7 (lecture notes of ucla 285j fall 206) 7 which is a non-separable function. Recall its subdifferential $ & t B f pxq }x} x u, if x 0 2 % tp : }p} 2 u, otherwise. Since f is differentiable at all but one point, we can apply the assumption trick. Let λ 0. First assume that y prox λ} }2 pxq 0. Then, y must satisfy 0 y }y } 2 λ py xq. (6) Use the polar coordinates x pr x, θ x q, where r x }x} 2 and θ x tan p x 2 y x q. From (6), }y } and x y must have the same angle. 2 Since y }y } 2 and y have the sample angle, their angle must equal the angle of x, or its negative. Therefore, it must hold y αx for α P R. Substituting this into (6) yields Hence, 0 signpαq α λ r x. $ & r x λ r α x, if r x λ % 0, otherwise. and y αx. We have assumed y 0, but it is easy to verify that when r x λ, λ px 0q P tp : }x} 2 u and thus y prox λ} }2 pxq 0. Therefore, $ & }x} 2 λ prox λ} }2 pxq }x} x, if }x} 2 λ 2 (7) % 0, otherwise. In Matlab, the computation can be written in one line as y = (max(0,norm(x)-lambda)/(norm(x)+eps))*x; where eps is added to avoid division by zero. l p,q norm This norm is used to impose properties on a group of variables. For a matrix A P R m n, its l p,q norm is }A} p,q ņ j m i a ij p q p q.

8 (lecture notes of ucla 285j fall 206) 8 The most common example is the l 2, norm, used in the Group Lasso model, }A} 2, ņ j m a ij 2 2 i ņ j }A j } 2, where the A j is the jth column of A. Therefore, }A} 2, is separable to the sum of the l 2 -norms of its columns. To take advantage of this property, we write X X X 2 X n where X i P R n is the ith column of X. Then, prox λ} }2, pxq : argmin YPR m n }Y} 2, 2 }Y X}2 F prox λ} }2 px q,..., prox λ} }2 px n q. Here, each prox λ} }2 px i q is given by (7), i,..., n. l 8 norm prox λl8 can be derived in two ways, either directly from the definition of the proximal operator and } } 8 or following the Moreau decomposition in the next section. TODO: add the direct approach, which is based on B} } 8. It should reduce to the problem of finding t }y } 8 such that x i t λ. (8) Given the solution t to (8), i:x i t prox λl8 pxq signpxq mint x, tu, component-wise. Unitary-invariant matrix norms We call a matrix norm ~ ~ unitary-invariant if ~X~ ~UXV ~ for any matrix X P C m n and unitary matrices U P C m m, V P C n n. Since singular values are rotational (unitary) invariant, all singular-valued based matrix norms are rotational (unitary) invariant. Common examples are nuclear norm ~ ~ : l or sum of singular values, Frobenius norm ~ ~ F : l 2 of singular values, l 2 -operator norm ~ ~ 2 : l 8 or maximum of singular values.

9 (lecture notes of ucla 285j fall 206) 9 They are called Schatten-p norms for p, 2, 8, respectively. Let ~ ~ be a unitary-invariant matrix norm and } } be its corresponding norm of the singular values. Consider the matrix proximal operator Y prox λ~ ~ pxq : argmin ~Y~ Y 2λ ~Y X~2 F. One can show that the solution Y must share the singular value factors, which are unitary matrices, with the input matrix X. Hence, the steps to compute prox λ~ ~ pxq are. Apply SVD to X: A UdiagpσqV, where σ is the vector with the singular values of X and diagpσq is its diagonal matrix; 2. Compute the vector proximal operator: σ prox λ} } pσq; 3. Form the solution: Y Udiagpσ qv. Moreau decomposition Let f : R n Ñ R be a proper closed convex function and λ 0. Let f be the convex conjugate, or the Legendre transform, of f : f puq sup v xv, uy f pvq. (9) In this chapter, we leave this definition unexplained. Another chapter is dedicated to convex duality and the property of convex conjugacy. The Moreau decomposition applies any point x P R n and decomposes it as where y prox λ f pxq, z prox λ f pλ xq. x y λz Complementary linear subspaces Let S be a linear subspace of R n and S K be its complementary subspace. Then,... (9) reduces to x proj S pxq proj S Kpxq.

10 (lecture notes of ucla 285j fall 206) 0 Cone and polar cone l p -norm and l q -ball Let p, q P r, 8s be such that p q. By definition, $ & } } 0, if }u} q, ppuq sup xv, uy } } p pvq v % 8, otherwise. Hence, } } pp q δ } }q p q. (0) Let us compute the projection to the l q -ball of radius α 0: B α q : tx : }x} q αu. Obviously, B α q α tx : }x} q u, and thus δ B α q p q α δ } }q p q. Therefore, by applying the Moreau decomposition (9) with (0) and λ α, we obtain proj B α q pxq prox δb α q pxq prox αδ} } q pxq prox λ δ } } pλxq λ q pλxq prox } } p pλxq λ x λ prox } } p pλxq x αprox } }p px{αq Applying this above identity to l, l 2, l 8 balls, we arrive at. proj } } α pxq x αprox } } 8 p x α q 2. proj } }2 α pxq x αprox } } 2 p x α q 3. proj } }8 α pxq x αprox } } p x α q Projection to box constraints Two vectors l P pt 8u Y Rq n and u P pr Y t8uq n define the box set S rl, us tx P R n : l i x i u i u R n....

11 (lecture notes of ucla 285j fall 206) Projection to subspaces and affine sets S tx P R n : x x 2 x n u.... S tx P R n : xa, xy b 0u.... S tx P R n : xa, xy b 0, xc, xy d 0u. Total variation Consider x P R n,... In two or more dimensional spaces,... graph-cut, max-flow Function under linear transform Let A P R m n. Consider the function, hpxq f paxq Assume that AA T I. Here, if A is a square matrix, A is called an orthogonal or orthonormal matrix. If A is rectangular, we can it a frame. Since AA T I, we have rankpaq m and }y x} 2 }Ay Ax} 2 for any x, y and the linear transform T : x Ñ Ax is surjective. Therefore, prox λh pyq argmin xpr n pz : Axq argmin xpr n f paxq f paxq A T argmin zpr m f pzq A T prox λ f payq. }y x}2 2λ }Ay Ax}2 2λ }Ay z}2 2λ Here, the change of variable from x P R n to z : Ax and z P R n relies on the fact that T : x Ñ Ax is surjective. From the solution z, we can recover the solution x since A T z A T Ax x. Proximable functions Definition 2. A function f : R n Ñ R is proximable if prox γ f can be computed in Opnq or Opnpolylogpnqq time. We have some common proximal functions such as

12 (lecture notes of ucla 285j fall 206) 2. norms: l, l 2, l 2,, l 8 ; 2. separable functions and indicator functions of separable constraints; 3. the standard simplex: tx P R n : T x, x 0u; TODO: add more. This section studies part 2 and some summative proximable functions. Separable sum Proposition. For a separable function it holds that f pxq n i f i px i q, prox λ f pxq prox λ f px q,..., prox λ fn px n q. Summative proximable functions In general, even if f, g are both proximable, h f g may not be proximable. Operator splitting algorithms, one may have to deal with f and g in two subproblems. However, there are exceptions, which we call summative proximable functions. Definition 3. We call h f g is a summative proximable function if prox λh pxq prox λ f prox λg P R n, λ 0. Examples of summative proximal functions are. In R, let f : R Ñ R be convex and satisfy f p0q 0. Then the function f is summative proximable. An example is the elastic net regularization function }x} 2µ }x} 2 2, which is component-wise separable. 2. In R n, if g is a homogeneous function 3, then the function. } } 2 g is 3 gpαxq 0. summative proximable. Special cases of homogeneous functions include norms (e.g., l p -norm, p P r, 8s) and indicator functions δ 0 and δ In R n, let f be a prox-monotone function 4 and g be D total variation 4 For x P R n and i,..., n, prox f satisfies gpxq n i x i x i, then f g is summative proximable. Examples of prox-monotone functions include l x i x j ñ prox f pxq i prox pxq, l 2, l 8 and δ u P R n. The f function α}x} x i x j ñ prox f pxq i prox pxq gpxq is called the Fused Lasso regularizer. f x i x j ñ prox f pxq i prox pxq f i, i, i.

13 (lecture notes of ucla 285j fall 206) 3 Proximal fixed-point optimality conditions Theorem. Let λ 0. The point x P R n is a minimizer of a proper closed convex function f if, and only if, x prox λ f px q. Proof. ñ": Let x P argmin f pxq. Then for any x P R n, f pxq 2λ }x x } 2 f px q f px q 2λ }x x } 2. Thus, x argmin f pxq 2λ }x x } 2 prox f px q. ð": Let x prox λ f px q. By the subgradient optimality condition: 0 P B f px q λ px x q B f px q. Thus, 0 P B f px q, and x P argmin f pxq. Proximal-point algorithm The proximal-point algorithm (PPA) refers to the iteration x k Ð prox λ f px k q, () where λ 0 is the step size. Although it is seldom used as an algorithm to minimizer f, it recovers the Method of Multipliers (Augmented Lagrangian Method) and others. The step size can vary with the iteration in an closed interval, namely, λ k P rl, us for 0 l u 8. Subgradient-descent interpretation Although a negative subgradient may not be a descending direction, we will that, in PPA, a subgradient evaluated at the new point x k ensures function value descent. The PPA iteration () satisfies x k prox λ f px k q ô 0 P λb f px k q x k x k ô x k x k λ r f px k q, where r f px k q P B f px k q is a subgradient. It is uniquely determined by proxλ f px k q even if B f px k q has more than one element. Let us compare f px k q and f px k q. By the definition of subgradient, f px k q f px k q x r f px k q, x k x k y f px k q λ }x k x k } 2. Hence, unless }x k x k } 0, in which case r f px k q 0 and thus x k is optimal, the function value is always decreased.

14 (lecture notes of ucla 285j fall 206) 4 Dual interpretation Let y k r f px k q P B f px k q, then we have y k P B f px k λy k q Therefore computing prox λ f px k q is equivalent to solving for a subgradient at the descent destination. Diminishing Tikhonov-regularization interpretation In PPA iteration, x k argminp f pxq xpr n 2λ }x xk } 2 2 q The second term can be considered as a regularization term, which keeps x k close to x k. Because the regularization is not anchored but uses the current point x k, the amount of regularization goes away as x k converges. Bregman iterative regularization interpretation The Bregman iterative regularization refers to the iteration x k Ð argmin λ D r px; x k q f pxq, (2) xpr n where D r p ; x k q is the Bregman distance (or Bregman divergence) function induced by a proper closed convex (possibly nonsmooth) function r. Specifically, D r px; x k q : rpxq rpx k q xp, x x k y, where p P Brpx k q is a given subgradient. Since p determines the Bregman distance, we sometimes write D r px; x k, pq. By definition of the subgradient, D r px; x k q 0 for all x P R n, and it tends to be smaller when x is closer to x k. Although it is called a distance, it generally violates the conditions of a mathematical distance. The following Bregman distances are often used. Squared Euclidean distance Dpx; yq }x y} 2 2 is induced by rp q } } 2 2 ; 2. The Kullback-Leibler divergence Dpx; yq ņ i x i log x i y i x i y i is induced by rpxq n i px i logpx i q x i q. This Bregman distance measures of the difference between two probability densities x, y.

15 (lecture notes of ucla 285j fall 206) 5 3. l Bregman distance is induced by rp q } }. It is used in compressed sensing. The total variation Bregman distance is used image reconstruction. Note that, due to the existence of multiple subgradients, these two Bregman distances are not defined until a subgradient is specified. Typically, one picks 0 at the beginning and then the one that appears in the optimality condition of the previous iteration. The PPA iteration is a special case of (2) corresponding to the convex function rp q 2 } }2 2. Convergence of the Proximal-Point Algorithm Several more complicated algorithms (including the alternating direction method of multipliers, or ADMM, and a variety of primal-dual methods) are special cases of the PPA. They correspond to particular proximal or resolvent operators. Hence, analyzing the convergence of PPA is fundamental to the study of first-order optimization algorithms and operator splitting methods. The analysis approach that we will take below underlies the analysis of many other algorithms. Let us assume that f is proper closed convex and argmin f is nonempty (but possibly non-unique). We first study the convergence of the sequence tx k u. Definition 4. Consider an operator T : R n Ñ R n.. The operator R : I T is called the (fixed-point) residual operator, and Rpxq x Tpxq is called the (fixed-point) residual at x. 2. The operator T is called firmly nonexpansive if }Tpxq Tpyq} 2 }x y} 2 }Rpxq Rpyq} y P domt.(3) 3. The operator T is called strictly contractive or α-contractive for α P p0, q if }Tpxq Tpyq} α}x y P domt. Through simple algebras, one can show that T is firmly nonexpansive if, and only if, R I T is firmly nonexpansive. We introduce the firmly nonexpansive operator because it leads to sequence convergence and the proximal operator of a convex function is firmly nonexpansive.

16 (lecture notes of ucla 285j fall 206) 6 Proposition 2. For any proper closed convex function f and λ 0, prox λ f is firmly nonexpansive. If f is also strongly convex, then prox λ f is contractive. Proof. add proof As along as f is convex and has a minimizer (not necessarily unique), the PPA converges to a minimizer. Theorem 2. For any proper closed convex function f : R n Ñ R Y t8u that has a minimizer and λ 0, the the PPA iteration () produces a sequence tx k u that converges to some x P argmin f. Proof. Pick an arbitrary x P argmin f. Applying (3) with T prox λ f, x x k, and y x yields }x k x } 2 }x k x } 2 }prox λ f px k q x k } 2, (4) from which we conclude. }x k x } }x 0 x } for all k and, thus, the sequence px k q k 0 is bounded and has a subsequence px j q jpj px k q k 0 such that x : lim jpj x j ; (5) 2. summing (4) in a telescopic fashion gives 8 k0 }prox λ f px k q x k } 2 8 and, thus, lim k }prox λ f px k q x k } 0. (6) Since prox λ f pxq is continuous in x, so is }prox λ f pxq x}. Based on this continuity, from (5) and (6), }prox λ f p xq x} lim jpj }prox λ f px j q x j } 0. Therefore, prox λ f p xq x 0 and, by Theorem, x P argmin f. Recall that x P argmin is arbitrary. By letting x x in (4) we get }x k x} 2 }x k x} 2. For each k 0, define j k maxtj P J : j ku. As j k k, }x k x} }x j k x}. Because tj k : k 0u J, we get lim k }x k x} lim k }x j k x} lim jpj }x j x} 0. TODO: add convergence rates:

17 (lecture notes of ucla 285j fall 206) 7. }x k x k } is monotonically nonincreasing 2. }x k x k } 2 op{kq 3. }x k x k } 2 op{k 2 q using the monotonicity and summability of f px k q f On the other hand, if f is strongly convex, then the PPA converges to its unique minimizer at a linear (geometric) rate. This is a direct result of the Banach Fixed-point Theorem 5. 5 Proximal operators of nonconvex functions In general, a nonconvex function does not have a subdifferential, and the minimization problem of the proximal operator may have more than one stationary points. The proximal operator of a nonconvex function is, therefore, computed in a case-by-case basis. l 0 norm The l 0 function counts the number of nonzeros in the input, that is, }x} 0 : ti : xi 0u. Given a vector x P R n, sort its components so that x rs x r2s x rns, where x ris is the ith smallest component of x (not necessarily equal x i ). Let us compute prox λl0 pxq : argmin ypr n }y} 0 2λ }y x}2 2. Since the value of a nonzero component y i does not affect }y} 0, we only need to decide in the solution y, the set of nonzero components. If y i is nonzero, it would equal x i. In addition, suppose that there are p nonzero components in y, then }y } 0 is fixed as p and, thus, 2λ }y x} 2 2 reaches its minimal if we identify the largest p components of x and send them to the corresponding components of y, making y equal to thresholding the smallest n p components of x to 0. Therefore, the problem is simplified to figure out p. For p }y } 0 0,,..., n, the values of f p min ypr n }y} 0 2λ }y x} 2 2

18 (lecture notes of ucla 285j fall 206) 8 are, respectively, f 0 0 f 2λ 2λ ņ i ņ i2 x ris 2 x ris 2 f n pn q x rns 2 f n n 0. The difference f i f i 2λ x ris 2, i,..., n, is monotonically increasing. Let i argmaxti : f i f i 0u argmaxti : x ris? 2λu and, if argmax not attained, let i 0. Then, the minimal value f is achieved at f i : f 0 p f i f i q i ņ x 2λ ris 2, i i ii which is attained at $? '& 0, if x i 2λ, y prox λl0 pxq where y i 0 or x i,, if x i? 2λ, '% x i, otherwise. Therefore, prox λl0 is also called the (hard) thresholding operator. l {2 and l 2{3 quasi-norms (TODO) Uncovered topics Proximal based operator-splitting algorithms, such as the proxgradient algorithm (a future topic) Dual proximal algorithms (a future topic) Analysis of existence, continuity, boundedness, etc. (references:...) History notes In 6, von Neumann showed that any unitary invariant matrix norm 6 can be written as ~X~ gpσ X q, where σ X is the vector of singular values of X and g is symmetric gauge function. (TODO: add more)

19 (lecture notes of ucla 285j fall 206) 9 Exercise. Consider the function r : R Ñ R Y t8u given by $ & log x, if x 0 rpxq % 8, otherwise. Show that for all y P R and λ 0, prox λr pyq y a y 2 4λ Consider the function r : R Ñ R Y t8u given by $ & x, if x 0 rpxq % 0, otherwise. Derive the formula of prox λr pyq, where y P R and λ Consider the function r : R Ñ R Y t8u given by $ & x, if x 0 rpxq % 8, otherwise. Derive the formula of prox λr pyq, where y P R and λ Consider the weighted -norm }x},w ņ i w i x i. Derive the formula of prox λr pyq, where y P R n and λ Consider the weighted 2-norm g f }x} 2,w e ņ i w i x i 2. Derive the formula of prox λr pyq, where y P R n and λ Given a proper function g : R Ñ R Y t8u and its proximal mapping prox λg, derive the proximal mapping prox γ f for the function f pxq αgpx{βq, where α, β 0 are given.

20 (lecture notes of ucla 285j fall 206) Given a proper function g : R n Ñ R Y t8u and its proximal mapping prox λg, for all λ 0, derive the proximal mapping prox γ f of the function f pxq gpxq 2α }x x0 } 2 2, where α 0 and x 0 P R n are given. 8. Given a proper function g : R Ñ R Y t8u and its proximal mapping prox λg, for all λ 0, derive the proximal mapping prox γ f of the function f : R n Ñ R Y t8u: f pxq gpx x 2 x n q. 9. Given a proper function g : R n Ñ R Y t8u and its proximal mapping prox λg, for all λ 0, derive the proximal mapping of the function f : R n tor Y t8u: f pxq gpax bq, where A P R m n, b P R m are given and AA T αi for some α Define the set D tx P R n : x x 2 x n u. Given a proper function g : R Ñ R Y t8u and its proximal mapping prox λg, for all λ 0, derive the proximal mapping prox γ f of the function f : R n Ñ R Y t8u: f pxq δ D pxq ņ i gpx n q.. Let f, r: R n Ñ R Y t8u be proper closed convex functions, and assume f is continuously differentiable. Show that x P argminprpxq x f pxqq ðñ x prox λr px λ f px qq, where λ 0. Is this still true if r is nonconvex?

Coordinate Update Algorithm Short Course Proximal Operators and Algorithms

Coordinate Update Algorithm Short Course Proximal Operators and Algorithms Instructor: Wotao Yin (UCLA Math) Summer 2016 1 / 36 Why proximal? Newton s method: for C 2 -smooth, unconstrained problems allow