A UNIFIED CHARACTERIZATION OF PROXIMAL ALGORITHMS VIA THE CONJUGATE OF REGULARIZATION TERM

Size: px

Start display at page:

Download "A UNIFIED CHARACTERIZATION OF PROXIMAL ALGORITHMS VIA THE CONJUGATE OF REGULARIZATION TERM"

George Jenkins
5 years ago
Views:

1 A UNIFIED CHARACTERIZATION OF PROXIMAL ALGORITHMS VIA THE CONJUGATE OF REGULARIZATION TERM OUHEI HARADA September 19, 2018 Abstract. We consider proximal algorithms with generalized regularization term. General requirements for the conjugate of regularization term are proposed to make the corresponding proximal algorithm globally converge. Applying the proposed technique to the case of ϕ-divergence, we have proved that ϕ-divergence with nonsmooth kernel function is acceptable as a regularization term. ey words. proximal point method, ϕ-divergence AMS subject classifications. 90C22, 90C25 1. Introduction In this paper, we are interested in the problem (P): minimize f(x), x X, where f : R n R is a convex, inf-compact and possibly nonsmooth function and X is a closed convex subset of R n. In order to solve the problem, a typical approach is a proximal algorithm, where the following subproblems, minimize f(x k + d) + 1 2t k d 2, x k + d X are iteratively solved with a positive sequence {t k }. Let d k be an optimal solution of the subproblem, the point x k is updated to x k + d k. The procedure continues until d k become sufficiently small. The term d 2 /2t is called a regularization term. In this paper, we try to generalize the regularization term as a sequence of positive convex functions {P k }, and iteratively solve the following subproblems. minimize f(x k + d) + P k (d), x k + d X. The extension of the regularization term in proximal algorithms has been widely researched, although some of them are given in the context of Augmented Lagrangian methods or bundle methods. Power regularization, P k (d) = d α /t k α, α > 1, has been considered by [13]. Indicator function, P k (d) = I B(0,tk )(d), is covered by [10]. Variable metric function, P k (d) = d T M k d/2, M k 0, has been considered NTTDATA Mathematical Systems Inc, 1F, Shinanomachi Renga-an, 35, Shinanomachi, Shinjuku-ku, Tokyo, JAPAN, harada@msi.co.jp. 1

2 2 OUHEI HARADA by [4]. The application of Bregman distance as a regularization term has been proposed by [5, 6]. The extension of nonsmooth Bregman function has been considered by [14]. ϕ-divergence is another candidate of regularization term. The application of ϕ-divergence as a regularization term has been proposed by [7, 12, 3, 17, 19]. A review of proximal point methods and Augmented Lagrangian methods with Bregman distance and ϕ-divergence are given by [11]. A unified approach which covers both Bregman distance and ϕ-divergence has been proposed by [2]. Quasi distance, where P k is not assumed to be convex, has been considered by [16]. To prove the global convergence of the algorithm, we focus on the structure of Pk, the conjugate function of P k. This approach has originally given by Auslender [1]. However, they assumed that P k is invariable with respect to k and P k is of Legendre type [18, Section 26]. This approach has been essentially extended and developed by Frangioni [8] in the context of bundle methods. They remove the both assumptions mentioned above. However, they assume that P k = Q t k, where Q t is a parametric function with respect to t which implies that the effect of the index k is limited to t k. We further extend their approach and show that more generalized P k is acceptable to proximal algorithms. Although most of the convergence properties for generalized P k proposed in this paper has been already researched independently, our result is meaningful because we can see the convergence properties from a unified viewpoint of Pk. Furthermore, in terms of ϕ-divergence based regularization term, we extend the previous result of [3]. We have proved that non-differentiable ϕ- divergence kernel is acceptable as a regularization term in proximal algorithms. This paper is organized as follows. In section 2, we introduce the notation used in the paper and briefly give some preliminary results which are necessary for the later sections. In section 3, we give the fundamental requirements for the regularization term to make the corresponding generalized proximal algorithm converges globally. In particular, a new requirement is given from a viewpoint of Pk, and it plays an important role. We also introduce some examples of regularization terms and confirm their convergence. In section 4, we apply the result of section 3 to Bregman distance. Unfortunately, our approach is not useful in this case, and no new results are proposed in this section. In section 5, we apply the result of section 3 to ϕ- divergence. In this case, our approach works well and we can extend the previous results. It is proved that the kernel function ϕ is not necessary to be smooth except for its minimum point. 2. Preliminaries We use the following standard notation which are mainly derived from [18], [9], and [10]. For a closed convex set S, I S denotes the indicator function of S. For a proper convex and lower semicontinuous(lsc) function F : R n R {+ }, its effective domain is defined by dom F = { x F (x) < + }. For a given set S, its interior, relative interior, and bound are denoted by int(s), ri(s), and bd(s) respectively. A proper convex function F is said to be essentially smooth if F is smooth on int(dom F ) and F (x k ) + whenever x k x such that x k int(dom (F )) and x bd(dom (F )). A proper convex function F is said to be of Legendre type if both F and F are essentially smooth. For all ɛ 0, the ɛ-subdifferential of a function F at x is defined by ɛ F (x) = { s R n F (y) F (x) + s, y x ɛ, y R n }. For all α R, the level set of F with respect to α is defined by L α (F ) = { x F (x) α }. We say that a function F

3 A UNIFIED CHARACTERIZATION OF PROXIMAL ALGORITHMS VIA THE CONJUGATE OF REGULARIZATION TERM3 is inf-compact if L α (F ) is inf-compact for every α R. The domain of F = 0 F is defined by dom F = { x F (x) }. The set of n-vectors with nonnegative(positive) components is denoted by R n +(R n ++). B(c, r) = { x x c r } denotes the Euqulidian ball of radius r with center c. A recession function of f with respect to direction d 0 is defined by f (d) = f(x + td) f(d) lim = sup t + t t 0 f(x + td) f(d). t The distance between a point x and a convex set S is denoted by dist(x, S) = inf s S x s. The projection point of x onto a closed cone is denoted by p (x) = arg min s x s. Following lemmas are necessary for the later sections. Lemma 2.1. [10, Chapter X, Corollary 1.4.4] Let f be a proper, lsc, and convex function, the following holds. z f(x) x f (z) f(x) + f (z) = x, z. Lemma 2.2. [18, Theorem 23.8] [10, Chapter XI, Theorem 3.1.1] Let f 1 and f 2 be a proper convex function on R n and ri(dom f 1 ) ri(dom f 2 ), then we have (f 1 + f 2 )(x) = f 1 (x) + f 2 (x), x. Lemma 2.3. Let f be a proper, lsc, and convex function such that f(0) = 0, then we have L ɛ (f ) = ɛ f(0) for all ɛ 0. Proof. By definition of ɛ-subdifferential and conjugate function, we have that z ɛ f(0) f(x) z, x ɛ, x ɛ f (z) z L ɛ (f ). We have completed the proof. Lemma 2.4. [9, Chapter III, Theorem 3.2.5] Let be a closed convex cone and = { z x, z 0, x } be its polar cone, then for all x R n, we have that x = p (x) + p (x), p (x), p (x) = Fundamentals In this section, we introduce fundamental requirements for regularization term P k to make the corresponding proximal algorithm globally converge. We firstly propose the two basic requirements in the following: (P1) P k is convex, lower semicontinuous, and positive. (P2) P k (0) = 0. From (P1) and (P2), following symmetric conditions for Pk clearly hold. (P*1) Pk is convex, lower semicontinuous, and positive. (P*2) Pk (0) = 0. In each iteration, we minimize f(x k + d) + P k (d) on x k + d X and update x k by (3.1) x k+1 = x k + d k, where d k is a minimizer of f(x k + d) + P k (d). To make the algorithm well-defined, we assume that

4 4 OUHEI HARADA (PF1) ri(dom [f + I X ]) ri(dom P k ). We use the symbol (PF) to stress that the assumption depends not only on P k but also f. Therefore, by means of Lemma 2.2, there exists a sequence {z k } which satisfies the following: (3.2) (3.3) z k [f + I X ](x k + d k ), z k P k (d k ). Proposition 3.1. Let {x k }, {z k } and {d k } be sequences which enjoy (3.1), (3.2), and (3.3), and {P k } be sequences of positive functions which enjoy (P1), (P2), and (PF), then we have the following: (i) There exists a convergent subsequece {x k } k such that x k x. (ii) k=0 ( Pk (d k ) + P k ( zk ) ) < +, in particular, P k ( zk ) 0. Proof. In view of (3.2), we have [f + I X ](x k ) [f + I X ](x k + d k ) + z k, x k (x k + d k ). On the other hand, in view of (3.3) and Lemma 2.1, we have Therefore, we have that (3.4) P k (d k ) + P k ( z k ) = d k, z k. [f + I X ](x k ) [f + I X ](x k + d k ) + P k (d k ) + P k ( z k ). In view of (P2) and (P*2), the sequence {[f +I X ](x k )} is a nonincreasing sequence. Therefore, we have that x k L [f+ix ](x 0 )(f + I X ) for all k, and hence there exists a convergent subsequence {x k } k such that x k x. We have proved item (i). Again, in view of (P2), (P*2), and (3.4), it follows that there exists a limit f R such that [f + I X ](x k ) f. Thus, by summing up (3.4) from k = 0 to, we have [f + I X ](x 0 ) f ( + Pk (d k ) + Pk ( z k ) ). k=0 Furthermore, by means of (P2) and (P*2), we have Pk ( zk ) 0. Therefore, we have proved item (ii). As pointed out by Frangioni [8], the role of P k ( zk ) has been overlooked. Many researches so far has been proved the convergence properties not via P k ( zk ) 0, but via P k (d k ) 0. We consider the difference in the latter of the section. Lemma 3.2. Let {x k } and {z k } be sequences which satisfies z k [f + I X ](x k+1 ), x k x and z k 0, then x minimizes f on X. Proof. x k 0, z k 0, and z k [f + I X ](x k+1 ) implies that 0 [f + I X ](x ); Indeed, z k f(x k+1 ) is equivalent to [f + I X ](x) [f + I X ](x k+1 ) + z k, x x k+1, x R n. Take k, we have that [f + I X ](x) [f + I X ](x ) + 0, x x, x R n. Therefore, we have completed the proof.

5 A UNIFIED CHARACTERIZATION OF PROXIMAL ALGORITHMS VIA THE CONJUGATE OF REGULARIZATION TERM5 Proposition 3.1 (i) and Lemma 3.2 suggest that any limit point x generalized by the proximal algorithm is an optimal solution of the original problem (P) if z k 0 holds. Therefore, by means of Proposition 3.1 (i), what remains is how to reach z k 0 from Pk ( zk ) 0. In this sense, the structure of Pk plays the key role. In addition to (P1) and (P2), we assume the following new requirement to prove z k 0. (P3) lim inf k + Pk (z) > 0 for all z 0. Following lemma clarifies the role of the assumption (P3). Lemma 3.3. Let F k : R n R + be a sequence of lsc convex functions such that F k (0) = 0 for all k, L 0 (F ) = {0} where F (y) = lim inf k + F k (y) and there exists a sequence {y k } such that F k (y k ) 0, then y k 0. Proof. Assume to achieve at a contradiction that there exists a ɛ > 0 and infinite subsequence {y k } k 0 such that y k / B(0, ɛ) for all k 0. For sufficiently large k, we have that ( ) y F k (y k k ) F k y k ɛ inf F k(y) inf F (y). y =ɛ y =ɛ First inequality follows from the convexity of F k and F k (0) = 0. Since L 0 (F ) = {0}, inf y =ɛ F (y) > 0. Therefore, F k (y k ) does not converges to 0 on k 0, and this is a contradiction. We have completed the proof. The structure of the assumption (P3) is a bit confusing. If P k is invariable with respect to k, (P3) is equivalent to L 0 (P ) = {0} where P = P k for all k. However, the following assumption is not equivalent to (P3). (P3A) L 0 (Pk ) = {0} for all k. For example, Pk (z) = tk z 2 /2 with t k 0 satisfies (P3A), but not satisfies (P3). Intuitively, we need to avoid Pk to become flat around 0 for {P k} to enjoy (P3). One way to meet (P3) is to add the following strong assumption. (P3B) There exists a k such that P k Pk for all k. With this assumption, we can confirm (P3) holds since P k (z) lim inf k + Pk (z) and L 0 (P k ) = {0}. If P k depends only on t k, there exists a positive, lsc, convex, and parametric function Q t with respect to t such that P k = Q t k. In this case, (P3B) can be reduced to the following: (P3C) Q t Q τ if t τ and there exists a t L > 0 such that t L t k for all k. By an easy algebra, Q t Q τ is equivalent to Q t Q τ. Except for the requirements of {t k }, (P3C) is equivalent to [8, (P4)(P4*) Section 3]. Theorem 3.4. Let {x k } and {z k } be sequences which enjoy (3.1), (3.2), and (3.3), then x x and 0 [f + I X ](x ), if (P1), (P2), (P3) and (PF1) hold. Proof. In view of Proposition 3.1 (i) and Lemma 3.2, we have that 0 [f +I X ](x ) if z k 0 holds. On the other hand, in view of Proposition 3.1 (ii) and Lemma 3.3, we have z k 0. Therefore, we have completed the proof. Let us review some regularization terms of which convergence properties can be confirmed in our approach. In other words, in order to apply Theorem 3.4, let us confirm that whether (P1), (P2), (PF1), and (P3) hold or not in every case.

6 6 OUHEI HARADA 1. Power regularization. This type of regularization term is defined by P k (d) = 1 t k α d α, α > 1, where {t k } is a positive sequence such that 0 < t L t k for all k. In this case, (P1),(P2), and (PF1) are obviously hold. By means of [18], we have P k (z) = t k z β /β where 1/α + 1/β = 1. Therefore, (P3A) and (P3C) hold, and hence (P3) holds. However if α = 1, we have P k (z) = I B(0,t k )(z). It implies that (P3) does not hold. 2. Trust region. This type of regularization term is defined by P k (d) = I B(0,tk )(d). The confirmation of (P1), (P2), and (PF1) are obvious if there exists a t L > 0 such that t k t L. In this casem, P k (z) = z /tk. Thus, (P3A) and (P3C) hold, and hence (P3) hold. 3. Variable metric. This type of regularization term is defined by P k (d) = 1 2 dt W k d, where {W k } is a sequence of n-dimentional positive definite functions which often depends on x k. In this case, Pk (z) = zt (W k ) 1 z/2, and (P3A) obviously holds. However, to hold (P3), we need another assumption: there exists a λ L > 0 such that λ min (W k ) λ L for all k. Without the assumption, we can easily construct a counter example W k = I k /t k with t k Indiscriminate variable. Surprisingly, the assumption (P3) does not require that every P k is generated by a parametric function with respect to t or x. All we need is just (P3), lim inf k Pk (z) > 0 for all z 0. Therefore, for example, following absurd sequence of functions 1 t k α d α, k = 3m P k (d) = I B(0,tk )(d), k = 3m + 1, 1 2 dt W k d, k = 3m + 2 where α > 1 and there exists a t L > 0 such that t k t L, is acceptable. We have proved z k 0 by means of P k ( zk ) 0 with (P3). However, Proposition 3.1 also implies that P k (d k ) 0, and one may try to prove z k 0 from P k (d k ) 0 via d k 0. To reach d k 0 from P k (d k ) 0, an assumption similar to (P3) is necessary for P k. In addition, we need to prove z k 0 from d k 0, and hence some appropriate but not obvious assumptions are necessary. Unfortunately, even if the assumptions become clear, this approach is less attractive, because it does not cover all the regularization terms proved from P k ( zk ) 0. For example, consider the case of trust region, P k (d) = I B(0,tk )(d). In this case, we have confirmed that (P3) holds and z k 0. On the other hand, P k (d k ) = I B(0,tk )(d) 0 does not imply d k 0. Therefore, in this case, the convergence property cannot be proved by this approach. In short, z k 0 and d k 0 are not equivalent, in general. If P k is invariable with respect to k and the uniform function P = P k is of Legendre type, then the corespondence between z and d is one-to-one, thus z k 0 and d k 0 become equivalent. Further, the one-to-one relation is not necessary to hold on the entire

7 A UNIFIED CHARACTERIZATION OF PROXIMAL ALGORITHMS VIA THE CONJUGATE OF REGULARIZATION TERM7 domain, its necessary to hold just on B(0, ɛ) for sufficiently small ɛ > 0. z P ɛ O ɛ d Figure 1. Locally of Legendre type We do not consider the entire convergence of {x k } so far. In the case of P k (d) = d 2 /2t k, the entire convergence can be shown through the relation in the following. (3.5) x x k+1 2 = x x k 2 x k+1 x k 2 2 x k+1 x k, x x k+1, (3.6) 0 x k+1 x k, x x k+1. Equality (3.5) is just an algebra. Inequality (3.6) is deduced from [f + I X ](x ) [f + I X ](x k ) and [f + I X ](x ) [f + I X ](x k+1 ) + z k, x x k+1. What is deficient to prove the entire convergence of {x k }? For a general regularization term, we can not prove (3.6). What we can prove in general is that (3.7) 0 z k, x x k+1. Thus, we wish to prove something which mimics (3.5). This desire is partially achievable as follows. However, it is still not sufficient to prove the entire convergence. Lemma 3.5. Let {x k }, {z k }, and {d k } be sequences such that (3.1), (3.2), and (3.3) hold and P k be a sequence of positive convex functions such that (P1) and (P2) hold, then there exists a positive semidefinite symmetirc matrix M k such that M k (x k+1 x k ) = z k for all k such that z k, d k 0. Proof. Let W k = ( z) k ( z k ) T / z k, d k, then we can easily confirm that W k d k = z k. W k is positive semidefinite, since z k, d k = P k (d k ) + P k ( zk ) 0 by means of (P2) and (P*2), and we assume that z k, d k 0. To make W k positive definite, we need to prepare n 1 vertical vectors q i, i = 1,, n 1 such that q i, q j = δ ij and q i, d k = 0. These vectors, indeed, exist since there exists a (n 1)-dimentional hyperplane of which d k is a vertical vector. Note that z k is independent of {q i } n 1 i=1 since zk, d k 0 and q i, d k = 0. Therefore, adding q i q T i to W k, we can prepare a symmetric positive definite matrix M k = W k + n 1 i=1 q iqi T which satisfies M k (x k+1 x k ) = z k. We have completed the proof. By means of Lemma 3.5, we can obtain x x k+1 2 M k = x x k 2 M k x k+1 x k 2 M k 2 x k+1 x k, x x k+1 Mk, 0 x k+1 x k, x x k+1 Mk.

8 8 OUHEI HARADA At first glance, this relation is similar to (3.5) and (3.6). The essential difference is that the positive definite matrices M k are not invariable. Therefore, it is not sufficient to deduce x k x. However, by preparing a sufficient condition for f+i X, we can deduce dist(x k, X ) 0, where X = { x R n [f + I X ](x) [f + I X ](y), y } is the optimal set of (P). (F1) τ > 0, dist(x, X ) τ z, z f(x). Condition (F1) is known as error bounds which has been proposed by Luo [15]. By means of (F1), we can deduce dist(x k, X ) 0 directly from z k 0. Theorem 3.6. Let {x k } and {z k } be sequences which enjoy (3.1), (3.2), and (3.3), then dist(x k, X ) 0, if (P1), (P2), (P3), (PF1) and (F1) hold. To attack the entire convergence, other ideas are necessary. An attempt for this subject can be seen at [2], where the distance like function H(x, y) is introduced for {P k } such that H(x, x k+1 ) = H(x, x k ) H(x k+1, x k ) 2 z k, x x k+1, 0 z k, x x k+1, to hold. Of course, much stronger assumptions are necessary for the family of P k. For more detailed discussion, see [2]. 4. Bregman functions In this section, we consider the case that the regularization term P k is defined as D-function of Bregman functions. This approach has been developed by [5]. For any differentiable convex function h, D-function of h is defined by D h (x, y) = h(x) h(y) h(y), x y. For an open convex set S, a convex function h is said to be a Bregman function with zone S, if the following conditions hold [7, Definition 1]. (h1) h is continuously differentiable on S. (h2) h is strictly convex and continuous on S. (h3) D h (x, y) is inf-compact with respect to both x and y. (h4) If x k x, then D h (x, x k ) 0. (h5) If x k and y k are sequences such that y k y S, {x k } is bounded, and D h (x k, y k ) 0, then x k y. We show that P k (d) = (t k ) 1 D h (x k + d, x k ) is suitable for a regularization term. (P1) is obvious. (P2) follows from D h (x, x) = 0. To make the algorithm welldefined, we need ri(dom [f + I X ]) S. Then, condition (PF1) holds. In order to hold P t,x (0) = 0, strict convexity of h is necessary. We can confirm that (P3) also holds if P t,x (0) = 0 and there exists a t L > 0 such that t L t k for all k. Therefore, we can apply Theorem 3.4 and conclude that every limit point x is an optimal solution of Problem (P). In the case of Bregman functions, (P3) is not tractable to prove z k 0. By definition, we can see that z k = (t k ) 1 ( h(x k+1 ) h(x k )) 0 directly if t k t L for all k. Furthermore, it is well known that much stronger convergence result, the entire convergence of x k x can be proved. Theorem 4.1. [7, Theorem 1] Let {x k } be a sequence such that x k x and enjoys (3.2) and (3.3) with P k (d) = (t k ) 1 D h (x k + d, x k ), then we have x k x.

9 A UNIFIED CHARACTERIZATION OF PROXIMAL ALGORITHMS VIA THE CONJUGATE OF REGULARIZATION TERM9 Proof. By definition, we have that (4.1) D h (x, x k+1 ) = D h (x, x k ) D h (x k+1, x k ) h(x k+1 ) h(x k ), x x k+1. On the other hand, since x is an optimal solution of (P) and h(x k ) h(x k+1 ) is the derivative of f + I X at x k+1, we have f(x ) f(x k+1 ) f(x ) f(x k+1 ) h(x k+1 ) h(x k ), x x k+1 Therefore, we have h(x k+1 ) h(x k ), x x k+1 0. Together with D h (x k+1, x k ) 0, and (4.1), we have that the sequence {D h (x, x k )} is a non increasing sequence and hence entirely converges to some positive value. On the other hand, we have D h (x, x k ) 0 by means of x k x and (h4). This yields D h (x, x k ) 0. As a more general case, we can consider the case that h is not assumed to be smooth. In this case, P k is defined by P k (d) = D k h(x k + d, x k ) = h(x k + d) h(x k ) s, d, s h(x k ). Convergence property in this case has been proved by iwiel [14]. However, in our approach, it is hard to show z k 0. Let us consider that n = 1, X = R, f(x) = x and h(x) = x + x 2 /2. In this case, if we reach optimal x k+1 = 0 at some k, 0 f(x k ) + P k (0) implies not z k = 0 but z k [ 1, 1]. Although we have reached the point such that 0 f(x k+1 ), we cannot confirm it from just Pk ( zk ) ϕ-divergence In this section, we consider the case that P k itself plays the role of I X if necessary. In this case, the subproblem is replaced with minimize f(x k + d) + P k (d), x k + d R n, and hence the condition (3.2) becomes (5.1) z k f(x k + d k ). It implies that z k 0 does not always secure 0 (f + I X )(x ). Similarly, the assumption (PF1) should be replaced with (PF1 ) ri(dom f) ri(dom P k ). On the other hand, we can easily confirm that Proposition 3.1 holds by just replacing f + I X with f in its proof. We first explore a sufficient condition for {z k } to achieve 0 (f + I X )(x ). The situation differs whether x int(x) or not. The normal cone N X (x ) and its polar cone T X (x ) play important roles. To simplify the description, we denote the projection of a vector z onto N X (x ) and T X (x ) as p N (z) and p T (z) respectively. Proposition 5.1. Let {x k } and {z k } be sequences such that x k x and p T (z k ) 0, then 0 (f + I X )(x ). Proof. In view of Lemma 2.4, we have z k = p T ( z k ) + p N ( z k ). Note that p T ( z k ) p T (z k ), in general. Since ( z k ) = z k f(x k+1 ), we have p T ( z k ) p N ( z k ) f(x k+1 ). By definition, p N ( z k ) N X (x ) = I X (x ), and hence

10 10 OUHEI HARADA p T ( z k ) f(x k+1 ) + I X (x ). Therefore, condition p T ( z k ) 0 implies that 0 (f + I X )(x ). In particular, if {z k } converges to some z, p T ( z k ) 0 is equivalent to z N X (x ). Moreover, N X (x ) = I X (x ) = {0} holds if x int(x). In this case, p T ( z k ) 0 is equivalent to z k 0. The problem is how to prepare P k to satisfy p T ( z k ) 0. Let us consider the special case that X = R n +. In this case, ϕ-divergence defined in the following are well known. n ( ) P k = Q tk,x k, Q xi + d i (5.2) t,x(d) = t i x i ϕ, where Q t,x is a parametric positive convex function with respect to t R n + and x, and ϕ : R n R + is a positive, lsc, and convex function such that the following conditions hold. i=1 (ϕ0) ϕ is smooth on (0, + ). (ϕ1) ϕ(1) = 0. (ϕ2) ϕ (1) = 0. (ϕ3) (0, + ) dom ϕ (0, + ]. (ϕ4) lim d 0 ϕ (d) = + and b > 0, ϕ (1) = b. ϕ is often said to be a kernel function of ϕ-divergence Q t,x. We can easily confirm that P k defined as (5.2) satisfies the requirements of the regularization term (P1), (P2), and (PF). However, in our setting, we can see that shifting ϕ by 1 to minus direction make the function more tractable and symmetric with respect to the conjugate operator. Furthermore, the smooth condition (ϕ0) is not necessary except for its minimum point. For that reason, we introduce another kernel function ˆϕ(d) which mimics ϕ(1+d). Then, the regularization term is defined as follows. n ( ) P k = Q t k,x k, Q di (5.3) t,x(d) = t i x i ˆϕ, where ˆϕ : R R + is a positive, lsc, and convex function which satisfies the following requirements: ( ˆϕ1) ˆϕ(0) = 0. ( ˆϕ2) ˆϕ(0) = {0} ( ˆϕ is differentiable at 0 and ˆϕ (0) = 0). ( ˆϕ3) ( 1, + ) dom ˆϕ [ 1, + ). ( ˆϕ4) ˆϕ ( 1) = + and b > 0, ˆϕ (1) = b. We can easily confirm that if a positive function ϕ satisfies (ϕ1) to (ϕ4), then the shifted function ˆϕ(d) = ϕ(1 + d) satisfies ( ˆϕ1) to ( ˆϕ4). Let ˆψ = ˆϕ be a conjugate function of ˆϕ, following properties hold. ( ˆψ1) ˆψ(0) = 0. ( ˆψ2) L 0 ( ˆψ) = {0}. ( ˆψ3) (, b) dom ˆψ (, b]. ( ˆψ4) ˆψ ( 1) = 1 and ˆψ (1) = +. Lemma 5.2. If a positive function ˆϕ satisfies ( ˆϕ1) to ( ˆϕ4), then its conjugate function ˆψ = ˆϕ satisfies ( ˆψ1) to ( ˆψ4). i=1 x i x i

11 A UNIFIED CHARACTERIZATION OF PROXIMAL ALGORITHMS VIA THE CONJUGATE OF REGULARIZATION TERM11 Proof. By definition, any conjugate function is convex and lsc, and hence ( ˆψ0) holds. ( ˆψ1) is obvious; ˆψ(0) = supd ( ˆϕ(d)) = 0. By means of Lemma, we have L 0 ( ˆψ) = ˆϕ(0) = {0}. Therefore ( ˆψ2) is proved. In view of ( ˆϕ4), there exists a d dom ˆϕ such that s ˆϕ(d) d ˆψ(s) for all s (, b). Therefore, we have (, b) dom ˆψ. On the other hand, by means of ˆϕ ( 1) = +, we have ˆψ(s) = sup d d(s ˆϕ(d)/d) sup d 0 d(s b) = + for all s > b. It follows that dom ˆψ (b, + ) =, and hence we have dom ˆψ (, 1]. We have proved ( ˆψ3). In view of ( ˆϕ3), there exists a s ˆϕ(d) d ˆψ(s) for all d ( 1, + ). It implies that ˆψ ( 1) 1 and ˆψ (1) = +. On the other hand dom ˆϕ [ 1, + ) implies that dom ˆϕ (, 1) =, and hence ˆψ ( 1) 1. We have proved ( ˆψ4). If we add L 0 ( ˆϕ) = {0} in ( ˆϕ2) and set b = 1 at ( ˆϕ4), then ˆψ(0) = {0} is added to ( ˆψ2) and ( ˆψ3) becomes (, 1) dom ˆψ (, 1]. Therefore, conditions between ˆϕ and ˆψ become symmetric with respect to the vertical line passing through the origin. A typical example is ˆϕ(d) = log(1 + d) + d and ˆψ(z) = log(1 z) z. ˆϕ ˆψ 1 O b Figure 2. ˆϕ and ˆψ In view of Proposition 3.1 and 5.1, our goal is to prove p T ( z k ) 0. We begin with the following lemmas. Lemma 5.3. The conjugate function of Q t,x is given as follows: n ( ) Q zi t,x(z) = Ψ ti,x i (z i ), Ψ ti,x i (z i ) = t i x i ˆψ. i=1 Proof. By definition, we have that { n [ Q t,x(z) = sup z i d i t i x i ˆϕ d i=1 ( di Therefore, we have completed the proof. x i )] } = n i=1 t i [ { zi t i x i sup di ˆϕ d t i x i ( di x i )}]. Lemma 5.4. Let {a k }, {v k }, and {w k } be sequences such that Ψ ak,v k(wk ) 0, a k ā > 0 for all k, and 0 is not a limit point of {v k }, then w k 0. Proof. Since 0 is not a limit point of {v k }, there exists a sufficiently large k such that v k 0 for all k k. It follows that the sequence of the function {Ψ a k,v k} k k and the sequence {w k } k k satisfies the requirements of Lemma 3.3. Therefore, we have w k 0 and completed the proof. Lemma 5.5. Let {a k }, {v k }, and {w k } be sequences such that Ψ a k,v k(wk ) 0 and a k 0, then (w k ) + 0.

12 12 OUHEI HARADA Proof. By definition, Ψ is a positive and convex function. In view of ( ˆψ3), dom Ψ (, ba k ]. Therefore w k (, ba k ]. By means of a k 0, it follows that (w k ) + 0. To make use of Lemma 5.4 and 5.5, we need to elaborate how to update t k which satisfies the following requirement. (T1) t k i 0 if x i = 0, otherwise there exists a t > 0 such that t k i t for all k. Under the requirement, we can establish the following proposition. Proposition 5.6. Let the regularization term P k is defined by (5.3), {t k }, {x k }, and {z k } be sequences such that P k ( z k ) 0, x k x, and {t k } satisfies (T1), then p T ( z k ) 0. Proof. In view of Lemma 5.3, P k ( z k ) 0 implies that Ψ t k i,x k( zk i i ) 0 for all i. Let I be the set of indices such that I = {i x i = 0}. For every i / I, we have zi k 0 from Lemma 5.4. Similarly, for every i I, we have ( zi k) + 0 from Lemma 5.5. On the other hand, N R n + (x ) = {z z R n, z i 0, i I}, and hence T R n + (x ) = {d d R n, d i 0, i I}. Therefore, we conclude that p T ( z k ) = ( z k ) + 0 and we have completed the proof. Corollary 5.7. Let the regularization term P k is defined by (5.3), {t k }, {x k }, and {z k } be sequences such that P k ( z k ) 0, x k x int(x), and {t k } satisfies (T1), then z k 0. The left is how to construct an update rule of t k to satisfy (T1). A simple rule is to set t k = x k. Ben-tal [3] proposed nondecreasing sublinear functions π k : R + R + such that there exists a scalar c > 0 which satisfies π k (x) cx for all positive schalar x > 0, and they update t k by means of π k such that t k i = πk (x k i ). This rule enjoys (T1), to be sure. However, more generalized π k is acceptable. (π1) π k is nonincreasing. (π2) lim t 0 π k (t) = 0. (π3) L 0 (π k ) = {0}. If π k satisfies (π1-3), then (T1) holds. For example, π k (x) = x is not sublinear, but satisfies (π1-3), and hence (T1) holds. In view of Proposition 5.1 and 5.6, we conclude the following. Theorem 5.8. Let {x k }, {z k } and {d k } be sequences which enjoy (3.2), (5.1), and (PF1 ), and {t k } is updated by t k = π k (x k ) such that (π1-3) holds, then there exists a convergent subsequence x k x and every limit point is a minimizer of f on R n +. The result obtained in this section can be seen an extention of [3]. We do not assume that ˆϕ is essentially smooth, and the somewhat artificial evaluation of t i x i ˆϕ(d i /x i ), [3, Lemma 2], is not necessary in our approach. In addition, our requirements for π k is more general. References [1] Alfred Auslender. Numerical methods for nondifferentiable convex optimization. In Nonlinear Analysis and Optimization, pages Springer, 1987.

13 A UNIFIED CHARACTERIZATION OF PROXIMAL ALGORITHMS VIA THE CONJUGATE OF REGULARIZATION TERM13 [2] Alfred Auslender and Marc Teboulle. Interior gradient and proximal methods for convex and conic optimization. SIAM Journal on Optimization, 16(3): , [3] Aharon Ben-Tal and Michael Zibulevsky. Penalty/barrier multiplier methods for convex programming problems. SIAM Journal on Optimization, 7(2): , [4] Joseph Frédéric Bonnans, Jean Charles Gilbert, Claude Lemaréchal, and Claudia Sagastizábal. A family of variable metric proximal methods. Mathematical Programming, 68(1-3):15 47, [5] Yair Censor and Stavros Andrea Zenios. Proximal minimization algorithm withd-functions. Journal of Optimization Theory and Applications, 73(3): , [6] Gong Chen and Marc Teboulle. Convergence analysis of a proximal-like minimization algorithm using Bregman functions. SIAM Journal on Optimization, 3(3): , [7] Jonathan Eckstein. Nonlinear proximal point algorithms using bregman functions, with applications to convex programming. Mathematics of Operations Research, 18(1): , [8] Antonio Frangioni. Generalized bundle methods. SIAM Journal on Optimization, 13(1): , [9] JB Hiriart-Urruty and Claude Lemaréchal. Convex Analysis and Minimization Algorithms I. Fundamentals. Springer-Verlag, New York, [10] JB Hiriart-Urruty and Claude Lemaréchal. Convex Analysis and Minimization Algorithms II: Advanced Theory and Bundle Methods. Springer-Verlag, New York, [11] Alfredo N Iusem. Augmented Lagrangian methods and proximal point methods for convex optimization. Investigación Operativa, 8(11-49):7, [12] Alfredo N Iusem, BF Svaiter, and Marc Teboulle. Entropy-like proximal methods in convex programming. Mathematics of Operations Research, 19(4): , [13] Sehun im, un-nyeong Chang, and Jun-Yeon Lee. A descent method with linear programming subproblems for nondifferentiable convex optimization. Mathematical programming, 71(1):17 28, [14] rzysztof C iwiel. Proximal minimization methods with generalized Bregman functions. SIAM Journal on Control and Optimization, 35(4): , [15] Zhi-Quan Luo and Paul Tseng. Error bounds and convergence analysis of feasible descent methods: a general approach. Annals of Operations Research, 46(1): , [16] Felipe Garcia Moreno, Paulo Roberto Oliveira, and Antoine Soubeyran. A proximal algorithm with quasi distance. application to habit s formation. Optimization, 61(12): , [17] Roman Polyak and Marc Teboulle. Nonlinear rescaling and proximal-like methods in convex optimization. Mathematical Programming, 76(2): , [18] R Tyrrell Rockafellar. Convex Analysis (Princeton Mathematical Series). Princeton University Press, [19] Marc Teboulle. Convergence of proximal-like algorithms. SIAM Journal on Optimization, 7(4): , 1997.

Optimality Conditions for Nonsmooth Convex Optimization

Optimality Conditions for Nonsmooth Convex Optimization Sangkyun Lee Oct 22, 2014 Let us consider a convex function f : R n R, where R is the extended real field, R := R {, + }, which is proper (f never