ALGORITHMS FOR MINIMIZING DIFFERENCES OF CONVEX FUNCTIONS AND APPLICATIONS

Size: px

Start display at page:

Download "ALGORITHMS FOR MINIMIZING DIFFERENCES OF CONVEX FUNCTIONS AND APPLICATIONS"

Ursula Green
6 years ago
Views:

1 ALGORITHMS FOR MINIMIZING DIFFERENCES OF CONVEX FUNCTIONS AND APPLICATIONS Mau Nam Nguyen (joint work with D. Giles and R. B. Rector) Fariborz Maseeh Department of Mathematics and Statistics Portland State University SIAM Conference on Imaging Sciences Albuquerque, New Mexico, May 2016

2 Outline 1 Minimizing Differences of Convex Functions-The DCA 2 Nesterov s Smoothing Technique via Convex Analysis 3 The DCA and Nesterov s Smoothing Technique for Weighted Fermat-Torricelli Problems 4 The DCA and Nesterov s Smoothing Technique for Multifacility Location

3 Differences of Convex Functions Consider the problem minimize f (x) := g(x) h(x), x R n where g : R n R and h : R n R are convex. We call g h a DC decomposition of f. g(x) = x 4 h(x) = 3x 2 +x f = g h

4 Examples of DC Programming Fermat- Torricelli Clustering Multifacility Location f (x) = m i=1 c i x a i f (x 1,..., x l ) = mi=1 min k l=1 x l a i 2 f (x 1,..., x l ) = mi=1 min k l=1 x l a i

5 Subgradients and Fenchel Conjugates of Convex Functions Definition Let f : R n (, ] be a convex function and let x dom(f ) := {x R n f (x) < }. A subgradient of f at x is any v R n such that v, x x f (x) f ( x) for all x R n. The subdifferential f ( x) of f at x is the set of all subgradients of f at x. Definition Let f : R n (, ] be a function. The Fenchel conjugate of f is defined by f (x) = sup u R n { x, u g(u)}, x R n.

6 DC Algorithm-The DCA Consider the problem minimize f (x) := g(x) h(x), x R n where g : R n R and h : R n R are convex. The DCA 1. INPUT: x 1 R n, N N for k = 1,..., N do Find y k h(x k ) Find x k+1 g (y k ) end for OUTPUT: x N+1 1 P.D. Tao, L.T.H. An, A d.c. optimization algorithm for solving the trust-region subproblem, SIAM J. Optim. 8 (1998),

7 DC Algorithm-The DCA Theorem Let g, h : R n (, ] be proper lower semicontinuous convex functions. Then v g (y) if and only if v argmin { g(x) y, x x R n}. Moreover, w h(x) if and only if w argmin { h (y) y, x y R n}.

8 DC Algorithm-The DCA The DCA. INPUT: x 1 R n, N N for k = 1,..., N do Find y k h(x k ) or find y k approximately by solving: minimize ψ k (y) := h (y) x k, y, y R n. Find x k+1 g (y k ) or find x k+1 approximately by solving: end for OUTPUT: x N+1 minimize φ k (x) := g(x) x, y k, x R n.

9 An Example of the DCA minimize f (x) = x 4 3x 2 x, x R Then f (x) = g(x) h(x), where g(x) = x 4 and h(x) = 3x 2 + x. We have { ( h(x) = {6x + 1}, g (y) = argmin x 4 yx) } { } 3 y = x R n 4 x k+1 = 3 6xk + 1 4

10 DC Algorithm-The DCA Definition A function h : R n (, ] is called γ-convex (γ 0) if there exists γ 0 such that the function defined by k(x) := h(x) γ 2 x 2, x R n, is convex. If there exists γ > 0 such that h is γ convex, then h is called strongly convex with parameter γ. Theorem Consider the sequence {x k } generated by the DCA. Suppose that g is γ 1 -convex and h is γ 2 -convex. Then f (x k ) f (x k+1 ) γ 1 + γ 2 x k+1 x k 2 for all k N. 2

11 DC Algorithm-The DCA Definition We say that an element x R n is a stationary point of the function f = g h if g( x) h( x). In the case where g and h are differentiable, x is a stationary point of f if and only if f ( x) = g( x) h( x) = 0. Theorem Consider sequence {x k } generated by the DCA. Then {f (x k )} is a decreasing sequence. Suppose further that f is bounded from below and that g is γ 1 -convex and h is γ 2 -convex with γ 1 + γ 2 > 0. If {x k } is bounded, then every subsequential limit of the sequence {x k } is a stationary point of f.

12 The DCA for Clustering Problem formulation 2 : Let a i for i = 1,..., m be target points in R n. m Minimize f (x 1,..., x l ) := min{ x l a i 2 : l = 1,..., k} i=1 over x l R n, l = 1,..., k. 2 L.T.H. An, M.T. Belghiti, P.D. Tao, A new efficient algorithm based on DC programming and DCA for clustering, J. Glob. Optim., 27 (2007),

13 K-Mean Clustering Let x 1, x 2,..., x m be the data points and let c 1,..., c k denote the centers. Randomly select k cluster centers. Assign each data point to the nearest center. Find the average of the data points assigned to each center. Repeat the second step with the obtained new centers in the third step until the centroids no longer move. Although k mean clustering is effective in many situations, it also has some disadvantages. The k-means algorithm does not necessarily find the optimal solution The algorithm is sensitive to the initial selected cluster centers

14 DCA for Clustering and K-Mean 3 Both DCA1 and DCA2 are better than K-means: the objective values given by DCA1 and DCA2 are much smaller than that computed by K-means. DCA2 is the best among the three algorithms: it provides the best solution with the shortest time. DCA2 is very fast and can then handle large-scale problems. m f (x 1,..., x k ) = min x l a i 1 l=1,...,k i=1 This is a nonsmooth nonconvex program for which there are rarely efficient solution algorithms, especially in the large scale setting. 3 L.T.H. An, L.H. Minh, P.D. Tao, New and efficient DCA based algorithms for minimum sum-of-squares clustering, Pattern Recognition, 47 (2014),

15 Nesterov s Smoothing Technique via Convex Analysis Consider the function f 0 (x) := max{ Ax, u φ(u) u Q}, where A is an m n matrix, Q is a nonempty closed bounded convex subset of R m, and φ: R m R is a convex function. Define A = sup{ Ax x 1}. For µ > 0, define f µ (x) := max{ Ax, u φ(u) µ 2 u u 0 2 u Q}, u 0 Q. Then f µ is a C 1 function with l Lipschitz gradient where l = A 2 µ and f µ (x) = A T u µ (x). Here u µ (x) Q is the element for which the maximum is attained in the definition of f µ (x). 4 4 Nesterov: Smooth minimization of non-smooth functions. Math.Program., Ser. A 103, (2005).

16 Nesterov s Smoothing Technique via Convex Analysis f (x) = sup{ x, u f (u) u R n }. Theorem If f is µ strongly convex, then f has a Lipschitz continuous gradient with modulus 1 µ. Moreover, f (x) = u(x), where u(x) is the unique element of R n for which the maximum is attained in the definition of f (x). a a J. Hiriart-Urruty and C. Lemaréchal. Convex Analysis and Minimization Algorithms I & II. Springer, New York, 1993.

17 Nesterov s Smoothing Technique via Convex Analysis Theorem Let A be an n m-matrix. Suppose that ϕ: R m R is a strongly convex function with parameter µ > 0. Then the function µ: R n R defined by f (x) := max{ Ax, u ϕ(u) u Q} is differentiable with f (x) = A T v(x), where v(x) is the unique element for which the maximum is attained in the definition of f (x). The gradient is Lipschitz continuous with constant l = A 2 µ.

18 Nesterov s Smoothing Technique via Convex Analysis We have f (x) = max{ A T x, [ϕ(u) + δ(u; Q)] u R m } = g (A T x), where g(u) := ϕ(u) + δ(u; Q). By the chain rule, f (x) = A T g (A T x) = A T u(ax) = A T v(x). We also have f (x 1 ) f (x 2 ) = A T u(ax 1 ) A T u(ax 2 ) A T u(ax 1 ) u(ax 2 ) A 1 µ Ax 1 Ax 2 A 2 µ x 1 x 2.

19 The Minkowski Gauge Let F be a nonempty closed bounded convex set in R n that contains the origin in its interior. Define the Minkowski gauge associated with F by ρ F (x) := inf{t > 0 x tf}. Note that if F is the closed unit ball in R n, then ρ F (x) = x. Given a nonempty bounded set K, the support function associated with K is given by σ K (x) := sup{ x, y y K }. It follows from the definition of the Minkowski function that ρ F (x) = σ F (x), where F := {y R n x, y 1 for all x F}.

20 Weighted Fermat-Torricelli problem Let a i R n for i = 1,..., m and let c i 0 for i = 1,..., m be real numbers. In the remainder of this section, we study the following generalized version of the Fermat-Torricelli problem: minimize f (x) := m c i ρ F (x a i ), x R n. i=1 The function f has the following obvious DC decomposition: f (x) = c i >0 c i ρ F (x a i ) c i <0( c i )ρ F (x a i ). Let I := {i c i > 0} and J := {i c i < 0} with α i = c i if i I, and β i = c i if i J. Then f (x) = i I α i ρ F (x a i ) j J β j ρ F (x a j ).

21 Weighted Fermat-Torricelli problem Let a i R n for i = 1,..., m and let c i 0 for i = 1,..., m be real numbers. In the remainder of this section, we study the following generalized version of the Fermat-Torricelli problem: minimize f (x) := m c i ρ F (x a i ), x R n. i=1 The function f has the following obvious DC decomposition: f (x) = c i >0 c i ρ F (x a i ) c i <0( c i )ρ F (x a i ). Let I := {i c i > 0} and J := {i c i < 0} with α i = c i if i I, and β i = c i if i J. Then f (x) = i I α i ρ F (x a i ) j J β j ρ F (x a j ).

22 Weighted Fermat-Torricelli problem Theorem Let γ 1 := sup{r > 0 B(0; r) F} and γ 2 := inf{r > 0 F B(0; r)}. Suppose that γ 1 α i > γ 2 β j. i I Then the function f and its approximation f µ have absolute minima. j J

23 Smoothing the Minkowski Gauge Given any a R n and µ > 0, a Nesterov smoothing approximation of ϕ(x) := ρ F (x a) has the representation ϕ µ (x) = 1 2µ x a 2 µ [ x a d( 2 µ ; F ) ] 2. Moreover, ϕ µ (x) = P( x a µ ; F ) and ϕ µ (x) ϕ(x) ϕ µ (x) + µ 2 F 2, where F := sup{ u u F}.

24 Smoothing the Minkowski Gauge Theorem Given any µ > 0, an approximation of the function f is the following DC function: where f µ (x) := g µ (x) h µ (x), x R n, g µ (x) := α i 2µ x a i 2, i I h µ (x) := µα i [ x a i d( 2 µ ; F ) ] 2 + β j ρ F (x a j ). i I j J Moreover, f µ (x) f (x) f µ (x) + µ F 2 2 i I α i for all x R n.

25 The DCA for Weighted Fermat-Torricelli Problems Theorem Given any µ > 0, an approximation of the function f is the following DC function: where f µ (x) := g µ (x) h µ (x), x R n, g µ (x) := α i 2µ x a i 2, i I h µ (x) := µα i [ x a i d( 2 µ ; F ) ] 2 + β j ρ F (x a j ). i I j J Moreover, f µ (x) f (x) f µ (x) + µ F 2 2 i I α i for all x R n.

26 The DCA for Weighted Fermat-Torricelli Problems Algorithm 3. INPUTS: µ > 0, x 1 R n, N IN, F, a 1,..., a m R n, c 1,..., c m R. for k = 1,..., N do Find y k = u k + v k, where u k := i I α i[ xk a i µ P( x k a i µ ; F ) ], v k j J β j ρ F (x k a j ). Find x k+1 = y k + i I α i a i /µ i I α i /µ. OUTPUT: x N+1.

27 Multifacility Location We now consider the multifacility location problem: given m points a 1,..., a m R n, m minimize f (x 1,..., x k ) = min ρ F (x l a i ). l=1,...,k where F is a nonempty, closed and bound convex set containing the origin, and ρ F (x) = inf {t > 0} is the Minkowski gauge. i=1 x tf When F is the closed unit ball IB, the problem becomes m minimize f (x 1,..., x k ) = min x l a i. l=1,...,k i=1 It can be shown that a globally optimal solution exists.

28 DC Decomposition f (x 1,..., x k ) = m min x l a i l=1,...,k i=1 We will utilize the fact that m k f (x 1,..., x k ) = = i=1 l=1 x l a i max r=1,...,k ( m k ) x l a i i=1 l=1 m i=1 k x l a i l=1 l r max r=1,...,k k x l a i. l=1 l r

29 DC Decomposition We obtain the µ-smoothing approximation f µ = g µ h µ, where g µ (x 1,..., x k ) = 1 2µ h µ (x 1,..., x k ) = µ 2 + m m i=1 l=1 m i=1 l=1 k x l a i 2 k max r=1,...,k i=1 [ ( xl a i d µ ( k l=1 l r )] 2 ; IB 1 2µ x l a i 2 µ 2 To implement DCA, we need g µ and h µ... [ ( )] ) xl a 2 i d ; IB µ

30 g µ Using the Frobenius norm in a space of matrices, we express g µ as G µ (X) = m 2µ X 2 1 k X, B + µ 2µ A 2, with the inner product A, B = k l n a lj b lj, X is the k n matrix with rows x 1,..., x k, A is m n with rows a 1,..., a m, and B is k n whose every row is the sum a a m. j G µ (X) = m µ X 1 µ B Gµ(Y ) = 1 m (B + µy ) X G (Y ) iff Y G(X)

31 h µ h µ (x 1,..., x k ) = µ 2 + m i=1 max r=1,...,k k l=1 l r m k i=1 l=1 [ ( )] xl a 2 i d ; IB µ ( 1 2µ x l a i 2 µ 2 [ ( )] xl a 2 ) i d ; IB µ

32 h µ i H 1 (X) = i ( x 1 a i µ P x1 a i IB µ. ( x k a i µ P xk a i IB µ ) ) For each i = 1,..., m, there is some R i such that the R-excluded sum is maximal. If we call this ( sum F ) Ri, then F Ri is a k n matrix whose l th R row is P xl a i IB µ, and R th row is 0. V H 2 (X) iff V = m i=1 F Ri

33 Multifacility Location Algorithm INPUT: X 1 dom g, N N for k = 1,..., N do Compute Y k = H 1 (X k )+V k Compute X k+1 = 1 m (B + µy k) end for OUTPUT: x N+1

34 Clustering

35 Clustering

36 Clustering

37 Results 7500 Fermat Torricelli, R Eight Rings, R2 Algorithm Algorithm f(xk) 6500 f(xk) iteration iteration Multifacility, R2 Multifacility, R10 Multifacility Sets, R Algorithm Algorithm Algorithm f(xk) f(xk) f(xk) iteration iteration iteration

38 References [1] An, Belghiti, Tao: A new efficient algorithm based on DC programming and DCA for clustering. J. Glob. Optim., 27 (2007), [2] Nam, Rector, Giles: Minimizing Differences of Convex Functions and Applications to Facility Location and clustering. arxiv: (2015). [3] Nesterov: Smooth minimization of non-smooth functions. Math.Program., Ser. A 103, (2005). [4] Rockafellar: Convex Analysis, Princeton University Press, Princeton, NJ, 1970.

arxiv: v2 [math.oc] 4 Feb 2016

arxiv: v2 [math.oc] 4 Feb 2016 MINIMIZING DIFFERENCES OF CONVEX FUNCTIONS WITH APPLICATIONS TO FACILITY LOCATION AND CLUSTERING August 8, 2018 Nguyen Mau Nam 1, Daniel Giles 2, R. Blake Rector 3. arxiv:1511.07595v2 [math.oc] 4 Feb 2016