Majorization Minimization (MM) and Block Coordinate Descent (BCD)

Majorization Minimization (MM) and Block Coordinate Descent (BCD) Wing-Kin (Ken) Ma Department of Electronic Engineering, The Chinese University Hong Kong, Hong Kong ELEG5481, Lecture 15 Acknowledgment: Qiang Li for helping prepare the slides.

Outline Majorization Minimization (MM) Convergence Applications Block Coordinate Descent (BCD) Applications Convergence Summary 1

Consider the following problem Majorization Minimization min x f(x) s.t. x X (1) where X is a closed convex set; f( ) may be non-convex and/or nonsmooth. Challenge: For a general f( ), problem (1) can be difficult to solve. Majorization Minimization: Iteratively generate {x r } as follows x r min x u(x, x r 1 ) s.t. x X (2) where u(x, x r 1 ) is a surrogate function of f(x), satisfying 1. u(x, x r ) f(x), x r, x X ; 2. u(x r, x r ) = f(x r ); 2

f(x) f(x) u(x;x 1 ) u(x;x 0 ) x? x 2 x 1 x 0 Figure 1: An pictorial illustration of MM algorithm. x Property 1. {f(x r )} is nonincreasing, i.e., f(x r ) f(x r 1 ), r = 1, 2,.... Proof. f(x r ) u(x r, x r 1 ) u(x r 1, x r 1 ) = f(x r 1 ) The nonincreasing property of {f(x r )} implies that f(x r ) f. But how about the convergence of the iterates {x r }? 3

Technical Preliminaries Limit point: x is a limit point of {x k } if there exists a subsequence of {x k } that converges to x. Note that every bounded sequence in R n has a limit point (or convergent subsequence); Directional derivative: Let f : D R be a function where D R m is a convex set. The directional derivative of f at point x in direction d is defined by f (x; d) lim inf λ 0 f(x + λd) f(x). λ If f is differentiable, then f (x; d) = d T f(x). Stationary point: x X is a stationary point of f( ) if f (x; d) 0, d such that x + d D. (3) A stationary point may be a local min., a local max. or a saddle point; If D = R n and f is differentiable, then (3) f(x) = 0. 4

Convergence of MM Assumption 1 u(, ) satisfies the following conditions u(y, y) = f(y), y X, (4a) u(x, y) f(x), x, y X, (4b) u (x, y; d) x=y = f (y; d), d with y + d X, (4c) u(x, y) is continuous in (x, y), (4d) (4c) means the 1st order local behavior of u(, x r 1 ) is the same as f( ). 5

Convergence of MM Theorem 1. [Razaviyayn-Hong-Luo] Assume that Assumption 1 is satisfied. Then every limit point of the iterates generated by MM algorithm is a stationary point of problem (1). Proof. From Property 1, we know that f(x r+1 ) u(x r+1, x r ) u(x, x r ), x X. Now assume that there exists a subsequence {x r j} of {x r } converging to a limit point z, i.e., lim j x r j = z. Then u(x r j+1, x r j+1 ) = f(x r j+1 ) f(x r j+1 ) u(x r j+1, x r j ) u(x, x r j ), x X. Letting j, we obtain u(z, z) u(x, z), x X, which implies that u (x, z; d) x=z 0, z + d X. Combining the above inequality with (4c) (i.e., u (x, y; d) x=y = f (y; d), d with y + d X ), we have f (z; d) 0, z + d X. 6

Applications Nonnegative Least Squares In many engineering applications, we encounter the following problem (NLS) min Ax x 0 b 2 2 (5) where b R m +, b 0, and A R m n ++. It s an LS problem with nonnegative constraints, so the conventional LS solution may not be feasible for (5). A simple multiplicative updating algorithm: x r l = c r l x r 1 l, l = 1,..., n (6) where x r l is the lth component of xr, and c r l = [AT b] l. [A T Ax r 1 ] l Starting with x 0 > 0, then all x r generated by (6) are nonnegative. 7

10 1 10 0 10-1 10-2 10-3 10-4 0 20 40 60 80 100 120 140 Number of iterations Figure 2: Ax r b 2 vs. the number of iterations. Usually the multiplicative update converges within a few tens of iterations. 8

MM interpretation: Let f(x) Ax b 2 2. The multiplicative update essentially solves the following problem where x r 1 1 min u(x, x 0 xr 1 ) u(x, x r 1 ) f(x r 1 ) + (x x r 1 ) T f(x r 1 ) + 1 2 (x xr 1 ) T Φ(x r 1 )(x x r 1 ), ( [A Φ(x r 1 T Ax r 1 ] 1 ) = Diag,..., [AT Ax r 1 ) ] n x r 1. n Observations: { u(x, x r 1 ) is quadratic approx. of f(x), Φ(x r 1 ) A T A, = { u(x, x r 1 ) f(x), x R n, u(x r 1, x r 1 ) = f(x r 1 ). The multiplicative update converges to an optimal solution of NLS (by the MM convergence in Theorem 1 and convexity of NLS). 9

Applications Convex-Concave Procedure/ DC Programming Suppose that f(x) has the following form f(x) = g(x) h(x), where g(x) and h(x) are convex and differentiable. nonconvex. Thus, f(x) is in general DC Programming: Construct u(, ) as u(x, x r ) = g(x) ( h(x r ) + x h(x r ) T (x x r ) ) }{{} linearization of h at x r By the 1st order condition of h(x), it s easy to show that u(x, x r ) f(x), x X, u(x r, x r ) = f(x r ). 10

Sparse Signal Recovery by DC Programming min x 0 s.t. y = Ax (7) x Apart from the popular l 1 approximation, consider the following concave approximation: 2 min x R n n log(1 + x i /ɛ) i=1 s.t. y = Ax, 1.5 1 0.5 0-2 -1.5-1 -0.5 0 0.5 1 1.5 2 Figure 3: log(1 + x /ɛ) promotes more sparsity than l 1 11

min x R n n log(1 + x i /ɛ) i=1 which can be equivalently written as s.t. y = Ax, min x,z R n n log(z i + ɛ) s.t. y = Ax, x i z i, i = 1,..., n (8) i=1 Problem (8) minimizes a concave objective, so it s a special case of DC programming (g(x) = 0). Linearizing the concave function at (x r, z r ) yields (x r+1, z r+1 ) = arg min n i=1 z i z r i + ɛ s.t. y = Ax, x i z i, i = 1,..., n We solve a sequence of reweighted l 1 problems. 12

(1) (1) J Fourier Anal Appl (2008) 14: 877 905 883 Fig. 2 Sparse signal recovery through reweighted l 1 iterations. (a) Original length n = 512 signal x 0 with 130 spikes. (b) Scatter plot, coefficient-by-coefficient, of x 0 versus its reconstruction x (0) using unweighted l 1 minimization. (c) Reconstruction x (1) after the first reweighted iteration. (d) Reconstruction x (2) after the second reweighted iteration 13

Applications l 2 l p Optimization Many problems involve solving the following problem (e.g., basis-pursuit denoising) min x f(x) 1 2 y Ax 2 2 + µ x p (9) where p 1. If A = I or A is unitary, optimal x is computed in closed-form as x = A T y Proj C (A T y) where C {x : x p µ}, p is the dual norm of p and Proj C denotes the projection operator. In particular, for p = 1 x i = soft(y i, µ), i = 1,..., n where soft(u, a) sign(u) max{ u a, 0} denotes a soft-thresholding operation. For general A, there is no simple closed-form solution for (9). 14

MM for l 2 l p Problem: Consider a modified l 2 l p problem min x u(x, x r ) f(x) + dist(x, x r ) (10) where dist(x, x r ) c 2 x xr 2 2 1 2 Ax Axr 2 2 and c > λ max (A T A). dist(x, x r ) 0 x = u(x, x r ) majorizes f(x). u(x, x r ) can be reexpressed as u(x, x r ) = c 2 x xr 2 2 + µ x p + const., where x r = 1 c AT (y Ax r ) + x r. The modified l 2 l p problem (10) has a simple soft-thresholding solution. Repeatedly solving problem (10) leads to an optimal solution of the l 2 l p problem (by the MM convergence in Theorem 1 ) 15

Applications Expectation Maximization (EM) Consider an ML estimate of θ, given the random observation w ˆθ ML = arg min θ ln p(w θ) Suppose that there are some missing data or hidden variables z in the model. Then, EM algorithm iteratively compute an ML estimate ˆθ as follows: E-step: g(θ, θ r ) E z w,θ r{ln p(w, z θ)} M-step: θ r+1 = arg max g(θ, θ r ) θ repeat the above two steps until convergence. EM algorithm generates a nonincreasing sequence of { ln p(w θ r )}. EM algorithm can be interpreted by MM. 16

MM interpretation of EM algorithm: ln p(w θ) = ln E z θ p(w z, θ) [ p(z w, θ r ] )p(w z, θ) = ln E z θ p(z w, θ r ) [ ] p(z θ)p(w z, θ) = ln E z w,θ r p(z w, θ r (interchange the integrations) ) [ ] p(z θ)p(w z, θ) E z w,θ r ln p(z w, θ r (Jensen s inequality) ) = E z w,θ r ln p(w, z θ) + E z w,θ r ln p(z w, θ r ) (11a) u(θ, θ r ) u(θ, θ r ) majorizes ln p(w θ), and ln p(w θ r ) = u(θ r, θ r ); E-step essentially constructs u(θ, θ r ); M-step minimizes u(θ, θ r ) (note θ appears in the 1st term of (11a) only). 17

Outline Majorization Minimization (MM) Convergence Applications Block Coordinate Descent (BCD) Applications Convergence Summary 18

Block Coordinate Descent Consider the following problem min x f(x) s.t. x X = X 1 X 2... X m R n (12) where each X i R n i is closed, nonempty and convex. BCD Algorithm: 1: Find a feasible point x 0 X and set r = 0 2: repeat 3: r = r + 1, i = (r 1 mod m) + 1 4: Let x i arg min x X i f(x r 1 1,..., x r 1 i 1, x, xr 1 i+1,..., xr 1 m ) 5: Set x r i = x i and xr k = xr 1 k, k i 6: until some convergence criterion is met Merits of BCD 1. each subproblem is much easier to solve, or even has a closed-form solution; 2. The objective value is nonincreasing along the BCD updates; 3. it allows parallel or distributed implementations. 19

Applications l 2 l 1 Optimization Problem Let us revisit the l 2 l 1 problem min f(x) 1 x R n 2 y Ax 2 2 + µ x 1 (13) Apart from MM, BCD is another efficient approach to solve (13): Optimize x k while fixing x j = x r j, j k: min x k f k (x k ) 1 2 y j k a j x r j }{{} The optimal x k has a closed form: ȳ a k x k 2 2 + µ x k x k = soft ( a T k ȳ/ a k 2, µ/ a k 2) Cyclically update x k, k = 1,..., n until convergence. 20

Applications Iterative Water-filling for MIMO MAC Sum Capacity Maximization MIMO Channel Capacity Maximization MIMO received signal model: y(t) = Hx(t) + n(t) where x(t) C N H C N N n(t) C N Tx signal MIMO channel matrix standard additive Gaussian noise, i.e., n(t) CN (0, I). Tx Rx Figure 4: MIMO system model. 21

MIMO channel capacity: C(Q) = log det ( I + HQH H) where Q = E{x(t)x(t) H } is the covariance of the tx signal. MIMO channel capacity maximization: max Q 0 log det ( I + HQH H) s.t. Tr(Q) P where P > 0 is the transmit power budget. The optimal Q is given by the well-known water-filling solution, i.e., Q = VDiag(p )V H where H = UDiag(σ 1,..., σ N )V H is the SVD of H, and p = [p 1,..., p N ] is the power allocation with p i = max(0, µ 1/σi 2 ) and µ 0 being the water-level such that i p i = P. 22

MIMO Multiple-Access Channel (MAC) Sum-Capacity Maximization Multiple transmitters simultaneously communicate with one receiver: x 1 (t) H 1 n(t) y(t) x K (t) H K Received signal model: MAC sum capacity: Figure 5: MIMO multiple-access channel (MAC). y(t) = K k=1 H kx k (t) + n(t) C MAC ({Q k } K k=1) = log det ( K k=1 H kq k H H k + I ) 23

MAC sum capacity maximization: max {Q k } K k=1 log det ( K k=1 H kq k H H k + I ) s.t. Tr(Q k ) P k, Q k 0, k = 1,..., K (14) Problem (14) is convex w.r.t. {Q k }, but it has no simple closed-form solution. Alternatively, we can apply BCD to (14) and cyclically update Q k while fixing Q j for j k ( ) max Q k log det ( H k Q k H H k + Φ ) s.t. Tr(Q k ) P k, Q k 0, where Φ = j k H jq j H H j + I ( ) has a closed-form water-filling solution, just like the previous single-user MIMO case. 24

Applications Low-Rank Matrix Completion In a previous lecture, we have introduced the low-rank matrix completion problem, which has huge potential in sales recommendation. For example, we would like to predict how much someone is going to like a movie based on its movie preferences: M = movies 2 3 1?? 5 5 1? 4 2???? 3 1? 2 2 2??? 3? 1 5 users 2? 4?? 5 3 M is assumed to be of low rank, as only a few factors affect users preferences. min W R m n W rank(w) s.t. W ij = M ij, (i, j) Ω 25

An alternative low-rank matrix completion formulation [Wen-Yin-Zhang]: ( ) min X,Y,Z 1 2 XY Z 2 F s.t. Z ij = M ij, (i, j) Ω where X R M L, Y R L N, Z R M N, and L is an estimate of min. rank. Advantage of adopting ( ): When BCD is applied, each subproblem of ( ) has a closed-form solution: X r+1 = Z r Y rt (Y r Y rt ), Y r+1 = (X r+1t X r+1 ) (X r+1t Z r ), { [Z r+1 [X r+1 Y r+1 ] i,j, for (i, j) Ω ] i,j = M i,j, for (i, j) Ω 26

Applications Maximizing A Convex Quadratic Function Consider maximizing a convex quadratic problem: ( ) max x where X is a polyhedral set, and Q 0. 1 2 xt Qx + c T x s.t. x X ( ) is equivalent to the following problem 1 ( ) 1 max x 1,x 2 2 xt 1 Qx 2 + 1 2 ct x 1 + 1 2 ct x 2 s.t. (x 1, x 2 ) X X When fixing either x 1 or x 2, problem ( ) is an LP, thereby efficiently solvable. 1 The equivalence is in the following sense: If x is an optimal solution of ( ), then (x, x ) is optimal for ( ); Conversely, if (x 1, x 2 ) is an optimal solution of ( ), then both x 1, x 2 are optimal for ( ). 27

Applications Nonnegative Matrix Factorization (NMF) NMF is concerned with the following problem [Lee-Seung]: where M 0. min M UV 2 U R m k,v R k n F s.t. U 0, V 0 (15) Usually k min(m, n) or mk + nk mn, so NMF can be seen as a linear dimensionality reduction technique for nonnegative data. 28

Image Processing: representation. As can be seen from Fig. 1, the NMF basis and encodings contain a large fraction of vanishing coefficients, so both the basis images and image encodings are NMF sparse. Examples The basis images are sparse because they are non-global and contain several versions of mouths, noses and other facial parts, where the various versions are in different locations or forms. The variability of a whole face is generated by combining these different parts. Although all parts are used by at U 0 constraints the basis elements to be nonnegative. V 0 imposes an additive reconstruction. rules for W and H diffe update rules yield resul the technical disadvanta controlling the learnin through trial and error the matrix V is very lar Fig. 2 may be advantag bases. NMF VQ Original The basis elements extract facial features such as eyes, nose and lips. = Figure 1 Non-negative matrix faces, whereas vector quantiz holistic representations. The t m ¼ 2;429 facial images, ea n m matrix V. All three find three different types of constra and methods. As shown in the r ¼ 49 basis images. Positive with red pixels. A particular in represented by a linear superp superposition are shown next superpositions are shown on th learns to represent faces with 29

Application Text Mining 2: text mining Basis elements allow allow to recover to recover different the different topics; topics; Weights allow allow to assign to assign each each text text to itstocorresponding corresponding topics. topics. Dagstuhl Robust Near-Separable NMF Using LP 30 5

Hyperspectral Unmixing Basis elements U represent different materials; Weights V allow to know which pixel contains which material. 31

Let s turn back to the NMF problem: min M UV 2 U R m k,v R k n F s.t. U 0, V 0 (16) Without 0 constraints, the optimal U and V can be obtained by SVD. With 0 constraints, problem (16) is generally NP-hard. When fixing U (resp. V), problem (16) is convex w.r.t. V (resp. U). For example, for a given U, the ith column of V is updated by solving the following NLS problem: min M(:, i) UV(:, i) 2 2, s.t. V(:, i) 0, (17) V(:,i) R k 32

BCD Algorithm for NMF: 1: Initialize U = U 0, V = V 0 and r = 0; 2: repeat 3: solve the NLS problem V arg 4: V r+1 = V ; 5: solve the NLS problem U arg 6: U r+1 = U ; 7: r = r + 1; 8: until some convergence criterion is met min M Ur V 2 V R k n F, s.t. V 0 min U R m k M UVr+1 2 F, s.t. U 0 33

Outline Majorization Minimization (MM) Convergence Applications Block Coordinate Descent (BCD) Applications Convergence Summary 34

BCD Convergence The idea of BCD is to divide and conquer. However, there is no free lunch; BCD may get stuck or converge to some point of no interest. Figure 6: BCD for smooth/non-smooth minimization. 35

BCD Convergence min x f(x) s.t. x X = X 1 X 2... X m R n (18) A well-known BCD convergence result due to Bertsekas: Theorem 2. ([Bertsekas]) Suppose that f is continuously differentiable over the convex closed set X. Furthermore, suppose that for each i g i (ξ) f(x 1, x 2,..., x i 1, ξ, x i+1,..., x m ) is strictly convex. Let {x r } be the sequence generated by BCD method. Then every limit point of {x r } is a stationary point of problem (18). If X is (convex) compact, i.e., closed and bounded, then strict convexity of g i (ξ) can be relaxed to having a unique optimal solution. 36

Application: Iterative water-filling for MIMO MAC sum capacity max.: ( ) max {Q k } K k=1 log det ( K k=1 H kq k H H k + I), s.t. Tr(Q k ) P k, Q k 0, k Iterative water-filling converges to a global optimal solution of ( ), because BCD subproblem is strictly convex (assuming full column rankness of H k ); X k is a convex closed subset; ( ) is a convex problem, so stationary point = global optimal solution 37

Generalization of Bertsekas Convergence Result Generalization 1: Relax Strict Convexity to Strict Quasiconvexity 2 [Grippo-Sciandrone] Theorem 3. Suppose that the function f is continuously differentiable and strictly quasiconvex with respect to x i on X, for each i = 1,..., m 2 and that the sequence {x r } generated by the BCD method has limit points. Then, every limit point is a stationary point of problem (18). Application: Low-Rank Matrix Completion ( ) min X,Y,Z 1 2 XY Z 2 F s.t. Z ij = M ij, (i, j) Ω m = 3 and ( ) is strictly convex w.r.t. Z = BCD converges to a stationary point. 2 f is strictly quasiconvex w.r.t. xi X i on X if for every x X and y i X i with y i x i we have f(x 1,..., tx i + (1 t)y i,..., x m ) < max {f(x), f(x 1,..., y i,..., x m )}, t (0, 1). 38

Generalization 2: Without Solution Uniqueness Theorem 4. Suppose that f is pseudoconvex 3 on X and that L 0 X := {x X : f(x) f(x 0 )} is compact. Then, the sequence generated by BCD method has limit points and every limit point is a global minimizer of f. Application: Iterative water-filling for MIMO-MAC sum capacity max. max {Q k } K k=1 log det ( K k=1 H kq k H H k + I) s.t. Tr(Q k ) P k, Q k 0, k = 1,..., K f is convex, thus pseudoconvex; {Q k Tr(Q k ) P k, Q k 0} is compact; iterative water-filling converges to a globally optimal solution. 3 f is pseudoconvex if for all x, y X such that f(x) T (y x) 0, we have f(y) f(x). Notice that convex pseudoconvex quasiconvex. 39

Generalization 3: Without Solution Uniqueness, Pseudoconvexity and Compactness Theorem 5. Suppose that f is continuously differentiable, and that X is convex and closed. Moreover, if there are only two blocks, i.e., m = 2, then every limit point generated by BCD is a stationary point of f. Application: NMF min U R m k,v R k n M UV 2 F s.t. U 0, V 0 Alternating NLS converges to a stationary point of the NMF problem, since the objective is continuously differentiable; the feasible set is convex and closed; m = 2. 40

Summary MM and BCD have great potential in handling nonconvex problems and realizing fast/distributed implementations for large-scale convex problems; Many well-known algorithms can be interpreted as special cases of MM and BCD; Under some conditions, convergence to stationary point can be guaranteed by MM and BCD. 41

References M. Razaviyayn, M. Hong, and Z.-Q. Luo, A unified convergence analysis of block successive minimization methods for nonsmooth optimization, submitted to SIAM Journal on Optimization, available online at http://arxiv.org/abs/ 1209.2385. L. Grippo and M. Sciandrone, On the convergence of the block nonlinear Gauss- Seidel method under convex constraints, Operation research letter vol. 26, pp. 127-136, 2000 E. J. Candes, M. B. Wakin, and S. P. Boyd, Enhancing sparsity by reweighted l 1 minimization, J. Fourier Anal. Appl., 14 (2008), pp. 877-905. M. Zibulevsky and M. Elad, l 1 l 2 optimization in signal and image processing, IEEE Signal Process. Magazine, May 2010, pp.76-88. D. P. Bertsekas, Nonlinear Programming, Athena Scientific, 1st Ed., 1995 W. Yu and J. M. Cioffi, Sum capacity of a Gaussian vector broadcast channel, IEEE Trans. Inf. Theory, vol. 50, no. 1, pp. 145-152, Jan. 2004 42

Z. Wen, W. Yin, and Y. Zhang, Solving a low-rank factorization model for matrix completion by a nonlinear successive over-relaxation algorithm, Rice CAAM Tech Report 10-07. Daniel D. Lee and H. Sebastian Seung, Algorithms for Non-negative Matrix Factorization. Advances in Neural Information Processing Systems 13: Proceedings of the 2000 Conference. MIT Press. pp. 556-562, 2001. 43