Majorization Minimization (MM) and Block Coordinate Descent (BCD)

Similar documents
Sum-Power Iterative Watefilling Algorithm

Majorization-Minimization Algorithm

Block Coordinate Descent for Regularized Multi-convex Optimization

ENGG5781 Matrix Analysis and Computations Lecture 10: Non-Negative Matrix Factorization and Tensor Decomposition

Recent Developments of Alternating Direction Method of Multipliers with Multi-Block Variables

Numerisches Rechnen. (für Informatiker) M. Grepl P. Esser & G. Welper & L. Zhang. Institut für Geometrie und Praktische Mathematik RWTH Aachen

Lagrange Duality. Daniel P. Palomar. Hong Kong University of Science and Technology (HKUST)

ELE539A: Optimization of Communication Systems Lecture 15: Semidefinite Programming, Detection and Estimation Applications

Convex Functions. Daniel P. Palomar. Hong Kong University of Science and Technology (HKUST)

1 Non-negative Matrix Factorization (NMF)

ECS289: Scalable Machine Learning

SPARSE SIGNAL RESTORATION. 1. Introduction

Proximal Methods for Optimization with Spasity-inducing Norms

Solving Dual Problems

ENGG5781 Matrix Analysis and Computations Lecture 0: Overview

A Novel Iterative Convex Approximation Method

Lecture 2: Convex Sets and Functions

Sparsity Models. Tong Zhang. Rutgers University. T. Zhang (Rutgers) Sparsity Models 1 / 28

Optimization methods

Sparse Covariance Selection using Semidefinite Programming

Parallel Successive Convex Approximation for Nonsmooth Nonconvex Optimization

MLCC 2018 Variable Selection and Sparsity. Lorenzo Rosasco UNIGE-MIT-IIT

Ergodic and Outage Capacity of Narrowband MIMO Gaussian Channels

Linear dimensionality reduction for data analysis

minimize x subject to (x 2)(x 4) u,

Solving DC Programs that Promote Group 1-Sparsity

BLOCK ALTERNATING OPTIMIZATION FOR NON-CONVEX MIN-MAX PROBLEMS: ALGORITHMS AND APPLICATIONS IN SIGNAL PROCESSING AND COMMUNICATIONS

Recovery of Sparse Signals from Noisy Measurements Using an l p -Regularized Least-Squares Algorithm

Gauge optimization and duality

Minimax Problems. Daniel P. Palomar. Hong Kong University of Science and Technolgy (HKUST)

Introduction to Convex Optimization

Optimality, Duality, Complementarity for Constrained Optimization

Design of Projection Matrix for Compressive Sensing by Nonsmooth Optimization

Large-Scale L1-Related Minimization in Compressive Sensing and Beyond

Minimax MMSE Estimator for Sparse System

Iterative Reweighted Minimization Methods for l p Regularized Unconstrained Nonlinear Programming

Convex Optimization Theory. Chapter 5 Exercises and Solutions: Extended Version

arxiv: v1 [math.oc] 1 Jul 2016

Semidefinite Programming Based Preconditioning for More Robust Near-Separable Nonnegative Matrix Factorization

LECTURE 25: REVIEW/EPILOGUE LECTURE OUTLINE

COM Optimization for Communications 8. Semidefinite Programming

Sparse Optimization Lecture: Basic Sparse Optimization Models

3. Convex functions. basic properties and examples. operations that preserve convexity. the conjugate function. quasiconvex functions

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

Shiqian Ma, MAT-258A: Numerical Optimization 1. Chapter 9. Alternating Direction Method of Multipliers

On the Convergence of the Concave-Convex Procedure

A Unified Approach to Proximal Algorithms using Bregman Distance

Lecture: Matrix Completion

2 Regularized Image Reconstruction for Compressive Imaging and Beyond

Lecture 4: Linear and quadratic problems

Under sum power constraint, the capacity of MIMO channels

First-order methods for structured nonsmooth optimization

Convex Functions. Wing-Kin (Ken) Ma The Chinese University of Hong Kong (CUHK)

On the Convergence of the Concave-Convex Procedure

Optimum Power Allocation in Fading MIMO Multiple Access Channels with Partial CSI at the Transmitters

Morning Session Capacity-based Power Control. Department of Electrical and Computer Engineering University of Maryland

Douglas-Rachford splitting for nonconvex feasibility problems

Lecture 3: Lagrangian duality and algorithms for the Lagrangian dual problem

Math 273a: Optimization Subgradients of convex functions

Majorization Minimization - the Technique of Surrogate

Coordinate Update Algorithm Short Course Proximal Operators and Algorithms

Relaxed Majorization-Minimization for Non-smooth and Non-convex Optimization

Lecture 6. Regression

A Note on the Complexity of L p Minimization

Does Alternating Direction Method of Multipliers Converge for Nonconvex Problems?

Project Discussions: SNL/ADMM, MDP/Randomization, Quadratic Regularization, and Online Linear Programming

CSC 576: Variants of Sparse Learning

Algorithms for Nonsmooth Optimization

A Brief Overview of Practical Optimization Algorithms in the Context of Relaxation

Fast Nonnegative Matrix Factorization with Rank-one ADMM

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

On Penalty and Gap Function Methods for Bilevel Equilibrium Problems

CS-E4830 Kernel Methods in Machine Learning

Solution Recovery via L1 minimization: What are possible and Why?

Machine Learning for Signal Processing Sparse and Overcomplete Representations

LECTURE NOTE #NEW 6 PROF. ALAN YUILLE

STA141C: Big Data & High Performance Statistical Computing

Sparsity Regularization

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

Robust Principal Component Analysis

Single-User MIMO systems: Introduction, capacity results, and MIMO beamforming

Factor Analysis (FA) Non-negative Matrix Factorization (NMF) CSE Artificial Intelligence Grad Project Dr. Debasis Mitra

Recent Developments in Compressed Sensing

Probabilistic Low-Rank Matrix Completion with Adaptive Spectral Regularization Algorithms

A NEW ITERATIVE METHOD FOR THE SPLIT COMMON FIXED POINT PROBLEM IN HILBERT SPACES. Fenghui Wang

Introduction to Compressed Sensing

Generalized Orthogonal Matching Pursuit- A Review and Some

Minimizing the Difference of L 1 and L 2 Norms with Applications

Introduction to Sparsity. Xudong Cao, Jake Dreamtree & Jerry 04/05/2012

Signal Recovery from Permuted Observations

Math 273a: Optimization Overview of First-Order Optimization Algorithms

On the convergence of a regularized Jacobi algorithm for convex optimization

Provable Alternating Minimization Methods for Non-convex Optimization

Convex Optimization and l 1 -minimization

Lecture 1. 1 Conic programming. MA 796S: Convex Optimization and Interior Point Methods October 8, Consider the conic program. min.

Convex relaxation. In example below, we have N = 6, and the cut we are considering

Transmitter optimization for distributed Gaussian MIMO channels

Multiuser Downlink Beamforming: Rank-Constrained SDP

A Proof of the Converse for the Capacity of Gaussian MIMO Broadcast Channels

Block stochastic gradient update method

Transcription:

Majorization Minimization (MM) and Block Coordinate Descent (BCD) Wing-Kin (Ken) Ma Department of Electronic Engineering, The Chinese University Hong Kong, Hong Kong ELEG5481, Lecture 15 Acknowledgment: Qiang Li for helping prepare the slides.

Outline Majorization Minimization (MM) Convergence Applications Block Coordinate Descent (BCD) Applications Convergence Summary 1

Consider the following problem Majorization Minimization min x f(x) s.t. x X (1) where X is a closed convex set; f( ) may be non-convex and/or nonsmooth. Challenge: For a general f( ), problem (1) can be difficult to solve. Majorization Minimization: Iteratively generate {x r } as follows x r min x u(x, x r 1 ) s.t. x X (2) where u(x, x r 1 ) is a surrogate function of f(x), satisfying 1. u(x, x r ) f(x), x r, x X ; 2. u(x r, x r ) = f(x r ); 2

f(x) f(x) u(x;x 1 ) u(x;x 0 ) x? x 2 x 1 x 0 Figure 1: An pictorial illustration of MM algorithm. x Property 1. {f(x r )} is nonincreasing, i.e., f(x r ) f(x r 1 ), r = 1, 2,.... Proof. f(x r ) u(x r, x r 1 ) u(x r 1, x r 1 ) = f(x r 1 ) The nonincreasing property of {f(x r )} implies that f(x r ) f. But how about the convergence of the iterates {x r }? 3

Technical Preliminaries Limit point: x is a limit point of {x k } if there exists a subsequence of {x k } that converges to x. Note that every bounded sequence in R n has a limit point (or convergent subsequence); Directional derivative: Let f : D R be a function where D R m is a convex set. The directional derivative of f at point x in direction d is defined by f (x; d) lim inf λ 0 f(x + λd) f(x). λ If f is differentiable, then f (x; d) = d T f(x). Stationary point: x X is a stationary point of f( ) if f (x; d) 0, d such that x + d D. (3) A stationary point may be a local min., a local max. or a saddle point; If D = R n and f is differentiable, then (3) f(x) = 0. 4

Convergence of MM Assumption 1 u(, ) satisfies the following conditions u(y, y) = f(y), y X, (4a) u(x, y) f(x), x, y X, (4b) u (x, y; d) x=y = f (y; d), d with y + d X, (4c) u(x, y) is continuous in (x, y), (4d) (4c) means the 1st order local behavior of u(, x r 1 ) is the same as f( ). 5

Convergence of MM Theorem 1. [Razaviyayn-Hong-Luo] Assume that Assumption 1 is satisfied. Then every limit point of the iterates generated by MM algorithm is a stationary point of problem (1). Proof. From Property 1, we know that f(x r+1 ) u(x r+1, x r ) u(x, x r ), x X. Now assume that there exists a subsequence {x r j} of {x r } converging to a limit point z, i.e., lim j x r j = z. Then u(x r j+1, x r j+1 ) = f(x r j+1 ) f(x r j+1 ) u(x r j+1, x r j ) u(x, x r j ), x X. Letting j, we obtain u(z, z) u(x, z), x X, which implies that u (x, z; d) x=z 0, z + d X. Combining the above inequality with (4c) (i.e., u (x, y; d) x=y = f (y; d), d with y + d X ), we have f (z; d) 0, z + d X. 6

Applications Nonnegative Least Squares In many engineering applications, we encounter the following problem (NLS) min Ax x 0 b 2 2 (5) where b R m +, b 0, and A R m n ++. It s an LS problem with nonnegative constraints, so the conventional LS solution may not be feasible for (5). A simple multiplicative updating algorithm: x r l = c r l x r 1 l, l = 1,..., n (6) where x r l is the lth component of xr, and c r l = [AT b] l. [A T Ax r 1 ] l Starting with x 0 > 0, then all x r generated by (6) are nonnegative. 7

10 1 10 0 10-1 10-2 10-3 10-4 0 20 40 60 80 100 120 140 Number of iterations Figure 2: Ax r b 2 vs. the number of iterations. Usually the multiplicative update converges within a few tens of iterations. 8

MM interpretation: Let f(x) Ax b 2 2. The multiplicative update essentially solves the following problem where x r 1 1 min u(x, x 0 xr 1 ) u(x, x r 1 ) f(x r 1 ) + (x x r 1 ) T f(x r 1 ) + 1 2 (x xr 1 ) T Φ(x r 1 )(x x r 1 ), ( [A Φ(x r 1 T Ax r 1 ] 1 ) = Diag,..., [AT Ax r 1 ) ] n x r 1. n Observations: { u(x, x r 1 ) is quadratic approx. of f(x), Φ(x r 1 ) A T A, = { u(x, x r 1 ) f(x), x R n, u(x r 1, x r 1 ) = f(x r 1 ). The multiplicative update converges to an optimal solution of NLS (by the MM convergence in Theorem 1 and convexity of NLS). 9

Applications Convex-Concave Procedure/ DC Programming Suppose that f(x) has the following form f(x) = g(x) h(x), where g(x) and h(x) are convex and differentiable. nonconvex. Thus, f(x) is in general DC Programming: Construct u(, ) as u(x, x r ) = g(x) ( h(x r ) + x h(x r ) T (x x r ) ) }{{} linearization of h at x r By the 1st order condition of h(x), it s easy to show that u(x, x r ) f(x), x X, u(x r, x r ) = f(x r ). 10

Sparse Signal Recovery by DC Programming min x 0 s.t. y = Ax (7) x Apart from the popular l 1 approximation, consider the following concave approximation: 2 min x R n n log(1 + x i /ɛ) i=1 s.t. y = Ax, 1.5 1 0.5 0-2 -1.5-1 -0.5 0 0.5 1 1.5 2 Figure 3: log(1 + x /ɛ) promotes more sparsity than l 1 11

min x R n n log(1 + x i /ɛ) i=1 which can be equivalently written as s.t. y = Ax, min x,z R n n log(z i + ɛ) s.t. y = Ax, x i z i, i = 1,..., n (8) i=1 Problem (8) minimizes a concave objective, so it s a special case of DC programming (g(x) = 0). Linearizing the concave function at (x r, z r ) yields (x r+1, z r+1 ) = arg min n i=1 z i z r i + ɛ s.t. y = Ax, x i z i, i = 1,..., n We solve a sequence of reweighted l 1 problems. 12

(1) (1) J Fourier Anal Appl (2008) 14: 877 905 883 Fig. 2 Sparse signal recovery through reweighted l 1 iterations. (a) Original length n = 512 signal x 0 with 130 spikes. (b) Scatter plot, coefficient-by-coefficient, of x 0 versus its reconstruction x (0) using unweighted l 1 minimization. (c) Reconstruction x (1) after the first reweighted iteration. (d) Reconstruction x (2) after the second reweighted iteration 13

Applications l 2 l p Optimization Many problems involve solving the following problem (e.g., basis-pursuit denoising) min x f(x) 1 2 y Ax 2 2 + µ x p (9) where p 1. If A = I or A is unitary, optimal x is computed in closed-form as x = A T y Proj C (A T y) where C {x : x p µ}, p is the dual norm of p and Proj C denotes the projection operator. In particular, for p = 1 x i = soft(y i, µ), i = 1,..., n where soft(u, a) sign(u) max{ u a, 0} denotes a soft-thresholding operation. For general A, there is no simple closed-form solution for (9). 14

MM for l 2 l p Problem: Consider a modified l 2 l p problem min x u(x, x r ) f(x) + dist(x, x r ) (10) where dist(x, x r ) c 2 x xr 2 2 1 2 Ax Axr 2 2 and c > λ max (A T A). dist(x, x r ) 0 x = u(x, x r ) majorizes f(x). u(x, x r ) can be reexpressed as u(x, x r ) = c 2 x xr 2 2 + µ x p + const., where x r = 1 c AT (y Ax r ) + x r. The modified l 2 l p problem (10) has a simple soft-thresholding solution. Repeatedly solving problem (10) leads to an optimal solution of the l 2 l p problem (by the MM convergence in Theorem 1 ) 15

Applications Expectation Maximization (EM) Consider an ML estimate of θ, given the random observation w ˆθ ML = arg min θ ln p(w θ) Suppose that there are some missing data or hidden variables z in the model. Then, EM algorithm iteratively compute an ML estimate ˆθ as follows: E-step: g(θ, θ r ) E z w,θ r{ln p(w, z θ)} M-step: θ r+1 = arg max g(θ, θ r ) θ repeat the above two steps until convergence. EM algorithm generates a nonincreasing sequence of { ln p(w θ r )}. EM algorithm can be interpreted by MM. 16

MM interpretation of EM algorithm: ln p(w θ) = ln E z θ p(w z, θ) [ p(z w, θ r ] )p(w z, θ) = ln E z θ p(z w, θ r ) [ ] p(z θ)p(w z, θ) = ln E z w,θ r p(z w, θ r (interchange the integrations) ) [ ] p(z θ)p(w z, θ) E z w,θ r ln p(z w, θ r (Jensen s inequality) ) = E z w,θ r ln p(w, z θ) + E z w,θ r ln p(z w, θ r ) (11a) u(θ, θ r ) u(θ, θ r ) majorizes ln p(w θ), and ln p(w θ r ) = u(θ r, θ r ); E-step essentially constructs u(θ, θ r ); M-step minimizes u(θ, θ r ) (note θ appears in the 1st term of (11a) only). 17

Outline Majorization Minimization (MM) Convergence Applications Block Coordinate Descent (BCD) Applications Convergence Summary 18

Block Coordinate Descent Consider the following problem min x f(x) s.t. x X = X 1 X 2... X m R n (12) where each X i R n i is closed, nonempty and convex. BCD Algorithm: 1: Find a feasible point x 0 X and set r = 0 2: repeat 3: r = r + 1, i = (r 1 mod m) + 1 4: Let x i arg min x X i f(x r 1 1,..., x r 1 i 1, x, xr 1 i+1,..., xr 1 m ) 5: Set x r i = x i and xr k = xr 1 k, k i 6: until some convergence criterion is met Merits of BCD 1. each subproblem is much easier to solve, or even has a closed-form solution; 2. The objective value is nonincreasing along the BCD updates; 3. it allows parallel or distributed implementations. 19

Applications l 2 l 1 Optimization Problem Let us revisit the l 2 l 1 problem min f(x) 1 x R n 2 y Ax 2 2 + µ x 1 (13) Apart from MM, BCD is another efficient approach to solve (13): Optimize x k while fixing x j = x r j, j k: min x k f k (x k ) 1 2 y j k a j x r j }{{} The optimal x k has a closed form: ȳ a k x k 2 2 + µ x k x k = soft ( a T k ȳ/ a k 2, µ/ a k 2) Cyclically update x k, k = 1,..., n until convergence. 20

Applications Iterative Water-filling for MIMO MAC Sum Capacity Maximization MIMO Channel Capacity Maximization MIMO received signal model: y(t) = Hx(t) + n(t) where x(t) C N H C N N n(t) C N Tx signal MIMO channel matrix standard additive Gaussian noise, i.e., n(t) CN (0, I). Tx Rx Figure 4: MIMO system model. 21

MIMO channel capacity: C(Q) = log det ( I + HQH H) where Q = E{x(t)x(t) H } is the covariance of the tx signal. MIMO channel capacity maximization: max Q 0 log det ( I + HQH H) s.t. Tr(Q) P where P > 0 is the transmit power budget. The optimal Q is given by the well-known water-filling solution, i.e., Q = VDiag(p )V H where H = UDiag(σ 1,..., σ N )V H is the SVD of H, and p = [p 1,..., p N ] is the power allocation with p i = max(0, µ 1/σi 2 ) and µ 0 being the water-level such that i p i = P. 22

MIMO Multiple-Access Channel (MAC) Sum-Capacity Maximization Multiple transmitters simultaneously communicate with one receiver: x 1 (t) H 1 n(t) y(t) x K (t) H K Received signal model: MAC sum capacity: Figure 5: MIMO multiple-access channel (MAC). y(t) = K k=1 H kx k (t) + n(t) C MAC ({Q k } K k=1) = log det ( K k=1 H kq k H H k + I ) 23

MAC sum capacity maximization: max {Q k } K k=1 log det ( K k=1 H kq k H H k + I ) s.t. Tr(Q k ) P k, Q k 0, k = 1,..., K (14) Problem (14) is convex w.r.t. {Q k }, but it has no simple closed-form solution. Alternatively, we can apply BCD to (14) and cyclically update Q k while fixing Q j for j k ( ) max Q k log det ( H k Q k H H k + Φ ) s.t. Tr(Q k ) P k, Q k 0, where Φ = j k H jq j H H j + I ( ) has a closed-form water-filling solution, just like the previous single-user MIMO case. 24

Applications Low-Rank Matrix Completion In a previous lecture, we have introduced the low-rank matrix completion problem, which has huge potential in sales recommendation. For example, we would like to predict how much someone is going to like a movie based on its movie preferences: M = movies 2 3 1?? 5 5 1? 4 2???? 3 1? 2 2 2??? 3? 1 5 users 2? 4?? 5 3 M is assumed to be of low rank, as only a few factors affect users preferences. min W R m n W rank(w) s.t. W ij = M ij, (i, j) Ω 25

An alternative low-rank matrix completion formulation [Wen-Yin-Zhang]: ( ) min X,Y,Z 1 2 XY Z 2 F s.t. Z ij = M ij, (i, j) Ω where X R M L, Y R L N, Z R M N, and L is an estimate of min. rank. Advantage of adopting ( ): When BCD is applied, each subproblem of ( ) has a closed-form solution: X r+1 = Z r Y rt (Y r Y rt ), Y r+1 = (X r+1t X r+1 ) (X r+1t Z r ), { [Z r+1 [X r+1 Y r+1 ] i,j, for (i, j) Ω ] i,j = M i,j, for (i, j) Ω 26

Applications Maximizing A Convex Quadratic Function Consider maximizing a convex quadratic problem: ( ) max x where X is a polyhedral set, and Q 0. 1 2 xt Qx + c T x s.t. x X ( ) is equivalent to the following problem 1 ( ) 1 max x 1,x 2 2 xt 1 Qx 2 + 1 2 ct x 1 + 1 2 ct x 2 s.t. (x 1, x 2 ) X X When fixing either x 1 or x 2, problem ( ) is an LP, thereby efficiently solvable. 1 The equivalence is in the following sense: If x is an optimal solution of ( ), then (x, x ) is optimal for ( ); Conversely, if (x 1, x 2 ) is an optimal solution of ( ), then both x 1, x 2 are optimal for ( ). 27

Applications Nonnegative Matrix Factorization (NMF) NMF is concerned with the following problem [Lee-Seung]: where M 0. min M UV 2 U R m k,v R k n F s.t. U 0, V 0 (15) Usually k min(m, n) or mk + nk mn, so NMF can be seen as a linear dimensionality reduction technique for nonnegative data. 28

Image Processing: representation. As can be seen from Fig. 1, the NMF basis and encodings contain a large fraction of vanishing coefficients, so both the basis images and image encodings are NMF sparse. Examples The basis images are sparse because they are non-global and contain several versions of mouths, noses and other facial parts, where the various versions are in different locations or forms. The variability of a whole face is generated by combining these different parts. Although all parts are used by at U 0 constraints the basis elements to be nonnegative. V 0 imposes an additive reconstruction. rules for W and H diffe update rules yield resul the technical disadvanta controlling the learnin through trial and error the matrix V is very lar Fig. 2 may be advantag bases. NMF VQ Original The basis elements extract facial features such as eyes, nose and lips. = Figure 1 Non-negative matrix faces, whereas vector quantiz holistic representations. The t m ¼ 2;429 facial images, ea n m matrix V. All three find three different types of constra and methods. As shown in the r ¼ 49 basis images. Positive with red pixels. A particular in represented by a linear superp superposition are shown next superpositions are shown on th learns to represent faces with 29

Application Text Mining 2: text mining Basis elements allow allow to recover to recover different the different topics; topics; Weights allow allow to assign to assign each each text text to itstocorresponding corresponding topics. topics. Dagstuhl Robust Near-Separable NMF Using LP 30 5

Hyperspectral Unmixing Basis elements U represent different materials; Weights V allow to know which pixel contains which material. 31

Let s turn back to the NMF problem: min M UV 2 U R m k,v R k n F s.t. U 0, V 0 (16) Without 0 constraints, the optimal U and V can be obtained by SVD. With 0 constraints, problem (16) is generally NP-hard. When fixing U (resp. V), problem (16) is convex w.r.t. V (resp. U). For example, for a given U, the ith column of V is updated by solving the following NLS problem: min M(:, i) UV(:, i) 2 2, s.t. V(:, i) 0, (17) V(:,i) R k 32

BCD Algorithm for NMF: 1: Initialize U = U 0, V = V 0 and r = 0; 2: repeat 3: solve the NLS problem V arg 4: V r+1 = V ; 5: solve the NLS problem U arg 6: U r+1 = U ; 7: r = r + 1; 8: until some convergence criterion is met min M Ur V 2 V R k n F, s.t. V 0 min U R m k M UVr+1 2 F, s.t. U 0 33

Outline Majorization Minimization (MM) Convergence Applications Block Coordinate Descent (BCD) Applications Convergence Summary 34

BCD Convergence The idea of BCD is to divide and conquer. However, there is no free lunch; BCD may get stuck or converge to some point of no interest. Figure 6: BCD for smooth/non-smooth minimization. 35

BCD Convergence min x f(x) s.t. x X = X 1 X 2... X m R n (18) A well-known BCD convergence result due to Bertsekas: Theorem 2. ([Bertsekas]) Suppose that f is continuously differentiable over the convex closed set X. Furthermore, suppose that for each i g i (ξ) f(x 1, x 2,..., x i 1, ξ, x i+1,..., x m ) is strictly convex. Let {x r } be the sequence generated by BCD method. Then every limit point of {x r } is a stationary point of problem (18). If X is (convex) compact, i.e., closed and bounded, then strict convexity of g i (ξ) can be relaxed to having a unique optimal solution. 36

Application: Iterative water-filling for MIMO MAC sum capacity max.: ( ) max {Q k } K k=1 log det ( K k=1 H kq k H H k + I), s.t. Tr(Q k ) P k, Q k 0, k Iterative water-filling converges to a global optimal solution of ( ), because BCD subproblem is strictly convex (assuming full column rankness of H k ); X k is a convex closed subset; ( ) is a convex problem, so stationary point = global optimal solution 37

Generalization of Bertsekas Convergence Result Generalization 1: Relax Strict Convexity to Strict Quasiconvexity 2 [Grippo-Sciandrone] Theorem 3. Suppose that the function f is continuously differentiable and strictly quasiconvex with respect to x i on X, for each i = 1,..., m 2 and that the sequence {x r } generated by the BCD method has limit points. Then, every limit point is a stationary point of problem (18). Application: Low-Rank Matrix Completion ( ) min X,Y,Z 1 2 XY Z 2 F s.t. Z ij = M ij, (i, j) Ω m = 3 and ( ) is strictly convex w.r.t. Z = BCD converges to a stationary point. 2 f is strictly quasiconvex w.r.t. xi X i on X if for every x X and y i X i with y i x i we have f(x 1,..., tx i + (1 t)y i,..., x m ) < max {f(x), f(x 1,..., y i,..., x m )}, t (0, 1). 38

Generalization 2: Without Solution Uniqueness Theorem 4. Suppose that f is pseudoconvex 3 on X and that L 0 X := {x X : f(x) f(x 0 )} is compact. Then, the sequence generated by BCD method has limit points and every limit point is a global minimizer of f. Application: Iterative water-filling for MIMO-MAC sum capacity max. max {Q k } K k=1 log det ( K k=1 H kq k H H k + I) s.t. Tr(Q k ) P k, Q k 0, k = 1,..., K f is convex, thus pseudoconvex; {Q k Tr(Q k ) P k, Q k 0} is compact; iterative water-filling converges to a globally optimal solution. 3 f is pseudoconvex if for all x, y X such that f(x) T (y x) 0, we have f(y) f(x). Notice that convex pseudoconvex quasiconvex. 39

Generalization 3: Without Solution Uniqueness, Pseudoconvexity and Compactness Theorem 5. Suppose that f is continuously differentiable, and that X is convex and closed. Moreover, if there are only two blocks, i.e., m = 2, then every limit point generated by BCD is a stationary point of f. Application: NMF min U R m k,v R k n M UV 2 F s.t. U 0, V 0 Alternating NLS converges to a stationary point of the NMF problem, since the objective is continuously differentiable; the feasible set is convex and closed; m = 2. 40

Summary MM and BCD have great potential in handling nonconvex problems and realizing fast/distributed implementations for large-scale convex problems; Many well-known algorithms can be interpreted as special cases of MM and BCD; Under some conditions, convergence to stationary point can be guaranteed by MM and BCD. 41

References M. Razaviyayn, M. Hong, and Z.-Q. Luo, A unified convergence analysis of block successive minimization methods for nonsmooth optimization, submitted to SIAM Journal on Optimization, available online at http://arxiv.org/abs/ 1209.2385. L. Grippo and M. Sciandrone, On the convergence of the block nonlinear Gauss- Seidel method under convex constraints, Operation research letter vol. 26, pp. 127-136, 2000 E. J. Candes, M. B. Wakin, and S. P. Boyd, Enhancing sparsity by reweighted l 1 minimization, J. Fourier Anal. Appl., 14 (2008), pp. 877-905. M. Zibulevsky and M. Elad, l 1 l 2 optimization in signal and image processing, IEEE Signal Process. Magazine, May 2010, pp.76-88. D. P. Bertsekas, Nonlinear Programming, Athena Scientific, 1st Ed., 1995 W. Yu and J. M. Cioffi, Sum capacity of a Gaussian vector broadcast channel, IEEE Trans. Inf. Theory, vol. 50, no. 1, pp. 145-152, Jan. 2004 42

Z. Wen, W. Yin, and Y. Zhang, Solving a low-rank factorization model for matrix completion by a nonlinear successive over-relaxation algorithm, Rice CAAM Tech Report 10-07. Daniel D. Lee and H. Sebastian Seung, Algorithms for Non-negative Matrix Factorization. Advances in Neural Information Processing Systems 13: Proceedings of the 2000 Conference. MIT Press. pp. 556-562, 2001. 43