2886 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 61, NO. 5, MAY 2015

Size: px

Start display at page:

Download "2886 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 61, NO. 5, MAY 2015"

Megan Gray
5 years ago
Views:

1 886 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL 61, NO 5, MAY 015 Simultaneously Structure Moels With Application to Sparse an Low-Rank Matrices Samet Oymak, Stuent Member, IEEE, Amin Jalali, Stuent Member, IEEE, Maryam Fazel, Yonina C Elar, Fellow, IEEE, an Babak Hassibi, Member, IEEE Abstract Recovering structure moels (eg, sparse or group-sparse vectors, low-rank matrices) given a few linear observations have been well-stuie recently In various applications in signal processing an machine learning, the moel of interest is structure in several ways, for example, a matrix that is simultaneously sparse an low rank Often norms that promote the iniviual structures are known, an allow for recovery using an orerwise optimal number of measurements (eg, l 1 norm for sparsity, nuclear norm for matrix rank) Hence, it is reasonable to minimize a combination of such norms We show that, surprisingly, using multiobjective optimization with these norms can o no better, orerwise, than exploiting only one of the structures, thus revealing a funamental limitation in sample complexity This result suggests that to fully exploit the multiple structures, we nee an entirely new convex relaxation Further, specializing our results to the case of sparse an low-rank matrices, we show that a nonconvex formulation recovers the moel from very few measurements (on the orer of the egrees of freeom), whereas the convex problem combining the l 1 an nuclear norms requires many more measurements, illustrating a gap between the performance of the convex an nonconvex recovery problems Our framework applies to arbitrary structure-inucing norms as well as to a wie range of measurement ensembles This allows us to give sample complexity bouns for problems such as sparse phase retrieval an low-rank tensor completion Inex Terms Compresse sensing, convex relaxation, regularization, sample complexity Manuscript receive February 11, 013; revise July 1, 014; accepte October 9, 014 Date of publication February 9, 015; ate of current version April 17, 015 A Jalali an M Fazel were supporte by the National Science Founation uner Grant ECCS an Grant CCF Y C Elar was supporte in part by the Israel Science Founation uner Grant 170/10, in part by SRC, an in part by the Intel Collaborative Research Institute for Computational Intelligence B Hassibi was supporte in part by the National Science Founation uner Grant CNS-09348, Grant CCF , Grant CCF , an Grant CCF , in part by the Office of Naval Research through the MURI uner Grant N , in part by Qualcomm Inc, San Diego, CA, USA, an in part by King Abulaziz University, Jeah, Saui Arabia S Oymak is with the Department of Electrical Engineering an Computer Sciences, University of California at Berkeley, Berkeley, CA 9470 USA ( oymak@eecsberkeleyeu) A Jalali an M Fazel are with the Department of Electrical Engineering, University of Washington, Seattle, WA USA ( amjalali@uweu; mfazel@uweu) Y C Elar is with the Department of Electrical Engineering, Technion Israel Institute of Technology, Haifa 3000, Israel ( yonina@eetechnionacil) B Hassibi is with the Department of Electrical Engineering, California Institute of Technology, Pasaena, CA 9115 USA ( hassibi@systemscaltecheu) Communicate by V Saligrama, Associate Eitor for Signal Processing Color versions of one or more of the figures in this paper are available online at Digital Object Ientifier /TIT I INTRODUCTION RECOVERY of a structure moel (signal) given a small number of linear observations has been the focus of many stuies recently Examples inclue recovering sparse or group-sparse vectors (which gave rise to the area of compresse sensing) [1] [4], low-rank matrices [5], [6], an the sum of sparse an low-rank matrices [7], [8], among others More generally, the recovery of a signal that can be expresse as the sum of a few atoms out of an appropriate atomic set has been stuie in [9] Canonical questions in this area inclue: How many linear measurements are enough to recover the moel by any means? How many measurements are enough for a tractable approach? In the statistics literature, these questions are often pose in terms of error rates for estimators minimizing the sum of a quaratic loss function an a regularizer that reflects the esire structure [10] There are many applications where the moel of interest is known to have several structures at the same time (Section I-B) We then seek a signal that lies in the intersection of several sets efining the iniviual structures (in a sense that we will later make precise) The most common convex regularizer use to promote all structures together is a linear combination of well-known regularizers for each structure However, there is currently no general analysis an unerstaning of how well such regularization performs in terms of the number of observations require for successful recovery of the esire moel This paper aresses this ubiquitous yet unexplore problem; ie, the recovery of simultaneously structure moels An example of a simultaneously structure moel is a matrix that is simultaneously sparse an low-rank One woul like to come up with algorithms that exploit both types of structures to minimize the number of measurements require for recovery An n n matrix with rank r n can be escribe by O (rn) parameters, an can be recovere using O (rn) generic measurements via nuclear norm minimization [5], [11] On the other han, a block-sparse matrix with a k k nonzero block where k n can be escribe by k parameters an can be recovere given O ( ) k log n k generic measurements using l1 minimization However, a matrix that is both rank r an block-sparse can be escribe by O (rk) parameters The question is whether we can exploit this joint structure to efficiently recover such a matrix with O (rk) measurements In this paper we give a negative answer to this question in the following sense: if we use multi-objective optimization IEEE Personal use is permitte, but republication/reistribution requires IEEE permission See for more information

2 OYMAK et al: SIMULTANEOUSLY STRUCTURED MODELS WITH APPLICATION TO SPARSE AND LOW-RANK MATRICES 887 TABLE I SUMMARY OF RESULTS IN RECOVERY OF STRUCTURED SIGNALSTHIS PAPER SHOWS A GAP BETWEEN THE PERFORMANCE OF CONVEX AND NONCONVEX RECOVERY PROGRAMS FOR SIMULTANEOUSLY STRUCTURED MATRICES (LAST ROW) with the l 1 an nuclear norms (use for sparse signals an low rank matrices, respectively), then the number of measurements require is lower boune by O ( min{k,rn} ) In other wors, we nee at least this number of observations for the esire signal to be recoverable by a combination of the l 1 norm an the nuclear norm This means we can o no better than an algorithm that exploits only one of the two structures We introuce a framework to express general simultaneously structure moels, an as our main result, we prove that the same phenomenon happens for a general set of structures We analyze a wie range of measurement ensembles, incluing the subsample stanar basis (ie matrix completion), Gaussian an subgaussian measurements, an quaratic measurements Table I summarizes known results on recovery of some common structure moels, along with a result of this paper specialize to the problem of low-rank an sparse matrix recovery The first column gives the number of parameters neee to escribe the moel (often referre to as its egrees of freeom ), while the secon an thir columns show how many generic measurements are neee for successful recovery In nonconvex recovery, we assume we are able to fin the global minimum of a nonconvex problem This is clearly intractable in general, an not a practical recovery metho we consier it as a benchmark for theoretical comparison with the (tractable) convex relaxation in orer to etermine how powerful the relaxation is The first an secon rows are the results on k sparse vectors in R n an rank r matrices in R n n respectively, [11], [1] The thir row consiers the recovery of low-rank plus sparse matrices Consier a matrix X R n n that can be ecompose as X = X L + X S where X L is a rank r matrix an X S is a matrix with only k nonzero entries The egrees of freeom in X are O (rn + k) Minimizing the combination of the l 1 norm an nuclear norm, ie, f(x) =min Y Y + λ X Y 1 subject to ranom Gaussian measurements on X, gives a convex approach for recovering X It has been shown that uner reasonable incoherence assumptions, X can be recovere from O ( (rn + k)log n ) measurements which is suboptimal only by a logarithmic factor [13] Finally, the last row in Table I shows one of the results in this paper Let X R n n bearankr matrix whose entries are zero outsie a k 1 k submatrix The egrees of freeom of X are O ((k 1 + k )r) We consier both convex an non-convex programs for the recovery of this type of matrix The nonconvex metho involves minimizing the number of nonzero rows, columns an rank of the matrix jointly, as iscusse in Section III-B As shown later, O ((k 1 + k )r log n) measurements suffice for this program to successfully recover the original matrix The convex metho minimizes any convex combination of the iniviual structure-inucing norms, namely the nuclear norm an the l 1, norm of the matrix, which encourage low-rank an column/row-sparse solutions respectively We show that with high probability this program cannot recover the original matrix with fewer than Ω(rn) measurements In summary, while the nonconvex metho is only slightly suboptimal, the convex metho performs poorly as the number of measurements scales with n rather than k 1 + k A Contributions This paper presents an analysis for the recovery of moels with more than one structure, by combining penalties corresponing to each structure The setting consiere is very broa: any number of structures can be combine, any (convex) combination of the corresponing norms can be analyze, an the results hol for a variety of popular measurement ensembles The framework propose inclues special cases that are of interest in their own right, eg, sparse an low-rank matrix recovery, an low-rank tensor completion [14], [15] More specifically, our contributions can be summarize as follows 1) Poor Performance of Convex Relaxations: We consier a moel with several structures an associate structureinucing norms For recovery, we consier a multi-objective optimization problem to minimize the iniviual norms simultaneously Given the convexity of the problem, we know that minimizing a weighte sum of the norms an varying the weights traces out all points of the Pareto-optimal front (Section II) We obtain a lower boun on the number of measurements for any convex function combining the iniviual norms A sketch of our main result is as follows Given a moel x 0 with τ simultaneous structures, the number of measurements require for recovery with high probability using any linear combination of the iniviual norms satisfies the lower boun m c min i=1,,τ m i where m i is an intrinsic lower boun on the require number of measurements when minimizing the ith norm only The term c epens on the measurement ensemble For the norms of interest, m i is proportional to the egrees of freeom of the ith structure With min i m i as the bottleneck, this result inicates that the combination of norms performs

3 888 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL 61, NO 5, MAY 015 no better than using only one of the norms, even though the target moel has small egrees of freeom ) Different Measurement Ensembles: Our characterization of recovery failure is easy to interpret an eterministic in nature We show that it can be use to obtain probabilistic failure results for various ranom measurement ensembles In particular, our results hol for measurement matrices with ii subgaussian rows, quaratic measurements an matrix completion type measurements 3) Unerstaning the Effect of Weighting: We characterize the sample complexity of the multi-objective function as a function of the weights associate with the iniviual norms Our upper an lower bouns reveal that the sample complexity of the multi-objective function is relate to a certain convex combination of the sample complexities associate with the iniviual norms We provie formulas for this combination as a function of the weights 4) Incorporating General Cone Constraints: In aition, we can incorporate sie information on x 0, expresse as convex cone constraints This aitional information helps in recovery; however, quantifying how much the cone constraints help is not trivial Our analysis explicitly etermines the role of the cone constraint: Geometric properties of the cone such as its Gaussian with etermines the constant factors in the boun on the number of measurements 5) Illustrating a Gap for the Recovery of Sparse an Low-Rank Matrices: As a special case, we consier the recovery of simultaneously sparse an low-rank matrices an prove that there is a significant gap between the performance of convex an non-convex recovery programs This gap is surprising when one consiers similar results on low-imensional moel recovery as shown in Table I B Applications We survey several applications where simultaneous structures arise, as well as existing results specific to these applications 1) Sparse Signal Recovery From Quaratic Measurements: Sparsity has long been exploite in signal processing, applie mathematics, statistics an computer science for tasks such as compression, enoising, moel selection, image processing an more Despite the great interest in exploiting sparsity in various applications, most of the work to ate has focuse on recovering sparse or low rank ata from linear measurements Recently, the basic sparse recovery problem has been generalize to the case in which the measurements are given by nonlinear transforms of the unknown input, [16] A special case of this more general setting is quaratic compresse sensing [17] in which the goal is to recover a sparse vector x from quaratic measurements b i = x T A i x This problem can be linearize by lifting, where we wish to recover a low rank an sparse matrix X = xx T subject to measurements b i = A i, X Sparse recovery problems from quaratic measurements arise in a variety of problems in optics One example is sub-wavelength optical imaging [17], [18] in which the goal is to recover a sparse image from its far-fiel measurements, where ue to the laws of physics the relationship between the (clean) measurement an the unknown image is quaratic In [17] the quaratic relationship is a result of using partiallyincoherent light The quaratic behavior of the measurements in [18] arises from coherent iffractive imaging in which the image is recovere from its intensity pattern Uner an appropriate experimental setup, this problem amounts to reconstruction of a sparse signal from the magnitue of its Fourier transform A relate an notable problem involving sparse an low-rank matrices is Sparse Principal Component Analysis (SPCA), mentione in Section IX ) Sparse Phase Retrieval: Quaratic measurements appear in phase retrieval problems, in which a signal is to be recovere from the magnitue of its measurements b i = a T i x, where each measurement is a linear transform of the input x R n an a i s are arbitrary, possibly complex-value measurement vectors An important case is when a T i x is the Fourier transform an b i is the power spectral ensity Phase retrieval is of great interest in many applications such as optical imaging [19], [0], crystallography [1], an more [] [4] The problem becomes linear when x is lifte an we consier the recovery of X = xx T where each measurement takes the form b i = a i a T i, X In [17], an algorithm was evelope to treat phase retrieval problems with sparse x base on a semiefinite relaxation, an low-rank matrix recovery combine with a row-sparsity constraint on the resulting matrix More recent works also propose the use of semiefinite relaxation together with sparsity constraints for phase retrieval [5] [8] An alternative algorithm was recently esigne in [9] base on a greey search In [7], the authors consier sparse signal recovery base on combinatorial an probabilistic approaches an provie uniqueness results uner certain conitions Stable uniqueness in phase retrieval problems is stuie in [30] The results of [31] an [3] applies to general (non-sparse) signals where in some cases maske versions of the signal are require 3) Fuse Lasso: Suppose the signal of interest x 0 is sparse an its entries vary slowly, ie, the signal can be approximate by a piecewise constant function To encourage sparsity, one can use the l 1 norm To promote the piece-wise constant structure, iscrete total variation can be use, efine as n 1 x TV = x i+1 x i, i=1 where TV is basically the l 1 norm of the graient an is approximately sparse The resulting optimization problem that estimates x 0 from its samples Ax 0 is known as fuse-lasso [33], an is given as min x 1 + λ x TV st Ax = Ax 0 (1) x To the best of our knowlege, the sample complexity of fuse lasso has not been analyze from a compresse sensing point of view However, there is a series of recent works [34], [35] on total variation minimization; which may lea to analysis of (1)

OYMAK et al: SIMULTANEOUSLY STRUCTURED MODELS WITH APPLICATION TO SPARSE AND LOW-RANK MATRICES 889 We remark that TV regularization is also use together with nuclear norm to encourage a low-rank an

4 OYMAK et al: SIMULTANEOUSLY STRUCTURED MODELS WITH APPLICATION TO SPARSE AND LOW-RANK MATRICES 889 We remark that TV regularization is also use together with nuclear norm to encourage a low-rank an smooth (ie, slowly varying entries) solution This regularization fins applications in imaging an physics [36], [37] 4) Low-Rank Tensors: Tensors with small Tucker rank can be seen as a generalization of low-rank matrices [38] In this setup, the signal of interest is a tensor X 0 R n1 nτ,where X 0 is low-rank along its unfolings which are obtaine by reshaping X 0 as a matrix with size n i n n i, with n = τ i=1 n i Denoting the ith unfoling by U i (X 0 ), a stanar approach to estimate X 0 from y = A(X 0 ) is minimizing the weighte nuclear norms of the unfolings, τ min λ i U i (X) subject to y = A(X 0 ) () X i=1 Low-rank tensors have applications in machine learning, physics, computational finance an high imensional PDE s [15] The problem () has been investigate in several papers [14], [39] Closer to us, [40] recently showe that the convex relaxation () performs poorly compare to information theoretically optimal bouns for Gaussian measurements Our results can exten those to the more applicable tensor completion setup, where we observe the entries of the tensor Other applications of simultaneously structure signals inclue Collaborative Hierarchical Sparse Moeling [41] where sparsity is consiere within the non-zero blocks in a block-sparse vector, an the recovery of hyperspectral images where we aim to recover a simultaneously block sparse an low rank matrix from compresse observations [4] C Outline of the Paper The paper is structure as follows Backgroun an efinitions are given in Section II An overview of the main results is provie in Section III Section IV iscusses some measurement ensembles for which our results apply Section V erives upper bouns for the convex relaxations assuming a Gaussian measurement ensemble Proofs of the general results are presente in Section VI The proofs for the special case of simultaneously sparse an low-rank matrices are given in Section VII, where we compare corollaries of the general results with the results on non-convex recovery approaches, an illustrate a gap Numerical simulations in Section VIII empirically support the results on sparse an low-rank matrices Future irections of research an a iscussion on the results are presente in Section IX II PROBLEM SETUP We begin by reviewing some basic efinitions Our results will be on structure-inucing norms; examples inclue the l 1 norm, the l 1, norm, an the nuclear norm A Definitions an Signal Moel The nuclear norm of a matrix is enote by an is the sum of the singular values of the matrix The l 1, norm is the sum of the l norms of the columns of a matrix Fig 1 Depiction of the correlation between a vector x an a set S s achieves the largest angle with x,hences has the minimum correlation with x Minimizing the l 1 norm encourages sparse solutions, while the l 1, norm an nuclear norm encourage column-sparse an low-rank solutions respectively, [5], [6], [43] [45]; see Section VI-D for more etaile iscussion of these norms an their subifferentials The Eucliean norm is enote by, ie, the l norm for vectors an the Frobenius norm F for matrices Overlines enote normalization, ie, for a vector x an a matrix X, x = x x an X = X X F The minimum an maximum singular values of a matrix A are enote by σ min (A) an σ max (A) The set of n n positive semiefinite (PSD) an symmetric matrices are enote by S n + an S n respectively cone(s) enotes the conic hull of agivensets A( ) : R n R m is a linear measurement operator if A(x) is equivalent to the matrix multiplication Ax where A R m n Ifxis a matrix, A(x) will be a matrix multiplication with a suitably vectorize x In some of our results, we consier Gaussian measurements, in which case A has inepenent N (0, 1) entries For a vector x R n, x enotes a general norm an x = sup z 1 x, z is the corresponing ual norm A subgraient of the norm at x is a vector g for which z x + g, z x hols for any z Thesetof all subgraients is calle the subifferential an is enote by x The Lipschitz constant of the norm is efine as z 1 z L = sup z 1 z R n z 1 z Definition 1 (Correlation): Given a nonzero vector x an a set S, ρ(x,s) is efine as ρ(x,s):= inf 0 s S x T s x s ρ(x,s) correspons to the minimum absolute-value correlation between the vector x an elements of S Let x = x x The correlation between x an the associate subifferential has a simple form: ρ(x, x ) = x T g inf = g x g x sup g x g Here, we use the fact that, for norms, subgraients g x satisfy x T g = x, [46] The enominator of the right han sie is the local Lipschitz constant of at x an is upper We will by κ Recently, this quantity has been stuie boune by L Consequently, ρ(x, x ) x L enote x L

5 890 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL 61, NO 5, MAY 015 Fig For a scale norm ball passing through x 0, κ = p x 0,wherep is any of the closest points on the scale norm ball to the origin by Mu et al to analyze the simultaneously structure signals in a similar spirit to us for Gaussian measurements [40] 1 Similar calculations as above gives an alternative interpretation for κ which is illustrate in Figure κ is a measure of alignment between the vector x an the subifferential For the norms of interest, it is associate with the moel complexity For instance, for a k-sparse vector x, x 1 lies between 1 an k epening on how spiky nonzero entries are Also the Lipschitz constant for the l 1 norm in R n is L = n Hence, when the nonzero entries are ±1, wefin κ = k n Similarly, given a, rankr matrix X, X lies between 1 an r If the singular values are sprea (ie ±1), we fin κ = r = r In these cases, κ is proportional to the moel complexity normalize by the ambient imension Simultaneously Structure Moels: We consier a signal x 0 which has several low-imensional structures S 1,S,,S τ (eg, sparsity, group sparsity, low-rank) Suppose each structure i correspons to a norm enote by (i) which promotes that structure (eg, l 1, l 1,, nuclear norm) We refer to such an x 0 as a simultaneously structure moel B Convex Recovery Program We investigate the recovery of the simultaneously structure x 0 from its linear measurements A(x 0 ) To recover x 0, we woul like to simultaneously minimize the norms (i), i =1,,τ, which leas to a multi-objective (vector-value) optimization problem For all feasible points x satisfying A(x) = A(x 0 ) an sie information x C, consier the set of achievable norms { x (i) } τ i=1 enote as points in R τ The minimal points of this set with respect to the positive orthant R τ + form the Pareto-optimal front, as illustrate in Figure 3 Since the problem is convex, one can alternatively consier the set {v R τ : x R n st x C, A(x) =A(x 0 ), v i x (i), for i =1,,τ}, which is convex an has the same Pareto optimal points as the original set (see [47, Ch 4]) 1 The work [40] was submitte after our initial manuscript; in which we projecte the subifferential onto a carefully chosen subspace to obtain bouns on the sample complexity (see Proposition 6) Inspire by [40], projection onto x 0 an the use of κ le to the simplification of the notation an improvement of the results in the current manuscript, in particular, Section IV Fig 3 Consier a point x 0 represente in the figure by the ot We nee at least m measurements for x 0 to be recoverable since for any m <mthis point is not on the Pareto optimal front Definition (Recoverability): We call x 0 recoverable if it is a Pareto optimal point; ie, there oes not exist a feasible x x satisfying A(x ) = A(x 0 ) an x C, with x (i) x 0 (i) for i =1,,τ The vector-value convex recovery program can be turne into a scalar optimization problem as minimize x C f(x) =h( x (1),, x (τ ) ) subject to A(x) =A(x 0 ), (3) where h : R τ + R + is convex an non-ecreasing in each argument (ie, non-ecreasing an strictly increasing in at least one coorinate) For convex problems with strong uality, it is known that we can recover all of the Pareto optimal points by optimizing weighte sums f(x) = τ i=1 λ i x (i), with positive weights λ i, among all possible functions f(x) =h( x (1),, x (τ ) ) For each x 0 on the Pareto, the coefficients of such a recovering function are given by the hyperplane supporting the Pareto at x 0 [47, Ch 4] In Figure 3, consier the smallest m that makes x 0 recoverable Then one can choose a function h an recover x 0 by (3) using the m measurements If the number of measurements is any less, then no function can recover x 0 Our goal is to provie lower bouns on m In [9], Chanrasekaran et al propose a general theory for constructing a suitable penalty, calle an atomic norm, given a single set of atoms that escribes the structure of the target object In the case of simultaneous structures, this construction requires efining new atoms, an then ensuring the resulting atomic norm can be minimize in a computationally tractable way, which is nontrivial an often intractable We briefly iscuss such constructions as a future research irection in Section IX III MAIN RESULTS In this section, we state our main theorems that aim to characterize the number of measurements neee to recover a simultaneously structure signal by convex or nonconvex programs We first present our general results, followe by results for simultaneously sparse an low-rank matrices as a specific but important instance of the general case The proofs

6 OYMAK et al: SIMULTANEOUSLY STRUCTURED MODELS WITH APPLICATION TO SPARSE AND LOW-RANK MATRICES 891 are given in Sections VI an VII All of our statements will implicitly assume x 0 0 This will ensure that x 0 is not a trivial minimizer an 0 is not in the subifferentials A General Simultaneously Structure Signals Consier the recovery of a signal x 0 that is simultaneously structure with S 1,S,,S τ as escribe in Section II-A We provie a lower boun on the require number of measurements, using the geometric properties of the iniviual norms Theorem 1 (Deterministic Failure): Suppose C = R n an, ρ(x 0, f(x 0 )) := inf g f(x ḡt x 0 > A x 0 0) σ min (A T ) (4) Then, x 0 is not a minimizer of (3) Theorem 1 is eterministic in nature However, it can be easily specialize to specific ranom measurement ensembles The left han sie of (4) epens only on the vector x 0 an the subifferential f(x 0 ), hence it is inepenent of the measurement matrix A For simultaneously structure moels, we will argue that, the left han sie cannot be mae too small, as the subgraients are aligne with the signal On the other han, the right han sie epens only on A an x 0 an is inepenent of the subifferential In linear inverse problems, A is often assume to be ranom For large class of ranom matrices, we will argue that, the right han sie is approximately m n which will yiel a lower boun on the number of require measurements Typical measurement ensembles inclue the following: Sampling Entries: In low-rank matrix an tensor completion problems, we observe the entries of x 0 uniformly at ranom In this case, rows of A are chosen from the stanar basis in R n We shoul remark that, instea of the stanar basis, one can consier other orthonormal bases such as the Fourier basis Matrices With ii Rows: A has inepenent an ientically istribute rows with certain moment conitions This is a wiely use setup in compresse sensing as each measurement we make is associate with the corresponing row of A [48] Quaratic Measurements: Arises in the phase retrieval problem as iscusse in Section I-B In Section IV, we fin upper bouns on the right han sie of (4) for these ensembles As iscusse in Section IV, we can moify the rows of A to get better bouns as long as this oes not affect its null space For instance, one can iscar the ientical rows to improve conitioning However, as m increases an A has more linearly inepenent rows, σ min (A T ) will naturally ecrease an (4) will no longer hol after a certain point In particular, (4) cannot hol beyon m n as σ min (A T )=0 This is inee natural as the system becomes overetermine The following proposition lower bouns the left han sie of (4) in an interpretable manner In particular, the correlation ρ(x 0, f(x 0 )) can be lower boune by the smallest iniviual correlation Proposition 1: Let L i be the Lipschitz constant of the ith norm an κ i = x0 (i) L i for 1 i τ Setκ min =min{κ i : i =1,,τ} Then: All functions f( ) in (3) satisfy, ρ(x 0, f(x 0 )) κ min Suppose f( ) is a weighte linear combination f(x) = τ i=1 λ i x (i) for nonnegative {λ i } τ i=1 λ Let λi = P il i τ for 1 i τ Then, i=1 λili ρ(x 0, f(x 0 )) τ λ i=1 i κ i Proof: From Lemma 3, any subgraient of f( ) can be written as, g = τ i=1 w ig i for some nonnegative w i s On the other han, from [46], x 0, g i = x 0 (i) Combining these results, τ g T x 0 = w i x 0 (i) i=1 From the triangle inequality, g τ i=1 w il i Therefore, τ i=1 w i x 0 (i) w i x 0 (i) τ i=1 w min = κ min (5) il i 1 i τ w i L i To prove the secon part, we use the fact that for the weighte sums of norms, w i = λ i an the subgraients have the form g = τ i=1 λ ig i, [47] Then, substitute λ i for λ i in the left han sie of (5) Before stating the next result, let us give a relevant efinition regaring the average istance between a set an a ranom vector Definition 3 (Gaussian Distance): Let M be a close convex set in R n an let h R n be a vector with inepenent stanar normal entries Then, the Gaussian istance of M is efine as D(M) =E[ inf h v ] v M When M is a cone, we have 0 D(M) n Similar efinitions have been use extensively in the literature, such as Gaussian with [9], statistical imension [49] an mean with [50] For notational simplicity, let the normalize istance be D(M) = D(M) n We will now state our result for Gaussian measurements; which can aitionally inclue cone constraints for the lower boun Other ensembles are consiere in Section IV Theorem (Gaussian Lower Boun): Suppose A has inepenent N (0, 1) entries Whenever m m low, x 0 will not be a minimizer of any of the recovery programs in (3) with probability at least 1 10 exp( 1 16 min{m low, (1 D(C)) n}), where m low (1 D(C))nκ min 100 Remark: When C = R n, D(C) =0hence, the lower boun simplifies to m low = nκ min 100 Note that D(C) epens only on C an can be viewe as a constant For instance, for the positive semiefinite cone, n we show that D(S + ) < 3 Observe that for a smaller cone C, it is reasonable to expect a smaller lower boun on the The lower boun κ min is irectly comparable to [40, Th 5] Inee, our lower bouns on the sample complexity will have the form O κ min n

7 89 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL 61, NO 5, MAY 015 TABLE II SUMMARY OF THE PARAMETERS THAT ARE DISCUSSED IN THIS SECTIONTHE LAST THREE LINES ARE FOR A S&L (k, k, r) MATRIX WHERE n = INTHEFOURTH COLUMN, THE CORRESPONDING ENTRY FOR S&L IS κ min =min{κ l1,κ } require number of measurements Inee, as C gets smaller, D(C) increases As iscusse before, there are various options for the scalarizing function in (3), with one choice being the weighte sum of norms In fact, for a recoverable point x 0 there always exists a weighte sum of norms which recovers it This function is also often the choice in applications, where the space of positive weights is searche for a goo combination Thus, we can state the following corollary as a general result Corollary 1 (Weighte Lower Boun): Suppose A has ii N (0, 1) entries an f(x) = τ i=1 λ i x (i) for nonnegative weights {λ i } τ i=1 Whenever m m low, x 0 will not be a minimizer of the recovery program (3) with probability at least 1 10 exp( 1 16 min{m low, (1 D(C)) n}), where m low n(1 D(C))( τ λ i=1 i κ i ), 100 an λ i = P λili τ i=1 λili Observe that Theorem is stronger than stating a particular function h( x (1),, x (τ ) ) will not work Instea, our result states that with high probability none of the programs in the class (3) can return x 0 as optimal unless the number of measurements is sufficiently large To unerstan the result better, note that the require number of measurements is proportional to κ min n which is often proportional to the sample complexity of the best iniviual norm As we have argue in Section II-A, κ i n correspons to how structure the signal is For sparse signals it is equal to the sparsity, an for a rank r matrix, it is equal to the egrees of freeom of the set of rank r matrices Consequently, Theorem suggests that even if the signal satisfies multiple structures, the require number of measurements is effectively etermine by only one ominant structure Intuitively, the egrees of freeom of a simultaneously structure signal can be much lower, which is provable for simultaneously sparse an low-rank (S&L) matrices Hence, there is a consierable gap between the expecte measurements base on moel complexity an the number of measurements neee for recovery via (3) (κ min n) B Simultaneously Sparse an Low-Rank Matrices We now focus on a special case, namely simultaneously sparse an low-rank (S&L) matrices We consier matrices with nonzero entries containe in a small submatrix where the submatrix itself is low rank Here, norms of interest are 1,, 1 an an the cone of interest is the PSD cone We also consier nonconvex approaches an contrast the results with convex approaches For the nonconvex problem, we replace the norms 1, 1,, with the functions 0, 0,, rank( ) which give the number of nonzero entries, the number of nonzero columns an rank of a matrix respectively an use the same cone constraint as the convex metho We show that convex methos perform poorly as preicte by the general result in Theorem, while nonconvex methos require optimal number of measurements (up to a logarithmic factor) Proofs are given in Section VII Definition 4: We say X 0 R 1 is an S&L matrix with (k 1,k,r) if the smallest submatrix that contains nonzero entries of X 0 has size k 1 k an rank (X 0 )=r WhenX 0 is symmetric, let = 1 = an k = k 1 = k We consier the following cases (a) General: X 0 R 1 is S&L with (k 1,k,r) (b) PSD moel: X 0 R n n is PSD an S&L with (k, k, r) We are intereste in S&L matrices with k 1 1,k so that the matrix is sparse, an r min{k 1,k } so that the submatrix containing the nonzero entries is low rank Recall from Section II-B that our goal is to recover X 0 from linear observations A(X 0 ) via convex or nonconvex optimization programs The measurements can be equivalently written as A vec(x 0 ), where A R m 1 an vec(x 0 ) R 1 enotes the vector obtaine by stacking the columns of X 0 Base on the results in Section III-A, we obtain lower bouns on the number of measurements for convex recovery We aitionally show that significantly fewer measurements are sufficient for non-convex programs to uniquely recover X 0 ; thus proving a performance gap between convex an nonconvex approaches The following theorem summarizes the results Theorem 3 (Performance of S&L Matrix Recovery): Suppose A( ) is an ii Gaussian map an consier recovering X 0 R 1 via minimize X C f(x) subject to A(X) =A(X 0 ) (6) For the cases given in Definition 4, the following convex an nonconvex recovery results hol for some positive constants c 1,c (a) General Moel: (a1) Let f(x) = X 1, + λ 1 X T 1, + λ X where λ 1,λ 0 an C = R 1 Then, (6) will fail to recover X 0 with probability 1 exp( c 1 m 0 ) whenever m c m 0 where m 0 =min{ 1 k, k 1, ( 1 + )r} (a) Let f(x) = 1 k X 0, + 1 k 1 X T 0, + 1 r rank(x) an C = R 1 Then, (6) will uniquely

8 OYMAK et al: SIMULTANEOUSLY STRUCTURED MODELS WITH APPLICATION TO SPARSE AND LOW-RANK MATRICES 893 TABLE III SUMMARY OF RECOVERY RESULTS FOR MODELS IN DEFINITION 4, ASSUMING 1 = = AND k 1 = k = k FOR THE PSD WITH l 1 CASE, WE ASSUME X 0 1 AND X 0 TO BE k r APPROXIMATELY CONSTANT FOR THE SAKE OF SIMPLICITYNONCONVEX APPROACHES ARE OPTIMAL UP TO A LOGARITHMIC FACTOR, WHILE CONVEX APPROACHES PERFORM POORLY recover X 0 with probability 1 exp( c 1 m) whenever m c max{(k 1 +k )r, k 1 log 1 k 1,k log k } (b) PSD with l 1, : (b1) Let f(x) = X 1, + λ X where λ 0 an C = S + Then, (6) will fail to recover X 0 with probability 1 exp( c 1 r) whenever m c r (b) Let f(x) = k X 0, + 1 r rank(x) an C = S Then, (6) will uniquely recover X 0 with probability 1 exp( c 1 m) whenever m c max{rk, k log k } (c) PSD with l 1 : (c1) Let f(x) = X 1 + λ X an C = S + Then, (6) will fail to recover X 0 with probability 1 exp( c 1 m 0 ) for all possible λ 0 whenever m c m 0 where m 0 =min{ X 0 1, X 0 } (c) Suppose rank(x 0 )=1Letf(X) = 1 k X 0 + rank(x) an C = S Then, (6) will uniquely recover X 0 with probability 1 exp( c 1 m) whenever m c k log k Remark on PSD With l 1 : In the special case, X 0 = aa T for a k-sparse vector a, wehavem 0 =min{ ā 4 1,} When nonzero entries of a are ±1, wehavem 0 =min{k,} The nonconvex programs require almost the same number of measurements as the egrees of freeom (or number of parameters) of the unerlying moel For instance, it is known that the egrees of freeom of a rank r matrix of size k 1 k is simply r(k 1 + k r) which is O ((k 1 + k )r) Hence, the nonconvex results are optimal up to a logarithmic factor On the other han, our results on the convex programs that follow from Theorem show that the require number of measurements are significantly larger Table III provies a quick comparison of the results on S&L For the S&L (k,k,r) moel, from stanar results one can easily euce that [3], [5], [43], l 1 penalty only: requires at least k measurements, l 1, penalty only: requires at least k measurements, Nuclear norm penalty only: requires at least r measurements These follow from the moel complexity of the sparse, column-sparse an low-rank matrices Theorem shows that, combination of norms require at least as much as the best iniviual norm For instance, combination of l 1 an the nuclear norm penalization yiels the lower boun O ( min{k,r} ) for S&L matrices whose singular values an nonzero entries are sprea This is inee what we woul expect from the interpretation that κ n is often proportional to the sample complexity of the corresponing norm an, the lower boun κ minn is proportional to that of the best iniviual norm As we saw in Section III-A, aing a cone constraint to the recovery program oes not help in reucing the lower boun by more than a constant factor In particular, we iscuss the positive semiefiniteness assumption that is beneficial in the sparse phase retrieval problem an show that the number of measurements remain high even when we inclue this extra information On the other han, the nonconvex recovery programs performs well even without the PSD constraint We remark that, we coul have state Theorem 3 for more general measurements given in Section IV without the cone constraint For instance, the following result hols for the weighte linear combination of iniviual norms an for the subgaussian ensemble Corollary : Suppose X 0 R obeys the general moel with k 1 = k = k an A is a linear subgaussian map as escribe in Proposition Choose f(x) =λ l1 X 1 + λ X,whereλ l1 = β, λ =(1 β) an 0 β 1 Then, whenever, m min{m low,c 1 n}, where, m low = (β X 0 1 +(1 β) X 0 ), (6) fails with probability 1 4exp( c m low )Herec 1,c > 0 are constants as escribe in Proposition Remark: Choosing X 0 = aa T where nonzero entries of a are ±1 yiels 1 (βk +(1 β) ) on the right han sie An explicit construction of an S&L matrix with maximal X 1, X is provie in Section VII-C This corollary compares well with the upper boun obtaine in Corollary 5 of Section V In particular, both the bouns an the penalty parameters match up to logarithmic factors Hence, together, they sanwich the sample complexity of the combine cost f(x) IV MEASUREMENT ENSEMBLES This section will make use of stanar results on sub-gaussian ranom variables an ranom matrix theory to obtain probabilistic statements We will explain how one can analyze the right han sie of (4) for, Matrices with sub-gaussian rows, Subsample stanar basis (in matrix completion), Quaratic measurements arising in phase retrieval A Sub-Gaussian Measurements We first consier the measurement maps with sub-gaussian entries The following efinitions are borrowe from [51] Definition 5 (Sub-Gaussian Ranom Variable): A ranom variable x is sub-gaussian if there exists a constant K>0 such that for all p 1, (E x p ) 1/p K p The smallest such K is calle the sub-gaussian norm of x an is enote by x Ψ A sub-exponential ranom

9 894 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL 61, NO 5, MAY 015 variable y is one for which there exists a constant K such that, P( y >t) exp(1 t K )Thevariablex is sub-gaussian if an only if x is sub-exponential Definition 6 (Isotropic Sub-Gaussian Vector): A ranom vector x R n is sub-gaussian if the one imensional marginals x T v are sub-gaussian ranom variables for all v R n The sub-gaussian norm of x is efine as, x Ψ = sup x T v Ψ v =1 The vector x is also isotropic if its covariance is equal to the ientity, ie E xx T = I n Proposition (Sub-Gaussian Measurements): Suppose A has ii rows in either of the following forms, a copy of a zero-mean isotropic sub-gaussian vector a R n,where a = n almost surely the rows consist of ii zero-mean an unit-variance sub-gaussian entries Then, there exists constants c 1,c epening only on the sub-gaussian norm of the rows, such that, whenever m c 1 n, with probability 1 4exp( c m), wehave, A x 0 σmin (AT ) m n Proof: Using [51, Th 558], there exists constants c, C epening only on the sub-gaussian norm of a such that for any t 0, with probability 1 exp( ct ) σ min (A T ) n C m t Choosing t = C n m an m 100C ensures that σ min (A T ) 4 n 5 Next, we shall estimate A x 0, which is a sum of ii sub-exponential ranom variables ientical to a T x 0 Note that E[ a T x 0 ]=1 Hence, [51, Proposition 516] gives, P( A x 0 m + t) exp( c min{ t m,t}) Choosing t = 7m 5, we fin that P( A x 0 3m 5 ) exp( c m) Combining the two, we obtain, ( A x0 P σmin (AT ) m ) 1 4exp( c m) n The secon statement can be prove in the exact same manner by using [51, Th 539] instea of Theorem 558 Remark: While Proposition assumes a has fixe l norm, this can be ensure by properly normalizing rows of A (assuming they stay sub-gaussian) For instance, if the l norm of the rows are larger than c n for a positive constant c, normalization will not affect sub-gaussianity Note that, scaling rows of a matrix oes not change its null space B Ranomly Sampling Entries We now consier the scenario where each row of A is chosen from the stanar basis uniformly at ranom Note that, when m is comparable to n, there is a nonnegligible probability that A will have uplicate rows Theorem 1 oes not take this situation into account which woul make σ min (A T )=0 In this case, one can iscar the copies as they on t affect the recoverability of x 0 This woul get ri of the ill-conitioning, since the new matrix is well-conitione with the exact same null space as the original This correspons to a sampling without replacement scheme where we ensure each row is ifferent Similar to achievability results in matrix completion [6], the following failure result requires true signal to be incoherent with the stanar basis, where incoherence is characterize by x 0, which lies between 1 n an 1 Proposition 3 (Sampling Entries): Let {e i } n i=1 be the stanar basis in R n an suppose each row of A is chosen from {e i } n i=1 uniformly at ranom Let Â be the matrix obtaine by removing uplicate rows in A Then, with probability 1 exp( m 4n x 0 ), wehave, Â x 0 σmin (Â) m n Proof: Let Â be the matrix obtaine by iscaring the rows of A that occur multiple times except one of them Clearly Null(Â) = Null(A) hence they are equivalent for the purpose of recovering x 0 Furthermore,σ min (Â) = 1 Therefore, we are intereste in upper bouning Â x 0 Clearly Â x 0 A x 0 Hence, we will boun A x 0 probabilistically Let a be the first row of A a T x 0 is a ranom variable, with mean 1 n an is upper boune by x 0 Applying the Chernoff Boun yiels P( A x 0 m n (1 + δ)) exp( mδ (1 + δ)n x 0 ) Setting δ = 1, we fin that, with probability 1 exp( m 4n x 0 ), wehave, Â x 0 σ A x 0 min (Â) σ m min (Â) n, completing the proof A significant application of this result woul be for the low-rank tensor completion problem, where we ranomly observe some entries of a low-rank tensor an try to reconstruct it A promising approach for this problem is using the weighte linear combinations of nuclear norms of the unfolings of the tensor to inuce the low-rank tensor structure escribe in (), [14], [15] Relate work [40] shows the poor performance of () for the special case of Gaussian measurements Combination of Theorem 1 an Proposition 3 will immeiately exten the results of [40] to the more applicable tensor completion setup (uner proper incoherence conitions that boun x 0 ) Remark: In Propositions an 3, we can make the upper boun for the ratio A x0 σ min(a) arbitrarily close to m n by changing the proof parameters Combine with Proposition 1, this woul suggest that, failure happens, when m<nκ min C Quaratic Measurements As mentione in the phase retrieval problem, quaratic measurements v T a of the vector a R can be linearize by the change of variable a X 0 = aa T an using V = vv T

10 OYMAK et al: SIMULTANEOUSLY STRUCTURED MODELS WITH APPLICATION TO SPARSE AND LOW-RANK MATRICES 895 The following proposition can be use to obtain a lower boun for such ensembles when combine with Theorem 1 Proposition 4: Suppose we observe quaratic measurements A(X 0 ) R m of a matrix X 0 = aa T R Here, assume that ith entry of A(X 0 ) is equal to vi T a where {v i } m i=1 are inepenent vectors, either with N (0, 1) entries or are uniformly istribute over the sphere with raius Then, there exists absolute constants c 1,c > 0 such that whenever m< c1 log, with probability 1 e, A( X 0 ) σ min (A T ) c m log Proof: Let V i = v i vi T Without loss of generality, assume v i s are uniformly istribute over the sphere with raius To lower boun σ min (A T ), we estimate the coherence of its columns, efine by, μ(a T )=max i j V i, V j = (vt i v j) V i F V j F [51, Sec 55] states that the sub-gaussian norm of v i is boune by an absolute constant Hence, conitione on v j (which satisfies v j = (v ), T i vj) is a subexponential ranom variable with mean 1 Using Definition 5, there exists a constant c>0 such that, ( (v T P i v j ) ) >clog e 4 Union bouning over all i, j pairs ensures that with probability e we have μ(a T ) c log Next, we use the stanar result that for a matrix with columns of equal length, σ min (A T ) (1 (m 1)μ) The reaer is referre to [5, Proposition 1] Hence, m c log, gives σ min (A T ) It remains to upper boun A( X 0 ) The ith entry of A( X 0 ) is equal to vi T ā, hence it is subexponential Consequently, there exists a constant c so that each entry is upper boune by c log with probability 1 e 3 Union bouning, an using m, we fin that A( X 0 ) c m log with probability 1 e Combining with the estimate of σ min (A T ) completes the proof Comparison to Existing Literature: Proposition 4 is useful to estimate the performance of the sparse phase retrieval problem, in which a is a k sparse vector, an we minimize a combination of the l 1 norm an the nuclear norm to recover X 0 Combine with Theorem 1, Proposition 4 gives that, whenever m c1 log an c m log min{ X 0 1, X 0 }, recovery fails with high probability Since X 0 =1an X 0 1 = ā 1, the failure conition reuces to, m c log min{ ā 4 1,} When ā is a k-sparse vector with ±1 entries, in a similar flavor to Theorem 3, the right han sie has the form c log min{k,} We emphasize that the lower boun provie in [6] is irectly comparable to our results The authors in [6] consier the same problem an give two results: first, if m O ( ā 1k log ) then minimizing X 1 + λ tr (X) for suitable value of λ over the set of PSD matrices will exactly recover X 0 with high probability Seconly, their Theorem 13 provies a necessary conition (lower boun) on the number of measurements, uner which the recovery program fails to recover X 0 with high probability In particular, their failure conition is m min{m 0, 40 log } where m 0 = max( ā 1 k/,0) 500 log ( ) Observe that both results require m O Focusing log on the sparsity requirements, when the nonzero entries are sufficiently iffuse (ie a ā 4 1 k) both results yiel O( log ) as a lower boun On the other han, if ā 1 k/, their lower boun becomes trivial while our lower boun still requires O( ā 4 log ) measurements ā 1 k/ can happen as soon as the nonzero entries are spiky, ie some of the entries are much larger than the rest In this sense, our bouns are tighter On the other han, their lower boun inclues the PSD constraint unlike ours D Asymptotic Regime While we iscusse two cases in the nonasymptotic setup, we believe significantly more general results can be state asymptotically (m, n ) For instance, uner finite fourth moment constraint, thanks to the Bai-Yin law [53], asymptotically, the smallest singular value of a matrix with ii unit variance entries concentrates aroun n m Similarly, A x 0 is the sum of inepenent variables; hence thanks to the law of large numbers, we will have A x0 m 1 Together, these yiel m n m A x0 σ min(a T ) V UPPER BOUNDS We now state an upper boun on the simultaneous optimization for Gaussian measurement ensemble Our upper boun will be in terms of istance to the ilate subifferentials To accomplish this, we make use of the recent theory on the sample complexity of linear inverse problems It has been recently shown that (3) exhibits a phase transition from failure with high probability to success with high probability when the number of Gaussian measurements are aroun the quantity m PT = D(cone( f(x 0 ))) [9], [49] This phenomenon was first observe by Donoho an Tanner, who calculate the phase transitions for l 1 minimization an showe that D(cone( x 0 1 )) k log en k for a k-sparse vector in R n [54] All of these works focus on signals with a single structure an o not stuy properties of a penalty that is a combination of norms The next theorem relates the phase transition point of the joint optimization (3) to the iniviual subifferentials Theorem 4: Suppose A has ii N (0, 1) entries an let f(x) = τ λ i x (i) For positive scalars {α i } τ i=1, let λ i = λiα 1i=1 i P τi=1 an efine, λ iα 1 i ( m up ({α i } τ i=1) := λ i D(α i x 0 (i) )) i

arxiv: v3 [cs.it] 25 Jul 2014

arxiv: v3 [cs.it] 25 Jul 2014 Simultaneously Structured Models with Application to Sparse and Low-rank Matrices Samet Oymak c, Amin Jalali w, Maryam Fazel w, Yonina C. Eldar t, Babak Hassibi c arxiv:11.3753v3 [cs.it] 5 Jul 014 c California